101
1 Discovery and characterization of coding and non-coding driver mutations in more than 2,500 whole cancer genomes Esther Rheinbay 1 *, Morten Muhlig Nielsen 2 *, Federico Abascal 3 *, Grace Tiao 1# , Henrik Hornshøj 2# , Julian M. Hess 1# , Randi Istrup Pedersen 2# , Lars Feuerbach 4 , Radhakrishnan Sabarinathan 5,6 , Tobias Madsen 2 , Jaegil Kim 1 , Loris Mularoni 5,6 , Shimin Shuai 18 , Andrés Lanzós 8,9 , Carl Herrmann 10,11 , Yosef E. Maruvka 1,12 , Ciyue Shen 13,14 , Samirkumar B. Amin 15,16 , Johanna Bertl 2 , Priyanka Dhingra 17,18 , Klev Diamanti 19 , Abel Gonzalez-Perez 5,6 , Qianyun Guo 20 , Nicholas J. Haradhvala 1,12 , Keren Isaev 21,22 , Malene Juul 2 , Jan Komorowski 19,23 , Sushant Kumar 24 , Donghoon Lee 24 , Lucas Lochovsky 25 , Eric Minwei Liu 17,18 , Oriol Pich 5,6 , David Tamborero 5,6 , Husen M. Umer 19 , Liis Uusküla-Reimand 26,27 , Claes Wadelius 28 , Lina Wadi 21 , Jing Zhang 24 , Keith A. Boroevich 29 , Joana Carlevaro-Fita 8,9 , Dimple Chakravarty 30 , Calvin W.Y. Chan 10,33 , Nuno A. Fonseca 31 , Mark P. Hamilton 32 , Chen Hong 4,33 , Andre Kahles 34 , Youngwook Kim 35 , Kjong-Van Lehmann 34 , Todd A. Johnson 29 , Abdullah Kahraman 36 , Keunchil Park 35 , Gordon Saksena 1 , Lina Sieverling 4,33 , Nicholas A. Sinnott-Armstrong 37 , Peter J. Campbell 3,38 , Asger Hobolth 20 , Manolis Kellis 1,39 , Michael S. Lawrence 1,12 , Ben Raphael 40 , Mark A. Rubin 41,42,43 , Chris Sander 13,14 , Lincoln Stein 7,21,44 , Josh Stuart 45 , Tatsuhiko Tsunoda 46,47,48 , David A. Wheeler 49 , Rory Johnson 50 , Jüri Reimand 21,22 , Mark B. Gerstein 51,52,53 , Ekta Khurana 17,18,42,43 , Núria López-Bigas 5,6,54 , Iñigo Martincorena 3& , Jakob Skou Pedersen 2,20& , Gad Getz 1,12,55,56& on behalf of the PCAWG Drivers and Functional Interpretation Group and the ICGC/TCGA Pan-Cancer Analysis of Whole Genomes Network * These authors contributed equally # These authors contributed equally & These authors contributed equally Abstract Discovery of cancer drivers has traditionally focused on the identification of protein- coding genes. Here we present a comprehensive analysis of putative cancer driver mutations in both protein-coding and non-coding genomic regions across >2,500 whole cancer genomes from the Pan-Cancer Analysis of Whole Genomes (PCAWG) Consortium. We developed a statistically rigorous strategy for combining significance levels from multiple driver discovery methods and demonstrate that the integrated results overcome limitations of individual methods. We combined this strategy with careful filtering and applied it to protein-coding genes, promoters, untranslated regions (UTRs), distal enhancers and non-coding RNAs. These analyses redefine the . CC-BY-NC-ND 4.0 International license available under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (which this version posted December 23, 2017. ; https://doi.org/10.1101/237313 doi: bioRxiv preprint

was not certified by peer review) is the author/funder. It ...1 Discovery and characterization of coding and non-coding driver mutations in more than 2,500 whole cancer genomes Esther

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: was not certified by peer review) is the author/funder. It ...1 Discovery and characterization of coding and non-coding driver mutations in more than 2,500 whole cancer genomes Esther

1

Discovery and characterization of coding and non-coding driver mutations in more

than 2,500 whole cancer genomes

Esther Rheinbay1*, Morten Muhlig Nielsen2*, Federico Abascal3*, Grace Tiao1#, Henrik

Hornshøj2#, Julian M. Hess1#, Randi Istrup Pedersen2#, Lars Feuerbach4, Radhakrishnan

Sabarinathan5,6, Tobias Madsen2, Jaegil Kim1, Loris Mularoni5,6, Shimin Shuai18, Andrés

Lanzós8,9, Carl Herrmann10,11, Yosef E. Maruvka1,12, Ciyue Shen13,14, Samirkumar B.

Amin15,16, Johanna Bertl2, Priyanka Dhingra17,18, Klev Diamanti19, Abel Gonzalez-Perez5,6,

Qianyun Guo20, Nicholas J. Haradhvala1,12, Keren Isaev21,22, Malene Juul2, Jan

Komorowski19,23, Sushant Kumar24, Donghoon Lee24, Lucas Lochovsky25, Eric Minwei

Liu17,18, Oriol Pich5,6, David Tamborero5,6, Husen M. Umer19, Liis Uusküla-Reimand26,27,

Claes Wadelius28, Lina Wadi21, Jing Zhang24, Keith A. Boroevich29, Joana Carlevaro-Fita8,9,

Dimple Chakravarty30, Calvin W.Y. Chan10,33, Nuno A. Fonseca31, Mark P. Hamilton32, Chen

Hong4,33, Andre Kahles34, Youngwook Kim35, Kjong-Van Lehmann34, Todd A. Johnson29,

Abdullah Kahraman36, Keunchil Park35, Gordon Saksena1, Lina Sieverling4,33, Nicholas A.

Sinnott-Armstrong37, Peter J. Campbell3,38, Asger Hobolth20, Manolis Kellis1,39, Michael S.

Lawrence1,12, Ben Raphael40, Mark A. Rubin41,42,43, Chris Sander13,14, Lincoln Stein7,21,44,

Josh Stuart45, Tatsuhiko Tsunoda46,47,48, David A. Wheeler49, Rory Johnson50, Jüri

Reimand21,22, Mark B. Gerstein51,52,53, Ekta Khurana17,18,42,43, Núria López-Bigas5,6,54, Iñigo

Martincorena3&, Jakob Skou Pedersen2,20&, Gad Getz1,12,55,56& on behalf of the PCAWG

Drivers and Functional Interpretation Group and the ICGC/TCGA Pan-Cancer Analysis of

Whole Genomes Network

* These authors contributed equally

# These authors contributed equally

& These authors contributed equally

Abstract

Discovery of cancer drivers has traditionally focused on the identification of protein-

coding genes. Here we present a comprehensive analysis of putative cancer driver

mutations in both protein-coding and non-coding genomic regions across >2,500

whole cancer genomes from the Pan-Cancer Analysis of Whole Genomes (PCAWG)

Consortium. We developed a statistically rigorous strategy for combining significance

levels from multiple driver discovery methods and demonstrate that the integrated

results overcome limitations of individual methods. We combined this strategy with

careful filtering and applied it to protein-coding genes, promoters, untranslated

regions (UTRs), distal enhancers and non-coding RNAs. These analyses redefine the

.CC-BY-NC-ND 4.0 International licenseavailable under awas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (whichthis version posted December 23, 2017. ; https://doi.org/10.1101/237313doi: bioRxiv preprint

Page 2: was not certified by peer review) is the author/funder. It ...1 Discovery and characterization of coding and non-coding driver mutations in more than 2,500 whole cancer genomes Esther

2

landscape of non-coding driver mutations in cancer genomes, confirming a few

previously reported elements and raising doubts about others, while identifying novel

candidate elements across 27 cancer types. Novel recurrent events were found in the

promoters or 5’UTRs of TP53, RFTN1, RNF34, and MTG2, in the 3’UTRs of NFKBIZ and

TOB1, and in the non-coding RNA RMRP. We provide evidence that the previously

reported non-coding RNAs NEAT1 and MALAT1 may be subject to a localized

mutational process. Perhaps the most striking finding is the relative paucity of point

mutations driving cancer in non-coding genes and regulatory elements. Though we

have limited power to discover infrequent non-coding drivers in individual cohorts,

combined analysis of promoters of known cancer genes show little excess of

mutations beyond TERT.

Introduction

Discovery of cancer drivers has traditionally focused on the identification of

recurrently mutated protein-coding genes. Large-scale projects such as The Cancer

Genome Atlas (TCGA) and the International Cancer Genome Consortium (ICGC) have

profiled the genomes of a number of different cancer types to date, leading to the discovery

of many putative cancer genes. Whole-genome sequencing (WGS) has made it possible to

systematically survey non-coding regions for potential driver events. Recent studies of

single-nucleotide variants (SNVs) and insertions/deletions (indels) detected from WGS data

from smaller cohorts of patients have revealed putative candidates for regulatory driver

events1–8. However, to date, only a few events have been functionally validated as regulatory

drivers, affecting expression of one or more target genes by changing their regulation, RNA

stability or disruption of normal genome topology6,9–12. Non-coding RNAs play diverse

regulatory roles and are enzymatically involved in key steps of transcription and protein

synthesis. Though functional evidence for non-coding RNAs in cancer is accumulating13,14,

only few examples have been shown to be the target of recurrent mutation15,16. In contrast to

protein-coding regions, the lower coverage and complexity of DNA sequence in these non-

coding regions pose additional challenges for high-quality mutation calling, mutation rate

estimation and identification of driver events. Adequate statistical methods that address

these issues are needed to identify non-coding drivers from these data.

The ICGC and TCGA Pan-Cancer Analysis of Whole Genomes (PCAWG) effort, which has

collected and systematically analyzed >2,700 cancer genome sequences from >2,500

patients representing a variety of cancer types17, offers an unprecedented opportunity to

perform a comprehensive analysis of putative coding and non-coding driver events. Here,

.CC-BY-NC-ND 4.0 International licenseavailable under awas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (whichthis version posted December 23, 2017. ; https://doi.org/10.1101/237313doi: bioRxiv preprint

Page 3: was not certified by peer review) is the author/funder. It ...1 Discovery and characterization of coding and non-coding driver mutations in more than 2,500 whole cancer genomes Esther

3

we describe results from multiple methods for detecting such drivers based on somatic

SNVs and indels and a framework for integrating them. Using this approach, we identify

significantly mutated genomic elements of various types in individual cancer types and

across cancer types. We then use additional sources of information to remove from the initial

list of significant elements those likely to result from mapping issues caused by repetitive

regions or by inaccurate estimation of the local background mutation rate, and classify

recurrently mutated elements as putative cancer-drivers or as likely false positives. Finally,

we estimate our power to discover novel drivers in the PCAWG data set and evaluate the

overall density of non-coding regulatory driver mutations around known cancer genes.

Overview of recurrent mutations across cancer types

Many protein-coding driver mutations and the two well-studied TERT promoter mutations

occur in frequently mutated single base genomic “hotspots”. In order to obtain an initial view

of these in the PCAWG data set, we generated a genome-wide list of highly mutated

hotspots by ranking all sites in the genome based on the number of cancers that harbor

somatic mutations in them. Only 12 genomic sites were recurrently somatically mutated in

more than 1% (26) of cancers and 106 in more than 0.5%, while the vast majority (93.19%)

of somatic mutations were private to a single patient’s cancer (Methods; Extended Data

Fig. 1). Interestingly, although protein-coding regions span only ~1% of the genome, 15

(30%) of the 50 most frequently mutated sites were known amino acid altering hotspots in

known cancer genes (in KRAS, BRAF, PIK3CA, TP53, and IDH1) (Fig. 1a). Also among the

top hits were the two TERT promoter hotspots. The top 50 list further contained a high

proportion of mutations almost exclusively contributed by lymphoid malignancies (Lymph-

BNHL and Lymph-CLL) (18 hotspots) and melanomas (6). The melanoma-specific hotspots

were located in regions occupied by transcription factors (5/6 in promoters), a phenomenon

that has recently been attributed to reduced nucleotide excision repair (NER) of UV-induced

DNA damage17–19. Indeed, signature analysis of the melanoma hotspot mutations attributed

them to UV radiation (Fig. 1b). The hotspots in lymphoid malignancies reside in the IGH

locus on chromosome 14, a known phenomenon in B-cell-derived cancers where IGH has

undergone somatic hypermutation by the activation-induced cytosine deaminase (AID), an

observation confirmed by mutational signature analysis (Fig. 1b). Hotspots in an intron of

GPR126 and in the promoter of PLEKHS13,20 were located inside palindromic DNA that may

fold into hairpin structures and expose single-stranded DNA loops to APOBEC enzymes3;

accordingly, these mutations had high APOBEC signature probabilities. The six remaining

non-coding hotspots could be attributed to a combination of mutational processes and

presumed technical artifacts (Supplementary Note). In contrast to these non-coding and

cancer-specific sites, mutations in protein-coding hotspots were attributed to the aging

.CC-BY-NC-ND 4.0 International licenseavailable under awas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (whichthis version posted December 23, 2017. ; https://doi.org/10.1101/237313doi: bioRxiv preprint

Page 4: was not certified by peer review) is the author/funder. It ...1 Discovery and characterization of coding and non-coding driver mutations in more than 2,500 whole cancer genomes Esther

4

signatures, or to a mixture of signatures. Our observation that, with the exception of the

TERT promoter hotspots, all of the top non-coding hotspots are generated by highly

localized, cancer-specific mutational processes suggests that these may be passengers.

Next, we sought to systematically identify genomic elements that drive cancer by

comprehensively analyzing SNVs and indels in transcribed and regulatory genomic

elements, including protein-coding genes, promoters, 5’UTRs, 3’UTRs, splice sites, distal

regulatory elements/enhancers, lncRNA genes, short ncRNAs, and miRNAs (in total ~4% of

the genome) (Methods; Fig. 1c; Extended Data Fig. 2; Supplementary Table 1). Drivers

can be common across many cancer types (such as TP53, PIK3CA, PTEN, and TERT

promoter) or highly specific (e.g. CIC and FUBP1 in oligodendroglioma, BCR-ABL in chronic

myelogenous leukemia, and FOXA1 in breast and prostate cancers). Analysis of large mixed

tumor cohorts increases the power to discover common drivers but dilutes the signal from

cancer-specific ones. In order to maximize our ability to detect common and cancer-specific

drivers, we performed analyses of individual tumor types, tumors grouped by their tissue of

origin or organ site (“meta-cohorts”), as well as a pan-cancer set (Fig. 1d; Methods). This

aggregation of tumors by tissue or organ site further allowed us to take advantage of

patients from cohorts too small to analyze individually (<20 patients). Overall, we analyzed

2,583 unique patient samples in 27 individual tumor types and 15 meta-cohorts.

Discovery of driver elements

In order to identify bona fide drivers in each cohort, we collected and integrated results from

multiple driver discovery methods. These methods identify statistically significant elements

based on SNVs and indels and evidence from one or more of three criteria: (i) mutational

burden, (ii) functional impact of mutation changes, and (iii) clustering of mutation sites in

hotspots (Fig. 2a; Methods; Supplementary Table 2). Generally, mutational burden and

clustering of events are independent of the type of genomic element that is tested for

significance. The ability to use measures of functional impact, however, may vary greatly

depending on the element type. In protein-coding genes, functional impact is assessed

through the amino acid change introduced by a given mutation. For some non-coding

elements, the impact of mutations on transcription factor binding sites, miRNA binding sites,

or functional RNA structure may be predicted. However, in general, the exact consequences

of somatic mutations are harder to assess in non-coding regions than in protein-coding

regions, and may even vary substantially across cell types.

Integration of results from multiple driver discovery methods

.CC-BY-NC-ND 4.0 International licenseavailable under awas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (whichthis version posted December 23, 2017. ; https://doi.org/10.1101/237313doi: bioRxiv preprint

Page 5: was not certified by peer review) is the author/funder. It ...1 Discovery and characterization of coding and non-coding driver mutations in more than 2,500 whole cancer genomes Esther

5

Most cancer genomic studies to date have employed a single method to identify putative

cancer driver elements. This can introduce biases because the number and type of drivers

detected by individual methods depends on their underlying assumptions and statistical

model. To reach a comprehensive consensus list of candidate drivers, we collected the p-

values calculated by up to 16 driver discovery methods for each genomic element

(Supplementary Table 2, 3) and derived a general strategy to integrate them (Fig. 2a). Our

strategy accounts for the correlation of p-values among driver discovery methods that use

similar approaches (burden, clustering, functional impact) when applied to the same data,

while accumulating the independent evidence provided by different methods (Fig. 2b). To

integrate potentially correlated statistical results, we used Brown’s method for combining p-

values21, an extension of the popular Fisher’s method22 (Fig. 2c). We first used simulated

data sets that preserve local mutation rates and signatures but do not contain drivers, to

assess the specificity of each method and the correlation structure among different methods

when no genomic element is expected to have a significant excess or pattern of mutations

(Methods; Extended Data Fig. 3). As anticipated, p-values were correlated among methods

that share similar approaches. We then used the resulting correlation structure to integrate

p-values from the observed (real) mutation data from different methods into a single p-value

for each genomic element. Since simulated data sets are based on assumptions, we tested

whether we can directly estimate the correlation structure from p-values calculated on the

observed data. As this simpler approach yielded analogous results, it was used throughout

this study (Methods, Extended Data Fig. 3). After additional filtering steps (see below), we

conservatively controlled for multiple hypothesis testing using the Benjamini-Hochberg

procedure (BH)23 across the 27 individual and 15 meta-cohorts, for each element type

separately (Fig. 2e). Cohort-element combinations (hits) with q<0.1 (10% false discovery

rate [FDR]) were considered significant. Overall, the union of all hits from the individual

driver discovery methods contained 3,048 cohort-element pairs involving 1,694 unique

elements, whereas the integrated results contained 1,406 significant hits with 635 unique

elements (Supplementary Table 4, 5).

Flagging and removal of potential false positive hits

Even after careful variant calling, false positive identification of driver loci can arise from

residual inaccuracies in background models, residual sequencing and mapping artifacts or

from local increases in the burden of mutations generated by as-yet unmodeled mutational

processes. To minimize technical issues, we carefully reviewed and filtered all candidate

driver elements based on low-confidence mappability regions and site-specific noise in a

panel of normal samples from PCAWG (Fig. 2d; Fig. 3a,b; Methods). For example, the

lncRNA RN7SK is a small nuclear RNA with many homologous regions in the genome,

.CC-BY-NC-ND 4.0 International licenseavailable under awas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (whichthis version posted December 23, 2017. ; https://doi.org/10.1101/237313doi: bioRxiv preprint

Page 6: was not certified by peer review) is the author/funder. It ...1 Discovery and characterization of coding and non-coding driver mutations in more than 2,500 whole cancer genomes Esther

6

making it prone to read mapping errors (Fig. 3a), while a fraction of LEPROTL1 mutations

fell in positions flagged by the site-specific noise filter (Fig. 3b).

In addition, mutational processes not properly captured by current background models can

cause local accumulation of mutations. As shown in the hotspot analysis above, events in

the promoters of PIM1 and RPL13A are strongly associated with the AID and UV mutational

processes, respectively, as previously noted18,19,24 (Fig. 3c). Moreover, DNA palindromes

may form hairpin structures that expose loop bases to APOBEC enzymes, as has been

proposed for PLEKHS1, GPR126, TBC1D12 and LEPROTL12,3 (Fig. 3d). Non-coding RNAs

that form secondary structures often contain palindromic sequences, which could be

preferential targets for APOBEC enzymes. However, only a small fraction of the mutations in

structural RNAs appear to be APOBEC derived overall (4.8%) (Supplementary Note). In

total, 569 out of 1,406 hits (element-cohort combinations with q<0.1) and 383 out of 635

unique elements that were initially significant, were filtered out, resulting in 837 hits (Fig. 3e;

Supplementary Tables 4, 5). Finally, multiple hypothesis correction was repeated after

assigning a non-significant p-value (P = 1) to the filtered out hits, resulting in further removal

of 91 candidate element-cohort combinations.

Performance of the integration method

To evaluate the sensitivity and validity of the integration approach, we compared the

performance of individual methods and of the integration on protein-coding genes using, as

a gold-standard, a list of 603 known recurrently-mutated cancer genes in the Cancer Gene

Census25 (CGC v80). These analyses revealed that, typically, the integration of different

methods outperformed individual methods, both in terms of sensitivity and specificity,

yielding lists of significant genes that were longer and more enriched in high-confidence

cancer genes (Extended Data Fig. 4).

Discovery of recurrently mutated elements

Our conservative integration and filtering strategy yielded 746 total hits, which include 646

hits in 157 protein-coding genes, 30 hits in seven protein-coding promoters, 26 hits in eight

long non-coding RNAs, 18 hits in six non-coding RNA promoters, ten hits in six 3’UTRs, six

hits in four 5’UTRs, six hits in one enhancer, three hits in one microRNA and one hit in a

small RNA candidate (Fig 4a; Supplementary Table 5). The number of significant elements

varied from just one in clear-cell renal cancer to 78 in the Carcinoma meta-cohort (Fig. 4b)

and depended strongly on cohort size (r = 0.84; P = 1.9x10-8; also see power analysis

below), reflecting that the landscape of driver candidates, particularly in small tumor cohorts,

is still incomplete.

.CC-BY-NC-ND 4.0 International licenseavailable under awas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (whichthis version posted December 23, 2017. ; https://doi.org/10.1101/237313doi: bioRxiv preprint

Page 7: was not certified by peer review) is the author/funder. It ...1 Discovery and characterization of coding and non-coding driver mutations in more than 2,500 whole cancer genomes Esther

7

Comparison of the p-values obtained from individual tumor types and meta-cohorts revealed

that although most candidate drivers gained significance in larger meta-cohorts, the tumor-

type specific genes DAXX (Panc-Endocrine), BRAF (Skin-Melanoma), NRAS (Skin-

Melanoma), SPOP (Prost-AdenoCa) and hsa-mir-142 (Lymph-BNHL) scored higher in their

respective tumor types (Extended Data Fig. 5). These results emphasize the need for

careful tradeoff between tumor-type specificity and cohort size when searching for drivers.

Protein-coding mutations. The ability to discover known protein-coding cancer genes from

WGS data provided the opportunity to evaluate our strategy and to put non-coding driver hits

in context (Extended Data Fig. 6). Overall, candidate coding drivers were concordant with

previous results: of the 157 genes significant in at least one patient cohort, 64% are listed in

the CGC and 87% are in a more comprehensive list of cancer genes

(https://doi.org/10.1101/190330). In contrast to prior studies based on large exome

sequencing data sets, the moderate number of patients per cancer type in this data set

provided sufficient power to detect only the genes with the strongest signal. Indeed, when

we relaxed the significance threshold (q<0.25) to detect hits “near significance”, we found 93

additional hits in 62 unique genes. Half (31) of these genes were already discovered as

significant (q<0.1) in at least one cohort in this study, and now became significant in

additional cohorts. Of the 31 genes not previously identified by our analyses, 19% were in

the CGC and 32% in the more comprehensive list of drivers. These results confirm that

many cancer genes were just beyond our significance threshold and would likely have been

discovered with larger cohorts (Supplementary Table 4).

Non-coding driver candidates

In contrast to protein-coding elements, there was far less agreement among significance

methods when analyzing non-coding elements, and most hits were strongly supported by

only a few methods (Extended Data Fig. 7). The limited agreement between driver

discovery methods in non-coding elements is likely due to a weaker signal of selection and

different underlying assumptions that imperfectly model background mutation frequencies in

non-coding regions. Thus, in order to nominate a significant element as a candidate driver,

we carefully reviewed the supporting evidence from the genomic data and sought additional

evidence from PCAWG (chromosomal breakpoints, copy-number, loss-of-heterozygosity and

expression data), cancer gene databases and the literature (Supplementary Table 6).

Promoters and 5’UTRs. Due to the overlap between the downstream regions of promoters

and 5’UTRs, we reviewed the 11 significant elements from our promoter and 5’UTR

.CC-BY-NC-ND 4.0 International licenseavailable under awas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (whichthis version posted December 23, 2017. ; https://doi.org/10.1101/237313doi: bioRxiv preprint

Page 8: was not certified by peer review) is the author/funder. It ...1 Discovery and characterization of coding and non-coding driver mutations in more than 2,500 whole cancer genomes Esther

8

analyses together, yielding several interesting candidates. Our integrated results confirmed

that the TERT promoter is the most significantly mutated promoter in cancer, being

significant in 8 individual tumor types and 11 meta-cohorts (Fig. 4c; Supplementary Table

5). As previously reported, we observed significantly higher TERT transcript expression

levels in mutated compared to non-mutated cases (Extended Data Fig. 8).

Among the remaining candidates, we found recurrent mutations in the promoters of PAX5

and RFTN1 in lymphoma. PAX5 is a known off-target of AID hypermutation24 and, indeed,

the percentage of its mutations attributable to AID activity (46%) was just below our filtering

threshold. Consistently, mutations in its promoter were not associated with gene expression

changes, suggesting that these mutations may be passengers (Extended Data Fig. 8). In

contrast, mutations in the promoter of RFTN1 were weakly associated with an increase in

RFTN1 expression levels (P = 0.03; mutant/wild type fold difference (FD) = 1.2; P = 0.02,

after excluding 8/21 mutations attributed to the AID signature) (Fig. 5a). The protein

encoded by RFTN1, Raftlin, associates with B-cell receptor complexes26 but its function in

lymphoma is unclear.

Mutations called in the 5’UTR of RNF34 in Liver-HCC overlap an intronic DNAse I

hypersensitive region in multiple cell types27, suggesting that events may affect expression

of RNF34 or a neighboring gene through an intragenic enhancer. Indeed, expression of the

histone demethylase KDM2B located downstream of RNF34 shows weak correlation with

these mutations (P = 0.03; FD = 1.3; Fig 5b) with no effect on RNF34 mRNA (P = 0.41).

Mutations in the 5’ region of MTG2 were concentrated in a hotspot and showed marginally

significant decreased expression in the Pan-cancer (P = 0.036; FD = 0.8) and Carcinoma (P

= 0.029; FD = 0.8) meta-cohorts (Fig. 5c; Extended Data Fig. 8). MTG2 encodes a little-

studied GTPase that associates with the mitochondrial ribosome28. HES1 promoter

mutations were significant in Carcinoma and Pan-cancer, but showed no association with

gene expression (Extended Data Fig. 8). HES1 is a NOTCH signalling target29, and is

focally amplified in gastric cancers (Extended Data Fig. 9).

PTDSS1, DTL (and INTS7, which shares a bidirectional promoter), IFI44L and POLR3E

showed trends towards increased or decreased expression in mutated tumors, although the

small number of samples prevented us from drawing definitive conclusions (Extended Data

Fig. 8). None of these genes were present in significantly amplified or deleted focal peaks

(Methods). Validation of these hits with additional data in further studies will be needed to

evaluate whether they are genuine drivers.

.CC-BY-NC-ND 4.0 International licenseavailable under awas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (whichthis version posted December 23, 2017. ; https://doi.org/10.1101/237313doi: bioRxiv preprint

Page 9: was not certified by peer review) is the author/funder. It ...1 Discovery and characterization of coding and non-coding driver mutations in more than 2,500 whole cancer genomes Esther

9

As previously reported in a number of studies1,3,7, we also found recurrent mutations in the

promoter of WDR74, which was significant in multiple cohorts. The mutations concentrate in

a small region overlapping an evolutionarily conserved spliceosomal U2 snRNA and

therefore potentially affect splice patterns. However, no significant associations with either

WDR74 or transcriptome-wide splicing were seen (Supplementary Note; Methods). We

did, however, find that the repetitive U2 sequence was frequently affected by recurrent

artifacts in SNP databases, raising concerns about potential mapping errors for this

repetitive RNA type (Extended Data Fig. 10).

Finally, restricted hypothesis testing of the promoters of CGC genes (n = 603) revealed

significant recurrence of mutations in the promoter region of TP53 (11 patients in the pan-

cancer cohort; q = 0.044, NBR method). Most of these mutations were substitutions and

deletions affecting either the TSS or the donor splice site of the first non-coding exon of

TP53. For the 6 samples with expression data, these mutations were associated with a

dramatic reduction of expression (Fig. 5d). In 8 of the 11 mutant cases, the mutation

occurred in combination with copy-neutral loss of heterozygosity. This is the first report of a

relatively infrequent, but impactful, form of TP53 inactivation by non-coding mutations.

Enhancers. Two enhancer regions were found significant in our integration analysis; one at

the IGH locus in Lymph-CLL and the other near TP53TG1 in multiple cohorts. However,

many of the recurrent mutations in the enhancer overlapping the IGH locus are due to

characteristic somatic hypermutation; 48% of mutations in this enhancer element matched

the AID signature, which is just below our filtering threshold. Nine of 18 mutations in the

enhancer near TP53TG1 were contributed by esophageal cancers. However, 89% of

mutations in these esophageal tumors were attributed to a mutational signature common in

this cancer type (mostly T>G substitutions in NTT context), raising the concern that these

reflect a localized mutational process rather than a driver event30 (PCAWG7 publication).

These mutations were concentrated around a conserved region (Fig. 5e) and overlapped

sites bound by NFIC and ZBTB7 transcription factors in HepG2 cells27. TP53TG1 is a non-

coding RNA suggested as a tumor suppressor involved in p53 response to DNA damage

and has been reported to be epigenetically silenced in cancer31.

3’UTRs. Recurrent somatic events were identified in the 3’UTRs of six genes: FOXA1 in

prostate cancer, ALB in liver cancer, SFTPB in lung adenocarcinoma, NFKBIZ and

TMEM107 in lymphomas, and TOB1 in Carcinoma, most of which contained a large number

of indels (Fig. 4c). High rates of indels in the 3’UTR of ALB in liver cancer and SFTPB in

lung cancer have been recently reported8 but it is unclear whether they are cancer drivers or

.CC-BY-NC-ND 4.0 International licenseavailable under awas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (whichthis version posted December 23, 2017. ; https://doi.org/10.1101/237313doi: bioRxiv preprint

Page 10: was not certified by peer review) is the author/funder. It ...1 Discovery and characterization of coding and non-coding driver mutations in more than 2,500 whole cancer genomes Esther

10

the result of a poorly-understood mechanism of localized indel hypermutation. To determine

whether indels affect function and accumulate specifically in the 3’UTRs, we compared indel

rates in 3’UTRs with other regions around these genes. Consistent with local hypermutation,

rather than selection, we observed similar indel rates in 3’UTR and 1 kb downstream of the

polyadenylation site in ALB, SFTPB and FOXA1 (Fig. 6a). This, together with additional

observations presented below, strongly suggest that indel recurrence at these loci is caused

by a novel localized hypermutation process.

On the other hand, although ALB is subject to localized hypermutation, protein-coding SNVs

in ALB are enriched in missense and truncating events relative to synonymous mutations in

liver cancer (P = 1.5x10-7; Fig. 6b). Furthermore, the ALB locus is lost in a fraction of liver

cases (Fig. 6b), with copy losses having a tendency to occur in samples without somatic

mutations (Fisher’s exact test, P = 0.0099). These findings, suggest that loss of ALB may be

a genuine driver event in liver cancer. Similarly, the well-known driver role of FOXA1 in

prostate tumors is supported by the functional impact of coding mutations (P = 4.4x10-7) and

focal amplifications, raising the possibility that the mutation enrichment observed in the

3'UTR might also be functional.

In contrast to the genes discussed above, the number of indels in the 3’UTRs of NFKBIZ and

TOB1 was significantly higher than in other parts of these genes, suggesting that indels in

the 3’UTRs could be under functional selection (Fig. 6a). TOB1 encodes an anti-proliferation

regulator that associates with ERBB2, and also affects migration and invasion in gastric

cancer32. As part of the CCR4-NOT complex, TOB1 regulates other mRNAs through binding

to their 3’UTR and promoting deadenylation33. Tumors with 3’UTR mutations in TOB1

showed a trend towards decreased expression (P = 0.053; FD = 0.7). Mutations did not

concentrate in known miRNA binding sites, possibly due to incomplete annotation (Fig. 6c).

However, the extreme conservation of this region (5th most conserved among all tested

3’UTRs with an average PhyloP score of 5.6) indicates that these events likely disrupt

functionally important sites (Fig. 6c). Interestingly, TOB1 and its neighboring gene WFIKKN2

are focally amplified in breast cancer and pan-cancer, suggesting a complex role in cancer

(Extended Data Fig. 9). NFKBIZ is a transcription factor that is mutated in

recurrent/relapsed DLBCL34 and amplified in primary lymphomas34,35. 3’UTR mutations

accumulated in a hotspot proximal to the stop codon and upstream of conserved miRNA

binding sites which might be affected by these mutations (Fig. 6d). Only two events in these

regions are SNVs, suggesting that mutations in the NFKBIZ 3’UTR are not the consequence

of AID off-target activity. Lymphomas with 3’UTR mutations showed a trend towards

increased mRNA levels of NFKBIZ (P = 0.035; FD = 3.2; the trend remains after correction

.CC-BY-NC-ND 4.0 International licenseavailable under awas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (whichthis version posted December 23, 2017. ; https://doi.org/10.1101/237313doi: bioRxiv preprint

Page 11: was not certified by peer review) is the author/funder. It ...1 Discovery and characterization of coding and non-coding driver mutations in more than 2,500 whole cancer genomes Esther

11

for copy number, P = 0.03; Fig. 6d). Functional studies will be required to understand the

exact function of 3’UTR mutations on transcript regulation of TOB1 and NFKBIZ.

Mutations in the significant 3’UTR of TMEM107 in lymphoma overlap the U8 small RNA

SNORD118 (Extended Data Fig. 10). Similar to WDR74, overlap with a repetitive RNA

element suggests that these mutations might be caused by mapping artifacts.

Non-coding RNAs. ncRNAs are often part of large gene families with multiple copies in the

genome. Sequence similarity with other genomic regions may therefore complicate their

analysis. Similarly to the U2 RNA upstream of WDR74, we observed high levels of germline

polymorphisms in normal samples in some of the significant ncRNAs and their promoters

(Extended Data Fig. 10). Though interesting candidates, caution should thus be exercised

in interpreting these hits.

The non-coding RNA RMRP is significantly mutated in multiple cancer types, in both its gene

body and promoter (Fig 4c; Fig 6e; Supplementary Table 5). RMRP is the RNA

component of the endoribonuclease RNase MRP, an enzymatically active

ribonucleoprotein36,37,38. Its catalytic function depends on the RNA secondary and tertiary

structure and its interactions with proteins39. Germline mutations in RMRP cause cartilage-

hair hypoplasia, and somatic promoter mutations have been reported to be functional9. In

addition, the RMRP locus is focally amplified in several tumor types, including epithelial

cancer (Extended Data Fig. 9c). SNVs in the gene body (7 in pan-cancer) are significantly

biased towards secondary structure impact (P = 0.011, permutation test)40,41. Three of these

are individually significant (each with P < 0.1, sample level permutation tests), with two

affecting the same position and one located in a tertiary interaction site (Fig. 6e). Of the four

gene-body indels, three are located in or near protein-binding sites (P = 0.08), including a

deletion that is predicted to affect the secondary structure (Extended Data Fig. 9d). Given

RMRP's role in replication of the mitochodrial genome36, we tested whether mutations in this

locus were associated with altered mitochondrial genome copy number. Indeed, mutated

samples showed a trend towards higher mitochondrial copy number (two-sided rank-sum

test, P = 0.1). However, RMRP also appears to be a target of potential artifacts (Extended

Data Fig. 10) and so its relevance to cancer requires further scrutiny.

The microRNA precursor (pre-miRNA) for miR-142 was found significant in Lymph-BNHL,

Lymphatic system and Hematopoietic system (Fig. 4c; Supplementary Table 5). The locus

is a known AID off-target in lymphoma5,42(Extended Data Fig. 10). However, five of seven

mutations (71%) in the dominantly expressed mature miRNA mir-142-p3, where the largest

.CC-BY-NC-ND 4.0 International licenseavailable under awas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (whichthis version posted December 23, 2017. ; https://doi.org/10.1101/237313doi: bioRxiv preprint

Page 12: was not certified by peer review) is the author/funder. It ...1 Discovery and characterization of coding and non-coding driver mutations in more than 2,500 whole cancer genomes Esther

12

functional impact is expected, were not assigned to AID, raising the possibility that they may

be under selection5.

Eleven additional ncRNA-related elements (RNU6-573P, RPPH1, RNU12, TRAM2-AS1,

G025135, G029190,RP11-92C4.6 and promoters of RNU12, MIR663A, RP11-440L14.1,

and LINC00963) passed our stringent post-filtering, but had limited supporting evidence after

manual inspection. These are discussed in a Supplementary Note. We further checked

whether any of our non-coding candidates were associated with known germline cancer risk

variants, and found no significant associations, with the exception of TERT (Methods).

NEAT1 and MALAT1, two neighboring lncRNAs previously reported as recurrently mutated

in liver43 and breast cancer3, were found significant in multiple cohorts (Fig. 4c), mutated in a

large number of patients (369 for NEAT1 and 158 for MALAT1; Fig 7a). In addition, these

genes are significantly associated with nearby chromosomal breakpoints (NEAT1: P =

5.3x10-9; MALAT1: P = 0.0012; Methods; Extended Data Fig. 11). The pattern of

breakpoints, indels and SNVs scattered throughout their sequence would suggest a possible

tumor suppressor role. To evaluate this hypothesis, we tested whether MALAT1 and NEAT1

mutations are associated to loss of heterozygosity (LOH). Neither gene showed biallelic loss,

in contrast to many known canonical tumor suppressors (Fig. 7b; Extended Data Fig. 11).

Mutations in MALAT1 and NEAT1 also do not exhibit higher cancer allele fractions

compared to mutations in flanking regions, a feature of early driver mutations (Fig 7c;

Extended Data Fig. 11). Furthermore, NEAT1 and MALAT1 mutations are not associated

with altered expression levels, suggesting that they do not affect post-transcriptional stability

nor show increased expression like many mutated oncogenes (Fig 7d; Extended Data Fig.

11).

The high enrichment of indels throughout the gene body of NEAT1 and MALAT1, which

have very high expression levels across many cancer types, resembles the phenomenon

described above for ALB and SFTPB. If these indels are generated by a specific mutational

process, we might expect distinct features from indels found elsewhere in the genome.

Indeed, we find that indels in NEAT1, MALAT1, ALB and SFTPB are strongly enriched in

events longer than 1 bp and particularly in indels of length 2-5 bp, compared to the genomic

background (least significant Fisher’s P = 6.8x10-5, for MALAT1; Fig. 7e). The association of

these indels with highly expressed genes suggests a transcription-coupled mutagenic

mechanism, possibly transcription-replication collisions44–46. A systematic search of genes

with increased rates of 2-5 bp indels reveals that this yet-unknown mutational process

affects other highly expressed genes (Extended Data Fig. 11), some of which were reported

.CC-BY-NC-ND 4.0 International licenseavailable under awas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (whichthis version posted December 23, 2017. ; https://doi.org/10.1101/237313doi: bioRxiv preprint

Page 13: was not certified by peer review) is the author/funder. It ...1 Discovery and characterization of coding and non-coding driver mutations in more than 2,500 whole cancer genomes Esther

13

in a recent study8, indicating that these indel sizes are a feature of the reported mutational

process. Interestingly, SNVs also occur at higher frequencies in these genes, suggesting

that they may be generated by the same mechanism, although less frequently (Fig. 7f).

Overall, the discovery of a localized mutational signature and the lack of association of the

mutations in MALAT1 and NEAT1 with loss of heterozygosity or higher allele frequencies

suggests that indels in these genes are most likely passenger events. The previously

reported oncogenic phenotypes associated with both ncRNAs43,47,48 may thus be related to

other types of alterations.

Localized lack of power to call mutations

Certain genomic regions, especially those with high GC content, are subject to systematic

low coverage in next-generation sequencing49. We used power analysis to quantify the

number of possibly missed mutations, especially in non-coding regions. These calculations

attempt to account for variations in local sequencing depth, purity and overall ploidy of

individual tumor samples, background mutation rates across cohorts and elements, and the

size of patient cohorts9,50–52. We found that mutation detection sensitivity was high across the

element types, (median d.s. = 0.98) but slightly lower for the typically GC-rich promoters

(median d.s. = 0.96) and 5’UTR elements (0.96; Fig. 8a; Supplementary Table 4,5).

However, for some individual genomic regions the detection sensitivity is dramatically

reduced and it is therefore important to evaluate it for elements of interest. For example, only

43% of tested cases have at least 50% average detection sensitivity across the third exon of

TCF7L2 and 20% in the 5’UTR of AKT1 (Extended Data Fig. 12a). Positional detection

sensitivities for the two canonical TERT promoter hotspot sites were highly variable among

patients and cohorts, ranging from 4% of sufficiently powered (≥90%) patients in CNS-

PiloAstro to 100% of patients in Thy-AdenoCa (Fig. 8b; Extended Data Fig. 12b). This

means that we have nearly no information about possible TERT promoter mutations in CNS-

PiloAstro and incomplete information in other cohorts. Using the detection sensitivity and

observed mutation count at the TERT hotspot sites, we inferred the expected total number of

events in each tumor type (Fig 8c). This analysis revealed that ~216 (CI95 = [188,245])

TERT hotspot mutations were likely missed in this study due to lack of power. Likewise,

about four additional mutations would be expected at the recurrent FOXA1 promoter hotspot

shown to be mutated in hormone-receptor positive breast cancers (Extended Data Fig.

12)9.

.CC-BY-NC-ND 4.0 International licenseavailable under awas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (whichthis version posted December 23, 2017. ; https://doi.org/10.1101/237313doi: bioRxiv preprint

Page 14: was not certified by peer review) is the author/funder. It ...1 Discovery and characterization of coding and non-coding driver mutations in more than 2,500 whole cancer genomes Esther

14

These calculations suggest that a considerable number of potentially impactful somatic

mutations were not detected due to systematic low sequencing coverage, and that specific

positions can be unpowered in elements with overall sufficient coverage.

We further evaluated the discovery power for recurrent events identified in mutational burden

tests in the tumor cohorts analyzed in this study9,50. As expected, discovery power was

highest in cohorts with many patients and low background mutation densities allowing

discovery of typical-sized driver elements mutated in fewer than 1% of patients in the

Carcinoma, Adenocarcinoma and pan-cancer meta-cohorts with 90% power (Fig. 8d). In

contrast, the small Bladder-TCC cohort (n=23) with relatively high background mutation

density (~2.7 mutations/Mb) is only powered to discover drivers that occur in at least 25% of

patients (Fig. 8d). Overall, power differences between element types for individual tumor

cohorts were small, suggesting that lack of power alone cannot explain the observed paucity

of regulatory driver elements.

Relative paucity of non-coding driver point mutations

To further analyze the relative paucity of driver mutations in non-coding elements, we sought

to estimate the overall number of driver mutations, in coding and non-coding regions of

known cancer genes. Given the limited statistical power to detect individually significant

elements, we combined the signal from multiple elements, as recently described53. The

difference between the total number of mutations observed across a combined set of

elements and the number of passenger mutations that is expected by chance, i.e. the

excess of mutations, approximates the total number of driver mutations in the set of

elements. To estimate the expected number of background mutations, we fit the NBR model

to presumed passenger genes, controlling for sequence composition, element size and

regional mutation densities (Methods; Supplementary Table 7; Supplementary Note)53.

We focused on 142 known cancer genes that include the significantly mutated cancer genes

in this study and frequently amplified or deleted cancer genes (Methods). We specifically

excluded TERT from this analysis because of its high frequency of promoter driver mutations

and the incomplete detection sensitivity reported above.

Overall, this approach predicted an excess of 3,133 driver mutations (CI95% [2,987-3,273];

2,258 SNVs and 875 indels) in the protein-coding sequences of these genes across the pan-

cancer meta-cohort (Fig. 8e). In contrast to coding regions, the observed number of

mutations across the combined set of non-coding elements associated with these genes is

very close to the expected number of passenger mutations, , with an excess of 71 (CI95%:22-

133) mutations in promoters, 32 (0-79) in 5’UTRs, and 103 (25-184) in 3’UTRs. These

.CC-BY-NC-ND 4.0 International licenseavailable under awas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (whichthis version posted December 23, 2017. ; https://doi.org/10.1101/237313doi: bioRxiv preprint

Page 15: was not certified by peer review) is the author/funder. It ...1 Discovery and characterization of coding and non-coding driver mutations in more than 2,500 whole cancer genomes Esther

15

results indicate that coding genes contribute the vast majority (>90%) of driver point

mutations in these 142 cancer genes. Importantly, these estimates are conservative, since

the estimation of the background number of mutations from putative passenger genes may

include yet undetected driver mutations. In addition, non-coding mutations in promoters of

cancer genes were also not generally associated with LOH nor with altered expression, as

one would expect if they were enriched with driver mutations (Supplementary Note).

Altogether, our results suggest that mutations in protein-coding regions dominate the

landscape of driver point mutations in known cancer genes.

Discussion

If we are to fulfill the ambitions of precision medicine, we need a detailed understanding of

the genetic changes that drive each person’s cancer, including those in non-coding regions

(https://doi.org/10.1101/190330). Most cancer genomic studies have focused on protein-

coding genes, leaving non-coding regulatory regions and non-coding RNA genes largely

unexplored. The unprecedented availability of high-quality mutation calls from whole-

genomes of >2,500 patients across 27 cancer types has enabled us to comprehensively

search for functional elements with cancer-driver mutations across the genome. To obtain

reliable results and avoid common pitfalls in detecting drivers54, we have benchmarked a

large number of methods and developed a novel and rigorous statistical strategy for

integrating their results.

Among the most interesting candidate non-coding driver elements identified in this analysis,

we have uncovered promoter or 5’UTR mutations in TP53, RFTN1 and MTG2, a putative

intragenic regulatory element in RNF34 that possibly regulates KDM2B; 3’UTR mutations in

NFKBIZ and TOB1; and recurrent mutations in the non-coding RNA RMRP. We have also

found evidence suggesting that a number of previously reported and frequently mutated

non-coding elements may not be genuine cancer drivers. Particularly, the non-coding RNAs

NEAT1 and MALAT1 are subject to a high density of passenger indels, seemingly due to a

transcription-associated mutational process that targets some of the most highly expressed

genes.

Our study has yielded an unexpectedly low number of non-coding driver candidates. The

results from four analyses - genomic hotspot recurrence, driver element discovery, discovery

power and mutational excess - suggest that the regulatory elements studied here contribute

a much smaller number of recurrent cancer-driving mutations than protein-coding genes.

This contrasts the distribution of germline polymorphisms associated with heritability of

complex traits, which are most frequently located outside of protein-coding genes 55.

.CC-BY-NC-ND 4.0 International licenseavailable under awas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (whichthis version posted December 23, 2017. ; https://doi.org/10.1101/237313doi: bioRxiv preprint

Page 16: was not certified by peer review) is the author/funder. It ...1 Discovery and characterization of coding and non-coding driver mutations in more than 2,500 whole cancer genomes Esther

16

Technical shortcomings, such as severe localized lack of sequence coverage in GC-rich

promoters (as in the case of the TERT promoter), may lead to considerable underestimation

of true drivers in certain regions. The yield of driver events will thus benefit from technical

improvements, including less variable sequence coverage (e.g. PCR-free library

preparation), longer sequencing reads and advanced methods for read mapping and variant

calling that can overcome common artifacts. Moreover, studying larger tumor cohorts will

provide increased power to discover infrequently mutated driver elements.

Our analyses have focused on currently-annotated non-coding regulatory elements and non-

coding genes, comprising 4% of the genome, and exclusively on SNVs and indels. Although

our genome-wide hotspot analysis did not detect novel highly recurrent non-coding mutation

sites, it is possible that additional non-coding drivers reside outside of the regions tested

here. Improvements in the functional annotation of non-coding elements, our understanding

of their tissue-specificity, and the impact of mutations in them, will refine current driver

discovery models and enable more comprehensive screens.

A challenge in the identification of non-coding drivers is distinguishing real novel driver

events from yet-unidentified mutational mechanisms targeting certain genomic regions, such

as the recently described hypermutation of transcription factor binding sites in

melanoma56,57. Better understanding of mutational processes and their activity along the

genome will be critical to improve the sensitivity and specificity of statistical methods for

driver discovery.

One potential explanation for the relative paucity of non-coding drivers is the smaller

functional territory size of many regulatory elements, and hence smaller chance of being

mutated, compared to protein-coding genes (Extended Data Fig.1; Fig. 8d). The presence

of TERT promoter mutations at just two sites suggests that non-coding driver mutations may

be confined to specific positions, similar to the small number of impactful sites in the

oncogenes BRAF, KRAS and IDH1.

SNVs and small indels may not easily alter the function of non-coding regulatory elements.

Directly mutating protein-coding sequences or altering expression levels by copy number

changes, epigenetic changes or repurposing of distal enhancers through genomic

rearrangements58,59 may be more likely to provide large phenotypic effects. While protein-

truncating or frameshift mutations can be created by a single nucleotide change, silencing

regulatory regions of tumor suppressors may depend on large deletions to achieve a similar

.CC-BY-NC-ND 4.0 International licenseavailable under awas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (whichthis version posted December 23, 2017. ; https://doi.org/10.1101/237313doi: bioRxiv preprint

Page 17: was not certified by peer review) is the author/funder. It ...1 Discovery and characterization of coding and non-coding driver mutations in more than 2,500 whole cancer genomes Esther

17

effect. Although the very high frequency of TERT promoter mutations in cancer

demonstrates that some promoter point mutations can be potent driver events, our analyses

suggest that the number of regulatory sites where a point mutation can lead to major

phenotypic effects may be smaller than anticipated across the genome. This highlights

TERT as an unusual example, perhaps because even a modest increase in the expression

level of TERT may be sufficient for circumventing normal telomere shortening in cancer

cells.

Comprehensive and reliable discovery of non-coding driver mutations in cancer genomes

will be an integral part of cancer research and precision medicine in the coming years. We

anticipate that the approaches developed here will provide a solid foundation for the incipient

era of driver discovery from ever-larger numbers of cancer whole-genome sequences.

Acknowledgements

We thank Kirsten Kübler for assistance with meta cohort generation. We are grateful to

Matthew Meyerson, Eric S. Lander and the PCAWG steering committee for helpful feedback

on the manuscript.

Author contributions

This work was carried out by the Discovery of Driver Elements Subgroup of the PCAWG

Drivers and Functional Interpretation Group based on data from the ICGC/TCGA Pan-

Cancer Analysis of Whole Genomes Network. Contributions spanned a range of categories,

as defined in Supplementary Note. The Driver discovery category includes method

development, driver significance evaluation and results integration. The Candidate vetting

and filtering category covers removal of potential false positive hits. The Case-based

analysis category includes driver potential and performed follow-up studies using additional

evidence. The Power analysis and driver mutations at known cancer genes category

includes power evaluations and estimation of the number of true drivers. The Leadership

and organisational work category includes organization and workgroup leadership. E.R.,

M.M.N., F.A., G.T., H.H., J.M.H., R.I.P., I.M., J.S.P. and G.G. wrote the manuscript. All

authors have had access to read and comment on the manuscript.

Author affiliations 1The Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02124, USA 2Department of Molecular Medicine (MOMA), Aarhus University Hospital, Aarhus, Denmark

.CC-BY-NC-ND 4.0 International licenseavailable under awas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (whichthis version posted December 23, 2017. ; https://doi.org/10.1101/237313doi: bioRxiv preprint

Page 18: was not certified by peer review) is the author/funder. It ...1 Discovery and characterization of coding and non-coding driver mutations in more than 2,500 whole cancer genomes Esther

18

3Wellcome Trust Sanger Institute, Hinxton CB10 1SA, Cambridgeshire, UK 4Division of Applied Bioinformatics, German Cancer Research Center, Im Neuenheimer Feld

280, Heidelberg 69120, Germany 5Institute for Research in Biomedicine (IRB Barcelona), The Barcelona Institute of Science

and Technology, Baldiri Reixac, 10, 08028 Barcelona, Spain. 6Research Program on Biomedical Informatics, Universitat Pompeu Fabra, Barcelona, Spain 7Department of Molecular Genetics, University of Toronto, Toronto, Ontario, Canada 8Centre for Genomic Regulation (CRG), Barcelona 08003, Spain 9Universitat Pompeu Fabra (UPF), Barcelona, Spain 10Division of Theoretical Bioinformatics, German Cancer Research Center (DKFZ), 69120

Heidelberg, Germany 11Institute of Pharmacy and Molecular Biotechnology, and Bioquant Center, University of

Heidelberg, 69120 Heidelberg, Germany 12Massachusetts General Hospital, Center for Cancer Research, Charlestown,

Massachusetts 02129, USA 13Department of Cell Biology, Harvard Medical School, Boston, MA, USA 14cBio Center, Dana-Farber Cancer Institute, Harvard Medical School, Boston, 15Department of Genomic Medicine, the University of Texas MD Anderson Cancer Center,

Houston, Texas, USA 16Graduate Program in Structural and Computational Biology and Molecular Biophysics,

Baylor College of Medicine, Houston, Texas, USA 17Department of Physiology and Biophysics, Weill Cornell Medicine, 1300 York Avenue, New

York, NY, 10065, USA 18Institute for Computational Biomedicine, Weill Cornell Medical College, New York, New

York 10021 USA 19Computational Biology and Bioinformatics, Department of Cell and Molecular Biology,

Uppsala University, Uppsala, Sweden 20Bioinformatics Research Centre (BiRC), Aarhus University, Aarhus, Denmark. 21Computational Biology Program, Ontario Institute for Cancer Research, Toronto, Ontario,

Canada 22Department of Medical Biophysics, University of Toronto, Toronto, Ontario, Canada 23Institute of Computer Science, Polish Academy of Sciences, Warsaw, 01-248, Poland 24Program in Computational Biology and Bioinformatics, Yale University, New Haven,

Connecticut 06520, USA 25Department of Molecular Biophysics and Biochemistry, Yale University, New Haven,

Connecticut 06520, USA

.CC-BY-NC-ND 4.0 International licenseavailable under awas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (whichthis version posted December 23, 2017. ; https://doi.org/10.1101/237313doi: bioRxiv preprint

Page 19: was not certified by peer review) is the author/funder. It ...1 Discovery and characterization of coding and non-coding driver mutations in more than 2,500 whole cancer genomes Esther

19

26Genetics and Genome Biology Program, SickKids Research Institute, Toronto, ON,

Canada 27Department of Gene Technology, Tallinn University of Technology, Estonia 28Science for Life Laboratory, Department of Immunology, Genetics and Pathology, Uppsala

University, Uppsala, Sweden 29Laboratory for Medical Science Mathematics, RIKEN Center for Integrative Medical

Sciences, Yokohama, Japan 30Department of Genitourinary Medical Oncology - Research, Division of Cancer Medicine,

The University of Texas MD Anderson Cancer Center, Houston, TX, USA 31European Bioinformatics Institute. European Molecular Biology Laboratory. Wellcome Trust

Genome Campus, Hinxton, Cambridgeshire, UK 32Department of Molecular and Cellular Biology, Baylor College of Medicine, 1 Baylor Plaza

Houston M822, Houston, Texas 77030, USA 33Faculty of Biosciences, Heidelberg University, Im Neuenheimer Feld 234, 69120

Heidelberg, Germany 34Division of Computational Biology, Memorial Sloan Kettering Cancer Center, New York, NY

10065, USA 35Samsung Medical Center, Sungkyunkwan University School of Medicine

South Korea 36Institute of Molecular Life Sciences and Swiss Institute of Bioinformatics, University of

Zurich, CH-8057 Zurich, Switzerland 37Department of Genetics, Stanford University School of Medicine, Stanford, California, USA 38Department of Haematology, University of Cambridge, Cambridge CB2 2XY, UK 39MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, Massachusetts,

USA 40Department of Computer Science, Princeton University, Princeton, New Jersey 08540,

USA 41Weill Cornell Medicine and New York Presbyterian, Department of Pathology and

Laboratory Medicine, New York, NY, USA 42Weill Cornell Medicine and New York Presbyterian, Caryl and Israel Englander Institute for

Precision Medicine, New York, NY, USA 43Weill Cornell Medicine and New York Presbyterian, Sandra and Edward Meyer Cancer

Center, New York, NY, USA 44Ontario Institute for Cancer Research, Toronto, Ontario M5G 0A3, Canada 45Center for Biomolecular Science and Engineering, University of California at Santa Cruz,

Santa Cruz, CA 95064, USA

.CC-BY-NC-ND 4.0 International licenseavailable under awas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (whichthis version posted December 23, 2017. ; https://doi.org/10.1101/237313doi: bioRxiv preprint

Page 20: was not certified by peer review) is the author/funder. It ...1 Discovery and characterization of coding and non-coding driver mutations in more than 2,500 whole cancer genomes Esther

20

46Department of Medical Science Mathematics, Medical Research Institute, Tokyo Medical

and Dental University, Tokyo, Japan 47Laboratory for Medical Science Mathematics, RIKEN Center for Integrative Medical

Sciences, Yokohama, Japan 48CREST, Japan Science and Technology Agency, Tokyo, Japan 49Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, USA 50Department of Medical Oncology, Inselspital, University Hospital and University of Bern,

3010 Bern, Switzerland 51Program in Computational Biology and Bioinformatics, Yale University, New Haven,

Connecticut, USA 52Department of Molecular Biophysics and Biochemistry, Yale University, New Haven,

Connecticut, USA 53Department of Computer Science, Yale University, New Haven, Connecticut, USA 54Catalan Institution for Research and Advanced Studies (ICREA), 08010 Barcelona, Spain 55Harvard Medical School, 250 Longwood Avenue, Boston, 02115, MA, USA

MA, USA 56Massachusetts General Hospital, Department of Pathology, Boston, Massachusetts, USA

Figure legends

Figure 1: Mutational hotspots and overview of functional elements and meta-cohorts.

a, Characteristics of top 50 single-site hotspots in PCAWG SNV data. The stacked bar chart

(left) shows the total number of patients mutated across the PCAWG cohorts colored by

mutation type. Gene names are given when hotspots overlap functional elements (colour-

coded). Known somatic driver sites are marked with a hashtag. Amino acid change is

indicated for protein-coding genes. The genomic location (chr:position) is colored by overlap

with center of palindromes (orange) or immunoglobulin loci (brown). The table (right) shows

the distribution of hotspot SNVs in the PCAWG cohorts sorted by number of hotspots.

Tumor-type specific hotspots are indicated with a box. This only includes cohorts with at

least 20 patients, and at least 10 patients or 10% of patients with a SNV. The Lymph-BNHL

and Lymph-CLL cohorts are shown together as Lymphoid malignancies. See Extended

Data Fig. 1 for a complete set of individual and meta cohorts. b, Mutational signature

attributions for mutations in each hotspot site. c, Schematic describing definition of functional

element types from GENCODE and ENCODE annotation resources. Functional elements

(black) are defined based on transcript annotations from various databases. Elements

arising from multiple transcripts with the same gene ID are collapsed, as seen here for the

protein-coding isoforms. Promoter elements are defined as 200 bases upstream and

downstream of the transcription start sites of a gene's transcripts (marked with red). Splice

.CC-BY-NC-ND 4.0 International licenseavailable under awas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (whichthis version posted December 23, 2017. ; https://doi.org/10.1101/237313doi: bioRxiv preprint

Page 21: was not certified by peer review) is the author/funder. It ...1 Discovery and characterization of coding and non-coding driver mutations in more than 2,500 whole cancer genomes Esther

21

site elements extend 6 and 20 bases from 3' and 5' exonic ends into intronic regions

respectively, as indicated. Regions overlapping protein-coding bases and protein-coding

splice sites are subtracted from other regions, as indicated for the lncRNA promoter element

in gray. d, Organization of meta-cohorts defined by tissue of origin and organ system

(Methods). Pan-cancer contains all cancers excluding Skin-Melanoma and lymphoid

malignancies.

Figure 2: Filtering and integration process for nomination of driver candidates. a,

Overview of driver discovery method and their lines of evidence to evaluate candidate gene

drivers. Methods employing each feature are marked with blue boxes next to the appropriate

track. b, Spearman’s correlation of p-values across the different driver discovery algorithms

based on simulated (null model) mutational data. Dendrogram illustrates relatedness of

method p-values and algorithm approaches marked by colored boxes on dendrogram

leaves. c, P-values are combined with Brown’s method based on the correlation structure

calculated in (b). Individual method (left) and integrated (right) log-transformed p-values are

shown in a heatmap (grey: missing data). d, Post-filtering used several criteria to identify

likely suspicious candidates (shaded grey rectangle). e, Significant driver candidates were

identified after controlling for multiple hypothesis testing based on an FDR q-value threshold

of 0.1 (blue asterisk). Candidates with q-values below 0.25 (blue dash) were also considered

of interest.

Figure 3: Technical and biological confounders used in candidate filtering. a, Read

mappability based on alignability, uniqueness and blacklisted regions. Example shown for

RN7SK, which was removed due to low alignability. b, Site-specific noise evaluated in

normal samples from PCAWG. c, Increased local mutation density caused by AID and UV

mutational processes in lymphoma and melanoma, respectively. d, Palindromic DNA can

expose bases to APOBEC enzymes. f, Number of significant unique hits (left) and unique

elements (right) removed by each filter.

Figure 4: Protein-coding and non-coding driver elements identified in PCAWG

cohorts. a, Number of total hits (elements significant in a certain patient cohort; left) and

number of unique significant elements (right). b, Number of significant elements by type and

by cohort. c, Significant non-coding elements identified in this study. Element types are

indicated with colors as in Fig. 1c.

Figure 5: Novel non-coding promoter and enhancer driver candidates. a, RFTN1

promoter locus (left) and gene expression for mutant (mut.) and wild-type (wt) (right) in

.CC-BY-NC-ND 4.0 International licenseavailable under awas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (whichthis version posted December 23, 2017. ; https://doi.org/10.1101/237313doi: bioRxiv preprint

Page 22: was not certified by peer review) is the author/funder. It ...1 Discovery and characterization of coding and non-coding driver mutations in more than 2,500 whole cancer genomes Esther

22

Lymph-BNHL tumors. Expression values are colored by copy number status; AID mutations

are highlighted in yellow. b, A mutational hotspot inside RNF34 overlapping a DNAse I site

(left) correlates with gene expression changes of KDM2B in Liver-HCC (right). Coloring of

mutations as in a. c, MTG2 promoter locus (left) and associated gene expression changes in

Carcinoma tumors (right). Coloring of mutations as in a. d, An enhancer associated with

TP53TG1 (Methods) (left) contains mutations mostly attributed to an esophageal cancer-

associated signature.

Figure 6: Novel 3’UTR and non-coding RNA driver candidates. a, Quantification of indel

rates for functional elements associated with significant 3’UTRs. b, Overview of ALB

genomic locus and PCAWG variants illustrates local density of indels. Coloring of enlarged

ALB gene elements corresponds to functional elements in barplots in a. Copy number

changes from 314 Liver-HCC cases are compared to other samples from PCAWG. c,

Genomic locus of TOB1 3’UTR. Note extreme conservation (PhyloP) for the 3’UTR. d,

Genomic locus of NFKBIZ 3’UTR (left) and associated gene expression changes in Lymph-

BNHL (right). e, Genomic locus of the RMRP transcript and promoter region (left), and its

RNA secondary structure, tertiary structure interactions, protein and substrate interactions,

and mutations with their predicted structural impact (right; Extended Data Figure 9d);

lymphoma and melanoma are excluded.

Figure 7: Characterization of a mutational process in NEAT1 and MALAT1. a, Genomic

locus overview showing the distribution of indels and SNVs in NEAT1 and MALAT1. b,

Analysis of enrichment in LOH associated with mutation (“double-hit”) for canonical tumor

suppressors (TSG), oncogenes (OG), and NEAT1/MALAT1. c, Comparison of cancer allelic

fractions within those same genes and within flanking regions (including 2 kbp upstream, 2

kbp downstream and introns). d, Difference in expression between mutated and wild-type

alleles of these genes. e, Percentages of different groups of indel sizes for all protein-coding

and lncRNA genes, ALB, NEAT1, MALAT1 and the set of genes enriched in 2-5 bp indels. f,

SNV and indel rates (total events/bp) in different functional regions of 18 protein-coding

genes enriched in 2-5 bps indels (without ALB, which contributed 47% of indels). Red lines

indicate background indel and SNV rates estimated from all protein-coding genes.

Figure 8: Missed mutations and paucity of non-coding drivers. a, Cumulative

distribution of average detection sensitivity (d.s.) from 60 PCAWG pilot samples in different

functional element types. b, Distribution of TERT promoter hotspot (chr5:1295228; hg19)

detection sensitivity for each each patient, by cohort. Grey dots indicate values for individual

patients inside estimated distribution (areas colored by cohort). Horizontal black bars mark

.CC-BY-NC-ND 4.0 International licenseavailable under awas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (whichthis version posted December 23, 2017. ; https://doi.org/10.1101/237313doi: bioRxiv preprint

Page 23: was not certified by peer review) is the author/funder. It ...1 Discovery and characterization of coding and non-coding driver mutations in more than 2,500 whole cancer genomes Esther

23

the medians. Numbers above distributions indicate the percentage of patients powered (d.s.

≥90%) in each cohort. See Extended Data Fig. 12b for the second TERT hotspot. c,

Percentage of patients with observed (blue) and inferred missed (red) mutations at the

chr5:1,295,228 and chr5:1,295,250 TERT promoter hotspot sites. Error bars indicate 95%

confidence interval. Numbers above bars show the total inferred number of TERT promoter

mutations for each site in this cohort. Red numbers indicate the absolute number of inferred

missed mutations (due to lack of read coverage). d, Heatmap shows minimal frequency of a

driver element in a cohort that is powered (≥90%) to be discovered. Cell numbers indicate

the number of patients in a given cohort that would need to harbor a mutation in a given

element. For example, the pan-cancer cohort is powered to discover a driver gene (CDS)

present in <1% or 18 patients, while the Bladder-TCC cohort is only powered to discover

drivers present in at least 27% or 6 patients. Power to discover driver elements is dependent

on the background mutation frequency (shown above the heatmap) and element length

(shown to the right). e, The ratio between observed numbers of mutations (SNVs and indels)

in regulatory and coding regions of 142 protein-coding cancer genes. The absolute number

of driver mutations predicted, with CI95% in brackets, is shown above each bar.

Extended Data Fig. 1: Mutational hotspots in additional tumor types. a, Barplot of

positions vs. patients. The stacked barcharts under the barplot show the proportion of

protein-coding (dark grey) and non-coding (light grey) positions, respectively, in each of the

bars in the barplot. b, Distribution of SNVs in top 50 single-site hotspots across all analyzed

individual cohorts and meta-cohorts.

Extended Data Fig. 2: Genomic element statistics. a, Percentage of genomic coverage

for each element type. b, Distribution of element lengths for each element type.

Extended Data Fig. 3: Integration details. a Quantile-quantile plots of p-values reported by

various driver detection algorithms on the three simulated data sets (Broad, DKFZ, and

Sanger; shown for coding regions in the meta-carcinoma cohort) showed no major

enrichment of mutations above the background rate. Results generally followed the expected

null (uniform) distribution, and the p-values reported on simulated data were subsequently

used to assess the covariance of method results. b, Quantile-quantile plots of integrated p-

values using the Brown and Fisher methods for combining p-values across the different

driver detection algorithm results were generated for a few representative tumor cohorts

(shown here for coding regions). Brown combined p-values (light blue) generally followed the

null distribution as expected, while Fisher combined p-values were significantly inflated (dark

blue), confirming that dependencies existed between the results reported by the various

.CC-BY-NC-ND 4.0 International licenseavailable under awas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (whichthis version posted December 23, 2017. ; https://doi.org/10.1101/237313doi: bioRxiv preprint

Page 24: was not certified by peer review) is the author/funder. It ...1 Discovery and characterization of coding and non-coding driver mutations in more than 2,500 whole cancer genomes Esther

24

driver detection algorithms. To simplify the integration procedure, we calculated covariances

using p-values from the observed data instead of simulated data and found that the

integrated results based on the observed covariances (first column of plots) were essentially

the same as the results obtained using the simulated covariances (second, third, and fourth

columns of plots). c, Triangular heatmaps showing the Spearman correlations of p-values

among the various driver detection methods in observed versus simulated data (coding

regions, meta-carcinoma cohort) are highly similar. Differences in the observed and

simulated correlation values (shown in the far-right heatmaps) were minimal, and thus the

final integration of p-values across methods was performed using covariances estimated on

observed data. d, Integrated p-values based on observed and simulated covariance

estimations (shown on the right, top heatmap, for coding regions in glioblastoma) did not

differ noticeably. In cases where individual methods reported results that yielded

substantially fewer hits than the median across all methods (bottom heatmap, methods in

light grey with results in dashed box), removing the methods from the integration did not

affect the number of significant genes identified (right column of results in bottom heatmap,

shown for coding regions in lung adenocarcinoma).

Extended Data Fig. 4: Sensitivity of driver discovery methods. The performance of

different driver discovery methods and of the integration method on protein-coding genes

was evaluated using known cancer genes. A list of 603 Cancer Gene Census genes (CGC,

v80) was used as a reference gold-standard set of known cancer genes. a-c, For each

method, genes with q-value<0.10 were sorted according to their p-value and the fraction of

CGC genes shown in the y-axis. The total number of significant genes identified by each

method is shown in the x-axis. The integration approach tends to outperform most methods

across cohorts, yielding longer lists of significant genes and more strongly enriched in known

cancer genes. d, Heatmap depicting the number of known cancer genes identified by each

method and by the integration approach in each cohort. The color of each cell reflects the

relative sensitivity of each method in each cohort, measured as the number of cancer genes

detected by a method (also shown as a number inside each cell) divided by the maximum

number detected by any of the methods. Methods are sorted from top to bottom according to

their mean relative sensitivity across datasets.

Extended Data Fig. 5: Sensitivity vs. specificity in individual cohorts vs. meta cohorts

for candidate drivers. q-values for the most significant individual cohort (x-axis) vs. meta

cohort (y-axis) are shown. Driver elements are colored by their element type.

.CC-BY-NC-ND 4.0 International licenseavailable under awas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (whichthis version posted December 23, 2017. ; https://doi.org/10.1101/237313doi: bioRxiv preprint

Page 25: was not certified by peer review) is the author/funder. It ...1 Discovery and characterization of coding and non-coding driver mutations in more than 2,500 whole cancer genomes Esther

25

Extended Data Fig. 6: Protein-coding driver elements identified in PCAWG cohorts.

Significant protein-coding elements identified in this study. Compare to figure 4.

Extended Data Fig. 7: Lack of concordance for non-coding results. Heatmaps depicting

p-values for methods included in integration for Liver-HCC cohorts. a, Most methods agree

on the top protein-coding hits. b, c: Agreement between methods on non-coding hits is far

lower for Liver-HCC promoters (b) and 3’UTRs (c).

Extended Data Fig. 8: Mutation to expression correlation. Expression is compared

between mutated and non-mutated samples. For each element, the z-score of the

expression values for mutated and wild type in the significant cohort is plotted. For copy

numbers, SCNA amplification indicates SCNA > 10, SCNA gain indicates SCNA ≥ 3, SCNA

loss indicates SCNA ≤ 1 and no events indicates SCNA < 3 and SCNA >1. If a patient is

mutated with multiple point mutation types, indels are indicated over SNVs.

Extended Data Fig. 9: Focal amplifications including TOB1, HES1 and RMRP in TCGA

tumors. a, Copy number profiles of 55 of 441 stomach adenocarcinomas from TCGA show

copy number gains around HES1. b, TOB1 and its gene neighbor WIFKKN2 are focally

amplified in cancer. 172 of 10844 total samples from 33 cancer types are shown. c, RMRP

focal amplifications in TCGA cancers (160 of 10844 total tumors shown). d, RMRP

secondary structure (RF00030 in Rfam) labeled with P4 tertiary interactions as well as

protein and substrate interactions (reported as ± 3bp windows). Mutations and their

predicted structural impact are indicated. Lymphoma and melanoma mutations are excluded.

Extended Data Fig. 10: Density of potential germline polymorphisms and somatic

mutations in several candidates. Genomic overviews showing mapping quality masks

from the 1,000 Genomes Project (pilot/strict), and densities of germline SNP and somatic

SNV calls on PCAWG normal and tumor samples, respectively. Excess of germline SNP

calls can be indicative of artifacts, thus an increased number of somatic mutations in such a

region raises concerns about true mutational recurrence. a, WDR74 promoter, U2 RNA

element. b, TMEM107 3’UTR. c, RNU12 small RNA. d, RMRP promoter and ncRNA

transcript. e, RPPH1 ncRNA. f, Genomic overview showing SNVs in mir-142. SNVs

attributed to the AID mutational signature are marked in orange.

Extended Data Fig. 11: Supporting information for NEAT1 and MALAT1. a, Mutation

density in the NEAT1 and MALAT1 genomic loci. Co-localized peaks of densities in the loci

are seen for structural variants, SNVs and indels. b, Levels of expression of the genes

.CC-BY-NC-ND 4.0 International licenseavailable under awas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (whichthis version posted December 23, 2017. ; https://doi.org/10.1101/237313doi: bioRxiv preprint

Page 26: was not certified by peer review) is the author/funder. It ...1 Discovery and characterization of coding and non-coding driver mutations in more than 2,500 whole cancer genomes Esther

26

enriched in 2-5 bp indels in their respective tissues. Background gene expression levels are

shown to the left for comparison. c, Heatmap showing the levels of expression across

tissues for the genes enriched in 2-5 bp indels. d, The average cancer allelic fraction (CAF)

is compared between each genomic element and the corresponding flanking regions (+/-

2Kb and introns; overlapping coding exons were excluded). The size of the points represent

the number of mutated samples for each particular element. e, The relative rate of loss of

heterozygosity (LOH) is compared between mutated and wild-type samples, coloured by

element type and highlighting significant LOH enrichments with an outside black circle. f,

Expression comparison between mutated and wild-type samples. For each element, the

median log2 fold expression difference of cohorts found significant in the driver discovery is

plotted. If elements were significant in multiple cohorts, the one with the most significant

expression difference is shown. Number of mutated samples refers to tumors for which

RNA-seq data could be obtained. Element types are colored as indicated in the legend.

Elements with a significant expression difference (q < 0.1 or q < 0.01) are indicated by

circles. g, Number samples with structural variants (SV) near or in each element is plotted

against the number of patients with a point mutation. Number of SVs is the sum of event

counts in bins of 50 kb regions that overlap with a given element. Elements with a

significantly higher than expected number of samples with SVs are indicated with black

circles (q < 0.1).

Extended Data Fig. 12: Lack of detection power in specific elements. a, Cumulative

Average d.s. for selected elements across 60 patients. Exon 3 of the colon cancer gene

TCF7L2 is 90% powered in only 14/60 (23%) of patients. The TERT promoter and 5’UTR

element of AKT1 show lack of detection power in the majority of patients. b, Distribution of

TERT promoter hotspot (chr5:1295250; hg19) detection sensitivity for each each patient, by

cohort. Grey dots indicate values for individual patients inside estimated distribution (areas

colored by cohort). Horizontal black bars mark the medians. Numbers above distributions

indicate the percentage of patients powered (d.s. ≥90%) in each cohort.

c, Left: distribution (pink) of detection sensitivity across all breast cancer patients (Breast-

AdenoCa, Breast-LobularCa, Breast-DCIS; grey dots) at the FOXA1 promoter hotspot site

(chr14:38064406; hg19). The number above the distribution plot indicates the percentage of

patients powered (d.s. ≥90%). Black horizontal bar indicates the median of the distribution.

Right: percentage of patients with observed (blue) and inferred missed (red) mutations. Error

bar indicates 95% confidence interval.

Supplementary Tables

Supplementary Table 1: Genomic element type summary statistics

.CC-BY-NC-ND 4.0 International licenseavailable under awas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (whichthis version posted December 23, 2017. ; https://doi.org/10.1101/237313doi: bioRxiv preprint

Page 27: was not certified by peer review) is the author/funder. It ...1 Discovery and characterization of coding and non-coding driver mutations in more than 2,500 whole cancer genomes Esther

27

Supplementary Table 2: Driver discovery method description

Supplementary Table 3: Driver discovery methods included in integration

Supplementary Table 4: Summary and annotation of protein-coding driver candidates

Supplementary Table 5: Summary and annotation of non-coding driver candidates

Supplementary Table 6: Summary of additional evidence for non-coding driver candidates

Supplementary Table 7: List of cancer genes used in this study

Supplementary Table 8: List of cases used in detection sensitivity analysis

Supplementary Table 9: Impact of covariates on obs/exp ratio

Supplementary Note

.CC-BY-NC-ND 4.0 International licenseavailable under awas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (whichthis version posted December 23, 2017. ; https://doi.org/10.1101/237313doi: bioRxiv preprint

Page 28: was not certified by peer review) is the author/funder. It ...1 Discovery and characterization of coding and non-coding driver mutations in more than 2,500 whole cancer genomes Esther

28

References

1. Khurana, E. et al. Integrative annotation of variants from 1092 humans: application to cancer genomics. Science 342, 1235587 (2013).

2. Fredriksson, N. J., Ny, L., Nilsson, J. A. & Larsson, E. Systematic analysis of noncoding somatic mutations and gene expression alterations across 14 tumor types. Nat. Genet. 46, 1258–1263 (2014).

3. Nik-Zainal, S. et al. Landscape of somatic mutations in 560 breast cancer whole-genome sequences. Nature 534, 47–54 (2016).

4. Melton, C., Reuter, J. A., Spacek, D. V. & Snyder, M. Recurrent somatic mutations in regulatory regions of human cancer genomes. Nat. Genet. 47, 710–716 (2015).

5. Puente, X. S. et al. Non-coding recurrent mutations in chronic lymphocytic leukaemia. Nature 526, 519–524 (2015).

6. Northcott, P. A. et al. Enhancer hijacking activates GFI1 family oncogenes in medulloblastoma. Nature 511, 428–434 (2014).

7. Weinhold, N., Jacobsen, A., Schultz, N., Sander, C. & Lee, W. Genome-wide analysis of noncoding regulatory mutations in cancer. Nat. Genet. 46, 1160–1165 (2014).

8. Imielinski, M., Guo, G. & Meyerson, M. Insertions and Deletions Target Lineage-Defining Genes in Human Cancers. Cell 168, 460–472.e14 (2017).

9. Rheinbay, E. et al. Recurrent and functional regulatory mutations in breast cancer. Nature (2017). doi:10.1038/nature22992

10. Huang, F. W. et al. Highly recurrent TERT promoter mutations in human melanoma. Science 339, 957–959 (2013).

11. Horn, S. et al. TERT promoter mutations in familial and sporadic melanoma. Science 339, 959–961 (2013).

12. Flavahan, W. A. et al. Insulator dysfunction and oncogene activation in IDH mutant gliomas. Nature 529, 110–114 (2016).

13. Huarte, M. The emerging role of lncRNAs in cancer. Nat. Med. 21, 1253–1261 (2015). 14. Bhan, A., Soleimani, M. & Mandal, S. S. Long Noncoding RNA and Cancer: A New

Paradigm. Cancer Res. 77, 3965–3981 (2017). 15. Lanzós, A. et al. Discovery of Cancer Driver Long Noncoding RNAs across 1112

Tumour Genomes: New Candidates and Distinguishing Features. Sci. Rep. 7, 41544 (2017).

16. Cancer Genome Atlas Research Network. Genomic and epigenomic landscapes of adult de novo acute myeloid leukemia. N. Engl. J. Med. 368, 2059–2074 (2013).

17. Campbell, P. J. et al. Pan-cancer analysis of whole genomes. (2017). doi:10.1101/162784

18. Perera, D. et al. Differential DNA repair underlies mutation hotspots at active promoters in cancer genomes. Nature 532, 259–263 (2016).

19. Sabarinathan, R., Mularoni, L., Deu-Pons, J., Gonzalez-Perez, A. & López-Bigas, N. Nucleotide excision repair is impaired by binding of transcription factors to DNA. Nature 532, 264–267 (2016).

20. Ellis, M. J. et al. Whole-genome analysis informs breast cancer response to aromatase inhibition. Nature 486, 353–360 (2012).

21. Brown, M. B. 400: A Method for Combining Non-Independent, One-Sided Tests of Significance. Biometrics 31, 987 (1975).

22. Fisher, R. A. Statistical Methods For Research Workers. (Genesis Publishing Pvt Ltd, 1925).

.CC-BY-NC-ND 4.0 International licenseavailable under awas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (whichthis version posted December 23, 2017. ; https://doi.org/10.1101/237313doi: bioRxiv preprint

Page 29: was not certified by peer review) is the author/funder. It ...1 Discovery and characterization of coding and non-coding driver mutations in more than 2,500 whole cancer genomes Esther

29

23. Benjamini, Y. & Hochberg, Y. Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. J. R. Stat. Soc. Series B Stat. Methodol. 57, 289–300 (1995).

24. Pasqualucci, L. et al. Hypermutation of multiple proto-oncogenes in B-cell diffuse large-cell lymphomas. Nature 412, 341–346 (2001).

25. Forbes, S. A. et al. COSMIC (the Catalogue of Somatic Mutations in Cancer): a resource to investigate acquired mutations in human cancer. Nucleic Acids Res. 38, D652–7 (2010).

26. Saeki, K. The B cell-specific major raft protein, Raftlin, is necessary for the integrity of lipid raft and BCR signal transduction. EMBO J. 22, 3015–3026 (2003).

27. ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature 489, 57–74 (2012).

28. Kotani, T., Akabane, S., Takeyasu, K., Ueda, T. & Takeuchi, N. Human G-proteins, ObgH1 and Mtg1, associate with the large mitochondrial ribosome subunit and are involved in translation and assembly of respiratory complexes. Nucleic Acids Res. 41, 3713–3722 (2013).

29. Kopan, R. & Ilagan, M. X. G. The canonical Notch signaling pathway: unfolding the activation mechanism. Cell 137, 216–233 (2009).

30. Dulak, A. M. et al. Exome and whole-genome sequencing of esophageal adenocarcinoma identifies recurrent driver events and mutational complexity. Nat. Genet. 45, 478–486 (2013).

31. Diaz-Lagares, A. et al. Epigenetic inactivation of the p53-induced long noncoding RNA TP53 target 1 in human cancer. Proc. Natl. Acad. Sci. U. S. A. 113, E7535–E7544 (2016).

32. Li, B.-S. et al. MicroRNA-25 promotes gastric cancer migration, invasion and proliferation by directly targeting transducer of ERBB2, 1 and correlates with poor survival. Oncogene 34, 2556–2565 (2015).

33. Hosoda, N. et al. Anti-proliferative protein Tob negatively regulates CPEB3 target by recruiting Caf1 deadenylase. EMBO J. 30, 1311–1323 (2011).

34. Morin, R. D. et al. Genetic Landscapes of Relapsed and Refractory Diffuse Large B-Cell Lymphomas. Clin. Cancer Res. 22, 2290–2300 (2016).

35. Chapuy, B. et al. Targetable genetic features of primary testicular and primary central nervous system lymphomas. Blood 127, 869–881 (2016).

36. Shadel, G. S. & Clayton, D. A. Mitochondrial DNA maintenance in vertebrates. Annu. Rev. Biochem. 66, 409–435 (1997).

37. Schmitt, M. E. & Clayton, D. A. Nuclear RNase MRP is required for correct processing of pre-5.8S rRNA in Saccharomyces cerevisiae. Mol. Cell. Biol. 13, 7935–7941 (1993).

38. Gill, T., Cai, T., Aulds, J., Wierzbicki, S. & Schmitt, M. E. RNase MRP cleaves the CLB2 mRNA to promote cell cycle progression: novel method of mRNA degradation. Mol. Cell. Biol. 24, 945–953 (2004).

39. Esakova, O. & Krasilnikov, A. S. Of proteins and RNA: the RNase P/MRP family. RNA 16, 1725–1747 (2010).

40. Sabarinathan R, E. al. RNAsnp: efficient detection of local RNA secondary structure changes induced by SNPs. - PubMed - NCBI. Available at: https://www.ncbi.nlm.nih.gov/pubmed/23315997. (Accessed: 7th July 2017)

41. Mularoni, L., Sabarinathan, R., Deu-Pons, J., Gonzalez-Perez, A. & López-Bigas, N. OncodriveFML: a general framework to identify coding and non-coding regions with cancer driver mutations. Genome Biol. 17, 128 (2016).

.CC-BY-NC-ND 4.0 International licenseavailable under awas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (whichthis version posted December 23, 2017. ; https://doi.org/10.1101/237313doi: bioRxiv preprint

Page 30: was not certified by peer review) is the author/funder. It ...1 Discovery and characterization of coding and non-coding driver mutations in more than 2,500 whole cancer genomes Esther

30

42. Robbiani, D. F. et al. AID produces DNA double-strand breaks in non-Ig genes and mature B cell lymphomas with reciprocal chromosome translocations. Mol. Cell 36, 631–641 (2009).

43. Fujimoto, A. et al. Whole-genome mutational landscape and characterization of noncoding and structural mutations in liver cancer. Nat. Genet. 48, 500–509 (2016).

44. Lippert, M. J. et al. Role for topoisomerase 1 in transcription-associated mutagenesis in yeast. Proc. Natl. Acad. Sci. U. S. A. 108, 698–703 (2011).

45. Sankar, T. S., Wastuwidyaningtyas, B. D., Dong, Y., Lewis, S. A. & Wang, J. D. The nature of mutations induced by replication–transcription collisions. Nature 535, 178–181 (2016).

46. Jinks-Robertson, S. & Bhagwat, A. S. Transcription-associated mutagenesis. Annu. Rev. Genet. 48, 341–359 (2014).

47. Ke, H. et al. NEAT1 is Required for Survival of Breast Cancer Cells Through FUS and miR-548. Gene Regul. Syst. Bio. 10, 11–17 (2016).

48. Han, Y., Liu, Y., Nie, L., Gui, Y. & Cai, Z. Inducing Cell Proliferation Inhibition, Apoptosis, and Motility Reduction by Silencing Long Noncoding Ribonucleic Acid Metastasis-associated Lung Adenocarcinoma Transcript 1 in Urothelial Carcinoma of the Bladder. Urology 81, 209.e1–209.e7 (2013).

49. Dabney, J. & Meyer, M. Length and GC-biases during sequencing library amplification: a comparison of various polymerase-buffer systems with ancient and modern DNA sequencing libraries. Biotechniques 52, 87–94 (2012).

50. Lawrence, M. S. et al. Discovery and saturation analysis of cancer genes across 21 tumour types. Nature 505, 495–501 (2014).

51. Cibulskis, K. et al. Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples. Nat. Biotechnol. 31, 213–219 (2013).

52. Carter, S. L. et al. Absolute quantification of somatic DNA alterations in human cancer. Nat. Biotechnol. 30, 413–421 (2012).

53. Martincorena, I. et al. Universal Patterns of Selection in Cancer and Somatic Tissues. Cell (2017). doi:10.1016/j.cell.2017.09.042

54. Lawrence, M. S. et al. Mutational heterogeneity in cancer and the search for new cancer-associated genes. Nature 499, 214–218 (2013).

55. Finucane, H. K. et al. Partitioning heritability by functional annotation using genome-wide association summary statistics. Nat. Genet. 47, 1228–1235 (2015).

56. Perera, D. et al. Differential DNA repair underlies mutation hotspots at active promoters in cancer genomes. Nature 532, 259–263 (2016).

57. Sabarinathan, R., Mularoni, L., Deu-Pons, J., Gonzalez-Perez, A. & López-Bigas, N. Nucleotide excision repair is impaired by binding of transcription factors to DNA. Nature 532, 264–267 (2016).

58. Flavahan, W. A. et al. Insulator dysfunction and oncogene activation in IDH mutant gliomas. Nature 529, 110–114 (2016).

59. Hnisz, D. et al. Activation of proto-oncogenes by disruption of chromosome neighborhoods. Science 351, 1454–1458 (2016).

.CC-BY-NC-ND 4.0 International licenseavailable under awas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (whichthis version posted December 23, 2017. ; https://doi.org/10.1101/237313doi: bioRxiv preprint

Page 31: was not certified by peer review) is the author/funder. It ...1 Discovery and characterization of coding and non-coding driver mutations in more than 2,500 whole cancer genomes Esther

31

Methods

Table of contents:

1. Patient cohorts

2. Mutational hotspot analysis

3. Mutational signatures

4. Genomic element definition

5. Driver discovery methods

6. Simulated data sets

7. Statistical framework for the integration of results from multiple driver discovery

methods

8. Post-filtering of candidates

9. Sensitivity and specificity analysis

10. Gene expression analyses

11. Normalization for copy number variation

12. Mutation to expression association

13. Copy number analyses

14. Power analysis

15. Associations between mutation and signatures of selection: loss of heterozygosity

and cancer allelic fractions

16. Signals of selection in aggregates of non-coding regions of known cancer genes

17. Mutational process and indel enrichment

18. Mutation association with splicing

19. Structural variation analysis

20. RNA structural analysis

21. Cancer associated germline variant distance to non-coding driver candidates

.CC-BY-NC-ND 4.0 International licenseavailable under awas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (whichthis version posted December 23, 2017. ; https://doi.org/10.1101/237313doi: bioRxiv preprint

Page 32: was not certified by peer review) is the author/funder. It ...1 Discovery and characterization of coding and non-coding driver mutations in more than 2,500 whole cancer genomes Esther

32

1. Patient cohorts

Generation of high-quality tumor set (Loris Mularoni)

We selected a total of 2583 samples to be included in the driver detection analyses. This list

contains all the samples that were not flagged as problematic by the PCAWG-TECH group.

A single aliquot was assigned to each sample; in cases where multiple aliquots were

present, we selected a single aliquot based on the following criteria, in order of importance:

- we prioritized primary tumors over metastatic or recurrent tumors

- we selected aliquots with an OxoG score higher than 40

- we prioritized aliquots with the highest quality (as indicated by the Stars values)

- we prioritized aliquots with RNA-seq data availability

- we prioritized aliquots with the lowest contamination (as indicated by the ContEst values)

- if a selection could not be made after applying the above filters we selected an aliquot

randomly

Selection of tumor cohorts for analysis (Esther Rheinbay)

Individual tumor type cohorts from the high-quality tumor set were selected for analysis if

they met a minimum size. This size was determined based on the cumulative number of

patients, such that no more than 2.5% of total patients were excluded. This led to a minimum

cohort size criterion of 20 patients, and removed the Bone-Cart (9 donors), Bone-Epith (11),

Bone-Osteoblast (5), Breast-DCIS (3), Breast-LobularCa (13), Cervix-AdenoCa (2), Cervix-

SCC (18), CNS-Oligo (18), Lymph-NOS (2), Myeloid-AML (13) and Myeloid-MDS (2)

individual cohorts. Samples from these cohorts were still included in meta-cohort analysis

(see below).

Tumor meta-cohorts (Esther Rheinbay)

Tumor meta-cohorts were assembled for identification of drivers and increase of discovery

power across cell lineages and organ systems. The following meta cohorts were used in

driver analyses:

By cell type of origin:

Epithelial: Carcinoma (comprised of tumor cohorts Bladder-TCC, Biliary-AdenoCa,

Breast-AdenoCa, Breast-LobularCa, Cervix-AdenoCa, ColoRect-AdenoCa, Eso-

AdenoCa, Kidney-ChRCC, Kidney-RCC, Liver-HCC, Lung-AdenoCa, Ovary-

AdenoCa, Panc-AdenoCa, Panc-Endocrine, Prost-AdenoCa, Stomach-AdenoCa,

Thy-AdenoCa, Uterus-AdenoCa,Head-SCC, Cervix-SCC, Lung-SCC),

Adenocarcinoma (Biliary-AdenoCa, Breast-AdenoCa, Breast-LobularCa, Cervix-

AdenoCa, ColoRect-AdenoCa, Eso-AdenoCa, Kidney-ChRCC, Kidney-RCC, Liver-

.CC-BY-NC-ND 4.0 International licenseavailable under awas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (whichthis version posted December 23, 2017. ; https://doi.org/10.1101/237313doi: bioRxiv preprint

Page 33: was not certified by peer review) is the author/funder. It ...1 Discovery and characterization of coding and non-coding driver mutations in more than 2,500 whole cancer genomes Esther

33

HCC, Lung-AdenoCa, Ovary-AdenoCa, Panc-AdenoCa, Prost-AdenoCa, Stomach-

AdenoCa, Thy-AdenoCa, Uterus-AdenoCa), squamous epithelium (Head-SCC,

Cervix-SCC, Lung-SCC)

Mesenchymal cells/sarcoma (Bone-Cart, Bone-Epith, Bone-Leiomyo, Bone-

Osteosarc)

Glioma (CNS-PiloAstro, CNS-Oligo, CNS-GBM)

Hematopoietic system (Lymph-BNHL, Lymph-CLL, Lymph-NOS, Myeloid-AML,

Myeloid-MDS, Myeloid-MPN)

By organ system:

digestive tract (Liver-HCC, ColoRect-AdenoCa, Panc-AdenoCa, Eso-AdenoCa,

Stomach-AdenoCa, Biliary-AdenoCa), kidney (Kidney-RCC, Kidney-ChRCC), lung

(Lung-AdenoCa, Lung-SCC), lymphatic system (Lymph-BNHL, Lymph-CLL,

Lymph-NOS), myeloid (Myeloid-AML, Myeloid-MDS, Myeloid-MPN), breast (Breast-

AdenoCa, Breast-LobularCa), female_reproductive_system (Breast-AdenoCa,

Breast-LobularCa, Cervix-AdenoCa, Cervix-SCC, Ovary-AdenoCa, Uterus-

AdenoCa), central nervous system (CNS-PiloAstro, CNS-Oligo, CNS-Medullo,

CNS-GBM)

Pan-cancer:

Two “Pan-cancer” cohorts were created: “Pancan-no-skin-melanoma” containing all

tumor types with the exception of Skin-Melanoma to remove issues caused by very high

mutation rate tumors; and “Pancan-no-skin-melanoma-lymph” with the additional removal of

lymphoid tumors (Lymph-BNHL, Lymph-CLL, Lymph-NOS) that have local somatic

hypermutation caused by AID.

.CC-BY-NC-ND 4.0 International licenseavailable under awas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (whichthis version posted December 23, 2017. ; https://doi.org/10.1101/237313doi: bioRxiv preprint

Page 34: was not certified by peer review) is the author/funder. It ...1 Discovery and characterization of coding and non-coding driver mutations in more than 2,500 whole cancer genomes Esther

34

2. Mutational hotspot analysis (Randi Istrup Pedersen)

We selected the top 50 single position hotspots based on the number of patients with an

SNV mutation. The individual positions marked as problematic by the site-specific noise filter

(see below) analysis were excluded.

Each hotspot was defined by its genomic position and annotated by the number of patients

with an SNV mutation in the given hotspot. We also annotated each hotspot with whether it

falled in one of the genomic element types analyzed in the driver discovery. We further

overlapped with loop-regions of palindromes, which are hypothesized to fold into DNA-level

hairpins, and with location in immunoglobulin loci. When a hotspot overlapped a protein-

coding gene we extracted the corresponding amino acids changes from Oncotator1

(http://portals.broadinstitute.org/oncotator/).

We identified known driver hotspots, by overlap with the somatic driver positions compiled in

the Cancer Genome Interpreter repository

(https://www.cancergenomeinterpreter.org/mutations), which among others include

mutations from ClinVar, DoCM and the literature (ref: https://doi.org/10.1101/140475).

For each hotspot we calculated the proportion of mutations in the defined cohorts and meta-

cohorts. Only cohorts with at least 20 patients, and at least 10 patients or 10% of patients

with an SNV were included in Fig. 1a (for the distribution in all cohorts and meta-cohorts see

Extended Data Fig. 1). Lymph-BNHL and Lymph-CLL were shown together as Lymphoid

malignancies.

Based on mutational signature analysis of all the cancer samples, we extracted the posterior

probability that each hotspot mutation from a given patient was generated by one of 37

identified mutational signatures. In lymphoid malignancies somatic hyper-mutations

generated by AID come in clusters along the genome. Posterior probabilities for the ten

signatures relevant for the lymph cohorts were therefore derived from models that consider

the correlation of AID mutations along the genome. For each hotspot the collected posterior

probabilities were averaged.

.CC-BY-NC-ND 4.0 International licenseavailable under awas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (whichthis version posted December 23, 2017. ; https://doi.org/10.1101/237313doi: bioRxiv preprint

Page 35: was not certified by peer review) is the author/funder. It ...1 Discovery and characterization of coding and non-coding driver mutations in more than 2,500 whole cancer genomes Esther

35

3. Mutational Signatures (Jaegil Kim)

We performed a de novo global signature discovery to identify mutational signatures

operating in PCAWG WGS cohort (N = 2,583). All single nucleotide variants (SNVs) were

stratified into M = 4608 mutation channels according to the combination of six base

substitutions at pyrimidine bases (C>A/G/T and T>A/C/G) in penta-nucleotide sequence

contexts and a transcription strand direction, transcribed (-), non-transcribed (+), non-coding

regions. All insertions and deletions were classified into additional 20 channels depending on

the number of inserted or deleted bases and any indels beyond 9 bases were grouped into a

single channel. The resulting mutation frequency matrix X (4,628 channels across 2,583

samples) were ingested to the Broad Institute’s signature analysis pipeline,

SignatureAnalyzer, to determine the optimal number of signatures (K) and the signature

profiles by de-convoluting X into a product of two non-negative matrices as X ~ WH, W (M ×

K) and H (K × N) being a signature-loading and an activity-loading matrix, respectively. The

most distinguished feature of SignatureAnalyzer is that it suggests an optimal number of

signatures best explaining X at the balance between the error measure and the model

complexity exploiting Bayesian non-negative matrix algorithm (NMF)2,3,4. Our de-novo

signature discovery identified 37 signatures including a split of 4 UV-related signatures, 3

APOBEC-related signatures, 5 POLE-related signatures, 3 MSI-related signatures, 2

COSMIC17 signatures, and other 21 singleton signatures (see PCAWG7 paper for details).

Although the matrix H contains the most representative activity across samples it also has

spurious activity assignments intrinsic to the global signature discovery (i.e., applying

SignatureAnalyzer on samples with different histological subtypes and mutational

processes). To minimize this contamination and interference we re-assigned activity

separately in each histological subtype using a more refined projection approach. We first

identified a subset of ultra-mutant samples with a dominant activity of POLE (n = 8), MSI (n =

18), and alkylating signatures (n = 1) from the original H, which are usually very exclusive to

samples with a specific DNA repair or replication defect, or a treatment. To prevent a false-

positive assignment of these signatures in remaining samples we introduced a binary matrix

Z (37 × N), where the row elements corresponding to POLE, MSI, and alkylating signatures,

are set to zero except for samples with a dominant activity, while all other elements are set

to one. More specifically, the projection was done by keeping the signature-loading matrix W'

(M × 37) frozen (W' represents the normalized signature profiles of 37 global signatures),

while allowing SignatureAnalyzer to determine a subset of signatures and infer the activity-

loading matrix H' (37 × N') that best approximates the mutation frequency matrix X' (M × N'),

where N' represents the number of samples in each histological subtype, as X' ~ W' (Z'¤H'),

Here “¤” denotes element-wise multiplications.

.CC-BY-NC-ND 4.0 International licenseavailable under awas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (whichthis version posted December 23, 2017. ; https://doi.org/10.1101/237313doi: bioRxiv preprint

Page 36: was not certified by peer review) is the author/funder. It ...1 Discovery and characterization of coding and non-coding driver mutations in more than 2,500 whole cancer genomes Esther

36

4. Definition of genomic elements (Morten Muhlig Nielsen; Nicholas Sinnott-Armstrong)

For coding elements and elements pertaining to protein coding genes (3’UTR, 5’UTR and

protein coding promoters) regions were defined based on GENCODE annotations (v.19)5:

Coding elements (CDS): The set of coding bases collapsed across all coding transcripts

with a given GENCODE gene ID.

Protein coding splice site elements (pcSS): Intronic regions extending 6 bases from

donor splice sites and 20 bases from acceptor splice sites were collected for all coding

transcripts. Bases were collapsed across all coding transcripts with a given GENCODE gene

ID. The global set of CDS bases were subtracted.

5’UTR elements (5UTR): The set of 5’UTR bases collapsed across all coding transcripts

with a given GENCODE gene ID. The global set of CDS and pcSS bases were subtracted.

3’UTR elements (3UTR): The set of 3’UTR bases collapsed across all coding transcripts

with a given GENCODE gene ID. The global set of CDS, pcSS and 5UTR bases were

subtracted.

Protein coding promoter elements (pcPROM): Regions extending 200 bases in both

directions from all protein coding transcripts’ transcription start sites (5’ ends). Bases were

collapsed across all coding transcripts with a given GENCODE gene ID. The global set of

CDS and pcSS bases were subtracted.

lncRNA elements: lncRNA transcripts were defined based on annotations from GENCODE

(v.19) and MiTranscriptome (v.2)6 not overlapping GENCODE. Transcripts were included if

fulfilling criteria 1-5 and 6 or 7 below:

1) No sense overlap to protein coding gene regions

2) More than 5kb away from protein coding genes on sense strand.

3) Longer than 200 bases.

4) Not annotated as the following biotypes: immunoglobulin, T-cell receptor, Mt_rRNA,

Mt_tRNA, miRNA, misc_RNA, rRNA, scRNA, snRNA, snoRNA, ribozyme, sRNA or

scaRNA.

5) Not overlapping genomic regions aligning back to the human genome (self chained

regions).

6) More than 20% of bases overlap conserved elements (except if annotated as

pseudogene)

.CC-BY-NC-ND 4.0 International licenseavailable under awas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (whichthis version posted December 23, 2017. ; https://doi.org/10.1101/237313doi: bioRxiv preprint

Page 37: was not certified by peer review) is the author/funder. It ...1 Discovery and characterization of coding and non-coding driver mutations in more than 2,500 whole cancer genomes Esther

37

7) Expressed in more than 10% of PCAWG samples with RNAseq data.

Genes corresponding to the selected transcripts were supplemented with a set of known

functional lncRNA genes from the literature in addition to GENCODE annotated non-coding

snoRNA and miRNA host genes. The elements were made by collapsing bases across

transcripts with given gene ID. The global set of CDS, pc_SS, 5UTR, 3UTR, pcPROM and

lncRNA_SS bases were subtracted.

lncRNA splice site elements (lncRNASS): Intronic regions extending 6 bases from donor

splice sites and 20 bases from acceptor splice sites were collected for all lncRNA transcripts.

Bases were collapsed across all lncRNA transcripts with a given gene ID. The global set of

CDS, pc_SS, 5UTR, 3UTR and pcPROM bases were subtracted.

lncRNA promoter elements: Regions extending 200 bases in both directions from all

lncRNA transcripts’ transcription start sites (5’ends). Bases were collapsed across all

lncRNA transcripts with a given gene ID. The global set of CDS, pc_SS, 5UTR, 3UTR,

pcPROM and lncRNA_SS bases were subtracted.

Short RNA elements: Short RNA transcripts were defined based on annotations from

databases Rfam (v.11)7, tRNAscanSE (v.2.0)8 and snoRNAdb (v.3)9 in addition to

GENCODE transcripts with biotype annotations mt_rRNA, mt_tRNA, misc_RNA, rRNA and

snoRNA.

Bases were collapsed across all smallRNA transcripts with a given gene ID. The global set

of CDS, pc_SS, 5UTR, 3UTR and pcPROM bases were subtracted.

microRNA elements: Mature miRNAs were defined based on mirBase (v.20)10 and a set of

potential novel miRNAs11

Enhancers: Contiguous 15-state ChromHMM called enhancers correlated between

H3K4me1 and RNA-seq across 57 human tissues were downloaded from Roadmap

Epigenomics Consortium extended data12. Associated links, defined by co-occurring activity

in a given cell type, were merged across cell types at FDR = 0.1. HoneyBadger213 p10 calls

for all DNase I sites were filtered to peaks with signal strength 0.8 or greater and intersected

with enhancer elements. The union of all DNase I peaks which overlapped with a given

element, with all CDS regions filtered out, were used as the input to driver detection.

.CC-BY-NC-ND 4.0 International licenseavailable under awas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (whichthis version posted December 23, 2017. ; https://doi.org/10.1101/237313doi: bioRxiv preprint

Page 38: was not certified by peer review) is the author/funder. It ...1 Discovery and characterization of coding and non-coding driver mutations in more than 2,500 whole cancer genomes Esther

38

5. Candidate driver identification methods

A summary of approaches used by each method is listed in Supplementary Table 2.

ActiveDriverWGS (Juri Reimand)

Driver analysis with ActiveDriverWGS (ref) was performed after discarding hypermutated

samples (>90,000 mutations) from the PCAWG cancer cohort.To avoid leakage of signals

from known cancer drivers, we removed missense mutations in analyses of non-coding

regions. ActiveDriverWGS is a local mutation enrichment method for genome-wide discovery

of cancer driver mutations with increased mutation burden of single nucleotide variants

(SNVs) and indels. ActiveDriverWGS performs a model-based test whether a given genomic

element is significantly more mutated than adjacent background genomic sequence (+/-

10kb and introns). Statistical significance of mutations is computed with a Poisson-linked

generalised linear regression model. The null model treats all SNVs with trinucleotide

context as cofactor, while indels are modelled with a separate cofactor for all nucleotides.

Mutation counts per nucleotide are presented as the response variable The alternative

model tests whether the element has different mutation burden than the background

sequence. The null and alternative models are compared with chi-square tests and

confidence intervals of expected mutations were derived from the null model using

resampling. If the confidence intervals indicated significant excess of mutation in the

background and depletion in the element of interest, we inverted corresponding small p-

values (p=1-p if p<0.5). Elements with no mutations were automatically assigned p=1.

CompositeDriver0.2 (Eric Minwei Liu, Ekta Khurana)

We have developed CompositeDriver (ref) – a computational method that combines signals

of mutation recurrence and the functional impact score derived from FunSeq2 scheme14 to

identify coding and non-coding elements under positive selection. CompositeDriver assigns

a score to each region of interest (i.e., CDS, promoter, UTR, enhancer or ncRNA) through

summation of positional mutation recurrence multiplied by the functional impact score for all

mutations within the region. A null CompositeDriver score distribution is built to calculate the

p-value for a region of interest. Mutations in the same element type but outside the region of

interest are defined as background mutations. To build the null distribution, the same

numbers of mutated positions are repeatedly drawn (default is 105 times) from background

mutations with similar replication timing and similar mutation context15. By drawing random

mutations from the same element type, CompositeDriver incorporates DNase I

hypersensitive sites and histone modification marks as covariates into the null model16.

Finally, the Benjamini-Hochberg method is used for multiple hypothesis correction17.

dNdScv (Inigo Martincorena)

.CC-BY-NC-ND 4.0 International licenseavailable under awas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (whichthis version posted December 23, 2017. ; https://doi.org/10.1101/237313doi: bioRxiv preprint

Page 39: was not certified by peer review) is the author/funder. It ...1 Discovery and characterization of coding and non-coding driver mutations in more than 2,500 whole cancer genomes Esther

39

dNdScv is a maximum-likelihood algorithm designed to test for positive or negative selection

in cancer genomes or other sparse resequencing studies. dNdScv models somatic

mutations in a given gene as a Poisson process, accounting for sequence composition and

mutational signatures using 192 trinucleotide substitution rates. Mutation rates are also

known to vary across genes, often co-varying with functional features of the human genome,

such as replication time and chromatin state. This information is exploited by dNdScv to

refine the estimates of the background mutation rate of each gene, using a negative

binomial regression. This regression removes known sources of variation of the mutation

rates and models the remaining unexplained variation of the mutation rate across genes as

being Gamma distributed, which protects the method against overconfidence in the

estimated background mutation rate for a gene. Overall, the local mutation rate for a gene is

estimated accounting for mutational signatures in the samples analysed, the sequence

composition of a gene in a trinucleotide context, 20 epigenomic covariates and the local

number of synonymous mutations in the gene. Inferences on selection are carried out

separately for missense substitutions, truncating substitutions (nonsense and essential

splice site mutations) and indels, and then combined into a global P-value per gene. dNdScv

has been described in much greater detail elsewhere18.

DriverPower (Shimin Shuai)

DriverPower is a combined burden and functional impact test for coding and non-coding

cancer driver elements. In the DriverPower framework, randomized non-coding genome

elements are used as training set. In total 1373 reference features covering nucleotide

compositions, conservation, replication timing, expression levels, epigenomic marks and

compartments are collected for downstream modelling. For the modelling, a feature selection

step by randomized Lasso is performed at first. Then the expected background mutation rate

is estimated with selected highly important features by binomial generalized linear model.

The predicted mutation rate is further calibrated with functional impact scores measured by

CADD and Eigen scores. Finally, a p-value is generated for each test element by binomial

test with the alternative hypothesis that the observed mutation rate is higher than the

adjusted mutation rate.

ExInAtor (Andres Lanzos; Rory Johnson)

ExInAtor was specifically created for prediction of cancer driver lncRNAs, but is agnostic to

gene type and can also be used for protein-coding genes. The exons of each gene are

identified and collapsed across transcript isoforms. For each gene, the trinucleotide content

of the exonic region is calculated. The remaining intronic regions, along with 10 kb of

sequence upstream and downstream, are defined as the background region. From this

.CC-BY-NC-ND 4.0 International licenseavailable under awas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (whichthis version posted December 23, 2017. ; https://doi.org/10.1101/237313doi: bioRxiv preprint

Page 40: was not certified by peer review) is the author/funder. It ...1 Discovery and characterization of coding and non-coding driver mutations in more than 2,500 whole cancer genomes Esther

40

background, a new background region is created by randomly sampling the maximum

number of nucleotides, such that the trinucleotide content exactly matches that of the exonic

region. Next, the number of mutations in the exonic and sampled background regions are

compared by hypergeometric test. Genes with elevated exonic mutational density are

considered candidate driver genes. ExInAtor was used with randomisation seed of 256.

Otherwise ExInAtor was run exactly as described in Lanzós et al, 201719.

LARVA (Jing Zhang; Lucas Lochovsky)

LARVA20, or Large-scale Analysis of Recurrent Variants in noncoding Annotations, is a

computational method that detects significantly elevated somatic mutation burdens in

genomic elements—both coding and non-coding—to identify putative cancer-driving

elements. Given a cancer cohort variant call set, and a list of genomic elements, LARVA

models the expected background somatic mutation rate by fitting a beta-binomial distribution

to the elements' variant counts. This model properly accounts for the high mutation rate

variability seen throughout the genome, which improves over some previous models'

assumption of a constant mutation rate. LARVA's model also incorporates the influence of

mutation rate covariates, such as DNA replication timing. LARVA's output lists each genomic

element from the input, along with a p-value based on the deviation of the element's

observed variant count from the expected variant count under LARVA's model.

MutSig (Julian Hess, Esther Rheinbay)

The MutSig suite21 classifies whether genomics features, both coding and non-coding, are

highly mutated relative to a predicted background mutation rate (BMR), which varies on a

macroscopic-level across patients (patient-specific mutation rates can span orders of

magnitude across pan-cancer cohorts) and genes (known covariates such as replication

timing are strongly correlated with mutation rate) and on a microscopic level across

sequence contexts (since mutational signatures are heterogeneous across a cohort and

highly context-dependent). MutSig accounts for all three of these to compute the joint BMR

distribution across genes/patients/contexts, and then convolves across the latter two

dimensions to estimate the expected distribution of total background burden for a given gene

across a whole cohort. Genes are then scored by how their total non-background burden

exceeds this null distribution.

MutSig estimates a gene’s BMR by its synonymous mutation rate for coding genes, and by

its mutation rate at nonconserved positions for non-coding genes. If the number of

background mutations in a given gene is insufficient to provide a confident estimate of its

.CC-BY-NC-ND 4.0 International licenseavailable under awas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (whichthis version posted December 23, 2017. ; https://doi.org/10.1101/237313doi: bioRxiv preprint

Page 41: was not certified by peer review) is the author/funder. It ...1 Discovery and characterization of coding and non-coding driver mutations in more than 2,500 whole cancer genomes Esther

41

BMR, MutSig will incorporate the background counts from other genes with similar covariate

profiles into its estimator.

MutSig (MutSig2CV) was originally designed for coding regions only21. Modifications to this

version of the algorithm to run on non-coding regions are novel to PCAWG. Coding MutSig

also incorporates per-gene functional impact and clustering tests, which were not run on

non-coding regions.

NBR (Inigo Martincorena)

NBR is a method that test for evidence of higher mutation density than expected by chance

in a given region of the genome, while accounting for trinucleotide mutational signatures,

sequence composition and the local density of mutations around each element. This method

has been described in detail in a previous publication22, where it was used to identify

candidate driver noncoding elements across 560 breast cancer whole-genomes.

Based on some of the features of dNdScv, NBR involves two main steps. First, all mutations

across all elements tested are used to obtain maximum-likelihood estimates for the 192 rate

parameters (rj) describing each of the possible trinucleotide substitutions in a strand-specific

manner. rj = nj/Lj , where nj is the total number of mutations observed across samples of a

given trinucleotide class (j) and Lj is the number of available sites for each trinucleotide.

These rates are used to estimate the total number of mutations across samples expected

under neutrality in each element considering the mutational signatures active in the cohort

and the sequence of the elements (Eh = Sj rjLj,h). This estimate assumes no variation of the

mutation rate across elements in the genome. Second, a negative binomial regression is

used to refine this estimate of the background mutation rate of an element, using covariates

and Eh as an offset. In this study, the local density of somatic mutations (normalized by

sequence composition) was used as a covariate, using a window around the element of a

variable size across cohorts to ensure sufficient numbers of mutations in each window

around each element and excluding coding sequences and previously identified candidate

noncoding driver regions. Replication time and average gene expression level for 100 kb

genomic bins were also used as covariates. The negative binomial regression models

mutation counts as Poisson-distributed within an element with mutation rates varying across

elements according to a Gamma distribution. As in dNdScv, this provides a refined estimate

of the background mutation rate for each element (Eh*) as well as a data-driven measure of

uncertainty around this estimate (-q- the overdispersion parameter of the negative binomial

regression). P-values for each element are calculated using a cumulative negative binomial

.CC-BY-NC-ND 4.0 International licenseavailable under awas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (whichthis version posted December 23, 2017. ; https://doi.org/10.1101/237313doi: bioRxiv preprint

Page 42: was not certified by peer review) is the author/funder. It ...1 Discovery and characterization of coding and non-coding driver mutations in more than 2,500 whole cancer genomes Esther

42

distribution with the mean (Eh*) and dispersion (q) parameters estimated by the negative

binomial regression.

To protect against neutral indel hotspots or indel artefacts, unique indel sites rather than total

indels per element were used. To protect against misannotation of a mutation clusters as

sets of independent events, a maximum of two mutations per region and per sample were

considered in the analysis.

ncdDetect (Malene Juul)

ncdDetect23 is a driver detection method tailored for the non-coding part of the genome. It

uses a burden based approach, in which the frequency of mutations is considered, to reveal

signs of recurrent positive selection across cancer genomes. For each candidate region, the

observed mutation frequency is compared to a sample- and position-specific background

mutation rate. A scoring scheme is applied to further account for functional impact in the

significance evaluation of a candidate cancer driver element. In the present application, the

scoring scheme is defined as log-likelihoods, i.e. minus the natural logarithm of the sample-

and position-specific probabilities of mutation.

The position- and sample-specific probabilities of mutation used by ncdDetect are obtained

by a statistical null model, inferred from somatic mutation calls of a collection of cancer

samples (https://doi.org/10.1101/122879). The model includes a set of genomic annotations,

known to correlate with the mutation rate in cancer. These are replication timing,

trinucleotides (the nucleotide under consideration and its left and right flanking bases),

genomic segment (a variable segmenting the genome into regulatory element types) and a

position-specific measure of the local mutation rate (a weighted average of the mutation rate,

calculated across samples in a 40 kb window flanking each specific position plus/minus 10

kb).

ncDriver (Henrik Hornshøj)

The ncDriver method (doi: https://doi.org/10.1101/182642) provides separate evaluations

of the significance for two functional mutation properties, the level of conservation and the

level of cancer type specificity. In the ncDriverConservation test, the conservation level of

mutated positions were evaluated locally for being surprisingly high, given the distribution of

conservation within the element. The p-value of the mean mutation phyloP conservation

score for an element was obtained by Monte Carlo simulation of 100,000 mean phyloP

scores based on the same number of mutations. Each mutated element was also evaluated

globally by looking up the rank of the element mean phyloP conservation score among all

.CC-BY-NC-ND 4.0 International licenseavailable under awas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (whichthis version posted December 23, 2017. ; https://doi.org/10.1101/237313doi: bioRxiv preprint

Page 43: was not certified by peer review) is the author/funder. It ...1 Discovery and characterization of coding and non-coding driver mutations in more than 2,500 whole cancer genomes Esther

43

elements annotated as the same type. This provided p-values for both local and global

mutation conservation level, which were combined into a single conservation p-value using

Fisher’s method. In the ncDriverCancerType test, the distribution of observed mutation

counts of an element across the cancer types were evaluated for being surprising compared

to expected counts estimated from a background null model (as described for the ncdDetect

method) that accounts for cancer type specific mutation signatures and other co-variates. A

Goodness-of-fit test with Monte Carlo simulation was used to determine whether the

distribution of observed mutation counts across cancer types within the element is surprising

given the expected mutation counts based on cancer types, mutation contexts and element

type. For indels, the expected mutation counts were estimated solely from the mutation rates

calculated from the mutation context, cancer type and element type.

OncodriveFML (Loris Mularoni)

OncodriveFML24 is a method designed to estimate the accumulated functional impact bias of

tumor somatic mutations in genomic regions of interest, both coding and non-coding, based

on a local simulation of the mutational process affecting it. The rationale behind

OncodriveFML is that the observation of somatic mutations on a genomic element across

tumors, whose average impact score is significantly greater than expected for said element

constitutes a signal that these mutations have undergone positive selection during

tumorigenesis. This, in turn is considered as a direct indication that this element drives

tumorigenesis.

OncodriveFML first computes the average functional impact score of the observed mutations

in the element of interest. The functional impact scores of mutations have been calculated

using both CADD25 (coding and non-coding regions) and VEST326 (only coding regions).

Then the method randomly samples the same number of observed mutations following the

probability of mutation of different tri-nucleotides, computed from the mutations observed in

each cohort. The randomization step is repeated many times (1,000,000 in these analyses)

and each time an average functional impact score is calculated. Finally, OncodriveFML

derives an empirical p-value for each element by comparing the average functional impact

score observed in the element to its local expected average functional impact score resulting

from the random sampling. The empirical p-values are then corrected for false discovery rate

and genomic elements that after the correction are still significant are considered candidate

drivers.

regDriver (Husen M. Umer)

.CC-BY-NC-ND 4.0 International licenseavailable under awas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (whichthis version posted December 23, 2017. ; https://doi.org/10.1101/237313doi: bioRxiv preprint

Page 44: was not certified by peer review) is the author/funder. It ...1 Discovery and characterization of coding and non-coding driver mutations in more than 2,500 whole cancer genomes Esther

44

regDriver assesses the significance of mutations affecting transcription factor motifs using

tissue-specific functional annotations (https://doi.org/10.1002/humu.23014). For each tumor

cohort, functional annotations from the cell lines most similar to the respective tumor type

are gathered. A functionality score is computed for each mutation based on its overlapping

functional annotations. regDriver, collects highly-scored mutations in each of the defined

elements and assesses the elements’ significance by comparing its accumulative score to a

background score distribution obtained from the simulated sets. Therefore, only candidate

regulatory mutations are considered in evaluating mutation enrichment per element.

.CC-BY-NC-ND 4.0 International licenseavailable under awas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (whichthis version posted December 23, 2017. ; https://doi.org/10.1101/237313doi: bioRxiv preprint

Page 45: was not certified by peer review) is the author/funder. It ...1 Discovery and characterization of coding and non-coding driver mutations in more than 2,500 whole cancer genomes Esther

45

6. Simulated datasets

Broad simulations (Yosef Maruvka, Gad Getz)

Due to their differing context characteristics, we simulated SNVs and indels with different

approaches. For SNVs, we divided the genome into 50kb regions. For each region, we

counted the number of mutations in it across all the PCAWG patients and divided it by the

total number of mutations. Every mutation was randomly assigned into a new region based

on the region's rate. The position inside the region was chosen to maintain the trinucleotide

context of each mutation (the 5’ and 3’ nearest neighbors and the mutated position itself)

and the alternate allele. In addition, for every base we counted how many times it was

covered sufficiently, in 401 tumor and normal WGS pairs, in order to enable calling of a

mutation27. The fraction of patients with enough coverage at a given site was used as the

position’s probability for being mutated inside the new current region.

For indels, a new, randomized position was chosen in a region of 50,000 bases around the

indel. The position of the new indel was chosen to match the indel 5’ and 3’ neighboring

reference bases. For insertions, the inserted motif was the same as the original insertion, but

for deletions only the length of the indels was kept but not the exact sequence.

DKFZ simulations (Carl Hermann, Calvin Chan)

This simulation utilizes the SNV calls to perform a localised randomisation. The original SNV

entries which do not map to chromosome 1-22, X, Y are first filtered and excluded from

randomization. All SNVs located in the protein coding regions (CDS) corresponding to

GENCODE19 definition are erased before performing randomisation. The trinucleotide

centered at each SNV position is determined and an identical trinucleotide is randomly

sampled within the 50kb window. In case of insertion, instead of the mutated trinucleotide,

the neighboring nucleotide of the insertion site is scanned within the randomisation window.

For deletion and multi-nucleotides variants, the altered sequence is scanned within the

randomization window with a ranked probability assigned for each position. The randomised

sample is then selected from the top 100 matched positions with scaled probability.

Sanger simulations (Inigo Martincorena)

This simulation aimed to generate datasets of neutral somatic mutations that retain key

sources of variation in mutation rates known to exist in cancer genomes, including

mutational signatures, and variable mutation rates across the genome and among

individuals and cancer types. To do so while minimizing the number of assumptions in the

simulation, we used a simple local randomization approach. First, all coding mutations as

.CC-BY-NC-ND 4.0 International licenseavailable under awas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (whichthis version posted December 23, 2017. ; https://doi.org/10.1101/237313doi: bioRxiv preprint

Page 46: was not certified by peer review) is the author/funder. It ...1 Discovery and characterization of coding and non-coding driver mutations in more than 2,500 whole cancer genomes Esther

46

well as mutations in the TERT promoter, MALAT1 or NEAT1 were excluded. Second, each

mutation in each patient was randomly moved to an identical trinucleotide within a 50 kb

window, while retaining the patient ID. Third, mutations falling within 50 bp of their original

position were filtered out. This simple randomization retains the variation of the mutation rate

and mutational signatures across large regions of the genome, across individuals and across

cancer types.

.CC-BY-NC-ND 4.0 International licenseavailable under awas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (whichthis version posted December 23, 2017. ; https://doi.org/10.1101/237313doi: bioRxiv preprint

Page 47: was not certified by peer review) is the author/funder. It ...1 Discovery and characterization of coding and non-coding driver mutations in more than 2,500 whole cancer genomes Esther

47

7. Statistical framework for the integration of results from multiple driver discovery

methods (Grace Tiao, Gad Getz)

The classical approach for combining p-values obtained from independent tests of a given

null hypothesis was described by R. A. Fisher in 1948. He noted that for a set of k p-values,

the sum X of the transformed p-values, where

X = -2 Σki=1 ln(pi)

and pi is the p-value for the ith test, follows a chi-square distribution with 2k degrees of

freedom28. Thus, to obtain a single combined p-value for a set of independent tests, the new

test statistic X is computed from the p-values obtained from the tests and scored against a

chi-square distribution with 2k degrees of freedom. Fisher’s test is asymptotically optimal

among all methods of combining independent tests29; however, in cases where tests exhibit

dependence, the Fisher combined p-value is generally too small (anti-conservative).

In this study, we combine p-values from several driver detection methods, many of

which share similar approaches and whose results are therefore not independent. To

address this issue, we used an extension of the Fisher method developed by Morten Brown

for cases in which there is dependence among a set of tests29. Using the same test statistic,

renamed Ψ to indicate the difference in the independence assumption, Brown observed that

if Ψ were assumed to have a scaled chi-square distribution – i.e.,

ψ ~ c X22f

then

f = E[ψ]2/var(ψ) and c = var(ψ)/ 2E[ψ]

Note that E[Ψ] = 2k irrespective of the independence requirement, and that

var(ψ) = 4k + 2 Σi<j cov(-2 lnpi, -2 lnpj)

Thus when the pi are independent, var(ψ) = 4k, which gives f = k and c = 1, and the test

statistic follows the chi-square distribution with 2k degrees of freedom described by Fisher.

However, when the independence condition is relaxed, var(ψ) ≠ 4k, and the test statistic

generally follows a different, scaled, chi-square distribution whose scaling parameter c and

degrees of freedom 2f are determined by the covariances of the pi’s. The covariances can

.CC-BY-NC-ND 4.0 International licenseavailable under awas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (whichthis version posted December 23, 2017. ; https://doi.org/10.1101/237313doi: bioRxiv preprint

Page 48: was not certified by peer review) is the author/funder. It ...1 Discovery and characterization of coding and non-coding driver mutations in more than 2,500 whole cancer genomes Esther

48

be computed via numerical integration over the joint distributions of all pi and pj pairs, but

this requires knowledge of the joint distribution; and even in cases where the joint distribution

is known, the integration may not be computationally feasible for large and complex

datasets30.

In this study, following the example of Poole et. al31, we computed the empirical

covariance of pi and pj, using the samples wi and wj , where wi is the set of all reported p-

values for method i, and used the empirical covariance to approximate the Brown scaled chi-

square distribution. The advantage to this approach is that the empirical covariance

estimation is non-parametric – it does not assume an underlying joint distribution of pi and pj

– and is thus applicable to complex and interrelated biological datasets where data is noisy

and not regularly Gaussian. Poole et. al showed that the empirical covariance estimation

approach is accurate, robust, and efficient for such datasets.

Implementing and evaluating the integration method on simulated and observed data

To evaluate the efficacy of the empirical Brown’s method of dependent p-value integration,

we generated three sets of simulated mutation data (see above) and ran the driver detection

algorithms on each of the simulated datasets. We checked that the p-value results from the

various driver detection algorithms followed the expected null (uniform) distribution

(Extended Data Fig. 3a). Then, for each simulated data set, we calculated the empirical

covariance for each pair of driver algorithm results. We then used these covariance values

over simulated datasets to compute the combined Brown p-values on observed data: for

each gene in the observed PCAWG somatic mutation dataset, we computed the Brown test

statistic from the set of p-values reported by the various driver detection algorithms. The

Brown test statistic was then evaluated against the appropriate chi-square distribution,

whose scale and degree parameters were approximated by the covariance values calculated

on the simulated data (see above).

We ran this procedure, as well as the Fisher method, for six representative tumor-

type cohorts (Breast-AdenoCa, CNS-GBM, ColoRect-AdenoCa, Lung-AdenoCa, Uterus-

AdenoCa, and meta-Carcinoma) and found that the Brown combined p-values generally

followed the null distribution as expected (Extended Data Fig. 3b). The Fisher combined p-

values were significantly inflated (Extended Data Fig. 3b), confirming that dependencies

existed between the results reported by the various driver detection algorithms.

We next explored whether it was possible to reduce the number of algorithm runs

required to complete these calculations for all tumor-type cohorts by computing the

covariance values on observed data instead of simulated data. In each of the six

representative tumor-type cohorts, we calculated the empirical covariances on the observed

.CC-BY-NC-ND 4.0 International licenseavailable under awas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (whichthis version posted December 23, 2017. ; https://doi.org/10.1101/237313doi: bioRxiv preprint

Page 49: was not certified by peer review) is the author/funder. It ...1 Discovery and characterization of coding and non-coding driver mutations in more than 2,500 whole cancer genomes Esther

49

data only and then computed the integrated Brown p-values on the observed data using the

observed covariances. Significant genes identified using only observed covariances

remained mostly unchanged from the significant genes identified using the simulated

covariances (Extended Data Fig. 3d), and examination of the differences in the covariance

values between the simulated estimations and the observed estimations revealed only minor

differences in values (Extended Data Fig. 3c). The significant drivers presented in this study

were identified using this final approach – e.g., by computing integrated Brown p-values

using estimations of covariance on observed data only.

Integration of p-values from observed data was performed for 42 tumor-type cohorts

and 13 target element types. Methods were selected for each given data set (see “Selecting

methods to include in the integration of observed p-values”, below) and raw p-values smaller

than 10-16 were trimmed to that value before proceeding with the integration. Methods with

missing data for a given element (i.e., ones that failed to report a p-value for a given

element) were excluded from the calculation for that element, and therefore in some cases

the integrated Brown p-value was computed from p-values reported by only a subset of all

the driver detection algorithms contributing results for that data set.

Selecting methods to include in the integration of observed p-values

In some cases, individual driver detection algorithms reported p-values for a given data set

that deviated strongly from the expected uniform null distribution. These were methods for

which the quantile-quantile (QQ) plots demonstrated considerable inflation. We removed

results that reported an unusual number of significant hits by calculating, for each set of

results, the number of significant elements found by each individual method using the

Benjamini-Hochberg FDR with q<0.1 as the significance threshold. Any single method that

reported four times the median number of significant elements identified by individual

methods was discarded from the integration. In a separate analysis, we found that removing

methods that yielded fewer hits than the median (i.e., methods with deflated QQ-plots) did

not affect the number of significant genes identified through the integration of the reported p-

values (Extended Data Fig. 3d); hence we did not remove such methods.

8. Post-filtering of candidates (Esther Rheinbay, Morten Muhlig Nielsen, Lars Feuerbach,

Henrik Tobias Madsen)

Post-filtering of significant hits was performed to remove those with accumulation of

mutations caused by sequencing problems or mutational processes. In particular, we applied

the following: (i) at least three mutations are present in the element, (ii) mutations are

present in at least three patients of the tested cohort, (iii) less than 50% of mutations are

located in palindromic DNA sequence22, (iv) more than 50% of mutations are located in

.CC-BY-NC-ND 4.0 International licenseavailable under awas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (whichthis version posted December 23, 2017. ; https://doi.org/10.1101/237313doi: bioRxiv preprint

Page 50: was not certified by peer review) is the author/funder. It ...1 Discovery and characterization of coding and non-coding driver mutations in more than 2,500 whole cancer genomes Esther

50

mappable genomic regions (CRG alignability, DAC blacklisted regions and DUKE

uniqueness32; (v), less than 50% of mutations occur near indels, (vi) a site-specific noise

filter (see below); and (vii) manual review of sequence evidence for novel drivers. For

lymphomas, which contain regions of somatic hypermutation caused by AID enzyme activity,

we (viii) further required less than 50% of mutations contributed by this process; and for

Skin-melanoma, we (ix) excluded mutations occurring in the extended (CTTCCG) context

that contributes to promoter hotspot mutations in this tumor type33,34,35. Even after this motif-

based filter, a large number of promoter and 5’UTR regions remained in the list of

candidates. We, therefore, marked these as likely due to failed repair in TF occupied sites

since it is unclear how many of them are true or false drivers. Additionally, elements that

failed manual mutation review were filtered out at this stage.

Site-specific noise filter

Genomic positions of mutations in each significant hit were analyzed in all normal control

samples to assess the position-specific noise-level. Therefore, for each of the three non-

reference nucleotides a and cohort c ϵ C the relative frequency of normal samples which had

at least two reads supporting the alternative allele a were calculated as p(a,c) ϵ [0;100]. The

noise score was then computed as !∈! 𝑙𝑜𝑔!"(𝑝(𝑎, 𝑐) + 1) . Mutations at positions with a

score > 20 for at least one of the non-reference nucleotides were flagged. Elements for

which the number of mutations at flagged positions exceeded 20% were removed.

DNA palindromes

We define a palindrome as a sequence of DNA followed by its complementary reverse with a

sequence of variable length in between (Fig. 3d). It is hypothesized that these palindromes

can temporarily form DNA hairpins36. While in the hairpin state the loop region is single-

stranded and open to attack by APOBEC enzymes. Based on observations in breast cancer

whole genome sequences37, we decided to consider palindromes with a minimal repeat

length of 6 bp and an intervening sequence (loop) length of 4-8 bp. We call these regions

genome-wide using the algorithm described in Ye et al.38 however using our own

implementation (https://github.com/TobiasMadsen/detectIR). In total, we find 7.3 M

palindrome regions covering a total of 135.2 MB of which 33.6 MB are loop sequence.

Computing the false discovery rate

We controlled the false discovery rate (FDR) within each of the sets of tested genomic

elements by concatenating all integrated Brown p-values from across all tumor-type cohorts

and applying the Benjamini-Hochberg procedure17 to the integrated Brown p-values. A q-

value threshold of 0.1 was chosen to designate cohort-element combinations as significant

.CC-BY-NC-ND 4.0 International licenseavailable under awas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (whichthis version posted December 23, 2017. ; https://doi.org/10.1101/237313doi: bioRxiv preprint

Page 51: was not certified by peer review) is the author/funder. It ...1 Discovery and characterization of coding and non-coding driver mutations in more than 2,500 whole cancer genomes Esther

51

hits. In addition, we defined cohort-element combinations in the range 0.1≤ q<0.25 as “near

significance”. We next applied several additional, mutation-based filtering criteria to each

significant or near-significant candidate and assigned p-values of 1 to candidates that failed

these filtering criteria. Final Benjamini-Hochberg FDR values were then re-calculated on the

adjusted sets of integrated Brown p-values to arrive at a list of candidate driver cohort-

element combinations.

.CC-BY-NC-ND 4.0 International licenseavailable under awas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (whichthis version posted December 23, 2017. ; https://doi.org/10.1101/237313doi: bioRxiv preprint

Page 52: was not certified by peer review) is the author/funder. It ...1 Discovery and characterization of coding and non-coding driver mutations in more than 2,500 whole cancer genomes Esther

52

9. Sensitivity and precision analysis of driver predictions (Iñigo Martincorena)

All methods employed in this study were shown to have a low rate of false positives when

run on a series of neutral simulated datasets without driver mutations. To evaluate the

sensitivity and precision of different methods and particularly of our approach for p-value

integration, we compared their relative performance in detecting known cancer genes when

applied to protein-coding genes. As a reference gold-standard set of known cancer genes

we used a list of 603 genes from the manually-curated Cancer Gene Census v80 database.

Results are shown in Extended Data Fig. 4.

.CC-BY-NC-ND 4.0 International licenseavailable under awas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (whichthis version posted December 23, 2017. ; https://doi.org/10.1101/237313doi: bioRxiv preprint

Page 53: was not certified by peer review) is the author/funder. It ...1 Discovery and characterization of coding and non-coding driver mutations in more than 2,500 whole cancer genomes Esther

53

10. Gene expression analyses (Samir B. Amin, Morten M. Nielsen, Andre Kahles, Nuno

Fonseca, Lehmann Kjong, members of the PCAWG Transcriptome Working Group and

Jakob Skou Pedersen)

To extend the RNAseq-based expression profiling of GENCODE annotations provided by

the PCAWG Transcriptome Working Group39, we profiled an extended set of gene

annotations, including a comprehensive set of non-coding RNAs (described above and at

Synapse:syn5325435).

The profiling used the docker-based workflow described in 39 for 1,180 RNA-seq donor

libraries, matched to WGS data across 27 different cancer types39. In brief, raw sequence

reads from donor libraries were uniformly evaluated for QC using FastQC tool, and

subsequent alignment was performed on QC-passed libraries using two methods: STAR

(v2.4.0i)40 and TopHat2 (v2.0.12)41. Resulting QC-passed bam files were independently

used to quantify extended RNA-seq annotations at the gene-level counts using htseq-count

method with following parameters: -m intersection-nonempty --stranded=no --idattr gene_id.

This step resulted in two sets of gene-level counts files per donor library which were

independently normalized using FPKM normalization and upper quartile normalization

(FPKM-UQ). The final expression values were provided as a gene-centric table (rows as

genes, columns as samples) with each value representing an average of the TopHat2 and

STAR-based alignments FPKM values. Gene-centric tables based on both, GENCODE and

extended RNA-seq annotations are available at Synapse data portal: syn2364711. Docker-

based workflow for quantifying extended RNA-seq annotations is at

https://github.com/dyndna/pcawg14_htseq

.CC-BY-NC-ND 4.0 International licenseavailable under awas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (whichthis version posted December 23, 2017. ; https://doi.org/10.1101/237313doi: bioRxiv preprint

Page 54: was not certified by peer review) is the author/funder. It ...1 Discovery and characterization of coding and non-coding driver mutations in more than 2,500 whole cancer genomes Esther

54

11. Normalization for copy number variation (Henrik Tobias Madsen, Morten Muhlig

Nielsen, and Jakob Skou Pedersen)

To account for the effects of somatic copy number alterations (SCNAs) on expression, we

used two different approaches to create two additional versions of the expression profiles:

First a conservative approach where we remove all samples not having the regular bi-allelic

copy number for the gene in question. Second a less conservative approach where we build

a regression model of expression data based oncopy number (CN) data, and then tested for

an effect of somatic mutations on the residual (i.e. the expression that is not explained by

copy number).

Generally the higher the copy number of a particular gene the higher expression.

The relationship between copy number and gene expression is not strictly linear, as various

feedback mechanisms in the cell try to compensate for the mostly deleterious effects of

SCNAs. This is known as dosage compensation and has been studied extensively in the

context of mammalian sex chromosomes, but also in evolution of yeast and in diseases

caused by aneuploidy42–45. We therefore fit a linear regression model between the logarithm

of expression and the logarithm of CN. This effectively amounts to a power-regression

model.

A number of factors makes it hard to learn the regression parameters for each gene and

cancer type in isolation. (i) for some cancer types we have only a limited number of samples;

(ii) for some genes there is not much variation in CN; and (iii) the variation in expression

between samples is generally high. We overcome these problems by employing a mixed

model strategy, that allows sharing of information between genes, effectively regularizing the

parameter estimates for gene cancer-type combinations that carry little information on their

own.

Let and denote the expression and SCNA measurement respectively

for gene, , in cancer-type, , and sample respectively. We then define:

,

where and are fixed effects, whereas and are random effects, with

and . Finally the residual is .

Using this model we infer a global SCNA to expression regression, , but allow some

regularized gene-specific and gene/cancer type-specific variation: and .

.CC-BY-NC-ND 4.0 International licenseavailable under awas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (whichthis version posted December 23, 2017. ; https://doi.org/10.1101/237313doi: bioRxiv preprint

Page 55: was not certified by peer review) is the author/funder. It ...1 Discovery and characterization of coding and non-coding driver mutations in more than 2,500 whole cancer genomes Esther

55

Thus we exploit the similarity across genes and similarity within genes across cancer types.

Since the variance increases with the absolute value of the explanatory variable associated

with a random slope, this kind of mixed model display heteroskedasticity. Furthermore the

model is not invariant under scaling of the explanatory variable, in this case SCNA. We

centralise SCNA, such that normal diploid regions have the least variance.

.CC-BY-NC-ND 4.0 International licenseavailable under awas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (whichthis version posted December 23, 2017. ; https://doi.org/10.1101/237313doi: bioRxiv preprint

Page 56: was not certified by peer review) is the author/funder. It ...1 Discovery and characterization of coding and non-coding driver mutations in more than 2,500 whole cancer genomes Esther

56

12. Mutation to expression association (Morten Muhlig Nielsen, Henrik Tobias Madsen,

and Jakob Skou Pedersen)

Mutation to expression association was calculated using non-parametric rank sum based

statistics on z-score normalized expression values. This equalizes expression mean and

variance for each cancer type and is a way to allow for comparison across cancer types. For

comparisons of expression in mutated vs non-mutated (wild-type) cases within the same

cancer type, the test reduces to the Wilcoxon rank sum test. For comparisons involving

samples in multiple cancer types, such as meta-cohorts and pan-cancer cohorts, the statistic

and associated p-value is still the non-parametric Wilcoxon rank sum test, and thus no

assumptions regarding distribution of z-score expressions are made. Tied expression values

were broken by adding a small random rank robust value. Association estimates were

performed based on original expression values as well as the two copy-number normalized

expression sets mentioned above. Fold difference values were calculated per mutation as

the log2 ratio of the expression of the mutated tumor to the median of all wild-type tumors of

the same cancer type. Reported Fold Difference values for an element with multiple

mutations represent the median fold difference of all mutations in that element.

.CC-BY-NC-ND 4.0 International licenseavailable under awas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (whichthis version posted December 23, 2017. ; https://doi.org/10.1101/237313doi: bioRxiv preprint

Page 57: was not certified by peer review) is the author/funder. It ...1 Discovery and characterization of coding and non-coding driver mutations in more than 2,500 whole cancer genomes Esther

57

13. Copy number analyses (Esther Rheinbay)

We surveyed significant focal copy number alterations for candidate driver genes as

orthogonal evidence for their “driverness”. Significant copy number alterations were obtained

from the TCGA Copy Number Portal (http://portals.broadinstitute.org/tcga/home), analysis

2015-06-01-stddata-2015_04_02 regular peel-off, a database of recurrent copy number

alterations calculated by the GISTIC2 algorithm46 across >10,000 samples and 33 tumor

types from TCGA. GISTIC2 results were included for candidate drivers if a gene was

significant (residual q<0.1) and was located within a peak with ≤ 10 genes. Visualization was

performed with the Integrative Genomics Viewer (IGV)47.

.CC-BY-NC-ND 4.0 International licenseavailable under awas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (whichthis version posted December 23, 2017. ; https://doi.org/10.1101/237313doi: bioRxiv preprint

Page 58: was not certified by peer review) is the author/funder. It ...1 Discovery and characterization of coding and non-coding driver mutations in more than 2,500 whole cancer genomes Esther

58

14. Power Calculations (Esther Rheinbay)

Calculation of mutation detection sensitivity. Average detection sensitivity, the power to

detect a true somatic variant, was calculated using a binomial model across all exon-like

regions for different genomic element types as previously published27,48. Detection sensitivity

was based on sequencing coverage and estimated clonal variant allele fraction from 60

representative PCAWG alignments from the pilot cases (https://doi.org/10.1101/161562;

Supplementary Table 7). Clonal allele fraction was estimated based on PCAWG consensus

purity and ploidy estimates (doi: https://doi.org/10.1101/161562) as purity/(average ploidy)48.

Element-wise averages were calculated as average across all exons for a given element.

Estimation of total number of promoter hotspot mutations. Detection sensitivity for all

patients was calculated for the two most recurrent TERT promoter hotspot sites

(chr5:1295228, chr5:1295250; hg19) using total read depth at these positions, sample purity

and average ploidy. For each cohort, the number and percentage of powered (≥90%)

patients was obtained. The number of total expected mutations was then inferred as number

of observed (called) mutations divided by the fraction of patients powered. The number of

“missed” mutations is the difference between the total expected and observed mutations.

Percentages of these numbers were calculated relative to the size of individual patient

cohorts. Confidence intervals (95%) on the total percentage of patients with a TERT hotspot

mutation were calculated using the beta distribution. Poisson confidence intervals were

calculated for the number of missed mutations in the PCAWG cohort. Note that the inference

of TERT mutations assumes exactly one mutation per patient. Estimates for the FOXA1

promoter hotspot mutation (chr14:38064406; hg19) were conducted using the same

procedure.

Calculation of the minimum powered mutation frequency in a population. Power to

discover driver elements mutated at a certain frequency in the population were conducted as

described before21,49, but solving for the lowest frequency for a driver element in the patient

population that is powered (≥90%) for discovery. The calculation of this lowest frequency

takes into account the average background mutation frequencies for each cohort/element

combination, the median length and average detection sensitivity for each element type,

patient cohort size, and a global desired false positive rate of 10%.

.CC-BY-NC-ND 4.0 International licenseavailable under awas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (whichthis version posted December 23, 2017. ; https://doi.org/10.1101/237313doi: bioRxiv preprint

Page 59: was not certified by peer review) is the author/funder. It ...1 Discovery and characterization of coding and non-coding driver mutations in more than 2,500 whole cancer genomes Esther

59

15. Associations between mutation and signatures of selection: loss of

heterozygosity and cancer allelic fractions (Federico Abascal, Iñigo Martincorena)

For protein-coding sequences mutation recurrence can be analysed in the context of the

functional impact of mutations (e.g. missense, truncating) to better distinguish the signal of

selection. In contrast, estimating the functional impact of mutations in non-coding elements

of the genome is a difficult, yet unsolved problem. To overcome this limitation and be able to

compare selection signatures for both coding and non-coding elements under a similar

framework, we developed two measures of selection which are agnostic to the functional

impact of mutations.

Association between mutation and loss of heterozygosity

When a tumour carries a driver mutation in one allele of a given gene, it may be the case

that a second hit on the other allele confers a growth advantage and is positively selected.

When one of the events involves the loss of one of the alleles the process is referred to as

loss of heterozygosity (LOH). This kind of biallelic losses are typical of, but not exclusive to,

tumour suppressor genes (TSGs).

For each gene, we build a 2x2 contingency table indicating the number of cases in which the

gene was mutated or not and the number of cases in which the gene was subject to LOH or

not. We applied a Fisher’s exact test of proportions to identify which genes showed an

excess of LOH associated to mutation. P-values were corrected with the FDR method to

account for multiple hypotheses testing. This analysis was applied to each tumour type and

cohort separately and proved very successful in identifying TSG as well as some oncogenes

(OGs) as well.

Association between mutation and cancer allelic fractions

Driver mutations that provide an advantage for tumour cells are expected to show higher

allelic fractions based on different interacting processes, including: early selection;

amplification of the locus carrying the driver mutation; loss of the non-mutated locus (LOH).

Comparing cancer allelic fractions (CAF) can be informative to detect signatures of selection,

both for TSGs and OGs.

CAFs are defined here as the proportion of reads coming from the tumour and carrying the

mutation. To transform observed fractions (VAFs) into CAFs, tumour purity and local ploidy

needs to be taken into account according to the following formula:

CAF = VAF * (Lp * Pt + 2 * (1 - Pt)) / (Lp * Pt)

.CC-BY-NC-ND 4.0 International licenseavailable under awas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (whichthis version posted December 23, 2017. ; https://doi.org/10.1101/237313doi: bioRxiv preprint

Page 60: was not certified by peer review) is the author/funder. It ...1 Discovery and characterization of coding and non-coding driver mutations in more than 2,500 whole cancer genomes Esther

60

Where Lp corresponds to the local ploidy for the mutated locus, and Pt denotes the tumour

purity. Ploidy and tumour purity predictions were obtained from Dentro et al (in preparation).

To determine whether CAFs for a given gene or element were higher than expected we

compared them to the CAFs observed in flanking regions. To define flanking regions we took

2 kb at each side of the gene/element, excluding any eventually overlapping coding exon,

and also included introns (if present). The two sets of CAFs associated to each

gene/element, i.e. those CAFs lying within the gene and those flanking it, were compared

with a t-test to detect significant deviations. P-values were corrected with the FDR method.

This approach was able to identify most known TSGs and OGs.

.CC-BY-NC-ND 4.0 International licenseavailable under awas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (whichthis version posted December 23, 2017. ; https://doi.org/10.1101/237313doi: bioRxiv preprint

Page 61: was not certified by peer review) is the author/funder. It ...1 Discovery and characterization of coding and non-coding driver mutations in more than 2,500 whole cancer genomes Esther

61

16. Signals of selection in aggregates of non-coding regions of known cancer genes

(Federico Abascal, Iñigo Martincorena)

We conducted a series of analyses on regions combined across genes to determine whether

the paucity of driver mutations found in non-coding regions was related to lack of statistical

power in single-gene analyses. This analysis also aimed to estimate how many driver

mutations present in cancer genes were missed in this study. For protein-coding sequences,

the number of driver mutations was estimated using dN/dS ratios as described in (50). For

non-coding, regulatory regions of protein-coding genes (Promoter and UTRs), we relied on a

modified version of the NBR negative binomial regression model described above (Section

5) to quantify the overall excess of driver mutations. We applied a second approach to

determine whether there was an enrichment of LOH associated to mutations in the different

types of non-coding regions associated to protein-coding genes.

Observed vs. expected numbers of mutations based on the NBR mutation model

NBR was used to estimate the background mutation rate expected across cancer genes,

using a conservative list of 19,082 putative passenger genes as background. The resulting

model is used to predict the numbers of expected SNVs and indels per element type per

gene, and aggregate sums across genes. For this analysis we selected a reduced but

diverse set of 142 cancer genes, encompassing 27 focally amplified and 22 deleted genes

found by TCGA and present in the Cancer Gene Census51, and 112 genes with a significant

signal of selection in this study (q<0.1) and present in the Cancer Gene Census. The list of

142 genes can be found in Supplementary. Table 7. To be as accurate as possible, we

used a diverse set of covariates in the NBR model, including: local mutation rate (estimated

on neutral regions +/- 100 kb around each gene), detection sensitivity (defined as the

element-average proportion of callable samples per site according to MuTect), gene

expression covariates (first 8 principal components of the matrix of average gene expression

values in each tumour type, as well as two binary variables marking the 500 genes with

highest expression values in any tumour and 1,229 genes with a maximum FPKM lower than

0.1 across tumour types), and averaged copy-number calls for each gene across all samples

(see Supplementary Note for more details). For each element type, the sum of observed

mutations across the 142 cancer genes were compared to the sum of the expected rates to

estimate the excess of mutations in regulatory and coding regions of cancer genes. An

excess of observed mutations provides an estimate of the number of driver events18.

Confidence intervals were calculated using the equation for the ratio of two Poisson

observations, which are the number of mutations in the list of known cancer genes and in the

list of passenger genes. It is important to know that these confidence intervals do not capture

uncertainty in the background model and should be interpreted with caution. For this reason,

.CC-BY-NC-ND 4.0 International licenseavailable under awas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (whichthis version posted December 23, 2017. ; https://doi.org/10.1101/237313doi: bioRxiv preprint

Page 62: was not certified by peer review) is the author/funder. It ...1 Discovery and characterization of coding and non-coding driver mutations in more than 2,500 whole cancer genomes Esther

62

we systematically evaluated the impact of a diverse array of covariates on our estimates

(see Supplementary Note). We also note that this test can underestimate the number of

non-coding drivers since some driver mutations can be present in the list of putative

passenger genes, although this effect is expected to be quantitatively small if the density of

driver mutations in regulatory regions of known cancer genes is higher than in those of

putative passenger genes.

Mutation-LOH association for aggregates of genes

For this analysis we combined data across known cancer genes, including 603 genes in the

Cancer Gene Census v80 and 154 additional significantly mutated genes found by exome

studies18,21. To estimate whether there was an excess of LOH associated to mutation in

regulatory and coding regions of cancer genes, we calculated the fold change in LOH for the

aggregate of cancer genes and normalized it dividing by the fold change observed in

passenger genes. Confidence intervals were estimated using parametric bootstrapping

(100,000 pseudoreplicates) for both cancer and passenger genes.

.CC-BY-NC-ND 4.0 International licenseavailable under awas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (whichthis version posted December 23, 2017. ; https://doi.org/10.1101/237313doi: bioRxiv preprint

Page 63: was not certified by peer review) is the author/funder. It ...1 Discovery and characterization of coding and non-coding driver mutations in more than 2,500 whole cancer genomes Esther

63

17. Mutational process and indel enrichment (Federico Abascal, Iñigo Martincorena)

After noticing the skewed distribution of indel lengths in genes like ALB, SFTBP, NEAT1 and

MALAT1, we carried a search for other genes showing the same pattern, which may be

subject to the same mutational process. For every gene we record the proportion of indels of

length 2-5 bp out of the total number of indels and compared this proportion with the

background proportion using a binomial test. The background proportion was calculated

using all protein-coding and lncRNAs genes. For every gene we also calculated the indel

rate and compared it to the background indel rate using a binomial test. Both sets of p-

values were independently corrected with the FDR method. The analysis was done for each

tumour type separately. Genes with a q-value < 0.1 both for enrichment in 2-5 bp indels and

for higher indel rates were further analyzed as candidates to be under the process of

localized indel hypermutation described in this study. The levels of expression of these

genes were analysed across all tumour types.

.CC-BY-NC-ND 4.0 International licenseavailable under awas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (whichthis version posted December 23, 2017. ; https://doi.org/10.1101/237313doi: bioRxiv preprint

Page 64: was not certified by peer review) is the author/funder. It ...1 Discovery and characterization of coding and non-coding driver mutations in more than 2,500 whole cancer genomes Esther

64

18. Mutations association to splicing (Andre Kahles)

For the assessment of the relationship between mutations in the U2 locus upstream of

WDR74 and local changes in alternative splicing, we analysed changes in the percent-

spliced-in (PSI) value52 of alternative splicing events located in WDR74 relative to the given

genotype. The analysis was based on four different alternative splicing event types extracted

and quantified by the PCAWG Transcriptome Working Group (exon skip, intron retention,

alternative 3’ splice site, alternative 5’ splice site)39. In total we analysed 116, 103, 27 and 98

events, respectively, for the above event types. We then filtered the events for a minimum of

expression evidence (non-NaN PSI) in at least 10 samples, presence of the alternate allele

together with expression evidence in at least 5 samples, and a minimum absolute distance

of mean PSI values of alternate and reference group of 0.05. Based on the selected events,

we retained 3, 3, 1 and 0 events for analysis. Using an ordinary least squares model with the

genotypes as factors and the the PSI values as responses, we computed p-values and the

variance explained by presence of mutations. Multiple testing correction followed the

Bonferroni method.

To assess the relationship between mutations in the U2 locus upstream of WDR74 and

global changes in splicing, we used two global splicing statistics. We measured the amount

of splicing as the number of edges in the splicing graph of a gene in a given samples. All

splicing graphs were taken from the analyses of the PCAWG Transcriptome Working

Group39. For each sample, we computed the mean number of splicing edges over all genes,

resulting in the extent of splicing per sample. We measured the extent of splicing as how far

each event is outlying from the mean over all event in the same cancer type (encoded by

project code x histotype). The splicing outlier values per sample and gene were taken from

the PCAWG Transcriptome Working Group analyses39. The mean over all genes of a

sample was taken as a statistic for the extent of splicing.

For both statistics, we again used an ordinary least squares model as described above to

model the relationship between genotype and splicing. As splicing is highly tissue

dependent, we included the project codes as additional factors, accounting for both tissue

specificity as well as possible underlying batch effects. To extract the amount of additional

variance explained through the presence of mutations, we computed one model on project

codes and presence of mutations and one model on project codes alone. For the model, we

only included samples of project codes, where we had at least one sample with a mutation

present, resulting in a total of 618 samples over 14 project codes.

.CC-BY-NC-ND 4.0 International licenseavailable under awas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (whichthis version posted December 23, 2017. ; https://doi.org/10.1101/237313doi: bioRxiv preprint

Page 65: was not certified by peer review) is the author/funder. It ...1 Discovery and characterization of coding and non-coding driver mutations in more than 2,500 whole cancer genomes Esther

65

19. Structural variation analysis (Morten Muhlig Nielsen, Lars Feuerbach)

Structural variant data was provided by the PCAWG Structural Variation Working Group.

The data provide p-values for the observed breakpoint counts in 50kb bins along the

genome. Candidate elements were overlapped with the bins, and fisher’s method was used

to calculate a single p-value for each element. The set of element p-values were corrected

with the FDR method.

.CC-BY-NC-ND 4.0 International licenseavailable under awas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (whichthis version posted December 23, 2017. ; https://doi.org/10.1101/237313doi: bioRxiv preprint

Page 66: was not certified by peer review) is the author/funder. It ...1 Discovery and characterization of coding and non-coding driver mutations in more than 2,500 whole cancer genomes Esther

66

20. RNA structural analysis (Radhakrishnan Sabarinathan, Ciyue Shen, Chris Sander,

Jakob Skou Pedersen)

In order to test if the observed mutations (SNVs) in the RMRP gene are biased towards high

RNA secondary structure impact, we performed a permutation test by following the steps

used in oncodriveFML24 together with the predicted structural impact scores from RNAsnp53.

At first, the RNAsnp was run with the options -m 1 -w 300 and other default parameters to

obtain the minimum correlation coefficient (r_min) score for each possible mutations in the

RMRP gene. The r_min scores were then transformed, 1-((r_min+1)/2), to range between 0

and 1, where 1 indicates high structural impact score. Further, we followed the steps of

oncodriveFML (see section ‘oncodriveFML’ for more details) with 1,000,000 randomizations

and by using per sample mutational signatures (i.e., the probability of observing a mutation

in a particular tri-nucleotide context in a given sample) to compute the p-value at the cohort

and sample level.

Furthermore, the RNA secondary structure impact scores (r_min) of indels

(insertions/deletion) were computed by using a modified version of RNAsnp (since the

current version of RNAsnp is limited to substitutions only). Briefly, we first computed the

base pair probability matrices of wild-type and mutant sequences (by taking into account the

insertion or deletion) and then adjusted the size of matrices to be equal (by introducing

additional rows and columns with zeros in one of the matrices with respect to insertion or

deletion). Further, by following the steps of RNAsnp, we computed the r_min score. The

structure shown in Fig. 6d is based on the conserved secondary structure annotation

obtained from Rfam (RF00030)54.

Tertiary structure contacts in RMRP were predicted using evolutionary couplings co-variation

analysis (EC analysis 55) of the multiple sequence alignment of 933 eukaryotic RMRP

sequences from Rfam (RF00030). The EC analysis (software available at

https://github.com/debbiemarkslab/plmc) was run with the options -le 20.0, -lh 0.01, -t 0.2, -m

100 and the top 100 interactions were chosen as predicted contacts, either in secondary or

tertiary structure, depending on local context. As no experimental 3D structure or cross-

linking experiments of the mammalian RMRP are available, interaction sites were inferred by

homology to the partially known yeast RMRP crystal structure. We (1) aligned the human

RMRP sequence with the Saccharomyces cerevisiae RMRP sequence using the sequence

family covariance model from Rfam and (2) mapped the locations of RNA-protein

interactions within 4Å56 from the crystal structure and the experimentally determined RNA-

protein crosslinking sites57, and RNA substrate crosslinking sites58 from the yeast sequence

to the human RMRP sequence. For the crosslinking sites, a ±3 nucleotide window is

.CC-BY-NC-ND 4.0 International licenseavailable under awas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (whichthis version posted December 23, 2017. ; https://doi.org/10.1101/237313doi: bioRxiv preprint

Page 67: was not certified by peer review) is the author/funder. It ...1 Discovery and characterization of coding and non-coding driver mutations in more than 2,500 whole cancer genomes Esther

67

reported as the interaction site. In order to test if the locations of the observed indels are

biased towards tertiary structure, protein- or substrate-interaction sites, 1,000,000

randomizations of five indels were performed assuming uniform distribution of indels across

the RMRP gene body, and an empirical p-value was calculated (P = .08), showing a

potential for functional impact.

Two different overlapping deletion calls in the RMRP gene body were observed in the same

thyroid cancer patient. After manual inspection of the tumor and normal bam files, it was

found that these calls were based on the same mutational event.

.CC-BY-NC-ND 4.0 International licenseavailable under awas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (whichthis version posted December 23, 2017. ; https://doi.org/10.1101/237313doi: bioRxiv preprint

Page 68: was not certified by peer review) is the author/funder. It ...1 Discovery and characterization of coding and non-coding driver mutations in more than 2,500 whole cancer genomes Esther

68

21. Cancer associated germline variant distance to non-coding driver candidates.

(Morten Muhlig Nielsen)

We used a set of genome-wide significant cancer associated germline SNPs (n=650) from

the NHGRI-EBI GWAS catalog59 as collected by Sud et al60. We evaluated the genomic

distance from candidate non-coding drivers to the closest germline variant. All distances

were above 50 kb with the exception of the TERT promoter which was 1 kb away from a

coding variant (rs2736098) in the TERT gene.

.CC-BY-NC-ND 4.0 International licenseavailable under awas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (whichthis version posted December 23, 2017. ; https://doi.org/10.1101/237313doi: bioRxiv preprint

Page 69: was not certified by peer review) is the author/funder. It ...1 Discovery and characterization of coding and non-coding driver mutations in more than 2,500 whole cancer genomes Esther

69

References

1. Ramos, A. H. et al. Oncotator: cancer variant annotation tool. Hum. Mutat. 36, E2423–9 (2015).

2. Kasar, S. et al. Whole-genome sequencing reveals activation-induced cytidine deaminase signatures during indolent chronic lymphocytic leukaemia evolution. Nat. Commun. 6, 8866 (2015).

3. Kim, J. et al. Somatic ERCC2 mutations are associated with a distinct genomic signature in urothelial tumors. Nat. Genet. 48, 600–606 (2016).

4. Polak, P. et al. A mutational signature reveals alterations underlying deficient homologous recombination repair in breast cancer. Nat. Genet. 49, 1476–1486 (2017).

5. Harrow, J. et al. GENCODE: the reference human genome annotation for The ENCODE Project. Genome Res. 22, 1760–1774 (2012).

6. Iyer, M. K. et al. The landscape of long noncoding RNAs in the human transcriptome. Nat. Genet. 47, 199–208 (2015).

7. Burge, S. W. et al. Rfam 11.0: 10 years of RNA families. Nucleic Acids Res. 41, D226–32 (2013).

8. Lowe, T. M. & Eddy, S. R. tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence. Nucleic Acids Res. 25, 955–964 (1997).

9. Lestrade, L. & Weber, M. J. snoRNA-LBME-db, a comprehensive database of human H/ACA and C/D box snoRNAs. Nucleic Acids Res. 34, D158–62 (2006).

10. Kozomara, A. & Griffiths-Jones, S. miRBase: annotating high confidence miRNAs using deep sequencing data. Nucleic Acids Res. 42, D68–D73 (2014).

11. Londin, E. et al. Analysis of 13 cell types reveals evidence for the expression of numerous novel primate- and tissue-specific microRNAs. Proc. Natl. Acad. Sci. U. S. A. 112, E1106–15 (2015).

12. Fingerman, I. M. et al. NCBI Epigenomics: a new public resource for exploring epigenomic data sets. Nucleic Acids Res. 39, D908–12 (2011).

13. Roadmap Epigenomics Consortium et al. Integrative analysis of 111 reference human epigenomes. Nature 518, 317–330 (2015).

14. Fu, Y. et al. FunSeq2: a framework for prioritizing noncoding regulatory variants in cancer. Genome Biol. 15, 480 (2014).

15. Alexandrov, L. B. et al. Signatures of mutational processes in human cancer. Nature 500, 415–421 (2013).

16. Polak, P. et al. Cell-of-origin chromatin organization shapes the mutational landscape of cancer. Nature 518, 360–364 (2015).

17. Benjamini, Y. & Hochberg, Y. Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. J. R. Stat. Soc. Series B Stat. Methodol. 57, 289–300 (1995).

18. Martincorena, I. et al. Universal Patterns of Selection in Cancer and Somatic Tissues. Cell (2017). doi:10.1016/j.cell.2017.09.042

19. Lanzos, A. et al. Discovery of Cancer Driver Long Noncoding RNAs across 1112 Tumour Genomes: New Candidates and Distinguishing Features. (2016). doi:10.1101/065805

20. Lochovsky, L., Zhang, J., Fu, Y., Khurana, E. & Gerstein, M. LARVA: an integrative framework for large-scale analysis of recurrent variants in noncoding annotations. Nucleic Acids Res. 43, 8123–8134 (2015).

21. Lawrence, M. S. et al. Discovery and saturation analysis of cancer genes across 21 tumour types. Nature 505, 495–501 (2014).

22. Nik-Zainal, S. et al. Landscape of somatic mutations in 560 breast cancer whole-genome sequences. Nature 534, 47–54 (2016).

23. Juul, M. et al. Non-coding cancer driver candidates identified with a sample- and position-specific model of the somatic mutation rate. Elife 6, (2017).

24. Mularoni, L., Sabarinathan, R., Deu-Pons, J., Gonzalez-Perez, A. & López-Bigas, N. OncodriveFML: a general framework to identify coding and non-coding regions with

.CC-BY-NC-ND 4.0 International licenseavailable under awas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (whichthis version posted December 23, 2017. ; https://doi.org/10.1101/237313doi: bioRxiv preprint

Page 70: was not certified by peer review) is the author/funder. It ...1 Discovery and characterization of coding and non-coding driver mutations in more than 2,500 whole cancer genomes Esther

70

cancer driver mutations. Genome Biol. 17, 128 (2016). 25. Kircher, M. et al. A general framework for estimating the relative pathogenicity of human

genetic variants. Nat. Genet. 46, 310–315 (2014). 26. Douville, C. et al. Assessing the Pathogenicity of Insertion and Deletion Variants with

the Variant Effect Scoring Tool (VEST-Indel). Hum. Mutat. 37, 28–35 (2016). 27. Cibulskis, K. et al. Sensitive detection of somatic point mutations in impure and

heterogeneous cancer samples. Nat. Biotechnol. 31, 213–219 (2013). 28. Anscombe, F. J. & Fisher, R. A. Statistical Methods for Research Workers. J. R. Stat.

Soc. Ser. A 118, 486 (1955). 29. Brown, M. B. 400: A Method for Combining Non-Independent, One-Sided Tests of

Significance. Biometrics 31, 987 (1975). 30. Poole, W., Gibbs, D. L., Shmulevich, I., Bernard, B. & Knijnenburg, T. A. Combining

dependentP-values with an empirical adaptation of Brown’s method. Bioinformatics 32, i430–i436 (2016).

31. Poole, W., Gibbs, D. L., Shmulevich, I., Bernard, B. & Knijnenburg, T. Combining Dependent P-values with an Empirical Adaptation of Brown’s Method. (2015). doi:10.1101/029637

32. Derrien, T. et al. Fast computation and applications of genome mappability. PLoS One 7, e30377 (2012).

33. Sabarinathan, R., Mularoni, L., Deu-Pons, J., Gonzalez-Perez, A. & López-Bigas, N. Nucleotide excision repair is impaired by binding of transcription factors to DNA. Nature 532, 264–267 (2016).

34. Perera, D. et al. Differential DNA repair underlies mutation hotspots at active promoters in cancer genomes. Nature 532, 259–263 (2016).

35. Fredriksson, J., Eliot, K., Filges, S., Stahlberg, A. & Larsson, E. Recurrent promoter mutations in melanoma are defined by an extended context-specific mutational signature. (2016). doi:10.1101/069351

36. Pearson, C. E., Zorbas, H., Price, G. B. & Zannis-Hadjopoulos, M. Inverted repeats, stem-loops, and cruciforms: significance for initiation of DNA replication. J. Cell. Biochem. 63, 1–22 (1996).

37. Nik-Zainal, S. et al. Landscape of somatic mutations in 560 breast cancer whole-genome sequences. Nature 534, 47–54 (2016).

38. Ye, C., Ji, G., Li, L. & Liang, C. detectIR: a novel program for detecting perfect and imperfect inverted repeats using complex numbers and vector calculation. PLoS One 9, e113349 (2014).

39. Fonseca, N. A. et al. Pan-cancer study of heterogeneous RNA aberrations. bioRxiv 183889 (2017). doi:10.1101/183889

40. Dobin, A. et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29, 15–21 (2013).

41. Kim, D. et al. TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biol. 14, R36 (2013).

42. Hose, J. et al. Dosage compensation can buffer copy-number variation in wild yeast. Elife 4, (2015).

43. Lockstone, H. E. et al. Gene expression profiling in the adult Down syndrome brain. Genomics 90, 647–660 (2007).

44. Birchler, J. A. Reflections on studies of gene expression in aneuploids. Biochem. J 426, 119–123 (2010).

45. Heard, E. & Disteche, C. M. Dosage compensation in mammals: fine-tuning the expression of the X chromosome. Genes Dev. 20, 1848–1867 (2006).

46. Mermel, C. H. et al. GISTIC2.0 facilitates sensitive and confident localization of the targets of focal somatic copy-number alteration in human cancers. Genome Biol. 12, R41 (2011).

47. Thorvaldsdóttir, H., Robinson, J. T. & Mesirov, J. P. Integrative Genomics Viewer (IGV): high-performance genomics data visualization and exploration. Brief. Bioinform. 14, 178–192 (2013).

.CC-BY-NC-ND 4.0 International licenseavailable under awas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (whichthis version posted December 23, 2017. ; https://doi.org/10.1101/237313doi: bioRxiv preprint

Page 71: was not certified by peer review) is the author/funder. It ...1 Discovery and characterization of coding and non-coding driver mutations in more than 2,500 whole cancer genomes Esther

71

48. Carter, S. L. et al. Absolute quantification of somatic DNA alterations in human cancer. Nat. Biotechnol. 30, 413–421 (2012).

49. Rheinbay, E. et al. Recurrent and functional regulatory mutations in breast cancer. Nature 547, 55–60 (2017).

50. Martincorena, I. et al. Universal Patterns of Selection in Cancer and Somatic Tissues. Cell (2017). doi:10.1016/j.cell.2017.09.042

51. Zack, T. I. et al. Pan-cancer patterns of somatic copy number alteration. Nat. Genet. 45, 1134–1140 (2013).

52. Katz, Y., Wang, E. T., Airoldi, E. M. & Burge, C. B. Analysis and design of RNA sequencing experiments for identifying isoform regulation. Nat. Methods 7, 1009–1015 (2010).

53. Sabarinathan, R. et al. RNAsnp: efficient detection of local RNA secondary structure changes induced by SNPs. Hum. Mutat. 34, 546–556 (2013).

54. Nawrocki, E. P. et al. Rfam 12.0: updates to the RNA families database. Nucleic Acids Res. 43, D130–7 (2015).

55. Weinreb, C. et al. 3D RNA and Functional Interactions from Evolutionary Couplings. Cell 165, 963–975 (2016).

56. Perederina, A., Esakova, O., Quan, C., Khanova, E. & Krasilnikov, A. S. Eukaryotic ribonucleases P/MRP: the crystal structure of the P3 domain. EMBO J. 29, 761–769 (2010).

57. Khanova, E., Esakova, O., Perederina, A., Berezin, I. & Krasilnikov, A. S. Structural organizations of yeast RNase P and RNase MRP holoenzymes as revealed by UV-crosslinking studies of RNA–protein interactions. RNA 18, 720–728 (2012).

58. Esakova, O., Perederina, A., Berezin, I. & Krasilnikov, A. S. Conserved regions of ribonucleoprotein ribonuclease MRP are involved in interactions with its substrate. Nucleic Acids Res. 41, 7084–7091 (2013).

59. Welter, D. et al. The NHGRI GWAS Catalog, a curated resource of SNP-trait associations. Nucleic Acids Res. 42, D1001–6 (2014).

60. Sud, A., Kinnersley, B. & Houlston, R. S. Genome-wide association studies of cancer: current insights and future perspectives. Nat. Rev. Cancer 17, 692–704 (2017).

.CC-BY-NC-ND 4.0 International licenseavailable under awas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (whichthis version posted December 23, 2017. ; https://doi.org/10.1101/237313doi: bioRxiv preprint

Page 72: was not certified by peer review) is the author/funder. It ...1 Discovery and characterization of coding and non-coding driver mutations in more than 2,500 whole cancer genomes Esther

72

Supplementary Note

1. Overview of non-coding hotspots in top 50 not categorized in main text

2. Discussion of additional significant non-coding elements

3. Evaluation of splicing association of U2 mutations upstream of WDR74

4. Enrichment of protein-coding drivers

5. Impact of covariates on the estimation of driver mutations in functional regions

of cancer genes

6. Author contributions by category

7. References

1. Overview of non-coding hotspots in top 50 not categorized in main text

Six of the 35 non-coding hotspots in top 50 are not assigned to the mutational processes

described in the main text and highlighted in Fig. 1b. Below we describe these in detail.

Four of the six remaining non-coding hotspots were found on the X chromosome

(X:116579329, X:7791111, X:83966025, and X:83967552). The mutations in these hotspots

were mainly contributed by males (61-94% of mutations per hotspot), suggesting the

possibility of uncaught noise. One of these hotspots is located in palindromic DNA

(X:7791111). The DNA sequence around this position is composed of two mononucleotide

repeats of pairing bases (ATTTTTAAAAAAAAAT), which fits our definition of palindromic

structures (Methods). The sequence context around this hotspot do not match the sequence

recognized by APOBEC enzymes1 and the hotspot had a low APOBEC signature probability

(Fig. 1b). It is therefore unlikely to be caused by the same mutational processes as the other

palindromic hotspots in the top 50 hotspots, but rather likely represents noise associated

with homopolymer runs (PCAWG variants paper).

A hotspot (3:164903710) contained mutations in multiple cancer types with most mutations

contributed by Liver-HCC (6/17). It is located about 1 kb downstream of SLITRK3, two base

pairs from an annotated CTCF transcription factor binding site. CTCF sites and the base

pairs immediately downstream are known to be highly mutated in several cancer types

including Liver-HCC. This position is lowly conserved, suggesting that it would not have a

functional impact2.

The last non-coding hotspot (1:103599442) had a high proportion of mutations attributed to

the COSMIC5 signature. It overlaps a repetitive LINE region, suggesting potential mapping

.CC-BY-NC-ND 4.0 International licenseavailable under awas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (whichthis version posted December 23, 2017. ; https://doi.org/10.1101/237313doi: bioRxiv preprint

Page 73: was not certified by peer review) is the author/funder. It ...1 Discovery and characterization of coding and non-coding driver mutations in more than 2,500 whole cancer genomes Esther

73

issues. Moreover, the position is flanked by two other positions, two base pairs away in both

directions, which also have mutations in the same samples and reads, however these

positions were called problematic in the Panel-of-Normals (PoN) filter, further suggesting

mapping issues.

2. Discussion of additional significant non-coding elements

WDR74 promoter

The WDR74 promoter has already been suggested as a potential driver in several studies3–6.

However, we found that mutations are concentrated inside a U2 RNA where the density of

putative polymorphisms is abnormally high. This outstanding level of diversity could in

principle be due to higher mutability in the germline. However, given the repetitive nature of

the U2 element (there are hundreds of copies in the genome) and the extreme levels of

putative diversity, it is more likely that this region is a source of mapping artefacts. This

observation raise concerns about the validity of WDR74 promoter and, hence, on its

potential as a cancer driver (Extended Data Fig. 10).

TMEM107 3’UTR

The case of the TMEM107 3’UTR is very similar to that of the WDR74 promoter. There is a

small RNA (SNORD118) embedded in the 3’UTR, and is there where most mutations

concentrate. In PCAWG normals data there is a remarkable excess of putative SNP diversity

in this element. (Extended Data Fig. 10).

Additional non-coding RNAs hits

Eleven additional ncRNAs or ncRNA promoter hits, not described in detail in the main text,

passed the flagging of potential false positive hits (RNU6-573P, RPPH1, RNU12, TRAM2-

AS1, G025135, G029190,RP11-92C4.6 and promoters of RNU12, MIR663A, RP11-

440L14.1, and LINC00963). The driver role of these were generally not supported by

additional lines of evidence and lacked functional evidence or appeared to be affected by

technical artefacts. They are described individually below.

RNU6-573P

The small RNA RNU6-573P is detected in Endocrine pancreas (q = 7.3x10-4) with three

mutations. However, it appears to be a nonfunctional pseudogene recently inserted in the

human lineage. Not only the locus, but the wider region is subject to increased mutational

burden, further supporting that mutational mechanisms or technical issues rather than

selection underlies the mutational recurrence.

.CC-BY-NC-ND 4.0 International licenseavailable under awas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (whichthis version posted December 23, 2017. ; https://doi.org/10.1101/237313doi: bioRxiv preprint

Page 74: was not certified by peer review) is the author/funder. It ...1 Discovery and characterization of coding and non-coding driver mutations in more than 2,500 whole cancer genomes Esther

74

RNU12

RNU12 is a spliceosomal RNA, which was found significant in Lymph-BNHL (q = 5.0x10-2)

and in the Hematopoietic system meta cohort (q = 7.0x10-2). The mutation rate is 1.6 times

higher inside the gene in pan-cancer compared to the flanking regions. However, the

mutation rate is 4 times higher in gnomAD, which renders it as a likely problematic region.

Significance in the promoter of RNU12 was also identified in Lymphomas (q = 3.5x10-2) and

the Hematopoietic system meta cohort (q = 1.8x10-6). But this region is overlapping the

promoter of POLDIP3 (Polymerase delta-interacting protein 3), which makes interpretation

difficult.

MIR663A

The promoter of MIR663A was recurrently mutated in the carcinoma meta cohort (q =

7.2x10-4). It is the primary transcript of MiR-663, which has tumorigenic functions in gastric

cancer and nasopharyngeal carcinoma7. The mutation rate is 1.2 times higher inside the

element compared to the flanking regions, but the mutation rate is 2.4 times higher in

gnomAD. The mutations were also not correlated with expression.

RPPH1

RPPH1 forms the RNA component of the RNase P ribonucleoprotein, which matures

precursor-tRNAs by cleaving their 5’ end8. It is transcribed from a divergent promoter

together with the protein-coding gene PARP2, however, no association with expression was

observed for either gene. The region has a high level of germline polymorphisms in normal

samples indicative of a problematic region to map.

TRAM2-AS1

The promoter element of the antisense non-coding RNA TRAM2-AS1 is recurrently mutated

in the Female reproductive system meta cohort (q = 0.10). The promoter is shared with the

divergently transcribed protein-coding gene TRAM2. Interestingly, the promoter has 8

mutations in other cohorts, which collectively associate with the expression of TRAM2 (P =

0.03; carcinoma) but not TRAM2-AS1 (P = 0.9). It is uncertain which gene is controlled by

this significant promoter element.

G025135

G025135 is a ncRNA element from MiTranscriptome9 identified here as a significant driver

candidate in Lymph-CLL (q = 0.01). There are patients with many mutations, indicating that

an unknown mutation process is at play leading to false driver prediction.

.CC-BY-NC-ND 4.0 International licenseavailable under awas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (whichthis version posted December 23, 2017. ; https://doi.org/10.1101/237313doi: bioRxiv preprint

Page 75: was not certified by peer review) is the author/funder. It ...1 Discovery and characterization of coding and non-coding driver mutations in more than 2,500 whole cancer genomes Esther

75

G029190

The G029190 ncRNA is also from MiTranscriptome with significance in Kidney-RCC (q =

0.04) and in the Kidney meta cohort (q = 0.05). It is located downstream of RAB11FIP3 and

upstream of CAPN15. A possible function of this gene has not been established.

LINC00963

LINC00963 was found significant in the Kidney meta cohort (q = 0.04). It has many

alternative transcripts with different TSSs and thus a large promoter. LINC00963 is known to

be involved in the prostate cancer transition from androgen-dependent to androgen-

independent and metastasis via the EGFR signaling pathway10. The SNV mutation rate is

generally high (0.05 mutations per position), although lower than the flanking regions (0.08).

RP11-92C4.6

RP11-92C4.6 is a predicted antisense ncRNA, which was identified as a significant driver

candidate in the Breast meta cohort (q = 0.08). It is a short region with low conservation

containing four SNVs. It is overlapping the COL15A1 protein-coding gene with the upstream

promoter region. COL15A1 has previously been linked to ovarian cancer. It is unclear

whether this ncRNA annotation is valid and whether these reflect true functional mutations.

RP11-440L14.1

The promoter of RP11-440L14.1 was identified as a candidate driver in the Carcinoma meta

cohort with 14 mutations (q = 5.4x10-3). It has a hotspot position with four mutations

overlapping two different deletions. The lncRNA is located between CPLX1 and PCGF3. The

validity of the annotation and possible function for this lncRNA has not been established.

3. Evaluation of splicing association of U2 mutations upstream of WDR74

We hypothesized that the mutations observed in the evolutionarily conserved spliceosomal

U2 RNA upstream of WDR74 may affect splicing. As prior sequencing evidence,

summarised in the GENCODE version 19 gene annotations11, shows that the U2 is co-

transcribed as an alternative 5’ exon in some cases, it might play a role in splicing and

regulation of the WDR74 transcript. We therefore both evaluated (A) whether the mutations

associate with alternative splicing of the WDR74 gene; and (B) whether they associate with

transcriptome-wide changes in splicing.

A) We modelled the association of the the somatic mutations with the amount of alternative

.CC-BY-NC-ND 4.0 International licenseavailable under awas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (whichthis version posted December 23, 2017. ; https://doi.org/10.1101/237313doi: bioRxiv preprint

Page 76: was not certified by peer review) is the author/funder. It ...1 Discovery and characterization of coding and non-coding driver mutations in more than 2,500 whole cancer genomes Esther

76

splicing, as measured by percent-spliced-in values12, using ordinary least squares

regression, which did not reveal any significant associations. For none of the seven tested

alternative splicing events (three exon skips, three intron retentions, one alternative 3-prime

splice site) the genotype was able to explain a substantial fraction of the observed variation.

The largest R-squared value was 0.01 and no p-value reached nominal significance.

B) A similar result was obtained for the relationship between the presence of U2 mutations

and global changes in alternative splicing. Modeling the amount of alternative splicing on the

ICGC project codes alone, which reflect cancer types and contributing institutions, we

reached an R-squared value of 0.612, reflecting the strong relationship between splicing and

tissue identity. Including the presence of somatic mutations in the model did not change this

value. We obtained an analogous result for the extent of alternative splicing, reaching an R-

squared of 0.414 on the project codes alone and the same value when including the

genotype into the model.

In conclusion, we found no evidence for an association between mutations in the U2

upstream of WDR74 and alternative splicing of WDR74 transcripts or global changes in

splicing.

4. Enrichment of protein-coding drivers

Collectively, mutations occurring in the promoter region of the 757 cancer genes did not

have a significantly different association with expression than synonymous mutations

(Supplementary fig. 1a). Similarly, promoter and UTR mutations in cancer genes are not

significantly enriched in LOH with respect to mutations in putative passenger genes

(Supplementary fig. 1b). This is consistent with the prediction above that only a very small

fraction of the mutations observed in the promoters and UTRs of known cancer genes are

genuine driver events.

.CC-BY-NC-ND 4.0 International licenseavailable under awas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (whichthis version posted December 23, 2017. ; https://doi.org/10.1101/237313doi: bioRxiv preprint

Page 77: was not certified by peer review) is the author/funder. It ...1 Discovery and characterization of coding and non-coding driver mutations in more than 2,500 whole cancer genomes Esther

77

Supplementary figure 1: a, Expression associated with mutations in coding and promoter

regions of cancer genes. Z-score expressions associated with non-sense mutations deviate

significantly from silent mutations, likely through nonsense mediated decay, whereas

expressions associated with promoter mutations do not differ from that of silent mutations.

Only mutations in diploid positions were used. b, Excess of LOH associated to mutation in

regulatory and coding regions of cancer genes. The y-axes shows the ratio of fold changes

in cancer vs. passenger genes, with fold changes representing the excess or depletion of

LOH associated with mutation.

5. Impact of covariates on the estimation of driver mutations in functional regions of

cancer genes

As described in Methods, we estimated the abundance of driver mutations in coding and

regulatory regions (promoter and UTRs) of 142 known cancer genes using the NBR

background model fitted on putative passenger genes.

Different covariates were included to improve the fit of the model. The local mutation rate,

calculated on neutral regions within +/-100 kb around each element, was included to account

for regional variation of mutation rates. Detection sensitivity (d.s.) was averaged for each

element using the MuTect estimates of callable sites from each sample. For genes in

chromosomes X and Y, we did not have MuTect estimates and d.s. was imputed using a

linear regression model of d.s. as a function of GC content. Detection sensitivity accounts for

elements where the mutation rate is lower than expected just because of poor sequencing

.CC-BY-NC-ND 4.0 International licenseavailable under awas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (whichthis version posted December 23, 2017. ; https://doi.org/10.1101/237313doi: bioRxiv preprint

Page 78: was not certified by peer review) is the author/funder. It ...1 Discovery and characterization of coding and non-coding driver mutations in more than 2,500 whole cancer genomes Esther

78

coverage. A third type of covariate was included in the model to account for associations

between gene expression levels and mutation rates. Starting with a matrix of mean FPKM

expression values for each gene and tumour type, we log-transformed and scaled the

expression matrix using pseudocounts and applied Principal Component Analysis to reduce

the dimensionality. We selected the first 8 components as covariates, which together

explained 95.5% of the variance. In addition, we added two additional covariates to account

for non-linearity between expression and mutation rate in the tails of the expression

spectrum. To accomplish this, we created two binary variables, one marking the 500 genes

with highest maximum expression values across tumour types, the other marking 1,229

genes whose expression did not exceed FPKM values of 0.1 in any tumour type. Finally,

since tumours are rich in amplifications and deletions and these events may result in

seemingly increased or decreased mutation rates, we included a copy-number covariate,

calculated as the average copy number of each gene across all PCAWG samples.

We intentionally selected a small set of cancer genes to estimate the abundance of driver

mutations because with larger numbers of genes the signal of selection becomes weaker

while any systematic bias may become more prominent. We noticed these biases becoming

increasingly apparent when analyzing larger sets of genes. In an attempt to capture a

diverse but relatively small group of cancer genes, we included Cancer Gene Census genes

recurrently mutated by coding mutations in PCAWG and genes frequently altered by copy

number gains or losses, as described in Methods.

Supplementary Table 9 shows the impact of using different covariates on the 142 genes

selected for this analysis. Reassuringly, this shows that the estimates are broadly consistent

across models with different covariates, with variations typically within the confidence

intervals of alternative models. This confirms that the overall conclusions are largely

unaffected by the use of different models. Supplementary Figure 2 shows the results for

the 142 genes and for a larger set of 603 genes from the Cancer Gene Census.

To evaluate the performance of the NBR model, we compared the number of driver

substitutions predicted by NBR in the CDS regions of the 142 cancer genes to the number

predicted by dN/dS (calculated by dNdScv). dN/dS offers an independent estimate of the

number of driver substitutions in a group of genes using the local density of synonymous

mutations to estimate the neutral expectation13, instead of predicting the background

mutation rate by extrapolation from putative passenger genes using a regression model.

Reassuringly, in these 142 genes, NBR predicts 2,258 (CI95%: 2,127-2,382) driver

substitutions and dN/dS predicts 2,384 (CI95%: 2,043-2,759).

.CC-BY-NC-ND 4.0 International licenseavailable under awas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (whichthis version posted December 23, 2017. ; https://doi.org/10.1101/237313doi: bioRxiv preprint

Page 79: was not certified by peer review) is the author/funder. It ...1 Discovery and characterization of coding and non-coding driver mutations in more than 2,500 whole cancer genomes Esther

79

Supplementary figure 2: Estimation of the excess substitutions (left) and indels (right) in

regulatory and protein-coding regions of cancer genes and for the 142 (left) genes and a

larger set of genes from the Cancer Gene Census (right). Ratios of observed vs. expected

number of mutations (top), the percentage of mutations predicted to be drivers (middle), and

the total number of predicted drivers in all cancers and in each patient (bottom) are shown.

6. Author contributions by category

Driver discovery

A.L., C.H., C.W., D.A.W., E.K., E.M.L., E.R., G.G., G.T., H.M.U., I.M., J.K., J.R., J.S.P.,

K.A.B., K.D., K.I., L.M., L.U.-R., M.M.N., M.P.H., N.A.S., P.D., P.J.C., R.J., S.B.A., T.A.J.,

T.T. and , Y.F. contributed and curated genomic annotations. C.H., C.W.Y.C., I.M., S.B.A.

and Y.E.M. contributed randomized mutational data sets for driver discovery. A.L., A.G.-P.,

A.H., D.L., D.T., E.K., E.M.L., E.R., H.H., H.M., I.M., I.R., J.B., J.C.-F., J.D., J.F., J.M.H.,

.CC-BY-NC-ND 4.0 International licenseavailable under awas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (whichthis version posted December 23, 2017. ; https://doi.org/10.1101/237313doi: bioRxiv preprint

Page 80: was not certified by peer review) is the author/funder. It ...1 Discovery and characterization of coding and non-coding driver mutations in more than 2,500 whole cancer genomes Esther

80

J.R., J.Z., K.C., K.D., K.I., L.L., L.M., L.S., L.U.-R., L.W., M.B.G., M.J., N.L.-B., O.P., P.D.,

Q.G., R.S., S.K. and S.S. contributed driver methods and results. E.R., G.G. and G.T.

implemented results integration. A.K., C.v.M., C.V., G.T., H.H., I.M., J.R., L.F. and M.M.N.

contributed driver results integration. C.H., C.W.Y.C., E.K., E.R., G.G., J.K., J.M.H., J.S.P.,

M.M.N. and R.I.P. contributed single site recurrence analysis.

Candidate vetting and filtering

E.R., F.A., H.H., H.T.M., J.K., L.F. and M.M.N. contributed individual candidate filters. A.L.,

C.H., C.W., E.K., E.M.L., E.R., F.A., G.G., G.T., H.H., H.M., H.M.U., J.K., J.M.H., J.S.P.,

K.D., L.F., L.S., M.M.N., M.S.L., N.A.S. and R.J. performed candidate vetting.

Case-based analysis

E.R., F.A., G.G., H.H., I.M., J.M.H., J.R., J.S.P., K.P., M.M.N. and M.P.H. contributed case-

based analysis. A.G.-P., A.H., A.L., C.H., D.C., D.T., E.K., E.R., F.A., G.G., G.T., H.H., H.K.,

I.M., J.C.-F, J.R., J.S.P., K.I., K.P., L.M., L.S., L.U.-R., L.W., M.A.R., M.B.G., M.M.N., M.S.L.,

N.A.S., N.L.-B., O.P., R.I.P., R.S., S.K. and Y.K. contributed results interpretation. A.K.,

J.S.P., K.A.B., K.-V.L., M.M.N., N.A.F., S.B.A., T.A.J. and T.T. contributed expression

profiling (extended GENCODE set). A.K., C.V., D.C., H.M., H.T.M., J.R., J.S.P., K.I., L.W.,

M.A.R., M.M.N., M.S.L. and S.B.A. contributed mutation-to-expression correlation analysis.

A.K., C.V., D.C., J.R., L.W., M.A.R., N.A.S. and Z.Z. contributed network or pathway

analysis. R.S, Ciyue S., Chris S., and J.S.P. contributed structural RNA analysis.

Power analysis and driver mutations at known cancer genes

E.R. analysed SNV detection and driver discovery power. I.M. evaluated excess number of

mutations at known drivers. F.A. and M.M.N. integrated additional evidence.

Leadership and organizational work

E.R., G.G. and J.S.P. contributed working group leadership. A.G.-P., D.A.W., E.K., E.R.,

G.G., G.T., I.M., J.R., J.S.P., L.F., L.M., M.B.G., N.L.-B., O.P., P.J.C., R.S. and S.K.

contributed organization.

.CC-BY-NC-ND 4.0 International licenseavailable under awas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (whichthis version posted December 23, 2017. ; https://doi.org/10.1101/237313doi: bioRxiv preprint

Page 81: was not certified by peer review) is the author/funder. It ...1 Discovery and characterization of coding and non-coding driver mutations in more than 2,500 whole cancer genomes Esther

81

7. References

1. Alexandrov, L. B. et al. Signatures of mutational processes in human cancer. Nature 500, 415–421 (2013).

2. Katainen, R. et al. CTCF/cohesin-binding sites are frequently mutated in cancer. Nat. Genet. 47, 818–821 (2015).

3. Weinhold, N., Jacobsen, A., Schultz, N., Sander, C. & Lee, W. Genome-wide analysis of noncoding regulatory mutations in cancer. Nat. Genet. 46, 1160–1165 (2014).

4. Nik-Zainal, S. et al. Landscape of somatic mutations in 560 breast cancer whole-genome sequences. Nature 534, 47–54 (2016).

5. Khurana, E. et al. Integrative annotation of variants from 1092 humans: application to cancer genomics. Science 342, 1235587 (2013).

6. Fujimoto, A. et al. Whole-genome mutational landscape and characterization of noncoding and structural mutations in liver cancer. Nat. Genet. 48, 500–509 (2016).

7. Yi, C. et al. MiR-663, a microRNA targeting p21(WAF1/CIP1), promotes the proliferation and tumorigenesis of nasopharyngeal carcinoma. Oncogene 31, 4421–4433 (2012).

8. Baer M, E. al. Structure and transcription of a human gene for H1 RNA, the RNA component of human RNase P. - PubMed - NCBI. Available at: https://www.ncbi.nlm.nih.gov/pubmed/2308839. (Accessed: 7th July 2017)

9. Iyer, M. K. et al. The landscape of long noncoding RNAs in the human transcriptome. Nat. Genet. 47, 199–208 (2015).

10. Wang, L. et al. Linc00963: a novel, long non-coding RNA involved in the transition of prostate cancer from androgen-dependence to androgen-independence. Int. J. Oncol. 44, 2041–2049 (2014).

11. Harrow, J. et al. GENCODE: the reference human genome annotation for The ENCODE Project. Genome Res. 22, 1760–1774 (2012).

12. Katz, Y., Wang, E. T., Airoldi, E. M. & Burge, C. B. Analysis and design of RNA sequencing experiments for identifying isoform regulation. Nat. Methods 7, 1009–1015 (2010).

13. Martincorena, I. et al. Universal Patterns of Selection in Cancer and Somatic Tissues. Cell 171, 1029–1041.e21 (2017).

.CC-BY-NC-ND 4.0 International licenseavailable under awas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (whichthis version posted December 23, 2017. ; https://doi.org/10.1101/237313doi: bioRxiv preprint

Page 82: was not certified by peer review) is the author/funder. It ...1 Discovery and characterization of coding and non-coding driver mutations in more than 2,500 whole cancer genomes Esther

0088880000009890009199999110000101120120131001700201010

Figure 1a

d

cProtein-coding (pc) gene

lncRNA gene

Promoter (pc)5'UTR

Promoter (lncRNA)Splice sites (lncRNA)

Splice sites (pc)Protein-coding

3'UTR

Enhancer miRNA snoRNA

lncRNA

miRNASmall RNA

Enhancer

with two isoforms

anno

tatio

nsD

eriv

ed e

lem

ents

for

driv

er d

isco

very

Inpu

t

b

C > A

C > G

C > T

T > A

T > C

T > G

Mutation types:

# Known driver hotspot

Protein-coding

Promoter (pc)

Intron

Gene annotation:

Promoter (lncRNA)

Lym

phoi

d m

alig

nanc

ies

000000000032000020010000010002010120203020051130000

Bre

ast−

Ade

noC

a

015000001511150001016000000000000180000010000271022100290151

Ski

n−M

elan

oma

000000002200000002000000020000020080000060064228220

Col

oRec

t−A

deno

Ca

0000001000100000000300000200000203205020400041021060

Pan

c−A

deno

Ca

10000000100000000201000000060001011010000001110040

Live

r−H

CC

0000000000

272200000000000005000

40050500005000

1450000

480

Bla

dder

−TC

C

60000080000000006100000000000012000010000000110001

Pro

st−A

deno

Ca

0000000000000000000000000000000400002000604

15220062

Hea

d−S

CC

000000000009000000000000000000000300003000030006314

Lung

−Ade

noC

a

0000000030550000000000000000050000000000000000210230

Thy

−Ade

noC

a

000000000000000000000000000000800000006000

1603000

520

CN

S−G

BM

1:10359944215:9093138314:10632741714:10632935014:10632923614:106326877X:8396755220:325809271:11525653019:215179310:11551159310:11551159014:10632955014:10632985214:10632919612:498776X:839660253:16490371014:10624024317:757712114:10632755914:10632711514:10632661814:10632661914:106326887X:779111114:106329192X:1165793298:569871416:1427062062:20911311217:757753914:10632894217:757819017:757821214:10632671317:757709414:10632697217:757753819:4999069417:757840614:1063269445:12952503:17893609117:75771203:1789520857:14045313612:253982855:129522812:25398284

Gen

omic

pos

ition

IQGAP1

NRAS #PLEKHS1PLEKHS1

KDM5A

TP53 #

RPS20GPR126IDH1 #TP53 #TP53 #TP53 #TP53 #TP53 #RPL13ATP53 #TERT #PIK3CA #TP53 #PIK3CA #BRAF # KRAS # TERT #KRAS #

Gen

e

RP5-1125A11.1

G12D/A/V

G12S/R/CV600EH1047L/RR273L/HE545Q/K

R175H

R248P/Q

R282W

R213*Y220C

R248G/WR132H

R273S/C

Q61K

Pro

tein

cha

nge

Center of palindrome

IGH locusPositional annotation:

0 10 20 30 40 50 60Percentage of cohort mutated

Tumor type specific(all or all but one mutation in tumor type)

Pan-ca

ncer

Carcin

oma

Adeno

carci

noma

Hemato

poiet

ic

Glioma

Squam

ous

Sarco

ma

Digesti

ve tr

act

Female

repr

oduc

tive t

ract

CNS

Breas

t

Lymph

oid

Kidney

Lung

Myeloi

d

Num

ber

of p

atie

nts

0

500

1000

1500

2000

2500 Cell type of origin Organ system

− 20

0 (7

.7 %

)−

150

(5.8

%)

− 10

0 (3

.9 %

)−

50 (1

.9 %

)−

0 (0

.0 %

)

16697

1515151515151616161616161616161717171717171717171718181919192020202122222424272828333538394146

56

− 0%

− − − − − 10

0%

Age APOBEC

AID COSMIC5

UV Other

Mutational processes:

.CC-BY-NC-ND 4.0 International licenseavailable under awas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (whichthis version posted December 23, 2017. ; https://doi.org/10.1101/237313doi: bioRxiv preprint

Page 83: was not certified by peer review) is the author/funder. It ...1 Discovery and characterization of coding and non-coding driver mutations in more than 2,500 whole cancer genomes Esther

Figure 2a

Mutational burden

Clustering

Functional impact

Distal TF binding site Promoter

ActiveDrive

rWGS

Com

posite

Drive

r

dNdScv

ExInAto

r

LARVAPlus

MutS

ig suite

NBR

ncd

Dete

ct

ncD

rive

r

Onco

drive

FM

L

regDrive

r

ERCC4

EEF1A1

SLC30A3

ATRX

PIK3CA

RB1

IDH1

PTEN

EGFR

TP53

Com

posite

Drive

r

ncd

Dete

ct

onco

drive

FM

L_ve

st3

ExIn

Ato

r

Active

Drive

rWGS

MutS

ig

LARVA

Drive

rPower

onco

drive

FM

L_ca

dd

NBR

dNdScv

ncD

riverC

onse

rvatio

n_indel

ncD

riverC

onse

rvatio

n_sn

v

Inte

gra

tion

* Significant

0 5 10 15

−log10

P

− Near-significant

Corr

ela

tion e

stim

ation

Fals

e d

iscovery

rate

corr

ection

Inte

gra

tion o

f m

eth

ods

ERCC4

ATRX

PIK3CA

RB1

IDH1

PTEN

EGFR

TP53

−****

Indiv

idual driver

dis

covery

meth

ods

Active

Drive

rWGS

NBR

ExIn

Ato

r

ncd

Dete

ct

ncD

riverC

ance

rType_indel

ncD

riverC

ance

rType_sn

v

ncD

riverC

onse

rvatio

n_indel

ncD

riverC

onse

rvatio

n_sn

v

onco

drive

FM

L_ve

st3

onco

drive

FM

L_ca

dd

dNdScv

MutS

ig

Drive

rPower

Burden

Functional impact

Clustering

Correlation

-1 0

ERCC4

EEF1A1

SLC30A3

ATRX

PIK3CA

RB1

IDH1

PTEN

EGFR

TP53

Coding genes

Promoters

lncRNAs

Ele

ments

Cohorts

Lists of

significant

elements

Post-

filtering o

f candid

ate

s

# mutations

# patients

Sequence properties

Signatures

ERCC4

ATRX

PIK3CA

RB1

IDH1

PTEN

EGFR

TP53

1

b

c

d

e

Drive

rPower

.CC-BY-NC-ND 4.0 International licenseavailable under awas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (whichthis version posted December 23, 2017. ; https://doi.org/10.1101/237313doi: bioRxiv preprint

Page 84: was not certified by peer review) is the author/funder. It ...1 Discovery and characterization of coding and non-coding driver mutations in more than 2,500 whole cancer genomes Esther

c

Figure 3

Sequence

palindrome

Alignment artifact

(mappability)

Technic

al

Bio

logic

al

Site-specific noise

UV + Lack of NER

b

Mutational process

AID somatic hypermutation

high

low

RNS7K

Alignability

Region of low alignability

a

d

Tumor reads

Patient 1

Patient 2

Patient n

...

...

Flagged mutation Pass mutation

Alternative allele

No

rma

l re

ad

s

fro

m a

ll p

atie

nts

e

Number of mutations

and patients

Mutational process

Mappability

% removed by filter

0 20 40 60 80 100

Sequence palindrome

Site−specific noise

Unique elements

0 20 40 60 80 100

Protein−codingPromoter (protein−coding)5'UTR

3'UTR

Enhancer

lncRNA

lncRNA promoter

miRNA

Small RNA

Indels

SNVs

Bladder-TCC

Breast-AdenoCa

Breast-LobularCa

Cervix-SCC

Lung-AdenoCa

Lung-SCC

Panc-AdenoCa

Panc-Endocrine

Thy-AdenoCa

GCTTTTTTGCAATTGAACAATTGCAAAATTGG

GC T

TT T T T G C A A T T

GA

AC

AATTGCAAAAT

TGG

*

*APOBEC mutation

**************

**

**

PLEKHS1 promoter

Element−cohort hits

% removed by filter

0

AP

OB

EC

Oth

er

% o

f m

uta

tions

PIM1

AID

Other

Mutational

signature

UV

RPL13A

DNAse signal

100

*

*

*

.CC-BY-NC-ND 4.0 International licenseavailable under awas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (whichthis version posted December 23, 2017. ; https://doi.org/10.1101/237313doi: bioRxiv preprint

Page 85: was not certified by peer review) is the author/funder. It ...1 Discovery and characterization of coding and non-coding driver mutations in more than 2,500 whole cancer genomes Esther

aFigure 4

b

−log10 q-value0 1 2 3 4 5

# patients with mutation in element

Prom

oter

(pro

tein

-cod

ing)

5’U

TR3’

UTR

Enhancer

lncR

NA

Prom

oter

(lncR

NA)

miRNA

small RNA

Biliary

−Ade

noC

aBl

adde

r−TC

CBo

ne−L

eiom

yoBo

ne−O

steo

sarc

Brea

st−A

deno

Ca

CN

S−G

BMC

NS−

Med

ullo

CN

S−Pi

loAs

troC

oloR

ect−

Aden

oCa

Eso−

Aden

oCa

Hea

d−SC

CKi

dney

−ChR

CC

Kidn

ey−R

CC

Live

r−H

CC

Lung

−Ade

noC

aLu

ng−S

CC

Lym

ph−B

NH

LLy

mph

−CLL

Mye

loid

−MPN

Ovary

−Ade

noC

aPa

nc−A

deno

Ca

Panc

−End

ocrin

ePr

ost−

Aden

oCa

Skin

−Mel

anom

aSt

omac

h−Ad

enoC

aTh

y−Ad

enoC

aU

teru

s−Ad

enoC

a

Aden

ocar

cino

ma

Car

cino

ma

Panc

ance

rMye

loid

Lym

phat

ic_s

yste

mH

emat

opoi

etic

_sys

tem

Glio

ma

CN

SBr

east

Fem

ale_

repr

oduc

tibve

_tra

ctKi

dney

Dig

estiv

e_tra

ctLu

ngSq

uam

ous

Sarc

oma

IFI44LPAX5

POLR3ERFTN1

HES1WDR74

TERT

0 20 60 100 140 0 20 60 100 140

DTLPTDSS1

RNF34MTG2

0 5 10 15 0 5 10 15 20

FOXA1SFTPB

TMEM107ALB

TOB1NFKBIZ

0 5 10 20 30 0 5 10 20 30

IGH locusTP53TG1

0 10 20 30 40 50 0 20 40 60 80

G025135RP11−92C4.6

G029190RNU12RMRP

RPPH1MALAT1

NEAT1

0 50 150 250 350 0 100 300 500

LINC00963MIR663A

RP11−440L14.1TRAM2−AS1

RNU12RMRP

0 20 40 60 80 0 20 40 60 80

hsa−mir−1420 5 10 15 0 5 10 15

RNU6−573P0 1 2 3 4 0 1 2 3 4

# mutations in all patients

SNV Indel

# pa

tient

s0

350

02000

Small RNAmiRNA

Enhancer3’UTR5’UTR

Promoter (protein-coding)Protein-coding

Number of element-cohort hits0 600 700

Number of unique elements0 150

lncRNA promoterlncRNA

0 10 20 30 0 2 4 6 8 10

c

Biliary−

Adeno

Ca

Bladde

r−TCC

Bone−

Leiomyo

Bone−

Osteos

arc

Breast−

Adeno

Ca

CNS−GBM

CNS−Med

ullo

CNS−Pilo

Astro

ColoRec

t−Ade

noCa

Eso−A

deno

Ca

Head−

SCC

Kidney

−ChR

CC

Kidney

−RCC

Liver−

HCC

Lung

−Ade

noCa

Lung

−SCC

Lymph

−BNHL

Lymph

−CLL

Myelo

id−MPN

Ovary

−Ade

noCa

Panc

−Ade

noCa

Panc

−End

ocrin

e

Prost−A

deno

Ca

Skin−M

elano

ma

Stomac

h−Ade

noCa

Thy−A

deno

Ca

Uterus

−Ade

noCa

Small RNAmiRNAlncRNA promoterlncRNAEnhancer3’UTR5’UTRPromoterProtein-coding

Num

ber o

f hits

0

10

20

30

40

50

Adeno

carci

noma

Carcino

ma

Panc

ance

r

Myelo

id

Lymph

atic_

syste

m

Hemato

poiet

ic_sy

stemGlio

maCNS

Breast

Female

_repro

ductive

_trac

t

Kidney

Digestive

_trac

tLu

ng

Squam

ous

Sarcom

a0

20

40

60

80

.CC-BY-NC-ND 4.0 International licenseavailable under awas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (whichthis version posted December 23, 2017. ; https://doi.org/10.1101/237313doi: bioRxiv preprint

Page 86: was not certified by peer review) is the author/funder. It ...1 Discovery and characterization of coding and non-coding driver mutations in more than 2,500 whole cancer genomes Esther

aFigure 5

b

c

P = 0.03

−2

0

2

4

wtmut.

RFT

N1

exp

ress

ion

z−sc

ore

Lym

ph−B

NH

L

Mutation type

SNVIndel

No event

AID mutation

AID

Non−AID

e

0

2.5

5

MTG

2 exp

ress

ion

z-sc

ore

Car

cino

ma

P = 0.029

wtmut.

CNAAmplificationGain

Loss

No event

2

0

2

KD

M2B

exp

ress

ion

z-sc

ore

Liv

er-H

CC

wtmut.

RNF34

P = 0.034

RNF34 KDM2BIntronic regulator

DNase Clusters

Indels

SNVs

Liver-HCC events

RFTN1

MTG2

% Eso-AdenoCa mutations

0 100

COSMIC17

Other

Lymph-BNHL events

Promoter

Conservation (PhyloP)

H3K4me3 (HepG2)

Indels

SNVs

Conservation (PhyloP)

Conservation (PhyloP)

H3K4me3 (HepG2)

Indels

SNVs

PromoterSS18L1 3’UTR

Transcription factor ChIP(ENCODE)

TEAD4NR2F2

CTCFIRF1HNF4A

SMC3RAD21ZNF143RCOR1

EP300CEBPBATF3JUND

Eso-AdenoCa events

Other digestive tract cancer events

Conservation (PhyloP)

H3K4me3 (HepG2)

Indels

SNVs

DNAse clusters

H3K4me3 (HepG2)

Enhancer regions

H3K4me1 (HepG2)

TP53TG1

d

Conservation (PhyloP)

H3K4me3 (GM12878)Indels

SNVs

IndelsSNVs

TP53

TP53 5’UTR

4

1 23 5 6

01

00

Expre

ssio

nle

vel (

FPK

M)

Kidney

-ChR

CC

Panc-A

deno

Ca

Breast-

Adeno

Ca

Lung

-SCC

Ova

ry-A

denoCa

Biliary-

Adeno

Ca

P = 0.0002

41 2 3 5 6

.CC-BY-NC-ND 4.0 International licenseavailable under awas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (whichthis version posted December 23, 2017. ; https://doi.org/10.1101/237313doi: bioRxiv preprint

Page 87: was not certified by peer review) is the author/funder. It ...1 Discovery and characterization of coding and non-coding driver mutations in more than 2,500 whole cancer genomes Esther

Figure 6

ALB FOXA1 NFKBIZ TOB1SFTPB

Indel rate (mutations/Mb/patient)

0

2

4

6N

FK

BIZ

exp

ress

ion

z-sc

ore

Lym

ph-B

NH

L

wtmut.

P = 0.035

Conservation (PhyloP)

TOB1 3’UTR

Exons

Introns

3’UTR

1 kb downstream

36

425

12

41

0

4

7

3

9

4

10

9

2

31

8

0

2

3

9

0

b

c

ALB

Indels

SNVs

a

dConservation (PhyloP)

IndelsSNVs

NFKBIZ 3’UTR

miRNA sites (TargetScan)

Lymph-BNHL events

IndelsSNVs

miRNA sites (TargetScan)miRNA sites (Ago-Clip)

wtmut.

Mutation typeSNVIndelNo event

CNAAmplificationGainLossNo event

Indels

SNVs

e

GainLossNo event

0 30 0 15 0 8 0 4 0 10

P = 0.053

2

0

2

4

TO

B1 e

xpre

ssio

n z-

scor

eC

arci

nom

a

chr4: 74,250,000 74,300,000 74,350,000ALB AFP AFM

Liver-HCC(n = 314)

0

25

50

75

100

Perc

enta

ge o

f sam

ples

P = 1.72 x10-14

Other(n = 2,200)

Indels

SNVs

RMRP transcript RMRP promoter

Conservation (PhyloP)

Panc-Endocrine: G>C

ins. G>ACGTGG: Liver-HCC

C>UStomach-AdenoCA

Panc-AdenoCA: C>ULung-AdenoCA: C>A

G G U U

C

G

U

G

C

U

G

AA

G

GC

CU

GU

GC

AG

GC

A

GUGCGUG

U

CC

G C G C A C

C

AA

CC A C A

CG

G

G

G

C

U

C

A

UU

CU

C

A

G

C

G

C

G

G C U G U

1

1020

220

230

240

250

260

268

Panc-AdenoCA: C>ULung-AdenoCA: C>A

Panc-Endocrine: G>C

C>UStomach-AdenoCA

ins. G>ACGTGG: Liver-HCC

ins. G>AGLiver-HCC

G>U: Liver-HCC

A>G:Panc-AdenoCA

del. G:Breast-AdenoCa

A>U: Liver-HCC

del. UCCUCThy-AdenoCA

G G U U

C

G

U

G

C

U

G

AA

G

GC

CU

GU

A

U

CC

UA

GG

CU A C

AC

A

C

U

GA

GG

A CU

C

UGU

UC

CU

CCCC

UU

U

C

C

GC

CU

AG

GG

GA

AA

GUCCCCGGACCU

CG

G

G

C

A

GA

G

A

GU

G

C

C

AC

GU

GC

AU

A

C GC

AC

GU

A

GACAU

U

CCCCGCUU

C

C

C

A

CUC

C

A

A

AG U C

C

G

C

C

A

AG A

AG C G

U A

U

CCCGC

UGAG

C

G

G

C

G

U

G

G C

G C G G G GG C

G U C

A

U

C

C

G

UCAGCU C C C U C U A

GUUACG

CA

GG

C

A

GUGCGUG

U

CC

G C G C A C

C

AA

CC A C A

CG

G

G

G

C

U

C

A

UU

CU

C

A

G

C

G

C

G

G C U G U

1

1020

30

40

50

60

70

80

90

100

110

120

130

140

150 160

170

180

190

200

210

220

230

240

250

260

268

Predicted structural impact of mutations

Low HighP4 tertiary interaction sites RNA-protein interaction sites RNA-substrate interaction sites

Mutation Mutation with predicted structural impact (P < 0.1)

.CC-BY-NC-ND 4.0 International licenseavailable under awas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (whichthis version posted December 23, 2017. ; https://doi.org/10.1101/237313doi: bioRxiv preprint

Page 88: was not certified by peer review) is the author/funder. It ...1 Discovery and characterization of coding and non-coding driver mutations in more than 2,500 whole cancer genomes Esther

in out in out in out in out in out in out

0.0

1.0

Can

cer a

llele

frac

tions

NEAT1 MALAT1TP53 VHL KRAS BRAF

* ** ***** n.s. n.s.

TP53

VHL

KRAS

BRAF

NEAT1

MALAT1

LOH

ass

ocia

tion

fold

diff

eren

ce (l

og2)

−2

0

2

4

6

**

**

n.s.

a

b dc

chr11 NEAT1 MALAT1 10 kb

NEAT1MALAT1

IndelsSNVs

IndelsSNVs

Tumor suppressor

Oncogene

.s.n .s.n .s.n

e

Genom

e

(23,50

9 gen

es)

ALB

NEAT1

MALAT1

2-5 bp

enric

hed

(18 ge

nes)

0

50

100n=1027341 n=563

1 bp2−5 bp6−9 bp10+ bp

Perc

enta

ge o

f ind

els

Indel length

f

Figure 7

Exp

ress

ion

fold

diff

eren

ce (l

og2)

−4

−3

−2

−1

0

1

2

3

TP53

VHL

KRAS

BRAF

NEAT1

MALAT1

** ** ** ** n.s. n.s.

n=31n=139n=485

ExonsInt

rons

3’UTR

1 kb d

owns

tream

5’UTR

1kb u

pstre

am

Mut

atio

n ra

te

SNVsIndels

0246810

024681012

ExonsInt

rons

3’UTR

1 kb d

owns

tream

5’UTR

1kb u

pstre

am

.CC-BY-NC-ND 4.0 International licenseavailable under awas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (whichthis version posted December 23, 2017. ; https://doi.org/10.1101/237313doi: bioRxiv preprint

Page 89: was not certified by peer review) is the author/funder. It ...1 Discovery and characterization of coding and non-coding driver mutations in more than 2,500 whole cancer genomes Esther

a bFigure 8

Average detection sensitivity

Cum

ulat

ive

prob

abili

ty

c

0.5 1.00.0

1.0CDSPromoters5'UTRs3'UTRsEnhancerslncRNAs

% p

atie

nts

with

TE

RT

prom

oter

mut

atio

n

chr5:1,295,228 chr5:1,295,250

02468

10

Mut

atio

ns/M

b

Pan

can-

no-s

kin-

mel

anom

a-ly

mph

(n=2

278)

Car

cino

ma

(n=1

856)

Ade

noca

rcin

oma

(n=1

631)

Dig

estiv

e_tra

ct (n

=797

)Fe

mal

e_re

prod

uctiv

e_tra

ct (n

=382

)Li

ver-

HC

C (n

=314

)C

NS

(n=2

87)

Hem

atop

oiet

ic_s

yste

m (n

=235

)P

anc-

Ade

noC

a (n

=232

)B

reas

t (n=

208)

Pro

st-A

deno

Ca

(n=1

99)

Lym

phat

ic_s

yste

m (n

=197

)B

reas

t-Ade

noC

a (n

=195

)K

idne

y (n

=186

)G

liom

a (n

=146

)K

idne

y-R

CC

(n=1

43)

CN

S-M

edul

lo (n

=141

)S

quam

ous

(n=1

21)

Ova

ry-A

deno

Ca

(n=1

10)

Ski

n-M

elan

oma

(n=1

07)

Lym

ph-B

NH

L (n

=105

)E

so-A

deno

Ca

(n=9

7)S

arco

ma

(n=9

5)Ly

mph

-CLL

(n=9

0)C

NS

-Pilo

Ast

ro (n

=89)

Lung

(n=8

4)P

anc-

End

ocrin

e (n

=81)

Sto

mac

h-A

deno

Ca

(n=6

8)H

ead-

SC

C (n

=56)

Col

oRec

t-Ade

noC

a (n

=52)

Thy-

Ade

noC

a (n

=48)

Lung

-SC

C (n

=47)

Ute

rus-

Ade

noC

a (n

=44)

Kid

ney-

ChR

CC

(n=4

3)B

one-

Ost

eosa

rc (n

=41)

CN

S-G

BM

(n=3

9)M

yelo

id (n

=38)

Lung

-Ade

noC

a (n

=37)

Bili

ary-

Ade

noC

a (n

=34)

Bon

e-Le

iom

yo (n

=34)

Mye

loid

-MP

N (n

=23)

Bla

dder

-TC

C (n

=23)

Promoter5’UTR3’UTR

lncRNAEnhancer

CDS

15 14 14 12 9 9 7 7 7 8 7 7 8 7 7 7 6 8 7 10 7 7 6 6 4 8 6 7 7 9 5 7 7 5 5 6 4 7 6 5 4 612 12 11 10 8 8 6 7 7 7 6 7 7 7 6 6 5 7 6 9 7 7 6 5 4 7 5 7 6 8 4 7 6 4 5 5 4 6 5 5 4 516 16 15 14 9 9 7 7 8 8 7 7 8 7 7 7 5 8 7 11 7 8 6 5 4 8 5 8 7 11 5 7 6 5 5 6 4 6 6 5 4 616 16 15 14 9 9 7 7 7 8 6 7 7 7 7 7 5 8 7 11 7 8 5 5 4 8 5 7 6 11 5 7 6 5 5 6 4 6 6 5 4 612 11 11 10 8 8 6 6 6 7 5 6 7 6 6 6 5 7 6 9 6 7 5 5 4 7 5 6 5 8 4 6 6 4 5 5 4 6 5 5 4 618 17 16 14 10 9 8 8 8 8 7 8 8 8 7 8 6 9 8 12 8 8 6 5 4 9 6 8 7 11 5 8 7 5 5 6 5 7 6 5 4 6

0

600

1200

Median element length

0 5 10 15 20 25 30Minimum % powered (≥90%)

d

e

0.0

1.0 32 70 94 83 52 77 34 4 21 60 59 16 20 7 78 74 27 68 65 91 66 93 60 61 49 100 48

Det

ectio

n se

nsiti

vity

at c

hr5:

1,29

5,22

8

% of patients in cohort powered (≥90%)

Bili

ary-

Ade

noC

a

Bla

dder

-TC

C

Bon

e-Le

iom

yo

Bon

e-O

steo

sarc

Bre

ast-A

deno

Ca

CN

S-G

BM

CN

S-M

edul

lo

CN

S-P

iloA

stro

Col

oRec

t-Ade

noC

a

Eso

-Ade

noC

a

Hea

d-S

CC

Kid

ney-

ChR

CC

Kid

ney-

RC

C

Live

r-H

CC

Lung

-Ade

noC

a

Lung

-SC

C

Lym

ph-B

NH

L

Lym

ph-C

LL

Mye

loid

-MP

N

Ova

ry-A

deno

Ca

Pan

c-A

deno

Ca

Pan

c-E

ndoc

rine

Pro

st-A

deno

Ca

Ski

n-M

elan

oma

Sto

mac

h-A

deno

Ca

Thy-

Ade

noC

a

Ute

rus-

Ade

noC

a

Bili

ary-

Ade

noC

a

Bla

dder

-TC

C

CN

S-G

BM

CN

S-M

edul

lo

Col

oRec

t-Ade

noC

a

Hea

d-S

CC

Kid

ney-

RC

C

Live

r-H

CC

Lung

-Ade

noC

a

Lym

ph-C

LL

Ova

ry-A

deno

Ca

Ski

n-M

elan

oma

Thy-

Ade

noC

a0

20

40

60

80

10023

415

525

914

34

25

79

159171

01

01

01

1026

011

MissedCalled

MissedTotal

Inferred mutations

% of patients with mutation

Bla

dder

-TC

C

CN

S-G

BM

Hea

d-S

CC

Ski

n-M

elan

oma

03

17

13

1336

Poi

nt m

utat

ions

ob

serv

ed/e

xpec

ted

ratio

0.0

0.5

1.0

1.5

2.0

2.5

3.0

Promote

r

5’UTR

Coding

3’UTR

3’UTR

71 [2

2-13

3]

32 [0

-79]

103

[25-

184]

37 [0

-88]

3,133[2,987-3,273]

downs

tream

.CC-BY-NC-ND 4.0 International licenseavailable under awas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (whichthis version posted December 23, 2017. ; https://doi.org/10.1101/237313doi: bioRxiv preprint

Page 90: was not certified by peer review) is the author/funder. It ...1 Discovery and characterization of coding and non-coding driver mutations in more than 2,500 whole cancer genomes Esther

Extended Data Figure 1a

b

9

1010

8

10

10

Number of patients

Num

ber o

f pos

ition

s

0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 160 165

KR

AS

TER

T

KR

AS

BR

AF

PIK

3CA

TP53RP

L13A

, TP

53

PIK

3CA

TER

T

IGH

locu

s, T

P53

TP53

TP53

IGH

locu

s

IGH

locu

s

TP53

0

10

100

103

104

105

106

107

0 %

100 %

Frac

tion

ofpo

sitio

ns

Protein-coding Non-coding

0 5 10 15 20 25 30 35 40 45 50 55 60 65Percentage of cohort mutated

0

0

14

13

11

12

0

0

0

0

0

0

11

12

12

0

0

0

15

1

14

17

11

13

12

1

12

0

0

0

1

0

17

0

0

17

0

19

1

0

0

29

0

0

3

0

1

0

0

0

Lym

ph−B

NH

L

0

0

2

3

54

0

0

0

0

0

0

6

4

5

0

0

0

3

0

4

0

7

5

6

0

7

0

0

0

0

0

4

3

0

6

0

6

0

0

0

4

0

0

0

0

0

0

2

0

Lym

ph−C

LL

0

0

0

0

0

0

0

0

0

0

3

2

0

0

0

0

2

0

0

1

0

0

0

0

0

1

0

0

0

2

0

1

0

1

2

0

2

0

3

0

2

0

0

51

13

0

0

0

0

Brea

st−A

deno

Ca

0

15

0

0

0

0

0

15

11

15

0

0

0

1

0

16

0

0

0

0

0

0

0

0

0

0

0

0

18

0

0

0

0

0

1

0

0

0

0

27

1

0

22

1

0

0

29

0

15

1

Skin

−Mel

anom

a

0

0

0

0

0

0

0

0

2

2

0

0

0

0

0

0

0

2

0

0

0

0

0

0

0

2

0

0

0

0

0

2

0

0

8

0

0

0

0

0

6

0

0

6

4

2

2

8

2

20

Col

oRec

t−Ad

enoC

a

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

5

0

3

0

0

0

0

0

7

0

0

0

4

0

3

0

2

4

0

4

0

5

0

5

0

0

5

6

0

0

3

0

3

Eso−

Aden

oCa

0

0

0

0

0

0

1

0

0

0

1

0

0

0

0

0

0

0

0

3

0

0

0

0

0

2

0

0

0

0

0

2

0

3

2

0

5

0

2

0

4

0

0

0

4

1

0

21

0

60

Pan

c−Ad

enoC

a

1

0

0

0

0

0

0

0

1

0

0

0

0

0

0

0

0

2

0

1

0

0

0

0

0

0

0

6

0

0

0

1

0

1

1

0

1

0

0

0

0

0

0

1

1

1

0

0

4

0

Live

r−H

CC

2

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

8

0

2

0

0

0

0

0

6

0

0

0

0

0

0

0

0

3

0

2

0

3

0

3

0

0

2

2

3

0

2

0

0

Stom

ach−

Aden

oCa

0

0

0

0

0

0

0

0

0

0

27

22

0

0

0

0

0

0

0

0

0

0

0

0

0

5

0

0

0

40

0

5

0

5

0

0

0

0

5

0

0

0

14

5

0

0

0

0

48

0

Blad

der−

TCC

6

0

0

0

0

0

8

0

0

0

0

0

0

0

0

0

6

1

0

0

0

0

0

0

0

0

0

0

0

0

1

2

0

0

0

0

1

0

0

0

0

0

0

0

1

1

0

0

0

1

Pros

t−Ad

enoC

a

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

5

0

0

0

0

0

0

0

0

0

0

0

5

0

5

0

0

3

0

5

0

3

0

0

5

7

3

0

0

0

5

Ute

rus−

Aden

oCa

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

4

0

0

0

0

2

0

0

0

6

0

4

15

2

2

0

0

6

2

Hea

d−SC

C

0

0

0

0

0

0

0

0

0

0

0

3

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

3

0

0

0

3

3

0

0

0

5

0

3

0

0

7

0

3

0

0

0

3

Lung

−SC

C

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

2

0

4

0

0

2

0

3

0

2

0

0

0

9

0

0

0

1

1

Ova

ry−A

deno

Ca

3

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

3

0

0

0

0

0

0

0

9

0

0

0

0

0

0

0

0

0

0

0

0

3

0

0

0

6

0

0

0

3

6

Bilia

ry−A

deno

Ca

0

0

0

0

0

0

0

0

0

0

0

9

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

3

0

0

0

0

3

0

0

0

0

3

0

0

0

6

3

14

Lung

−Ade

noC

a

0

0

0

0

0

0

0

0

3

0

5

5

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

5

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

21

0

23

0

Thy−

Aden

oCa

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

8

0

0

0

0

0

0

0

6

0

0

0

16

0

3

0

0

0

52

0

CN

S−G

BM

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

1

0

0

0

0

2

0

Kid

ney−

RC

C

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

2

0

0

4

0C

NS−

Med

ullo

0

0

0

0

0

0

0

0

0

0

0

2

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

2

0

0

0

0

0

0

0

0

0

0

0P

anc−

Endo

crin

e

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

4

0

0

0

CN

S−Pi

loAs

tro

0

0

0

0

0

0

0

0

3

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

Kid

ney−

ChR

CC

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

Mye

loid

−MPN

1

0

0

0

0

0

1

0

1

1

1

1

0

0

0

0

1

1

0

1

0

0

0

0

0

1

0

2

0

2

1

2

0

2

2

0

2

0

2

0

2

0

1

2

2

2

1

4

3

9

Car

cino

ma

0

0

7

7

7

7

0

0

1

0

0

0

7

7

7

0

0

0

8

1

8

8

8

8

8

1

8

0

0

0

1

0

9

1

0

10

0

11

1

0

0

15

0

0

2

0

1

0

1

0

Hem

atop

oiet

ic s

yste

m

1

0

0

0

0

0

1

0

1

1

1

1

0

0

0

0

1

2

0

1

0

0

0

0

0

1

0

2

0

1

1

1

0

2

2

0

2

0

2

0

2

0

0

2

2

3

1

4

2

10

Aden

ocar

cino

ma

0

0

8

8

8

8

0

0

0

0

0

0

9

8

9

0

0

0

9

1

9

9

9

9

9

1

10

0

0

0

1

0

11

2

0

12

0

13

1

0

0

17

0

0

2

0

1

0

1

0

Lym

phat

ic s

yste

m

1

0

0

0

0

0

1

0

1

1

1

0

0

0

0

0

0

3

0

2

0

0

0

0

0

2

0

3

0

1

0

2

0

2

3

0

2

0

2

0

3

0

0

2

3

1

1

7

2

20

Dig

estiv

e tra

ct

0

0

0

0

0

0

0

0

0

0

2

2

0

0

0

0

2

0

0

1

0

0

0

0

0

1

0

0

0

2

0

2

0

2

1

0

2

0

3

0

2

0

0

4

4

8

0

0

1

2

Fem

ale

repr

oduc

tive

syst

em

0

0

0

0

0

0

0

0

0

0

1

1

0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

0

1

0

2

0

1

1

0

1

0

2

0

4

0

2

11

1

2

0

0

4

3

Squ

amou

s

0

0

0

0

0

0

0

0

0

0

3

2

0

0

0

0

2

0

0

1

0

0

0

0

0

1

0

0

0

2

0

1

0

1

2

0

2

0

2

0

2

0

0

51

14

0

0

0

0

Brea

st

0

0

0

0

0

0

0

0

0

0

0

5

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

2

0

0

0

3

2

0

0

0

4

0

2

0

0

5

0

2

0

3

2

8

Lung

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

7

0

0

0

1

0

1

0

1

0

0

0

3

0

1

1

2

0

13

0

CN

S

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

13

0

0

0

1

0

1

0

2

0

0

0

5

0

1

0

3

0

21

0

Glio

ma

0

0

0

0

0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

1

0

0

0

0

2

0

Kid

ney

0

0

0

0

0

0

0

0

3

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

Mye

loid

1:10359944215:9093138314:10632741714:10632935014:10632923614:106326877X:8396755220:325809271:11525653019:215179310:11551159310:11551159014:10632955014:10632985214:10632919612:498776X:839660253:16490371014:10624024317:757712114:10632755914:10632711514:10632661814:10632661914:106326887X:779111114:106329192X:1165793298:569871416:1427062062:20911311217:757753914:10632894217:757819017:757821214:10632671317:757709414:10632697217:757753819:4999069417:757840614:1063269445:12952503:17893609117:75771203:1789520857:14045313612:253982855:129522812:25398284

Gen

omic

pos

ition

IQGAP1

NRAS #

PLEKHS1PLEKHS1

KDM5A

TP53 #

RPS20GPR126IDH1 #TP53 #

TP53 # TP53 #

TP53 #

TP53 #RPL13ATP53 #

TERT #PIK3CA # TP53 #PIK3CA #BRAF #KRAS # TERT #KRAS #

Gen

e

RP5-1125A11.1

− 20

0 (7

.7 %

)−

150

(5.8

%)

− 10

0 (3

.9 %

)−

50 (1

.9 %

)−

0 (0

.0 %

)

1515151515151616161616161616161717171717171717171718181919192020202122222424272828

333538394146

5697

166

T > CC > AC > GC > T

T > A

T > G

Mutation types:

IntronProtein-codingPromoter (pc)

# Known driver hotspot

Gene annotation:Promoter (lncRNA) Center of palindrome

IGH locus

Positional annotation:

G12D/A/V

G12S/R/CV600EH1047L/RR273L/HE545Q/K

R175H

R248P/Q

R282W

R213*Y220C

R248G/WR132H

R273S/C

Q61K

Pro

tein

cha

nge

.CC-BY-NC-ND 4.0 International licenseavailable under awas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (whichthis version posted December 23, 2017. ; https://doi.org/10.1101/237313doi: bioRxiv preprint

Page 91: was not certified by peer review) is the author/funder. It ...1 Discovery and characterization of coding and non-coding driver mutations in more than 2,500 whole cancer genomes Esther

geno

mic

cov

erag

e (%

) 1.2

0.8

0.4

0.0

Coverage of tested genomic elements

enha

ncers 3'u

tr5'u

tr

promote

rslnc

RNAs

miRNAs

short

RNAs

protei

n-cod

ing

protei

n-cod

ing

gene

s lncRNA

promote

rs

10 1

10 2

10 3

10 4

10 5

1

-

-

-

--

--

-

-

37350855 4610258102881919639161803 6448 8872number

(bas

es)

elem

ent l

engt

h

Length distribution of tested genomic elements

a

b

Extended Data Figure 2

.CC-BY-NC-ND 4.0 International licenseavailable under awas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (whichthis version posted December 23, 2017. ; https://doi.org/10.1101/237313doi: bioRxiv preprint

Page 92: was not certified by peer review) is the author/funder. It ...1 Discovery and characterization of coding and non-coding driver mutations in more than 2,500 whole cancer genomes Esther

a

compo

siteDriv

er

ncdD

etect

onco

driveF

ML_ves

t3

ExInAtor

ActiveD

riverW

GSMutS

igLA

RVA

DriverP

ower

onco

driveF

ML_ca

dd NBR

dNdS

cv

ncDriv

erCon

serva

tion_

snv

ncDriv

erCon

serva

tion_

indel

Integ

rated

Integ

rated

after

method

remov

al

TTC24SPARCATP5BOR4E2EGFRDDRGK1TBL1XR1CTNNB1STK11RBM10KEAP1KRASTP53

−*− −

**

* ** − * * *

− − * * * ** * * * * * * *

* * * * * * * * * * ** * * * * * * * * * *

Lung

-Ade

noCa

TTC24SPARCATP5BOR4E2EGFRDDRGK1TBL1XR1CTNNB1STK11RBM10KEAP1KRASTP53

* Significant gene − Near-significant gene

filtered method

compo

siteDriv

er

ncdD

etect

onco

driveF

ML_ves

t3

ExInAtor

ActiveD

riverW

GSMutS

igLA

RVA

DriverP

ower

onco

driveF

ML_ca

ddNBR

dNdS

cv

Integ

rated

(sim

ulatio

ns)

Integ

rated

(obs

erved

)ERCC4BAG1EEF1A1MYSM1SLC30A3ATRXPIK3CARB1IDH1PTENEGFRTP53

* ** − −* * * * * *

* * * * * * * * * ** * * * * * * * * * * * ** * * * * * * * * * * * *

CNS-

GBM

0 5 10 15−log10 P

0 1 2 3 4

01

23

45

Broad simulation

Expected p−value (−log10)

Obs

erve

d p−

valu

e (−

log1

0)

ncdDetectoncodriveFML_vest3ExInAtorActiveDriverWGSDriverPoweroncodriveFML_cadddNdScvNBRMutSigncDriverConservation_snvncDriverCancerType_snv

0 1 2 3 40

12

34

Sanger simulation

Expected p−value (−log10)

ncdDetectoncodriveFML_vest3ExInAtorActiveDriverWGSDriverPoweroncodriveFML_cadddNdScvNBRMutSigncDriverConservation_indelncDriverConservation_snvncDriverCancerType_snvncDriverCancerType_indel

0 1 2 3 4

01

23

45

DKFZ simulation

Expected p−value (−log10)

ncdDetectoncodriveFML_vest3ExInAtorActiveDriverWGSDriverPoweroncodriveFML_caddNBRdNdScvncDriverConservation_indelncDriverConservation_snvMutSig

0 1 2 3 4

05

1015

2025

30

Observed

Expected p−value (−log10)

Obs

erve

d p−

valu

e (−

log1

0)

0 1 2 3 4

05

1015

2025

30

Broad simulation

Expected p−value (−log10)

0 1 2 3 4

05

1015

2025

30

DKFZ simulation

Expected p−value (−log10)

0 1 2 3 4

05

1015

2025

30

Sanger simulation

Expected p−value (−log10)

Colo

rect

al A

deno

carc

inom

aUt

erus

Ade

noca

rcin

oma

0 1 2 3 4

05

1015

20

0 1 2 3 4

05

1015

2025

0 1 2 3 4

05

1015

2025

0 1 2 3 4

05

1015

2025

0 1 2 3 4

05

1015

20

0 1 2 3 4

05

1015

20

0 1 2 3 4

05

1015

20

0 1 2 3 4

05

1015

20

Obs

erve

d p−

valu

e (−

log1

0)O

bser

ved

p−va

lue

(−lo

g10)

Lung

Ade

noca

rcin

oma

Fisher combined p-values Brown combined p-values

b c

ActiveD

riverW

GS

DriverP

ower

ExInAtor

MutSig

NBR

dNdS

cv

ncDriv

erCan

cerTy

pe_s

nv

ncDriv

erCon

serva

tion_

snv

ncdD

etect

onco

driveF

ML_ca

dd

onco

driveF

ML_ves

t3

Color Key

Observed Broad simulation Difference

ActiveD

riverW

GS

DriverP

ower

ExInAtor

MutSig

NBR

dNdS

cv

ncDriv

erCan

cerTy

pe_s

nv

ncDriv

erCon

serva

tion_

snv

ncdD

etect

onco

driveF

ML_ca

dd

onco

driveF

ML_ves

t3

ActiveD

riverW

GS

DriverP

ower

ExInAtor

MutSig

NBR

dNdS

cv

ncDriv

erCan

cerTy

pe_s

nv

ncDriv

erCon

serva

tion_

snv

ncdD

etect

onco

driveF

ML_ca

dd

onco

driveF

ML_ves

t3

ActiveD

riverW

GS

DriverP

ower

ExInAtor

MutSig

NBR

dNdS

cv

ncDriv

erCan

cerTy

pe_in

del

ncDriv

erCan

cerTy

pe_s

nv

ncDriv

erCon

serva

tion_

indel

ncDriv

erCon

serva

tion_

snv

ncdD

etect

onco

driveF

ML_ca

dd

onco

driveF

ML_ves

t3

ActiveD

riverW

GS

DriverP

ower

ExInAtor

MutSig

NBR

dNdS

cv

ncDriv

erCan

cerTy

pe_in

del

ncDriv

erCan

cerTy

pe_s

nv

ncDriv

erCon

serva

tion_

indel

ncDriv

erCon

serva

tion_

snv

ncdD

etect

onco

driveF

ML_ca

dd

onco

driveF

ML_ves

t3

ActiveD

riverW

GS

DriverP

ower

ExInAtor

MutSig

NBR

dNdS

cv

ncDriv

erCan

cerTy

pe_in

del

ncDriv

erCan

cerTy

pe_s

nv

ncDriv

erCon

serva

tion_

indel

ncDriv

erCon

serva

tion_

snv

ncdD

etect

onco

driveF

ML_ca

dd

onco

driveF

ML_ves

t3

Observed Sanger simulation Difference

ActiveD

riverW

GS

DriverP

ower

ExInAtor

MutSig

NBR

dNdS

cv

ncDriv

erCon

serva

tion_

indel

ncDriv

erCon

serva

tion_

snv

ncdD

etect

onco

driveF

ML_ca

dd

onco

driveF

ML_ves

t3

Observed DifferenceDKFZ simulation

ActiveD

riverW

GS

DriverP

ower

ExInAtor

MutSig

NBR

dNdS

cv

ncDriv

erCon

serva

tion_

indel

ncDriv

erCon

serva

tion_

snv

ncdD

etect

onco

driveF

ML_ca

dd

onco

driveF

ML_ves

t3

ActiveD

riverW

GS

DriverP

ower

ExInAtor

MutSig

NBR

dNdS

cv

ncDriv

erCon

serva

tion_

indel

ncDriv

erCon

serva

tion_

snv

ncdD

etect

onco

driveF

ML_ca

dd

onco

driveF

ML_ves

t3

-1 0 0.5 1-0.5Spearmancorrelation

-

-

-

=

=

=

d

Extended Data Figure 3

.CC-BY-NC-ND 4.0 International licenseavailable under awas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (whichthis version posted December 23, 2017. ; https://doi.org/10.1101/237313doi: bioRxiv preprint

Page 93: was not certified by peer review) is the author/funder. It ...1 Discovery and characterization of coding and non-coding driver mutations in more than 2,500 whole cancer genomes Esther

Biliary

−Ade

noC

aBl

adde

r−TC

CBo

ne−L

eiom

yoBo

ne−O

steo

sarc

Brea

st−A

deno

Ca

CN

S−G

BMC

NS−

Med

ullo

CN

S−Pi

loAs

troC

oloR

ect−

Aden

oCa

Eso−

Aden

oCa

Hea

d−SC

CKi

dney

−ChR

CC

Kidn

ey−R

CC

Live

r−H

CC

Lung

−Ade

noC

aLu

ng−S

CC

Lym

ph−B

NH

LLy

mph

−CLL

Mye

loid

−MPN

Ovary

−Ade

noC

aPa

nc−A

deno

Ca

Panc

−End

ocrin

ePr

ost−

Aden

oCa

Skin

−Mel

anom

aSt

omac

h−Ad

enoC

aTh

y−Ad

enoC

aU

teru

s−Ad

enoC

a

Aden

ocar

cino

ma

Brea

stC

arci

nom

aC

NS

Dig

estiv

e_tra

ctFe

mal

e_re

prod

uctiv

e_tra

ctG

liom

aH

emat

opoi

etic

_sys

tem

Kidn

eyLu

ngLy

mph

atic

_sys

tem

Mye

loid

Sarc

oma

Squa

mou

sPa

ncan

cer

Panc

ance

r (no

lym

phom

a)Pa

ncan

cer (

no m

elan

oma)

Panc

ance

r (no

mel

anom

a/lym

phom

a)

ncDriverConservation_indelncDriverConservation_snvncDriverCancerType_indelncDriverCancerType_snvcompositeDriverLARVAoncodriveFML_caddoncodriveFML_vest3ExInAtorncdDetectActiveDriver2NBRDriverPowerMutSigdNdScvBrown_observed

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 2 2 2 2

7 1 8 2 5 0 0 0 0 0 0 0 0 0 4 3 4 210 0 11 6 7 0 5 0 0 0 0 0 1 0 20 14 20 14

2 0 0 1 4 2 3 1 4 4 3 0 3 10 2 2 6 2 0 1 3 2 2 4 0 62 1 0 1 4 3 3 0 2 2 5 1 2 3 2 1 5 4 1 1 3 2 3 1 3 0 50 3 0 0 4 2 3 0 5 5 2 0 3 7 1 3 7 1 0 1 6 4 0 4 2 0 2 34 6 43 7 21 8 3 9 3 6 9 0 2 6 50 44 52 420 1 0 1 5 2 6 0 3 5 2 0 3 6 1 4 10 0 0 1 5 3 1 6 2 1 4 37 6 45 11 20 9 3 10 3 7 9 0 1 8 53 47 53 511 2 0 1 8 3 3 0 4 5 2 0 4 6 2 2 26 1 0 1 10 3 3 5 2 1 4 22 9 28 9 12 9 7 32 5 4 29 0 2 6 42 29 47 322 2 1 1 5 4 2 3 3 3 1 2 3 3 2 38 6 2 1 5 2 4 3 2 4 9 5 9 10 7 5 8 40 3 3 38 3 5 16 11 18 103 2 2 1 8 3 7 3 8 5 5 0 6 10 4 4 28 6 0 1 11 5 4 4 3 1 5 33 8 38 14 20 16 9 35 7 6 35 3 3 8 58 58 431 3 2 2 10 4 5 4 4 5 5 1 5 11 3 4 28 3 1 1 9 5 5 4 5 1 6 34 13 40 10 19 13 8 31 6 6 30 3 7 49 545 6 2 1 8 4 9 3 11 9 5 1 6 11 4 5 23 3 0 1 12 4 4 5 1 8 39 14 43 15 26 16 11 27 8 6 26 3 3 7 52 524 3 2 2 9 4 6 2 11 8 6 0 5 13 5 5 38 7 0 2 12 3 5 6 5 2 11 50 9 55 17 30 18 11 41 5 8 42 3 4 124 5 2 2 12 6 9 4 7 4 5 1 6 11 5 5 26 4 1 4 12 5 5 6 4 1 6 46 14 54 22 26 18 13 35 6 9 30 3 11 75 763 4 2 2 11 4 10 5 13 8 5 1 5 13 5 7 33 6 2 2 14 5 4 7 5 2 9 50 16 58 21 30 19 13 39 7 9 37 2 4 11 80 58 79 62

Meta-cohortsIndividual tumor types

Genes ranked by p−value Genes ranked by p−value Genes ranked by p−value0 20 40 60 80

0

20

40

60

80

100

0 2 4 6 8 10 12 14

0

20

40

60

80

100

compositeDriver (n=4/4)ncdDetect (n=4/5)oncodriveFML_vest3 (n=4/6)ExInAtor (n=6/10)ActiveDriver2 (n=6/8)MutSig (n=7/9)LARVA (n=3/9)DriverPower (n=6/8)oncodriveFML_cadd (n=3/4)

dNdScv (n=9/12)NBR (n=8/11)ncDriverConservation_indel (n=0/0)ncDriverConservation_snv (n=0/0)Brown_observed (n=8/11)

0 5 10 15

0

20

40

60

80

100

compositeDriver (n=4/7)ncdDetect (n=3/3)oncodriveFML_vest3 (n=3/6)ExInAtor (n=4/4)ActiveDriver2 (n=7/10)MutSig (n=7/14)LARVA (n=2/2)DriverPower (n=9/15)oncodriveFML_cadd (n=5/8)NBR (n=4/6)

dNdScv (n=6/9)ncDriverConservation_indel (n=0/0)ncDriverConservation_snv (n=0/0)Brown_observed (n=9/16)

Carcinoma Breast−AdenoCa ColoRect−AdenoCaa b c

Perc

enta

ge o

f wel

l-kno

wn

canc

er g

enes

ncdDetect (n=9/11)oncodriveFML_vest3 (n=34/53)ExInAtor (n=23/38)ActiveDriver2 (n=32/48)MutSig (n=42/90)DriverPower (n=36/62)oncodriveFML_cadd (n=35/54)NBR (n=31/47)

dNdScv (n=43/79)ncDriverConservation_snv (n=2/5)ncDriverConservation_indel (n=0/0)ncDriverCancerType_indel (n=7/9)ncDriverCancerType_snv (n=11/16)Brown_observed (n=47/74)

d

Extended Data Figure 4

.CC-BY-NC-ND 4.0 International licenseavailable under awas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (whichthis version posted December 23, 2017. ; https://doi.org/10.1101/237313doi: bioRxiv preprint

Page 94: was not certified by peer review) is the author/funder. It ...1 Discovery and characterization of coding and non-coding driver mutations in more than 2,500 whole cancer genomes Esther

Extended Data Figure 5

0 5 10 15 20 25

05

1015

2025

Most significant individual cohort (−log10 q-value)

Mos

t sig

nific

ant m

eta

coho

rt (−

log 10

q-va

lue)

ALB

FOXA1

NFKBIZSFTPB

TMEM107

TOB1

DTL

MTG2

PTDSS1

RNF34

ACVR1B

ACVR2A

AJUBA

AKT1

ALB

APC

APOB

ARHGAP35

ARID1A

ARID2

ATMATRX

AXIN1

B2M

BAP1

BAXBCOR

BRAF

BRD7

BRPF1

BTG1

CAMK1

CARD11

CASP8

CBFB

CCND3

CDH1

CDK12

CDKN1A

CDKN1B

CDKN2A

CIC

CREBBP

CTC−512J12.6

CTCF

CTDNEP1

CTNNB1

DAXX

DDX3X

DNMT3A

A1KRYD 1I1CNYD

EBF1

EEF1A1

EGFR

ELF3EPHA2

EZH2

FAT1

FBXO11

FBXW7

FGFR1

FGFR3

FOXA1

FUBP1

GATA3

GNA13

GNAI2

GNAS

GRB2H3F3A

H3F3B

HIST1H1B

HIST1H1C

HIST1H2AG

HIST1H2BC

HIST1H4D

HLA−B

HNF4A

HRAS

IDH1

IDH2

IRF4

ITPKB

KBTBD4

KDM5C

KDM6A

KEAP1

KLHL6

KMT2C

KMT2D

KRAS

LATS1

LDB1

MAP2K4

MAP2K7

MAP3K1

MCL1

MEF2B

MEN1

MTOR

MYD88

NCOR1

NF1

NFE2L2

NFKBIE

NOTCH1

NOTCH2

NRAS

NXF1

PA2G4

PAX5

PBRM1

PCBP1

PHF6

PIK3CA

PIK3R1

PLK1

PPP2R1APRKAR1APRKCD

PTCH1

PTEN

PTPN11

RAC1

RB1

RBM10

RELA

RFTN1

RHOA

RIPK4

RNF43

RPL22

RPL5

RPS6KA3

RRAGC

SETD2

SETDB1

SF3B1

SMAD2

SMAD4

SMARCA4

SMARCB1

SMOSOX9

SPOP

SRSF7

STAT3

STAT6

STK11

TBL1XR1

TBR1

TBX3

2L7FCT 4FCT

TET2

TGFBR2

TMEM30A

TMSB4X

TNFRSF14

TP53

TSC1U2AF1

VHL

ZFP36L2

ZNF292IGH locus

TP53TG1

G025135

G029190

MALAT1

NEAT1

RMRP

RNU12RP11−92C4.6

RPPH1

LINC00963

MIR663A

RMRP

RP11−440L14.1

TRAM2−AS1G085970

hsa−mir−142

HES1

IFI44L

PAX5

POLR3E RFTN1

TERT

WDR74

.CC-BY-NC-ND 4.0 International licenseavailable under awas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (whichthis version posted December 23, 2017. ; https://doi.org/10.1101/237313doi: bioRxiv preprint

Page 95: was not certified by peer review) is the author/funder. It ...1 Discovery and characterization of coding and non-coding driver mutations in more than 2,500 whole cancer genomes Esther

Extended Data Figure 6#

patie

nts

030

0

020

00

APOBBAX

BCORBRPF1CAMK1DYNC1I1DYRK1AEEF1A1

FAT1FGFR3GNASH3F3AH3F3B

HIST1H4DHNF4AIDH2IRF4

LATS1NCOR1NOTCH2PA2G4PCBP1PHF6PLK1RAC1RELARIPK4SMAD2

SMARCB1TBR1TBX3TCF4

TCF7L2TSC1U2AF1ZNF292AJUBACDK12

CICCTC−512J12.6

CTDNEP1DAXX

DNMT3AELF3

FUBP1HIST1H2AG

ITPKBKBTBD4KDM5CLDB1MCL1MTORPAX5

PPP2R1APRKAR1APTPN11RPL5

SETDB1SMOTET2ATRXBTG1

CARD11CCND3CDKN1B

EBF1EPHA2FBXO11FGFR1GNA13GNAI2GRB2

HIST1H1BHIST1H1CHIST1H2BC

HLA−BKLHL6MEF2BMEN1NXF1

PRKCDPTCH1RFTN1RHOA

RRAGCSRSF7STAT3STAT6STK11

TMEM30ATMSB4X

TNFRSF14BRD7

CREBBPCTCFEGFREZH2

HLA−AHRASIDH1

MYD88NFKBIE

RPS6KA3SOX9SPOP

TBL1XR1ACVR1B

AKT1ALB

ARHGAP35ARID2AXIN1CASP8FOXA1

MAP2K7RNF43RPL22

SMARCA4TGFBR2

VHLAPCCBFBCDH1

CDKN1ADDX3XGATA3KEAP1MAP3K1NOTCH1PBRM1PIK3R1SF3B1

ZFP36L2ACVR2A

ATMBAP1

KMT2CNF1

RBM10SETD2B2M

CTNNB1FBXW7KDM6AMAP2K4NFE2L2NRASKMT2DBRAF

CDKN2ARB1

SMAD4KRAS

ARID1APIK3CAPTENTP53

0 200 400 600 800 0 200 400 600 800# patients with

mutation in element# mutations in all patients

−log10 q-value0 1 2 3 4 5

SNV Indel

Bilia

ry−A

denoCa

Bladder−TC

CBo

ne−Leiom

Bone−O

steosarc

Breast−A

denoCa

CNS−

GBM

CNS−

Medullo

CNS−

PiloAstro

ColoR

ect−Ad

enoC

aEso−Ad

enoC

aHead−SC

CKi

dney−C

hRCC

Kidn

ey−R

CC

Liver−HCC

Lung−A

denoCa

Lung−S

CC

Lymph−B

NHL

Lymph−C

LLM

yeloid−M

PNO

vary−A

denoCa

Panc−A

denoCa

Panc−E

nocrine

Prost−Ad

enoC

aSkin−M

elanom

aStom

ach−Ad

enoC

aThy−Ad

enoC

aU

terus−Ad

enoC

a

Adenocarcinoma

Carcinoma

Pancancer

Myeloid

L ymphatic_system

Hem

atopoietic_system

Glioma

CNS

Brea

stFemale_reproductive_tract

Kidn

eyDigestive_tract

Lung

Squa

mou

sSa

rcom

a

.CC-BY-NC-ND 4.0 International licenseavailable under awas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (whichthis version posted December 23, 2017. ; https://doi.org/10.1101/237313doi: bioRxiv preprint

Page 96: was not certified by peer review) is the author/funder. It ...1 Discovery and characterization of coding and non-coding driver mutations in more than 2,500 whole cancer genomes Esther

a

ALBKRTAP4−3ANKRD39PSMB7DTWD2OGDHZSCAN5BERP27CD14GLE1XRCC2CASC5TMEM183AAL590714.1PCF11KRT222HIST1H2BKAC008914.1HIST1H4ATTRHES1IER3IFI44LWDR74TERT

*

* − −* * * * ** * * * * ** * * * * * *

ncdD

etec

t

Activ

eDriv

erW

GS

Mut

Sig

LARV

A

Driv

erPo

wer

onco

driv

eFM

L_ca

dd

ncD

river

Con

serv

atio

n_sn

v

ncD

river

Con

serv

atio

n_in

del

Inte

grat

ion

LEMD3HIST1H2BGNR4A2BPTFPRR19GLI2TRIML2VAV3GPRC5AMEA1CTD−2021H9.3GLYR1ARPC2SPAG5CALM2FGAHSP90AB1AHSA2AFF4TMEM107DBPRFC4USP34SPHK2ALB

−−

−*

−−

−− −− −

−* * *

−log10 p-value

com

posi

teD

river

ncdD

etec

t on

codr

iveF

ML_

vest

3 Ex

InAt

or

Activ

eDriv

erW

GS

Mut

Sig

LARV

AD

river

Pow

eron

codr

iveF

ML_

cadd

NBR

dN

dScv

nc

Driv

erC

onse

rvat

ion_

snv

ncD

river

Con

serv

atio

n_in

de

UBBP4ALDOBTRAF6TNRC6BSEC23ASDHBCDKN2ATYW5USP15TSC2VIMPRKACAC13orf45CHPT1VAV3TSC1HIST1H2BDIDH1G6PCCRIP3C11orf57HNF4ARPL5ACVR2ADYRK1ACAMK1SETDB1EEF1A1RPL22APOBKRTAP5−11PTENCDKN1AKEAP1BRD7RB1BAP1RPS6KA3NFE2L2ALBARID1AARID2AXIN1TP53CTNNB1

* −

−− −

−*

* −

**

* − −* −

* −* * −

* −* * −

* − * − *− * * − *

− ** ** * * * *− − − ** − − * * *

− * − * ** * * * * *

* − * * * * ** * * − − * * *

− * * * * * *− − * * * * *

* * * * * * * ** * * * * * − * *

− − * * * * * * ** * * * * * *

− * * * * * * ** * − * * * * * * *

* * * * − * ** * * * * * * * ** * * * * * * * * ** − * * * * * * * * ** * * * * * * * * * * ** * * * * * * * * * * *

3’UTR

Promoter

Protein-coding

com

posi

teD

river

nc

dDet

ect

Activ

eDriv

erW

GS

Mut

Sig

LARV

AD

river

Pow

eron

codr

iveF

ML_

cadd

N

BR

ncD

river

Con

serv

atio

n_sn

v nc

Driv

erC

onse

rvat

ion_

inde

l In

tegr

atio

n

Included methods: 8Included methods: 10

Included methods: 13

Inte

grat

ion

Liver−HCC

Liver−HCC

Liver−HCC

0 5 10 15

0 5 10 15

0 5 10

b c

No p-value calculated

* FDR < 0.1- FDR < 0.25

Extended Data Figure 7

.CC-BY-NC-ND 4.0 International licenseavailable under awas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (whichthis version posted December 23, 2017. ; https://doi.org/10.1101/237313doi: bioRxiv preprint

Page 97: was not certified by peer review) is the author/funder. It ...1 Discovery and characterization of coding and non-coding driver mutations in more than 2,500 whole cancer genomes Esther

0

5

10

15

TERT

expr

essi

on z−s

core

CN

S_t

umor

s

Mutation type

Copy number event

SNVno event

CNA gainno event

promoter mut. wt

P=5.3e−04

0

20

40

60

80

100

TER

T ex

pres

sion

z−s

core

Thy−

Ade

noC

A

Mutation type

Copy number event

SNVno event

CNA gainno event

promoter mut. wt

P=7.3e−08

0

20

40

60

80

100

TER

T ex

pres

sion

z−s

core

Car

cino

ma_

tum

ors Mutation type

Copy number event

SNVno event

CNA amplificationCNA gainCNA lossno event

promoter mut. wt

P=0.025

−2

−1

0

1

2

3

4

PAX5

expr

essi

on z−s

core

Lym

ph_t

umor

s

Mutation type

Copy number event

indelSNVno event

CNA gainCNA lossno event

promoter mut. wt

P=0.27

−2

0

2

4

6

8

HES1

expr

essi

on z−s

core

Car

cino

ma_

tum

ors Mutation type

Copy number event

SNVno event

CNA amplificationCNA gainCNA lossno event

promoter mut. wt

P=0.45

−2

0

2

4

6

8

10

HES1

expr

essi

on z−s

core

panc

an

Mutation type

Copy number event

SNVno event

CNA amplificationCNA gainCNA lossno event

promoter mut. wt

P=0.42

−1

0

1

2

3

4

5

PTDSS1

expr

essi

on z−s

core

Dig

estiv

e_tra

ct_t

umor

s

Mutation type

Copy number event

SNVno event

CNA gainCNA lossno event

5'UTR mut. wt

P=0.15

−1

0

1

2

3

4

DTL

exp

ress

ion

z−sc

ore

Lung−A

deno

CA Mutation type

Copy number event

SNVno event

CNA gainCNA lossno event

5'UTR mut. wt

P=0.81

−1

0

1

2

3

INTS7

expr

essi

on z−s

core

Lung−A

deno

CA Mutation type

Copy number event

SNVno event

CNA gainCNA lossno event

5'UTR mut. wt

P=0.16

0

1

2

3

4

IFI44L

exp

ress

ion

z−sc

ore

Live

r−H

CC

Mutation type

Copy number event

SNVno event

CNA gainCNA lossno event

promoter mut. wt

P=0.53

−1

0

1

2

3

4

5

POLR3E

exp

ress

ion

z−sc

ore

Ova

ry−A

deno

CA Mutation type

Copy number event

SNVno event

CNA gainCNA lossno event

promoter mut. wt

P=0.25

Extended Data Figure 8.CC-BY-NC-ND 4.0 International licenseavailable under a

was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (whichthis version posted December 23, 2017. ; https://doi.org/10.1101/237313doi: bioRxiv preprint

Page 98: was not certified by peer review) is the author/funder. It ...1 Discovery and characterization of coding and non-coding driver mutations in more than 2,500 whole cancer genomes Esther

192,400 kb

TCGA-AO-A0JM-01A-21D-A059-01TCGA-AC-A23H-01A-11D-A160-01TCGA-A2-A3XZ-01A-42D-A238-01TCGA-D8-A27N-01A-11D-A16C-01TCGA-C8-A12M-01A-11D-A111-01TCGA-A2-A1FW-01A-11D-A13J-01TCGA-C8-A137-01A-11D-A111-01TCGA-A8-A076-01A-21D-A011-01TCGA-GM-A2DA-01A-11D-A18N-01TCGA-DD-A73D-01A-12D-A32F-01TCGA-L5-A8NG-01A-11D-A37B-01TCGA-C8-A12T-01A-11D-A111-01TCGA-21-A5DI-01A-31D-A26L-01TCGA-EY-A1GJ-01A-12D-A141-01TCGA-AR-A24U-01A-11D-A166-01TCGA-BH-A0DG-01A-21D-A12N-01TCGA-EW-A6SC-01A-12D-A32H-01TCGA-D8-A1XZ-01A-11D-A14J-01TCGA-A2-A25B-01A-11D-A166-01TCGA-AN-A0FW-01A-11D-A036-01TCGA-ZF-A9R3-01A-11D-A38F-01TCGA-NA-A4R1-01A-11D-A28Q-01TCGA-C8-A12U-01A-11D-A111-01TCGA-E2-A1LB-01A-11D-A141-01TCGA-A8-A07I-01A-11D-A011-01TCGA-DD-AACJ-01A-11D-A40Q-01TCGA-ND-A4WA-01A-12D-A28Q-01TCGA-29-2429-01A-01D-1043-01TCGA-BR-4369-01A-01D-1155-01TCGA-D5-6537-01A-11D-1717-01TCGA-BH-A0DD-01A-31D-A12N-01TCGA-W3-AA21-06A-11D-A38F-01TCGA-AN-A0XW-01A-11D-A107-01TCGA-AO-A0JG-01A-31D-A087-01TCGA-BH-A0BP-01A-11D-A111-01TCGA-09-1665-01B-01D-0533-01TCGA-IF-A3RQ-01A-11D-A227-01TCGA-BH-A0HW-01A-11D-A036-01TCGA-C8-A1HI-01A-11D-A134-01TCGA-D1-A0ZP-01A-21D-A10L-01TCGA-E9-A1RF-01A-11D-A160-01TCGA-56-A4BY-01A-11D-A24C-01TCGA-AR-A0TY-01A-12D-A111-01TCGA-AO-A0J3-01A-11D-A036-01TCGA-BR-4357-01A-01D-1155-01TCGA-EO-A2CH-01A-11D-A17U-01TCGA-BH-A0HB-01A-11D-A059-01TCGA-DX-AB3A-01A-11D-A416-01TCGA-A8-A08J-01A-11D-A011-01TCGA-BL-A0C8-01B-04D-A273-01TCGA-D3-A8GP-06A-11D-A371-01TCGA-10-0931-01A-01D-0399-01TCGA-S3-AA14-01A-11D-A41E-01TCGA-AP-A053-01A-21D-A00X-01TCGA-N7-A59B-01A-11D-A28Q-01TCGA-B9-A5W9-01A-11D-A28F-01TCGA-2Y-A9H0-01A-11D-A381-01TCGA-NF-A4WX-01A-11D-A28Q-01TCGA-A2-A0T3-01A-21D-A111-01TCGA-04-1356-01A-01D-0452-01TCGA-AQ-A0Y5-01A-11D-A14J-01TCGA-BH-A18R-01A-11D-A12A-01TCGA-A2-A0D1-01A-11D-A036-01TCGA-EW-A1OZ-01A-11D-A141-01TCGA-UD-AAC4-01A-11D-A39Q-01TCGA-EW-A2FV-01A-11D-A17C-01TCGA-A8-A09W-01A-11D-A011-01TCGA-G3-A25S-01A-11D-A16U-01TCGA-E9-A5UP-01A-11D-A28A-01TCGA-A7-A26H-01A-11D-A166-01TCGA-EA-A5ZF-01A-11D-A28A-01TCGA-50-5944-01A-11D-1752-01TCGA-E2-A14T-01A-11D-A111-01TCGA-3C-AALK-01A-11D-A41E-01TCGA-BH-A0EE-01A-11D-A036-01TCGA-UT-A88E-01B-11D-A39Q-01TCGA-12-0818-01A-01D-0384-01TCGA-AG-3609-01A-02D-0824-01TCGA-D8-A27V-01A-12D-A17C-01TCGA-AP-A5FX-01A-11D-A27O-01TCGA-E9-A1NA-01A-11D-A141-01TCGA-B3-4104-01A-02D-1348-01TCGA-QQ-A5VC-01A-31D-A32H-01TCGA-FJ-A3ZE-01A-11D-A23L-01TCGA-AN-A0AJ-01A-11D-A011-01TCGA-E9-A2JS-01A-11D-A17U-01TCGA-D8-A140-01A-11D-A111-01TCGA-A8-A07P-01A-11D-A011-01TCGA-H8-A6C1-01A-11D-A32M-01TCGA-WB-A81J-01A-11D-A35H-01TCGA-LN-A9FQ-01A-31D-A386-01TCGA-OR-A5LT-01A-11D-A29H-01TCGA-VQ-AA6D-01A-11D-A40Z-01TCGA-A8-A08F-01A-11D-A011-01TCGA-13-1487-01A-01D-0472-01TCGA-DI-A2QY-01A-12D-A19X-01TCGA-BR-8289-01A-11D-2338-01TCGA-AN-A04A-01A-21D-A036-01TCGA-52-7810-01A-11D-2121-01TCGA-AA-3556-01A-01D-0819-01TCGA-55-6712-01A-11D-1854-01TCGA-XF-AAMH-01A-11D-A42D-01TCGA-95-7043-01A-11D-1943-01TCGA-KV-A74V-01A-11D-A33P-01TCGA-D8-A27G-01A-11D-A16C-01TCGA-A7-A0CD-01A-11D-A011-01TCGA-3A-A9IV-01A-11D-A40V-01TCGA-IB-7891-01A-11D-2200-01TCGA-C8-A1HM-01A-12D-A134-01TCGA-AR-A0TW-01A-11D-A087-01TCGA-G3-AAUZ-01A-11D-A381-01TCGA-BQ-5891-01A-11D-1588-01TCGA-BH-A0HX-01A-21D-A059-01TCGA-HS-A5N8-01A-11D-A26F-01TCGA-55-1595-01A-01D-0944-01TCGA-G7-6790-01A-11D-1959-01TCGA-E9-A22A-01A-11D-A160-01TCGA-63-7023-01A-11D-1943-01TCGA-JU-AAVI-01A-11D-A402-01TCGA-GL-A59T-01A-21D-A28F-01TCGA-BH-A0DK-01A-21D-A059-01TCGA-EW-A1OY-01A-11D-A141-01TCGA-G7-6789-01A-11D-1959-01TCGA-29-1693-01A-01D-0559-01TCGA-UY-A8OD-01A-11D-A363-01TCGA-2L-AAQM-01A-11D-A396-01TCGA-E2-A1IG-01A-11D-A141-01TCGA-E9-A54Y-01A-11D-A25N-01TCGA-E2-A1B1-01A-21D-A12N-01TCGA-13-0921-01A-01D-0399-01TCGA-41-2573-01A-01D-0911-01TCGA-BC-A10Y-01A-11D-A12Y-01TCGA-EW-A6SA-01A-21D-A32H-01TCGA-58-A46N-01A-11D-A24C-01TCGA-BH-A0BC-01A-22D-A087-01TCGA-FR-A3YN-06A-11D-A238-01TCGA-B6-A0IG-01A-11D-A036-01TCGA-P3-A6T6-01A-11D-A34I-01TCGA-NK-A7XE-01A-12D-A400-01TCGA-G7-6795-01A-11D-1959-01TCGA-G7-A8LD-01A-11D-A35Y-01TCGA-DD-A3A7-01A-11D-A22E-01TCGA-AR-A2LL-01A-11D-A17U-01TCGA-10-0936-01A-01D-0399-01TCGA-VR-AA4D-01A-11D-A37B-01TCGA-FB-A545-01A-11D-A26H-01TCGA-AH-6897-01A-11D-1923-01TCGA-AN-A04C-01A-21D-A036-01TCGA-N6-A4V9-01A-11D-A28Q-01TCGA-61-1722-01A-01D-0577-01TCGA-85-8070-01A-11D-2243-01TCGA-GV-A3JZ-01A-11D-A219-01TCGA-B9-4117-01A-02D-1348-01TCGA-B4-5835-01A-11D-1668-01TCGA-EY-A2OQ-01A-11D-A19X-01TCGA-N5-A4RD-01A-11D-A28Q-01TCGA-AA-A00L-01A-01D-A003-01TCGA-BH-A0HY-01A-11D-A059-01

48,600 kb

p24.1 p22.3 p21.3 p21.1 p13.2 p11.2 q12 q13 q21.13 q21.32 q22.2 q22.33 q31.2 q32 q33.2 q34.11 q34.3

193,000 kb 194,000 kbchr3

CACNA1G ABCC3 ANKRD40 LUC7L3 LINC00483 WFIKKN2 TOB1 SPAG9

48,700 kb 49,000 kb

TESK1 SIT1 RMRP CA9 TPM2 TLN1 CREB3

MB21D2 HRASLS ATP13A4 OPA1 LOC647323 HES1 LINC00887 GP5 LINC00884 TMEM44 FAM43ATCGA-VQ-A922-01A-11D-A40Z-01TCGA-BR-7723-01A-11D-2052-01TCGA-BR-4357-01A-01D-1155-01TCGA-CD-8527-01A-11D-2338-01TCGA-BR-8360-01A-11D-2338-01TCGA-VQ-A8PJ-01A-11D-A40Z-01TCGA-VQ-A91Z-01A-11D-A40Z-01TCGA-B7-5818-01A-11D-1599-01TCGA-VQ-A91Q-01A-12D-A40Z-01TCGA-CD-A487-01A-21D-A24C-01TCGA-BR-4191-01A-02D-1130-01TCGA-D7-A6F0-01A-11D-A31K-01TCGA-IN-A6RN-01A-12D-A33S-01TCGA-FP-8211-01A-11D-2338-01TCGA-EQ-5647-01A-01D-1599-01TCGA-CG-4437-01A-01D-1799-01TCGA-RD-A7BS-01A-11D-A32M-01TCGA-CD-5800-01A-11D-1599-01TCGA-CG-4472-01A-01D-1155-01TCGA-BR-8060-01A-11D-2338-01TCGA-CD-A48C-01A-11D-A24C-01TCGA-B7-A5TN-01A-21D-A31K-01TCGA-B7-5816-01A-21D-1599-01TCGA-FP-7735-01A-11D-2052-01TCGA-VQ-A925-01A-11D-A40Z-01TCGA-CG-5725-01A-11D-1599-01TCGA-BR-6801-01A-11D-1881-01TCGA-BR-8484-01A-11D-2390-01TCGA-BR-7958-01A-21D-2338-01TCGA-IN-A7NT-01A-21D-A34T-01TCGA-HU-A4H5-01A-21D-A253-01TCGA-HU-A4G9-01A-11D-A24C-01TCGA-VQ-A91S-01A-11D-A40Z-01TCGA-VQ-A8P5-01A-11D-A396-01TCGA-D7-6519-01A-11D-1799-01TCGA-R5-A7ZE-01B-11D-A34T-01TCGA-VQ-AA68-01A-11D-A40Z-01TCGA-HU-A4H4-01A-21D-A253-01TCGA-BR-4366-01A-01D-1155-01TCGA-D7-A6EV-01A-11D-A31K-01TCGA-HU-A4H3-01A-21D-A253-01TCGA-BR-A4PE-01A-31D-A253-01TCGA-3M-AB46-01A-11D-A40Z-01TCGA-CG-4305-01A-01D-1155-01TCGA-IN-A6RR-01A-12D-A32M-01TCGA-VQ-A8E0-01A-11D-A40Z-01TCGA-CG-4449-01A-01D-1155-01TCGA-VQ-A923-01A-11D-A40Z-01TCGA-VQ-A8PP-01A-21D-A40Z-01TCGA-D7-A6EZ-01A-11D-A31K-01TCGA-HU-A4GF-01A-11D-A24C-01TCGA-BR-4267-01A-01D-1130-01TCGA-B7-A5TI-01A-11D-A31K-01TCGA-IN-A6RL-01A-11D-A32M-01TCGA-VQ-AA6A-01A-11D-A40Z-01

TCGA Samples

TCGA-BH-A0AW-01A-11D-A059-01TCGA-A8-A0A7-01A-11D-A011-01TCGA-AR-A250-01A-31D-A166-01TCGA-EW-A1OX-01A-11D-A141-01TCGA-BH-A1F2-01A-31D-A13J-01TCGA-D8-A1JA-01A-11D-A13J-01TCGA-EW-A6SD-01A-12D-A33D-01TCGA-3C-AALI-01A-11D-A41E-01TCGA-C8-A275-01A-21D-A16C-01TCGA-BH-A0DZ-01A-11D-A011-01TCGA-C8-A1HO-01A-11D-A13J-01TCGA-A2-A04X-01A-21D-A036-01TCGA-E2-A109-01A-11D-A10L-01TCGA-C8-A1HK-01A-21D-A13J-01TCGA-AO-A0JM-01A-21D-A059-01TCGA-AC-A23H-01A-11D-A160-01TCGA-A2-A3XZ-01A-42D-A238-01TCGA-D8-A27N-01A-11D-A16C-01TCGA-C8-A12M-01A-11D-A111-01TCGA-A2-A1FW-01A-11D-A13J-01TCGA-C8-A137-01A-11D-A111-01TCGA-A8-A076-01A-21D-A011-01TCGA-GM-A2DA-01A-11D-A18N-01TCGA-DD-A73D-01A-12D-A32F-01TCGA-L5-A8NG-01A-11D-A37B-01TCGA-C8-A12T-01A-11D-A111-01TCGA-21-A5DI-01A-31D-A26L-01TCGA-EY-A1GJ-01A-12D-A141-01TCGA-AR-A24U-01A-11D-A166-01TCGA-BH-A0DG-01A-21D-A12N-01TCGA-EW-A6SC-01A-12D-A32H-01TCGA-D8-A1XZ-01A-11D-A14J-01TCGA-A2-A25B-01A-11D-A166-01TCGA-AN-A0FW-01A-11D-A036-01TCGA-ZF-A9R3-01A-11D-A38F-01TCGA-NA-A4R1-01A-11D-A28Q-01TCGA-C8-A12U-01A-11D-A111-01TCGA-E2-A1LB-01A-11D-A141-01TCGA-A8-A07I-01A-11D-A011-01TCGA-DD-AACJ-01A-11D-A40Q-01TCGA-ND-A4WA-01A-12D-A28Q-01TCGA-29-2429-01A-01D-1043-01TCGA-BR-4369-01A-01D-1155-01TCGA-D5-6537-01A-11D-1717-01TCGA-BH-A0DD-01A-31D-A12N-01TCGA-W3-AA21-06A-11D-A38F-01TCGA-AN-A0XW-01A-11D-A107-01TCGA-AO-A0JG-01A-31D-A087-01TCGA-BH-A0BP-01A-11D-A111-01TCGA-09-1665-01B-01D-0533-01TCGA-IF-A3RQ-01A-11D-A227-01TCGA-BH-A0HW-01A-11D-A036-01TCGA-C8-A1HI-01A-11D-A134-01TCGA-D1-A0ZP-01A-21D-A10L-01TCGA-E9-A1RF-01A-11D-A160-01TCGA-56-A4BY-01A-11D-A24C-01TCGA-AR-A0TY-01A-12D-A111-01TCGA-AO-A0J3-01A-11D-A036-01TCGA-BR-4357-01A-01D-1155-01TCGA-EO-A2CH-01A-11D-A17U-01TCGA-BH-A0HB-01A-11D-A059-01TCGA-DX-AB3A-01A-11D-A416-01TCGA-A8-A08J-01A-11D-A011-01TCGA-BL-A0C8-01B-04D-A273-01TCGA-D3-A8GP-06A-11D-A371-01TCGA-10-0931-01A-01D-0399-01TCGA-S3-AA14-01A-11D-A41E-01TCGA-AP-A053-01A-21D-A00X-01TCGA-N7-A59B-01A-11D-A28Q-01TCGA-B9-A5W9-01A-11D-A28F-01TCGA-2Y-A9H0-01A-11D-A381-01TCGA-NF-A4WX-01A-11D-A28Q-01TCGA-A2-A0T3-01A-21D-A111-01TCGA-04-1356-01A-01D-0452-01TCGA-AQ-A0Y5-01A-11D-A14J-01TCGA-BH-A18R-01A-11D-A12A-01TCGA-A2-A0D1-01A-11D-A036-01TCGA-EW-A1OZ-01A-11D-A141-01TCGA-UD-AAC4-01A-11D-A39Q-01TCGA-EW-A2FV-01A-11D-A17C-01TCGA-A8-A09W-01A-11D-A011-01TCGA-G3-A25S-01A-11D-A16U-01TCGA-E9-A5UP-01A-11D-A28A-01TCGA-A7-A26H-01A-11D-A166-01TCGA-EA-A5ZF-01A-11D-A28A-01TCGA-50-5944-01A-11D-1752-01TCGA-E2-A14T-01A-11D-A111-01TCGA-3C-AALK-01A-11D-A41E-01TCGA-BH-A0EE-01A-11D-A036-01TCGA-UT-A88E-01B-11D-A39Q-01TCGA-12-0818-01A-01D-0384-01TCGA-AG-3609-01A-02D-0824-01TCGA-D8-A27V-01A-12D-A17C-01TCGA-AP-A5FX-01A-11D-A27O-01TCGA-E9-A1NA-01A-11D-A141-01TCGA-B3-4104-01A-02D-1348-01TCGA-QQ-A5VC-01A-31D-A32H-01TCGA-FJ-A3ZE-01A-11D-A23L-01TCGA-AN-A0AJ-01A-11D-A011-01TCGA-E9-A2JS-01A-11D-A17U-01TCGA-D8-A140-01A-11D-A111-01TCGA-A8-A07P-01A-11D-A011-01TCGA-H8-A6C1-01A-11D-A32M-01TCGA-WB-A81J-01A-11D-A35H-01TCGA-LN-A9FQ-01A-31D-A386-01TCGA-OR-A5LT-01A-11D-A29H-01TCGA-VQ-AA6D-01A-11D-A40Z-01TCGA-A8-A08F-01A-11D-A011-01TCGA-13-1487-01A-01D-0472-01TCGA-DI-A2QY-01A-12D-A19X-01TCGA-BR-8289-01A-11D-2338-01TCGA-AN-A04A-01A-21D-A036-01TCGA-52-7810-01A-11D-2121-01TCGA-AA-3556-01A-01D-0819-01TCGA-55-6712-01A-11D-1854-01TCGA-XF-AAMH-01A-11D-A42D-01TCGA-95-7043-01A-11D-1943-01TCGA-KV-A74V-01A-11D-A33P-01TCGA-D8-A27G-01A-11D-A16C-01TCGA-A7-A0CD-01A-11D-A011-01TCGA-3A-A9IV-01A-11D-A40V-01TCGA-IB-7891-01A-11D-2200-01TCGA-C8-A1HM-01A-12D-A134-01TCGA-AR-A0TW-01A-11D-A087-01TCGA-G3-AAUZ-01A-11D-A381-01TCGA-BQ-5891-01A-11D-1588-01TCGA-BH-A0HX-01A-21D-A059-01TCGA-HS-A5N8-01A-11D-A26F-01TCGA-55-1595-01A-01D-0944-01TCGA-G7-6790-01A-11D-1959-01TCGA-E9-A22A-01A-11D-A160-01TCGA-63-7023-01A-11D-1943-01TCGA-JU-AAVI-01A-11D-A402-01TCGA-GL-A59T-01A-21D-A28F-01TCGA-BH-A0DK-01A-21D-A059-01TCGA-EW-A1OY-01A-11D-A141-01TCGA-G7-6789-01A-11D-1959-01TCGA-29-1693-01A-01D-0559-01TCGA-UY-A8OD-01A-11D-A363-01TCGA-2L-AAQM-01A-11D-A396-01TCGA-E2-A1IG-01A-11D-A141-01TCGA-E9-A54Y-01A-11D-A25N-01TCGA-E2-A1B1-01A-21D-A12N-01TCGA-13-0921-01A-01D-0399-01TCGA-41-2573-01A-01D-0911-01TCGA-BC-A10Y-01A-11D-A12Y-01TCGA-EW-A6SA-01A-21D-A32H-01TCGA-58-A46N-01A-11D-A24C-01TCGA-BH-A0BC-01A-22D-A087-01TCGA-FR-A3YN-06A-11D-A238-01TCGA-B6-A0IG-01A-11D-A036-01TCGA-P3-A6T6-01A-11D-A34I-01TCGA-NK-A7XE-01A-12D-A400-01TCGA-G7-6795-01A-11D-1959-01TCGA-G7-A8LD-01A-11D-A35Y-01TCGA-DD-A3A7-01A-11D-A22E-01TCGA-AR-A2LL-01A-11D-A17U-01TCGA-10-0936-01A-01D-0399-01TCGA-VR-AA4D-01A-11D-A37B-01TCGA-FB-A545-01A-11D-A26H-01TCGA-AH-6897-01A-11D-1923-01TCGA-AN-A04C-01A-21D-A036-01TCGA-N6-A4V9-01A-11D-A28Q-01TCGA-61-1722-01A-01D-0577-01TCGA-85-8070-01A-11D-2243-01TCGA-GV-A3JZ-01A-11D-A219-01TCGA-B9-4117-01A-02D-1348-01TCGA-B4-5835-01A-11D-1668-01TCGA-EY-A2OQ-01A-11D-A19X-01TCGA-N5-A4RD-01A-11D-A28Q-01TCGA-AA-A00L-01A-01D-A003-01TCGA-BH-A0HY-01A-11D-A059-01

chr3

35,600 kb 35,700 kbchr9

TCGA Samples

160

TCG

A S

ampl

es

a

b

c

d

Extended Data Figure 9

Panc-AdenoCA: C>ULung-AdenoCA: C>A

Panc-Endocrine: G>C

C>UStomach-AdenoCA

ins. G>ACGTGG: Liver-HCC

ins. G>AGLiver-HCC

G>U: Liver-HCC

A>G:Panc-AdenoCA

del. G:Breast-AdenoCa

A>U: Liver-HCC

del. UCCUCThy-AdenoCA

G G U U

C

G

U

G

C

U

G

AA

G

GC

CU

GU

A

U

CC

UA

GG

CU A C

AC

A

C

U

GA

GG

A CU

C

UGU

UC

CU

CCCC

UU

U

C

C

GC

CU

AG

GG

GA

AA

GUCCCCGGACCU

CG

G

G

C

A

GA

G

A

GU

G

C

C

AC

GU

GC

AU

A

C GC

AC

GU

A

GACAU

U

CCCCGCUU

C

C

C

A

CUC

C

A

A

AG U C

C

G

C

C

A

AG A

AG C G

U A

U

CCCGC

UGAG

C

G

G

C

G

U

G

G C

G C G G G GG C

G U C

A

U

C

C

G

UCAGCU C C C U C U A

GUUACG

CA

GG

C

A

GUGCGUG

U

CC

G C G C A C

C

AA

CC A C A

CG

G

G

G

C

U

C

A

UU

CU

C

A

G

C

G

C

G

G C U G U

1

1020

30

40

50

60

70

80

90

100

110

120

130

140

150 160

170

180

190

200

210

220

230

240

250

260

268

Predicted structural impact of mutations

Low High P4 tertiary interaction sites RNA-protein interaction sites RNA-substrate interaction sites

Mutation Mutation with predicted structural impact (P < 0.1)

.CC-BY-NC-ND 4.0 International licenseavailable under awas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (whichthis version posted December 23, 2017. ; https://doi.org/10.1101/237313doi: bioRxiv preprint

Page 99: was not certified by peer review) is the author/funder. It ...1 Discovery and characterization of coding and non-coding driver mutations in more than 2,500 whole cancer genomes Esther

Scalechr17:

1000G Accs Pilot1000G Accs Strict

2 kb hg198,076,000 8,076,500 8,077,000 8,077,500 8,078,000 8,078,500 8,079,000 8,079,500 8,080,000 8,080,500

TMEM107SNORD118

TMEM107TMEM107

TMEM107TMEM107TMEM107

RP11-599B13.7

100 Vert. Cons

4.88 -

-4.5 _

0 -

Scalechr22:

1000G Accs Pilot1000G Accs Strict

500 bases hg1943,011,000 43,011,500 43,012,000

POLDIP3POLDIP3POLDIP3

RNU12

100 Vert. Cons

4.88 -

-4.5 _

0 -

Scalechr14:

1000G Accs Pilot1000G Accs Strict

2 kb hg1920,809,500 20,810,000 20,810,500 20,811,000 20,811,500 20,812,000 20,812,500 20,813,000 20,813,500

RPPH1PARP2

PARP2PARP2

RP11-203M5.2

100 Vert. Cons

4.88 -

-4.5 _

0 -

Scalechr9:

1000G Accs Pilot1000G Accs Strict

1 kb hg1935,656,500 35,657,000 35,657,500 35,658,000 35,658,500 35,659,000 35,659,500 35,660,000

RMRPCCDC107CCDC107CCDC107CCDC107CCDC107

CCDC107

ARHGEF39ARHGEF39ARHGEF39

100 Vert. Cons

4.88 -

-4.5 _

0 -

Scalechr11:

1000G Accs Pilot1000G Accs Strict

5 kb hg1962,601,000 62,602,000 62,603,000 62,604,000 62,605,000 62,606,000 62,607,000 62,608,000 62,609,000 62,610,000

RP11-727F15.9WDR74WDR74WDR74WDR74

RP11-727F15.9WDR74

WDR74 RNU2-2P

100 Vert. Cons

4.88 -

-4.5 _

0 -

PCAWG germline

PCAWG somatic

PCAWG germline

PCAWG somatic

PCAWG germlinePCAWG somatic

PCAWG germlinePCAWG somatic

PCAWG germline

PCAWG somatic

WDR74a

b

c

d

e

Extended Data Figure 10

f

primary miRNApre-miRNA

SNVsmature miRNA mir-142-p5mir-142-p3

pre-miRNAmature miRNASNVs

primary miRNAchr 7 100 bp

AIDNon-AID

Mutational signature

TMEM107

RNU12

RMRP

RPPH1

mir-142

.CC-BY-NC-ND 4.0 International licenseavailable under awas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (whichthis version posted December 23, 2017. ; https://doi.org/10.1101/237313doi: bioRxiv preprint

Page 100: was not certified by peer review) is the author/funder. It ...1 Discovery and characterization of coding and non-coding driver mutations in more than 2,500 whole cancer genomes Esther

Lym

ph.N

OS

Head

.SCC

Lym

ph.B

NHL

Cerv

ix.Ad

enoC

ABl

adde

r.TCC

Pros

t.Ade

noCA

Ova

ry.A

deno

CABr

east

.Lob

ular

CaBo

ne.L

eiom

yoSk

in.M

elan

oma

Cerv

ix.SC

CCN

S.O

ligo

CNS.

GBM

Lym

ph.C

LLUt

erus

. Ade

noCA

Colo

Rect

.Ade

noCA

Kidn

ey.C

hRCC

Eso.

Aden

oCa

Brea

st.A

deno

CaLu

ng.S

CCTh

y.Ad

enoC

ALu

ng.A

deno

CAKi

dney

.RCC

Stom

ach .

Aden

oCA

Panc

.Ade

noCA

Bilia

ry.A

deno

CALi

ver.H

CC

APOBADH4CYP2C8RP11-1151B14.3HPRFGAFGBALBADH1BCPS1ALDOBCYP3A5PNLIPCPB1LIPFTGCCDC152SPRNSFTPBNEAT1

Extended Data Figure 11D

ensi

ty

65000000 65100000 65200000 65300000 65400000 65500000

Structuralvariants (n = 379)Indels (n = 454)SNVs (n = 4324)

NEAT1 MALAT1

a

d e

b c

Expression level

Background

ALB

(Liver-H

CC)

TG(T

hy-AdenoCa)

FGA

(Liver-H

CC)

FGB

(Liver-H

CC)

SFTPB

(Lung-A

denoCa)

ALDOB

(Liver-H

CC)

APOB

(Liver-H

CC)

ADH4

(Liver-H

CC)

ADH1B

(Liver-H

CC)

CPS1

(Liver-H

CC)

HPR

(Liver-H

CC)

CYP2C8

(Liver-H

CC)

CYP3A5

(Liver-H

CC)

NEAT1 (L

iver-HCC)

NEAT1 (E

so-AdenoCa)

RP11-1151B14.3

(Liver-H

CC)

CCDC152 (L

iver-HCC)

SPRN

(Liver-H

CC)

PNLIP

(Panc-A

denoCa)

CPB1

(Panc-A

denoCa)

LIPF

(Stomach-A

denoCa)− 2

− 1

0

1

2

3

4

5

0.4 0.5 0.6 0.7 0.8 0.9

0.4

0.5

0.6

0.7

0.8

0.9

●●

●●

●●●

●●

TP53

KRASBRAF

VHL

SMAD4PBRM1

NEAT1MALAT1

CDKN2AMEN1

TERT

Protein−coding promoter5'UTRProtein−coding

lncRNAlncRNA promoterSmall RNAEnhancersmiRNABackground

3'UTR

q < 0.1

Mea

n C

AF

gene

/ele

men

t

Mean CAF flanking

0 50 100 150

−4−2

02

46

8

●●

●●

●●

●●

● ●● ●

● ●

●●●

SMAD4CDKN2A

MEN1

PBRM1 PTEN

APCRB1

DAXX

AXIN1KEAP1B2M

BAP1

CBFBSTK11

CASP8

ARID1A

PRKAR1ATNFRSF14

CDH1 SETD2MAP2K4 CIC

HRAS

TGFBR2ATM

ACVR1BFUBP1

NF1NOTCH1

TSC1 PIK3CASMARCA4

CREBBPFAT1 KMT2C

ARID2APOBATRX KMT2DBRAF

GNASCTNNB1

IRF4FOXA1

SF3B1FGFR1

BTG1H3F3BHIST1H2AG

NXF1

PA2G4PLK1

RPL22 RRAGC

U2AF1

POLR3E

TERT

IFI44LRFTN1

PAX5

WDR74

MTG2

DTL

PTDSS1

RNF34

ALBTOB1NFKBIZ

TMEM107

FOXA1

SFTPB

Enhancer:IGH

Enhancer:TP53TG1

RNU12G029190RMRP MALAT1

G025135

RP11−92C4.6RPPH1

RMRPLINC00963

TRAM2−AS1RNU12

MIR663A

RNU6−573P

KRAS

1000

NEAT1

200

●RP11−440L14.1

HES1

mir-142

400 600 800

Number of mutated samples

)2gol( ecner effi d dl of H

OLTP53

VHL

f g

Exp

ress

ion

fold

diff

eren

ce (l

og2)

Number of mutated samples with RNAseq data

-2

-1

0

1

2

3

0 50 100 300

CDKN2A

KRAS

VHL APC

RB1

TERT

NEAT1MALAT1

TP53

Number of mutated samples

Num

ber o

f stru

ctur

al e

vent

s

0

50

100

150

0 100 200 300 400

CDKN2ABRAF

TERT

PTEN

RB1

NEAT1

EGFR

ARID1A

MALAT1

NF1

APC

KRASVHL

FAT1

TP53

.CC-BY-NC-ND 4.0 International licenseavailable under awas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (whichthis version posted December 23, 2017. ; https://doi.org/10.1101/237313doi: bioRxiv preprint

Page 101: was not certified by peer review) is the author/funder. It ...1 Discovery and characterization of coding and non-coding driver mutations in more than 2,500 whole cancer genomes Esther

a c

Average detection sensitivity

TCF7L2 (exon 3) AKT1 5’UTR

0.0 1.0 0.0 1.0

% p

atie

nts

0

100

% p

atie

nts

0.0

1.0

Det

ectio

n se

nsiti

vity

All breast cancersFOXA1 promoter(chr14:38,064,406)

0

2

4

6

64

MissedCalled

MissedTotal

Inferred mutations

31%

b35 91 94 85 48 79 35 4 23 61 61 19 22 8 81 77 30 69 61 91 66 94 61 64 51 100 48

0.0

1.0

Det

ectio

n se

nsiti

vity

at c

hr5:

1,29

5,25

0

% of patients in cohort powered (≥90%)

Bili

ary-

Ade

noC

a

Bla

dder

-TC

C

Bon

e-Le

iom

yo

Bon

e-O

steo

sarc

Bre

ast-A

deno

Ca

CN

S-G

BM

CN

S-M

edul

lo

CN

S-P

iloA

stro

Col

oRec

t-Ade

noC

a

Eso

-Ade

noC

a

Hea

d-S

CC

Kid

ney-

ChR

CC

Kid

ney-

RC

C

Live

r-H

CC

Lung

-Ade

noC

a

Lung

-SC

C

Lym

ph-B

NH

L

Lym

ph-C

LL

Mye

loid

-MP

N

Ova

ry-A

deno

Ca

Pan

c-A

deno

Ca

Pan

c-E

ndoc

rine

Pro

st-A

deno

Ca

Ski

n-M

elan

oma

Sto

mac

h-A

deno

Ca

Thy-

Ade

noC

a

Ute

rus-

Ade

noC

a

Extended Data Figure 12

.CC-BY-NC-ND 4.0 International licenseavailable under awas not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (whichthis version posted December 23, 2017. ; https://doi.org/10.1101/237313doi: bioRxiv preprint