10
Cancer-mutation network and the number and specificity of driver mutations Jaime Iranzo a,1 , Iñigo Martincorena b , and Eugene V. Koonin a,1 a National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894; and b Wellcome Trust Sanger Institute, CB10 1SA Hinxton, Cambridgeshire, United Kingdom Contributed by Eugene V. Koonin, May 9, 2018 (sent for review February 21, 2018; reviewed by Katerina Gurova and Sergei Maslov) Cancer genomics has produced extensive information on cancer- associated genes, but the number and specificity of cancer-driver mutations remains a matter of debate. We constructed a bipartite network in which 7,665 tumors from 30 cancer types are connected via shared mutations in 198 previously identified cancer genes. We show that about 27% of the tumors can be assigned to statistically supported modules, most of which encompass one or two cancer types. The rest of the tumors belong to a diffuse network component suggesting lower gene specificity of driver mutations. Linear re- gression of the mutational loads in cancer genes was used to estimate the number of drivers required for the onset of different cancers. The mean number of drivers in known cancer genes is approximately two, with a range of one to five. Cancers that are associated with modules had more drivers than those from the diffuse network component, suggesting that unidentified and/or interchangeable drivers exist in the latter. cancer types | driver mutations | passenger mutations | bipartite networks | community detection C ancer develops as a result of the accumulation of somatic mutations and other genetic alterations that impair cell- division checkpoints, resulting in abnormal cell proliferation and eventually tumorigenesis (1, 2). Such mutations are known as drivers,and their characterization is central to understanding the early steps of tumor progression (3, 4). During the last de- cade, comparative analyses of large collections of cancer ge- nomes have led to the identification of overlapping sets of genes that are typically associated to cancer, i.e., that harbor a significant excess of mutations in tumors and show signatures of positive se- lection (511). The continued identification of cancer genes provides insights into the processes and pathways involved in tumorigenesis as well as possible therapy targets (1217). Like any other gene in the genome, cancer genes are expected to accumulate passenger mu- tations that do not contribute to or even hinder cancer progression (1820). Therefore, although cancer genes often harbor driver mutations, only a fraction of the mutations found in these genes are actual drivers (911). Distinguishing driver mutations from passenger ones poses a formidable challenge for cancer genomics. The number of driver mutations required for the onset of cancer is a fundamental question that remains a matter of debate (3, 9, 2123). Classical approaches to this problem use ageincidence curves to infer the number of rate-limiting steps in tumorigenesis, each of which is assumed to be associated with a unique driver mutation; these estimates, however, are sensitive to changes in mutation and replication rates during tumor progression (2225). A modified method has been proposed that compares the incidence of cancer across risk groups with different mutation rates, but this approach applies only to cancers with relatively well-defined risk groups, such as lung and colorectal cancer (21). Recent mea- surements of selection in cancer genomes have provided quan- titative estimates of the number of positively selected mutations, i.e., drivers, per tumor, ranging from less than one in thyroid and testicular cancers to more than 10 in endometrial and colorectal cancers (9). Given the novelty of these findings, a comparison with independent inference approaches appears highly desirable. A second major question about cancer-driver mutations refers to their specificity in different cancer types. Some tumors show recurrent mutation patterns, such as the oncogenic fusion BCR- ABL in chronic myeloid leukemia (26) or the inactivation of specific tumor suppressors, e.g., RB1 in retinoblastoma (27). Other tumors appear to result from interchangeable mutations in a pool of genes involved in key signaling pathways, such as the receptor tyrosine kinase/RAS/RAF pathway in lung adenocarci- noma (28). Between these two extremes, intermediate degrees of specificity are observed in many cancer types (8, 29, 30). Fur- thermore, although numerous recent studies on cancer muta- tional landscapes have yielded extensive lists of genes that are mutated in various cancers (3133), a quantitative understanding of the extent to which the current tumor classification captures the existence of specific sets of driver mutations is lacking. Here, we combine tools for network analysis and multivariate statistics to assess the number and specificity of cancer-driver mutations in 30 cancer types. We show that an unsupervised community-detection approach applied to the bipartite network of somatic mutations in cancer recovers modules consisting of mutu- ally specific tumors and genes that (i ) are consistent with the tumor histology and (ii ) are enriched in putative driver mutations. We used multivariate statistical analysis to estimate the characteristic number of driver mutations in known cancer genes required for the onset of each cancer type. Notably, the average age of onset for different cancer types correlates with the predicted number of drivers. Furthermore, cancers that are not associated with the specific Significance Cancer genomics yields a wealth of information on cancer- associated mutations in various cancer types, but current un- derstanding of the number and tissue specificity of the driver mutations remains limited. We applied mathematical methods for network analysis to identify distinct modules linking tumors to driver mutations. About 27% of the tumors belong to such mod- ules, whereas the rest form a diffuse component of the genetumor network. The cancers from the diffuse component show an onset later in life than those in the modules and have fewer associated known drivers, implying the existence of multiple un- identified and/or interchangeable drivers in the former. Author contributions: J.I. and E.V.K. designed research; J.I. performed research; J.I., I.M., and E.V.K. analyzed data; and J.I. and E.V.K. wrote the paper. Reviewers: K.G., Roswell Park Cancer Institute; and S.M., University of Illinois at UrbanaChampaign. The authors declare no conflict of interest. Published under the PNAS license. 1 To whom correspondence may be addressed. Email: [email protected] or koonin@ ncbi.nlm.nih.gov. This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10. 1073/pnas.1803155115/-/DCSupplemental. Published online June 12, 2018. E6010E6019 | PNAS | vol. 115 | no. 26 www.pnas.org/cgi/doi/10.1073/pnas.1803155115 Downloaded by guest on February 2, 2021

Cancer-mutation network and the number and specificity of ... · mutations in 30 cancer types. We show that an unsupervised community-detection approach applied to the bipartite network

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Cancer-mutation network and the number and specificity of ... · mutations in 30 cancer types. We show that an unsupervised community-detection approach applied to the bipartite network

Cancer-mutation network and the number andspecificity of driver mutationsJaime Iranzoa,1, Iñigo Martincorenab, and Eugene V. Koonina,1

aNational Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894; and bWellcome TrustSanger Institute, CB10 1SA Hinxton, Cambridgeshire, United Kingdom

Contributed by Eugene V. Koonin, May 9, 2018 (sent for review February 21, 2018; reviewed by Katerina Gurova and Sergei Maslov)

Cancer genomics has produced extensive information on cancer-associated genes, but the number and specificity of cancer-drivermutations remains a matter of debate. We constructed a bipartitenetwork in which 7,665 tumors from 30 cancer types are connectedvia shared mutations in 198 previously identified cancer genes. Weshow that about 27% of the tumors can be assigned to statisticallysupported modules, most of which encompass one or two cancertypes. The rest of the tumors belong to a diffuse network componentsuggesting lower gene specificity of driver mutations. Linear re-gression of themutational loads in cancer genes was used to estimatethe number of drivers required for the onset of different cancers. Themean number of drivers in known cancer genes is approximately two,with a range of one to five. Cancers that are associated with moduleshad more drivers than those from the diffuse network component,suggesting that unidentified and/or interchangeable drivers exist inthe latter.

cancer types | driver mutations | passenger mutations |bipartite networks | community detection

Cancer develops as a result of the accumulation of somaticmutations and other genetic alterations that impair cell-

division checkpoints, resulting in abnormal cell proliferation andeventually tumorigenesis (1, 2). Such mutations are known as“drivers,” and their characterization is central to understandingthe early steps of tumor progression (3, 4). During the last de-cade, comparative analyses of large collections of cancer ge-nomes have led to the identification of overlapping sets of genesthat are typically associated to cancer, i.e., that harbor a significantexcess of mutations in tumors and show signatures of positive se-lection (5–11). The continued identification of cancer genes providesinsights into the processes and pathways involved in tumorigenesis aswell as possible therapy targets (12–17). Like any other gene in thegenome, cancer genes are expected to accumulate passenger mu-tations that do not contribute to or even hinder cancer progression(18–20). Therefore, although cancer genes often harbor drivermutations, only a fraction of the mutations found in these genes areactual drivers (9–11).Distinguishing driver mutations from passenger ones poses a

formidable challenge for cancer genomics. The number of drivermutations required for the onset of cancer is a fundamentalquestion that remains a matter of debate (3, 9, 21–23). Classicalapproaches to this problem use age–incidence curves to infer thenumber of rate-limiting steps in tumorigenesis, each of which isassumed to be associated with a unique driver mutation; theseestimates, however, are sensitive to changes in mutation andreplication rates during tumor progression (22–25). A modifiedmethod has been proposed that compares the incidence ofcancer across risk groups with different mutation rates, but thisapproach applies only to cancers with relatively well-defined riskgroups, such as lung and colorectal cancer (21). Recent mea-surements of selection in cancer genomes have provided quan-titative estimates of the number of positively selected mutations,i.e., drivers, per tumor, ranging from less than one in thyroid andtesticular cancers to more than 10 in endometrial and colorectal

cancers (9). Given the novelty of these findings, a comparisonwith independent inference approaches appears highly desirable.A second major question about cancer-driver mutations refers

to their specificity in different cancer types. Some tumors showrecurrent mutation patterns, such as the oncogenic fusion BCR-ABL in chronic myeloid leukemia (26) or the inactivation ofspecific tumor suppressors, e.g., RB1 in retinoblastoma (27).Other tumors appear to result from interchangeable mutationsin a pool of genes involved in key signaling pathways, such as thereceptor tyrosine kinase/RAS/RAF pathway in lung adenocarci-noma (28). Between these two extremes, intermediate degrees ofspecificity are observed in many cancer types (8, 29, 30). Fur-thermore, although numerous recent studies on cancer muta-tional landscapes have yielded extensive lists of genes that aremutated in various cancers (31–33), a quantitative understandingof the extent to which the current tumor classification capturesthe existence of specific sets of driver mutations is lacking.Here, we combine tools for network analysis and multivariate

statistics to assess the number and specificity of cancer-drivermutations in 30 cancer types. We show that an unsupervisedcommunity-detection approach applied to the bipartite network ofsomatic mutations in cancer recovers modules consisting of mutu-ally specific tumors and genes that (i) are consistent with the tumorhistology and (ii) are enriched in putative driver mutations. Weused multivariate statistical analysis to estimate the characteristicnumber of driver mutations in known cancer genes required for theonset of each cancer type. Notably, the average age of onset fordifferent cancer types correlates with the predicted number of drivers.Furthermore, cancers that are not associated with the specific

Significance

Cancer genomics yields a wealth of information on cancer-associated mutations in various cancer types, but current un-derstanding of the number and tissue specificity of the drivermutations remains limited. We applied mathematical methods fornetwork analysis to identify distinct modules linking tumors todriver mutations. About 27% of the tumors belong to such mod-ules, whereas the rest form a diffuse component of the gene–tumor network. The cancers from the diffuse component showan onset later in life than those in the modules and have fewerassociated known drivers, implying the existence of multiple un-identified and/or interchangeable drivers in the former.

Author contributions: J.I. and E.V.K. designed research; J.I. performed research; J.I., I.M.,and E.V.K. analyzed data; and J.I. and E.V.K. wrote the paper.

Reviewers: K.G., Roswell Park Cancer Institute; and S.M., University of Illinois at Urbana–Champaign.

The authors declare no conflict of interest.

Published under the PNAS license.1To whom correspondence may be addressed. Email: [email protected] or [email protected].

This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1803155115/-/DCSupplemental.

Published online June 12, 2018.

E6010–E6019 | PNAS | vol. 115 | no. 26 www.pnas.org/cgi/doi/10.1073/pnas.1803155115

Dow

nloa

ded

by g

uest

on

Feb

ruar

y 2,

202

1

Page 2: Cancer-mutation network and the number and specificity of ... · mutations in 30 cancer types. We show that an unsupervised community-detection approach applied to the bipartite network

modules in the gene–tumor network appear later in life thanexpected based on the general trend.

ResultsCancer Mutational Landscape as a Partially Modular Network. So-matic mutations in a set of tumors can be collectively representedas a bipartite network, that is, a network with two classes of nodes.In such a network, nodes of one class correspond to tumor samples,and nodes of the other class correspond to cancer genes. Mutationsare represented as edges that connect each tumor sample with thegenes mutated in it; conversely, each gene is linked to the tumors inwhich it carries a mutation(s). Using this approach, we built thenetwork of somatic mutations from The Cancer Genome Atlas(TCGA), a collection that consists of 7,665 tumor samples from30 cancer types (Fig. 1 A and B), focusing on coding mutations in198 recurrently mutated cancer genes (Methods). The structure ofthe network is characterized by a relatively long-tailed degreedistribution (i.e., the distribution of the number of connectionsbetween a given node and other nodes in the network) (Fig. 1C),which is indicative of the pronounced heterogeneity in the fre-quencies of mutations across genes and tumors. Also, the networkhas small clustering coefficients (i.e., the fraction of neighbors of anode that share connections with a second node) (Fig. 1D), due tothe sparsity of mutations in the dataset.Within the bipartite network framework, the association be-

tween mutually specific sets of genes and cancer types becomesmanifest by the existence of groups of nodes (tumors and genes)that are densely connected with members of the same group butpoorly connected with the rest of the network. Such groups arecalled “modules,” and a network with such structure is said to be“modular” (34). We tested the modular nature of the cancermutation network by computing its Barber’s modularity index (35)and comparing it with 200 random networks with the same degreedistribution (Fig. 1E). The result of this comparison supported theexistence of a significant degree of mutual specificity betweencancer types and cancer genes (P < 10−20, Welch’s t test), whichdemonstrates the ability of the network approach to detect a well-known feature of cancer mutational landscapes (3, 8, 29).To further investigate the specificity of the mutation landscapes,

we identified the modules of the network and assessed their sta-tistical significance. To that end, we first applied a battery ofmodule-detection algorithms and then pruned all genes and tu-mors for which the specificity patterns were compatible with arandom null model (Methods). The analysis revealed the existenceof 12 modules with a significance threshold P < 0.05. Each of thesemodules contains tumors and genes that are, to a certain extent,mutually specific. Accordingly, tumors in a module typically har-bor mutations in genes from the same module, whereas the con-stituent genes are more frequently mutated in tumors that belongto the same module. Overall, the statistically significant modulescomprise 27% of the samples and 66 (33%) cancer-associatedgenes (Table 1).Before proceeding with a more detailed dissection of the

genes and cancer types represented in each module, we evalu-ated their biological relevance by characterizing the mutationsthat correspond to intra- and intermodule connections. To thatend, we split the cancer-associated genes into oncogenes andtumor-suppressor genes (TSGs) and for each of these groupscalculated the percentage of coding mutations that truncate thegene product and the percentage of copy number alterations thatinvolve gene loss. We obtained these percentages separately formutations affecting samples and genes that belong to the samemodule (intramodule mutations) and for those involving samplesand genes from different modules (intermodule mutations). Be-cause driver mutations in TSGs entail loss of function (11, 36), thepercentage of truncations and large deletions in TSGs is indicativeof the enrichment of a set of mutations in putative drivers. In thecase of oncogenes, driver mutations are typically missense sub-

stitutions in mutational hotspots and/or gene amplifications (11,36). Accordingly, the enrichment of a set of oncogene mutations inputative drivers results in a smaller percentage of truncations anddeletions. As shown in Fig. 1F, intramodule mutations in TSGsinclude a significantly higher percentage of truncations thanintermodule mutations, whereas the opposite holds for oncogenes.A similar trend is observed in copy number variation data: Com-pared with intermodule alterations, intramodule alterations en-compass a significantly higher percentage of losses when affectingTSG and, a significantly higher percentage of amplifications whenaffecting oncogenes. Overall, we conclude that intramodule mu-tations are significantly enriched in putative drivers compared withintermodule mutations. In other words, mutations that affectmutually specific genes and tumors (i.e., genes and samples fromthe same module) are more likely to be cancer drivers than thoseaffecting genes and samples that belong to different modules. Totest whether intermodule mutations in oncogenes and TSGs in-clude any drivers or only a background of passenger mutations, weobtained baseline values for the percentage of truncations andgene losses in genes that are not associated with cancer. Mutationsin those genes are expected to be passengers and thus serve as areference of the percentage of truncations and deletions expectedfor passenger mutations. The deviations with respect to thebaseline shown in Fig. 1F imply that some intermodule mutationsinvolving oncogenes and TSGs and gene losses involving TSGs arealso relevant for tumor progression. Such deviations remain afterremoving the most widespread cancer genes across tissues (TP53,PIK3CA, and ARID1A), indicating that potential intermoduledrivers are not limited to such genes (SI Appendix, Fig. S1). Acloser inspection of intermodule mutations highlights the onco-genes BRAF (primarily assigned to the melanoma/thyroid cancermodule 12; see below) and IDH1 (primarily assigned to the lower-grade glioma module 6) as major sources of intermodule drivers.Thus, mutations in BRAF are likely drivers in melanomas frommodule 11, and mutations in IDH1 are likely drivers in acutemyeloid leukemias (module 9), with missense mutations repre-senting 96% of the coding mutations in both genes compared withthe 84% baseline in genes not associated to cancer. Similarly, theTSGs STAG2, KDM6A, PIK3R1, MAP3K1, and CDH1, with 50–60% of truncating mutations (baseline is 16%), constitute prob-able intermodule drivers, with MAP3K1 and CDH1 being morerelevant in breast cancer.

Specificity Modules for Cancer Types and Cancer Genes. As shown inTable 1, the composition of the gene-tumor specificity modulesstrongly correlates with the histological classification of tumors.Most modules include tumors from one or two cancer types,together with genes for which mutation frequencies are signifi-cantly higher in cancers of those types (Fig. 2). The mutualspecificity analysis recovers some well-established features ofcancer mutational landscapes, such as the associations betweenthyroid cancer and BRAF and between colorectal cancer andAPC, KRAS, TP53, and SMAD4. The colorectal cancer moduleincludes two additional genes that are mutated in a smaller fractionof samples, namely, ubiquitin ligase FBXW7 and transcriptionfactor TCF7L2. Most samples from pancreatic adenocarcinomacluster in the same module as colorectal cancer, in agreement witha significant excess of mutations in KRAS, TP53, and SMAD4 inboth tumor types. Acute myeloid leukemia constitutes a singlemodule, with the genes DNMT3A, FLT3, and NPM1 mutatedin >30% of samples and CEBPA, IDH2, RUNX1, and WT1 mu-tated at lower frequencies. Testicular germ cell cancer, head andneck squamous cell carcinoma, and urothelial bladder carcinomaalso form separate modules, which, however, comprise a smallerfraction of the samples from each cancer type. The sets of associ-ated genes include KIT in testicular cancer; NOTCH1, CASP8,HLA-A, andHRAS in head and neck cancer; and FGFR3, KDM6A,and STAG2 in bladder cancer. Clear cell kidney carcinoma clusters

Iranzo et al. PNAS | vol. 115 | no. 26 | E6011

MED

ICALSC

IENCE

SPN

ASPL

US

Dow

nloa

ded

by g

uest

on

Feb

ruar

y 2,

202

1

Page 3: Cancer-mutation network and the number and specificity of ... · mutations in 30 cancer types. We show that an unsupervised community-detection approach applied to the bipartite network

Fig. 1. Structure of the cancer mutation network. (A) Bipartite network of somatic mutations in tumors from the TCGA. Samples are arranged by cancer type alongthe x axis (black/gray/white bars); cancer genes are sorted by module along the y axis (colors indicate module assignations). Samples from the same cancer type andgenes from the same module are sorted by number of connections. The upper and left semiaxes contain genes and samples that belong to statistically significantmodules. The rest of the nodes were assigned to the best-match extended module with which they share the highest similarity (see text); they are represented in thelower (genes) and right (samples) semiaxes. Links connect samples and genes affected by at least one nonsynonymous somatic mutation. Links between two nodesfrom the same module (intramodule links) are drawn in distinctive colors; intermodule links appear in gray. BLCA, bladder urothelial carcinoma; BRCA, breast invasivecarcinoma; COAD, colon adenocarcinoma; ESCA, esophageal carcinoma; GBM, glioblastomamultiforme; HNSC, head and neck squamous carcinoma; KIRC, kidney renalclear cell carcinoma; KIRP, kidney renal papillary cell carcinoma; LAML, acute myeloid leukemia; LGG, brain lower-grade glioma; LIHC, liver hepatocellular carcinoma;LUAD, lung adenocarcinoma; LUSC, lung squamous cell carcinoma; OV, ovarian serous cystadenocarcinoma; PAAD, pancreatic adenocarcinoma; PRAD, prostate ad-enocarcinoma; SARC, sarcoma; SKCM, skin cutaneous melanoma; STAD, stomach adenocarcinoma; THCA, thyroid carcinoma; UCEC, uterine corpus endometrial car-cinoma. (B) Alternative representation of the cancer mutation network with a force-directed drawing method. Node colors indicate module assignations. (C) Nodedegree distribution per sample (black) and per gene (blue). (D) Clustering coefficient of samples (black) and genes (blue) as a function of the node degree (bipartiteclustering coefficient calculated as in ref. 67). (E) Modularity of the cancer mutation network, quantified by its Barber’s modularity index (Qb) and compared with200 random networks with the same degree distribution. The modularity distribution for the original network results from 200 realizations of the community-detection algorithm, each yielding slightly different sets of modules. The lack of overlap reveals a highly significant modular structure (P < 10−20, Welch’s t test). (F)Differences in the functional spectrum of mutations between intramodule and intermodule links (significant modules only). In TSGs, higher percentages of truncatingmutations (with respect to all coding mutations) and severe losses [with respect to all copy number variants (CNV)] indicate enrichment in putative drivers. For on-cogenes (OG), driver enrichment is associated with lower percentages of truncating mutations and severe losses. *P < 0.05, ***P < 10−5.

E6012 | www.pnas.org/cgi/doi/10.1073/pnas.1803155115 Iranzo et al.

Dow

nloa

ded

by g

uest

on

Feb

ruar

y 2,

202

1

Page 4: Cancer-mutation network and the number and specificity of ... · mutations in 30 cancer types. We show that an unsupervised community-detection approach applied to the bipartite network

with genes VHL, PBRM1, SETD2, and BAP1, among others. Somesamples from mesothelioma and papillary kidney carcinoma arealso assigned to that module, mostly because of mutations inSETD2 and BAP1.Notably, some cancer types are distributed among more than

one module. Thus, glioblastoma is represented in two modules,one characterized by mutations in EGFR and PTEN and theother by mutations in IDH1, ATRX, and TP53. Such subdivisionis consistent with previous reports, which relate the second groupwith the glioblastoma-CpG island methylator phenotype (37–39).The same module also includes most samples from lower-gradeglioma and some representatives from sarcoma (although thelatter typically lack mutations in IDH1). Similarly, melanoma isdivided into tumors with mutated BRAF on a low mutationalbackground, which cluster with thyroid cancer, and samples withmutations in a larger set of genes (including NRAS, KDR, ILR7,and PTPRB), which constitute a separate module. Finally, breastcancer splits between a breast cancer-only module characterizedby mutations in GATA3 and TBX3 and a larger module thatincludes uterine (endometrial and carcinosarcoma) and prostatecancers, with PIK3CA as the signature gene. In terms of histo-logical types, the PIK3CA module is significantly enriched inlobular breast tumors (46% compared with 13% in the GATA3/TBX3 module and 17% in the entire dataset; P < 10−6, χ2 test).

In terms of the molecular subtypes, both modules include breasttumors that mostly belong to the luminal subtype (estrogenreceptor-positive), whereas most of the basal-like breast tumorsare not assigned to any significant module (P < 10−10, χ2 test).Among all the modules, the one that combines breast, uterine,

and prostate cancers stands out for its size and diversity. Thismodule contains the largest number of genes, with many of thosemutated in less than 30% of the samples. Moreover, two of itsconstituent histologies (breast cancer and uterine carcinosarcoma)are split between this and other modules. To further dissect thespecificity of the mutations affecting these cancer types, we re-analyzed the subnetwork composed by all cancer genes and samplesfrom breast, prostate, and uterine (endometrial and carcino-sarcoma) cancers. This analysis yielded four significant modules thatare dominated by each of the four cancer types (Table 2). The list ofmodule-specific genes is consistent with the findings of the globalanalysis. Notably, the reanalysis places most breast cancer samplesin a single module with genes CDH1, PIK3CA, GATA3, and TBX3,whereas uterine carcinosarcoma clusters with genes FBXW7, PPP2RIA,and TP53 (the samples without mutations in PPP2RIA were for-merly assigned to the colorectal cancer module). Specific modulesfor prostate and endometrial cancer are also clearly delineated,the former with SPOP and FOXA1 and the latter with ARID1A,CTNNB1, PI3KR1, and PTEN, among other genes.

Table 1. Composition of the 12 statistically significant modules in the cancer mutation network

Module no. Cancer typesGenes mutated in >30% tumors

from the moduleGenes mutated in <30% tumors

from the module

1 Bladder (BLCA, 10%) FGFR3, KDM6A, STAG2 ERCC2, EP3002 Breast (BRCA, 12%) GATA3, TBX33 Endometrium (UCEC, 57%], uterus-CS (UCS, 24%),

breast (BRCA, 17%), prostate (PRAD, 6%),stomach (STAD, 6%)

ARID1A, PIK3CA, PTEN BCOR, CBFB, CCND1, CDH1, CTNNB1, FGFR2,FOXA1, MAP2K4, MAP3K1, MAX, MED12,PIK3R1, PPP2RIA, RUNX1, SPOP

4 Colon (COAD, 41%), rectum (READ, 62%],pancreas (PAAD, 36%), uterus-CS (UCS, 11%)

APC*, KRAS, TP53 FBXW7, SMAD4, TCF7L2

5 Glioblastoma (GBM, 13%) EGFR PTEN6 Liver (LGG, 38%), sarcoma (SARC, 9%),

glioblastoma† (GBM, 6%)ATRX, IDH1‡, TP53

7 Head and neck (HNSC, 6%) CASP8, HLA-A, HRAS, NOTCH18 Kidney-RCC (KIRC, 40%), mesothelioma

(MESO, 16%), Kidney-RP(KIRP, 6%)PBRM1, SETD2, VHL ATM, BAP1, KDM5C, NF2, PTPN11

9 AML (LAML, 38%) DNMT3A, FLT3, NPM1 CEBPA, IDH2, RUNX1, WT110 Testis (TGCT, 25%) KIT11 Melanoma (SKCM, 14%) IL7R, KDR, NRAS, PDGFRA,

PTPRBCARD11, CBLB, MET, PPP6C, RAC1

12 Thyroid (THCA, 72%), melanoma (SKCM, 15%) BRAF AKT1

Tumor samples are grouped by cancer type, with the TCGA abbreviated name in parentheses. The percentage next to the TCGA abbreviation indicates thefraction of tumors from that class that are present in the module (only classes represented by >5% of their samples are shown). Genes are presented in twocolumns according to the prevalence of mutations within the module. AML, acute myeloid leukemia; kidney-RCC, kidney renal clear cell carcinoma; kidney-RP,kidney renal papillary cell carcinoma; READ, rectum adenocarcinoma; UCEC, uterine corpus endometrial carcinoma; uterus-CS, uterine carcinosarcoma. Othercancer abbreviations are given in the legend of Fig. 1 and in SI Appendix, Table S1.*Mutations in APC are typically absent from pancreatic cancer.†Glioblastoma samples in this module belong to the glioblastoma-CpG island methylator phenotype subtype.‡Mutations in IDH1 are typically absent from sarcoma.

Table 2. Modules in the subnetwork of breast, prostate, endometrial, and uterine cancers

Genes Breast, % Prostate, % Endometrium, % Uterus-CS, %

CBFB, CDH1, GATA3, MAP2K4, MAP3K1, PIK3CA,RUNX1, TBL1XR1, TBX3

30 2 2 0

ARID1A, BCOR, CCND1, CIC, CTNNB1, CUX1, ESR1,FGFR2, KRAS, MAX, PIK3CA, PIK3R1, PTEN

1 1 43 7

FOXA1, SPOP 3 26 2 5FBXW7, PPP2RIA, TP53 11 6 15 55

Numbers indicate the percentage of samples from a given cancer type associated with the module.

Iranzo et al. PNAS | vol. 115 | no. 26 | E6013

MED

ICALSC

IENCE

SPN

ASPL

US

Dow

nloa

ded

by g

uest

on

Feb

ruar

y 2,

202

1

Page 5: Cancer-mutation network and the number and specificity of ... · mutations in 30 cancer types. We show that an unsupervised community-detection approach applied to the bipartite network

Two Major Modes of Driver Accumulation. The statistically signifi-cant modules of mutual tumor–gene specificity include 27% of thetumors in the TCGA. There are at least two alternative explana-tions for why 73% of samples remain unassigned. The first pos-sibility is that, despite having mutational patterns compatible withone of the modules, the unassigned samples do not reach therequired threshold of statistical significance. That would be thecase if mutations affecting module-specific genes occurred innoncoding regions, involved copy number variants, or occurred infunctionally equivalent genes not included in our list of cancergenes. The second possibility is that unassigned samples accountfor the exchangeability of cancer genes. Under such a scenario,some cancer types might not be specifically associated to any setof genes.To evaluate the first possibility, we built a set of “best-match

extended” modules by attaching unassigned samples and genesto the module with which they shared the most connections (SIAppendix, Table S2). We would expect that, if the specific asso-ciation between tumors and genes held for most samples within acancer type, the extended modules would recover unassignedsamples from the same cancer types as those already assigned to theoriginal modules. Indeed, the best-match extended modules clus-ter >75% of the samples from rectum, pancreas, kidney (clear cell),acute myeloid leukemia, thyroid, and melanoma, and 50–75% ofthe samples from lower-grade glioma, mesothelioma, colon, andendometrial cancer in a tissue-specific way (Fig. 3A). In contrast,cancer types that were absent from the original modules, such asstomach and lung, appear distributed among multiple extendedmodules, with typically <25% of the samples assigned to the samemodule (Fig. 3B). The only exception is ovarian cancer, which doesnot appear in any of the significant modules although 70% of thesamples are recovered as members of the same extended module aslower-grade glioma. The apparent cause is the tight associationbetween ovarian cancer and TP53, which is mutated in almost 90%of samples, against a low background of somatic mutations (40).We confirmed the existence of a diffuse (nonspecific) mode of

driver accumulation by running the module-detection pipeline

without removal of nonsignificant members. The resulting set of“statistically relaxed” modules consists of 12 modules with acounterpart among the original (statistically significant) mod-ules, 16 minor modules with a single gene each and small samplesizes, and a giant, nonsignificant module that includes 20% ofthe samples and 90 (45%) genes (SI Appendix, Table S3). Theappearance of nonsignificant (pseudo)modules is a well-knownartifact that results from applying module detection algorithmsto large networks with partial or no modular structure (41, 42). Inthe context of cancers and the associated genes, the giant pseu-domodule accounts for the nonmodular (diffuse) component ofthe cancer mutation network, which comprises exchangeable cancergenes that are not specifically associated with particular groups oftumors. Cancer types differ with respect to their contribution to thiscomponent. Thus, only 10–20% of samples from cancer types thatare represented in the original modules are assigned to the diffusecomponent, whereas the fraction rises to more than 25% in othercancers (P < 0.01, Wilcoxon test) (Fig. 3 A and B).Taken together, these findings indicate that cancer types can

be conceptually split into two major groups: (i) those that accu-mulate driver mutations in specific sets of cancer genes and areaccordingly clustered into distinct modules (Table 1) and (ii) thosethat accumulate exchangeable driver mutations in a non–tissue-specific manner, such as stomach and lung cancers. Although mostcancer types can be clearly placed in one of those two extremecategories, there are some mixed cases (Fig. 3C). For example,bladder cancer generates its own module that includes genesKDM6A, FGFR3, and STAG2. However, only 10% of bladdertumors are assigned to that module (30% in the best-match ex-tension), whereas about 40% belong to the diffuse component andaccumulate mutations in a less specific manner (Fig. 2B). Similar,albeit less extreme, trends are observed for head and neck andtesticular cancers. Such a mixed mutational landscape, modularfor some subsets of the samples but nonspecific for others, seemsto mirror the heterogeneity of these cancer types.

Fig. 2. Cancer genes mutated at significantly distinct rates in different modules and cancer types. Tumors that do and do not belong to specificity modulesare shown in A and B, respectively. Owing to their “mixed” nature, bladder, prostate, head and neck, and testicular cancers appear twice, with samplesassigned to significant modules in A and the rest in B. Only genes that belong to specificity modules are shown. Significance was evaluated with a two-tailedFisher’s exact test; red and blue indicate that mutations in a gene are over- and underrepresented, respectively, in a group of tumors.

E6014 | www.pnas.org/cgi/doi/10.1073/pnas.1803155115 Iranzo et al.

Dow

nloa

ded

by g

uest

on

Feb

ruar

y 2,

202

1

Page 6: Cancer-mutation network and the number and specificity of ... · mutations in 30 cancer types. We show that an unsupervised community-detection approach applied to the bipartite network

Copy Number Alterations. We further explored the specificity ofdriver events by jointly considering somatic mutations and copynumber alterations that affect cancer genes. To reduce thenumber of noninformative connections, only amplifications ofoncogenes and losses of TSGs were added to the network of so-matic mutations. The addition of copy number alterations does notsignificantly change the results described so far (SI Appendix, Ta-bles S4 and S5). The same significant modules were recovered,although their composition in terms of cancer types is slightly lessclear. Besides the modules described above, the addition of thecopy number alterations resulted in five new modules, none ofwhich showed an obvious correspondence with a particular cancertype. Four of these modules are associated with arm-level alter-ations affecting chromosome arms 7q (genes BRAF, EZH2, MET,and SMO), 9q (genes ABL1, GNAQ, and KLF4), and 17p (genesMAP2K4 andNCOR1) and the X chromosome region Xp11 (genesBCOR, GATA1, KDM5C, and KDM6A). The fifth of these newmodules accounts for the frequent loss of 56 cancer genes (mostlyTSGs) distributed across the genome. Moreover, some genes inwhich copy number alterations are more recurrent than somaticmutations, such as ERBB1 in breast cancer, become assigned tosignificant modules. Overall, the inclusion of oncogene amplifica-tions and TSG losses does not reveal the existence of specificmodules that were not already detected by the analysis of somaticmutations, although this outcome could be biased by the fact thatmost cancer genes in our study were identified through mutations.The minor changes in the network structure caused by the in-clusion of copy number alterations seem related to the chromo-somal location of cancer genes and the opposite trends towardamplification and deletion in oncogenes and TSG, respectively.

Average Number of Driver Mutations. Identification of driver muta-tions is confounded by the numerous passenger mutations that aretypically found in cancer genomes. Passenger mutations in thecoding regions appear to dominate even in cancer genes (9, 43),resulting in a strong correlation between the number of mutationsin cancer genes and noncancer (passenger) genes (R = 0.87, P <10−20) (SI Appendix, Fig. S2). To obtain an estimate of the numberof driver mutations in different tumors, we built a general linearmodel with the number of coding mutations (substitutions andsmall indels) in cancer genes as the dependent variable, the numberof coding mutations in putative passenger genes as the explanatoryvariable, and the cancer type as grouping factor. Due to the per-vasive abundance of passenger mutations, a major feature of themodel is the strong correlation between mutations in cancer genes

and noncancer genes. In this context, the intercepts (which dependon the cancer type) can be interpreted as the excess of mutationsin cancer genes that is not attributable to the same causes that leadto the accumulation of mutations in passenger genes. Thus, theseintercepts constitute a proxy for the number of driver mutations inknown cancer genes.We found that 75% of the variance in the number of muta-

tions in cancer genes is explained by the number of mutations inpassenger genes [P < 10−20, analysis of covariance (ANCOVA)],which is indicative of a common trend of mutation accumulationin both gene classes (Fig. 4A). Considering all tumors togetherand controlling for the nonuniform abundances of differentcancer types, the mean number of driver mutations per tumor inknown cancer genes was estimated as 1.72 ± 0.09 (95% CI).Differences in the intercepts across tumor classes are statisticallysignificant but explain only 4% of the total variability in the data(Fig. 4 A and B and SI Appendix, Table S6). Such low explanatorypower could be due to the heterogeneity of the samples withinthe same cancer type and the possible occurrence of driver mu-tations in noncoding regions or in genes that do not belong to ourlist of 198 cancer genes; indeed, 12% of the samples lack codingmutations in these genes. Both limitations can be mitigated byanalyzing only samples from significant modules, which representmore homogeneous subsets of tumors enriched in putative drivermutations. Thus, to assess how the number of driver mutationsvaries across tumors, we repeated the regression analysis with thesamples that were assigned to significant modules, using both theassignment to the modules and the cancer type as the groupingvariables (Fig. 4A and SI Appendix, Table S7). In samples fromsignificant modules, differences in the number of drivers acrosscancer types explain 20% of the observed variance in the numberof mutations in cancer genes. The predicted number of drivermutations in these samples ranges from values near 1 in testicularcancer, thyroid carcinoma, and the subset of melanoma with lowmutational load to values between 4 and 5 in bladder and endo-metrial cancers (Fig. 4C). The average number of drivers per tu-mor in colorectal cancer is 3.23 ± 0.26, consistent with previousestimates based on epidemiological models (21, 44) and measuresof positive selection in cancer genes (9). Notably, the estimates ofthe average number of driver mutations in the two groups ofmelanomas differ by almost 3 (3.52 ± 0.60 vs. 0.75 ± 0.45, re-spectively), which could be related to the higher mutation burdenobserved in the first group.Overall, both the number of drivers and the fraction of mu-

tations that are inferred to be drivers tend to be larger in samples

Fig. 3. Classification of cancer types according tothe gene specificity of their driver mutations. (A andB) Fraction of samples assigned to statistically signif-icant (solid bars) and best-match extended (semi-transparent bars) modules obtained by reassigningnonsignificant samples and genes to the significantmodules with which they share the largest number ofconnections. Black diamonds indicate the fraction ofsamples assigned to the largest nonsignificant pseu-domodule. Cancer types without major contributionsto any significant module are shown in B. Bar colorsrefer to the best-match extended module that con-tains most samples from each type. (C) Principalcomponent analysis of cancer types based on thefraction of samples assigned to statistically significantmodules, best-match extended modules, and thelargest nonsignificant pseudomodule. The percent-ages of the total variance explained by the first andsecond components are 88.5% and 8.6%, respectively.Special cases discussed in the text are labeled: BLCA,bladder cancer; HNSC, head and neck cancer; OV,ovarian cancer; PRAD, prostate cancer; TGCT, testicu-lar cancer.

Iranzo et al. PNAS | vol. 115 | no. 26 | E6015

MED

ICALSC

IENCE

SPN

ASPL

US

Dow

nloa

ded

by g

uest

on

Feb

ruar

y 2,

202

1

Page 7: Cancer-mutation network and the number and specificity of ... · mutations in 30 cancer types. We show that an unsupervised community-detection approach applied to the bipartite network

from the significant specificity modules. Moreover, the estimatednumber of driver mutations in these samples closely matches theaverage number of module-specific (i.e., intramodule) mutationsper tumor (P < 0.001, Spearman’s ρ = 0.815) (Fig. 4D). These twofindings indicate that module-specific genes are the principal targetof actual driver mutations in cancer types that belong to significantmodules. The only notable exception occurs for bladder cancer (seethe deviation from the 1:1 trend in Fig. 4D). In this cancer type, theaverage number of module-specific mutations falls well below theestimated number of drivers, suggesting the existence of drivermutations in cancer genes that do not belong to the module.

There is a strong positive correlation between the estimatednumber of drivers in a cancer type and the average age at whichthe respective cancers are diagnosed (Fig. 4E). This result holdsregardless of whether the number of drivers is estimated for thecomplete set of samples of the given cancer type (P = 0.003,Spearman’s ρ = 0.527) or for the samples assigned to signifi-cant modules only (P = 0.005, ρ = 0.693). After controlling forthe differences in the number of drivers, cancer types that areand are not associated with significant modules (i.e., with andwithout specific sets of cancer genes) differ significantly in theaverage age at diagnosis, the former appearing earlier in life

Fig. 4. Estimation of the average number of driver mutations per tumor in 198 cancer genes. (A) Regression between the number of coding mutations incancer genes (y axis) and noncancer genes (x axis). Colored circles correspond to samples from significant modules. The solid lines show the fit to an ANCOVAmodel with class- or module-specific slopes and intercepts, y = ðα+ αiÞ+ ðβ+ βiÞx + «, when considering all samples (gray) or samples from significant modules(colored). The global R2 of the model are 0.83 and 0.75, respectively (P < 10−20 in both cases). Vertical and horizontal axes were jointly scaled in all panels toallow comparison of slopes. (B and C) The intercepts that correspond to the estimated number of driver mutations are represented in B (all cancer types) and C(members of significant modules); error bars represent 95% CIs. (D and E) The number of drivers correlates with the number of intramodule mutations(Spearman’s rho = 0.815, P < 0.001) (D) and with the age at diagnosis (Spearman’s rho = 0.527, P = 0.003) (E).The solid line in D is a visual reference that showsthe 1:1 correspondence between the number of drivers and the number of intramodule mutations. Solid lines in E are fits to the curve y = aTx=ð1+ axÞ derivedfrom the model of Armitage and Doll (22), where T = 75 is the average lifespan in the absence of cancer and a is the proportionality constant between thenumber of drivers and rate-limiting steps (light gray, a= 2.5, all tumors; dark gray, a= 1.5, tumors from significant modules). (F) Comparison of the number ofdriver mutations in the main set of 198 cancer genes and in an extended set of 369 known cancer genes.

E6016 | www.pnas.org/cgi/doi/10.1073/pnas.1803155115 Iranzo et al.

Dow

nloa

ded

by g

uest

on

Feb

ruar

y 2,

202

1

Page 8: Cancer-mutation network and the number and specificity of ... · mutations in 30 cancer types. We show that an unsupervised community-detection approach applied to the bipartite network

(P = 0.036, difference = 6.7 rank units, ANCOVA on rank-transformed data).Although the 198 genes included in the cancer mutation net-

work likely harbor a large fraction of driver mutations, recentstudies have extended the list of well-supported cancer genes be-yond those included in the network (45). To assess the contribu-tion of other cancer genes to tumor initiation, we extended theregression analysis to include a larger set of 369 high-confidencecancer genes (Fig. 4F). The average number of driver mutationsper tumor in the extended dataset is 1.96 ± 0.12, which constitutesa 10% increase compared with the estimate obtained with theoriginal set of 198 cancer genes (P < 0.001, Student’s t test).Larger differences were observed for bladder, head and neck,endometrial, and kidney (chromophobe) cancers, suggesting thatsome genes from the extended list have a key role in those types ofcancer. For the other cancer types, however, we conclude that the198 genes included in the cancer mutation network account formost of the driver substitutions and small indels that occur inknown cancer genes.

DiscussionIt is well established that tumors accumulate recurrent mutationsmore often in some genes than in others. This phenomenonunderlies the discovery of oncogenes and TSGs and has led tothe identification of central pathways for tumorigenesis and tu-mor progression (7, 8, 11, 13, 46). Here we went a step furtherand tested the extent to which the association between genes andcancer types is mutually specific and suffices to define coherent,biologically meaningful groups of tumors. Although related toprevious research on the detection of significantly mutated genesand tumor classification (29, 47, 48), our approach differed inthree major aspects. First, tumor samples were not classified apriori based on their histology, which enabled us to test whetherdifferent cancer types are distinguishable by comparison oftheir mutational landscapes alone. Second, genes and sampleswere jointly clustered in a single step, so that the resultingnetwork modules reflect mutual specificity. Third, because ournetwork-based clustering is conducive to rigorous statisticaltesting, we could discriminate between cancer types that do anddo not show a significant degree of specificity in their sets ofmutated genes.Previous studies involving 12 major cancer types have shown

that tissue-specific clusters can be automatically identified fromgenomic and transcriptomic features, suggesting the existence ofa consistent molecular basis for a tissue-based classification oftumors (29). Our analysis provides a generalization of that resultto a more diverse dataset that included 30 cancer types and 198cancer genes, revealing major differences among cancer types. Thus,colorectal, pancreatic, endometrial, kidney (clear cell), breast,thyroid, and brain cancers, and acute myeloid leukemia, sarcoma,mesothelioma, melanoma, and uterine carcinosarcoma are sig-nificantly associated with mutations in tissue-specific sets of genes.In contrast, stomach, esophagus, and lung cancers, among others,follow a more diffuse, less specific mode of driver accumulation.Some cancers, such as bladder, prostate, testicular (germ cell), andhead-and-neck squamous cancer, show a mixed picture, with asignificant specificity of mutations observed in only a fractionof samples.The observed specificity patterns of cancer genes could originate

from at least two, not mutually exclusive, causes. Biases in mutationand/or repair could make some tissues more prone to accumulatingmutations in certain genes (e.g., due to differences in transcriptionlevels, chromatin configuration, and exposure to mutational pro-cesses) (49), although such biases are unlikely to account for thelarge differences observed across cancer types (50). A more im-portant factor is the tissue specificity of the pathways that controlcell proliferation, which have to be overcome for tumor progres-sion through mutations in different genes (51). This view is sup-

ported by experimental research on the functional mechanisms bywhich APC and KRAS mutations lead to colorectal cancer (52, 53)and combined VHL-BAP1 mutations lead to clear cell renal cellcarcinoma (54). Along similar lines, recent analysis of synonymousand nonsynonymous substitutions in cancer genomes has shownthat positive selection promotes the fixation of somatic mutationsin a gene- and tissue-specific manner, implying that selectionpressures during tumorigenesis are tissue specific (9, 10).The absence of detectable specificity patterns in some cancers

might be affected by inherent limitations of community-detectionalgorithms on large, partially modular networks such as the one weanalyzed here. In particular, small sample size could compromisethe identification of modules for thymoma, adrenocortical, cervi-cal, and kidney (chromophobe) carcinomas. Additionally, it couldbe hard to find specificity patterns in cancers with high mutationalload, such as lung and stomach cancer, due to their low signal-to-noise (drivers-to-passengers) ratio. Nonetheless, the detection ofsignificant modules for bladder cancer and melanoma, which bothhave high mutational loads (3, 55), implies that a high mutationburden does not critically affect the performance of the moduledetection algorithm. Finally, relevant specificity modules based oncopy number variation or gene rearrangements could remain un-detected if these large-scale mutations involve genes that are notconsidered here. It should be noted that the absence of specificsets of driver genes for some cancer types does not imply that suchcancers cannot be clustered on the basis of other molecular fea-tures, as has been shown for lung adenocarcinoma and lungsquamous carcinoma based on transcriptomics (29).The second major theme of this study is the estimation of the

number of driver mutations affecting cancer genes in differentcancer types. By comparing the number of mutations in cancergenes and noncancer genes, we inferred an average of approxi-mately two driver mutations (substitutions and small indels) inknown cancer genes per tumor, with significant variation (fromless than one to more than five) across cancer types. Our resultsgenerally agree with previously reported numbers based on mu-tations under positive selection, providing independent supportfor such values (9). Remarkably, there is a connection betweendriver mutations and tissue-specific cancer genes. Specifically, thenumber of driver mutations in different cancer types shows a1:1 correspondence (some minor variations notwithstanding) withthe number of intramodule mutations, that is, with the number ofmutations in tissue-specific genes.We further show that the number of driver mutations strongly

and positively correlates with the mean age of cancer onset, as onewould expect if the number of driver mutations was proportional(yet not necessarily equal) to the number of rate-limiting steps intumorigenesis (22, 24). Supporting this view, many of the tissue-specific cancer genes detected in this study are targets of muta-tions that appear as early clonal events in the trunk of single-tumor evolutionary trees and likely reflect crucial steps in tumorprogression (56). That is, for example, the case for VHL andPBRM1 in clear cell renal cell carcinoma (often accompanied byparallel subclonal mutations in SETD2) (57), for DNMT3A andNPM1 in acute myeloid leukemia (often with parallel subclonalmutations in FLT3) (58), for KRAS, TP53, and SMAD4 in pan-creatic cancer (59), and for APC, KRAS, and TP53 in colorectalcancer (60).Based on our observations, we propose that the cancer phe-

notype can be reached in two ways: via driver mutations in tissue-specific cancer genes (the modular mode) or via mutations in abroader set of genes, some of which have not yet been identified ascancer genes (the diffuse mode). Tumors from the same tissuetypically follow one of these routes preferentially, which makes itpossible to classify cancers as modular or diffuse. We found anintriguing link between the onset of a cancer and the specificity ofcancer genes: Modular cancer types (those that carry mutations inspecific genes) tend to appear earlier in life than expected given

Iranzo et al. PNAS | vol. 115 | no. 26 | E6017

MED

ICALSC

IENCE

SPN

ASPL

US

Dow

nloa

ded

by g

uest

on

Feb

ruar

y 2,

202

1

Page 9: Cancer-mutation network and the number and specificity of ... · mutations in 30 cancer types. We show that an unsupervised community-detection approach applied to the bipartite network

their estimated number of driver mutations. We suspect that thisdifference could be explained by the requirement for additionaldriver mutations in genes that are currently not classified ascancer-associated in the case of the diffuse cancer types and/or bystronger effects of driver mutations in the modular cancers.The results of this work show that rigorous statistical methods

for community detection in bipartite networks can shed light on therelationships between different types of tumors and cancer genes.The modularity of the gene–tumor network analyzed here is rela-tively low, with only about one in four tumor samples included insignificant modules. This fraction is likely to increase as more tu-mors are sequenced and additional cancer genes are identified.Nevertheless, the good agreement between the numbers of drivermutations estimated here and those obtained by other methods, aswell as the significant difference in the age of cancer onset betweenthe modular and the diffuse components of the network, suggestthat the difference between the two modes of carcinogenesisrevealed by this analysis withstands the test of time. This distinctionbetween tumors caused by driver mutations in limited sets of tissue-specific genes and those caused by mutations in interchangeablegenes that are only weakly linked to specific tumor types couldhave important implications for understanding cancer evolution aswell as for diagnostic and therapeutic approaches.

MethodsData: Tumors and Mutations. TCGA public mutation calls were downloadedfrom TCGA, https://tcga-data.nci.nih.gov/docs/publications/tcga/, on January2016. Mutations were reannotated with the Ensembl Variant Effect Pre-dictor (VEP) software (61); information on the affected gene, type of mu-tation, and coarse-grained impact was collected. The classification of genesinto oncogenes and TSGs was extracted from ref. 62.

Network Construction. An unweighted, undirected, bipartite network ofsomatic mutations in cancers was built by connecting tumor sample nodes togene nodes whenever a small mutation in the coding region was identified ina given gene in a given sample. For this purpose, the following were con-sidered as coding mutations: missense and nonsense substitutions, loss of astop codon, mutations affecting splice donor or acceptor sites, and in-frameand out-of-frame indels. To keep the network size tractable and maximizethe signal-to-noise ratio, only genes with a known associationwith cancer wereincluded in the network. Specifically, we used the list of 198 cancer genesreported in ref. 3, which includes the curated list of 174 mutated genes fromthe COSMIC database (version 73) (31) and any other gene from the CancerGene Census database found recurrently mutated in ref. 7. Samples withoutany mutation in cancer genes and samples with >3,000 coding mutations(hypermutators) were excluded from the network. The resulting networkconsisted of a single connected component with 7,665 samples and 198 genes,with a density of connections of 0.019.

Module Detection. Following recent methodological advances in networkanalysis, a consensus clustering approach was used to identify the modules ofthe network (63). In the first step, maximization of Barber’s modularity indexwas performed in 200 replicas of the network with the MODULAR software(simulated annealing algorithm, default parameters) (64), which yielded200 alternative partitions. To test, from a global perspective, if the cancermutation network has a significant modular structure, we used the Chung–Lu model (65) to generate 200 random bipartite networks with the sameexpected gene- and sample-degree distributions, ran MODULAR on them,and compared the modularities of the resulting partitions with those of thecancer mutation network, using a Welch’s t test. In the second step, the200 alternative partitions of the cancer mutation network were used tobuild a consensus matrix by assigning to each pair of nodes a value equal tothe fraction of replicas in which both nodes were assigned to the samemodule. A distance matrix was then defined as one minus the consensusmatrix, and the consensus partition was finally obtained by performing hi-erarchical clustering on the distance matrix [the unweighted pair groupmethod with arithmetic mean (UPGMA) method], implemented by thelinkage function in MATLAB version R2015a. The number of clusters waschosen to maximize the Barber’s modularity index of the consensus partitionwith respect to the original network. We refer to the clusters in this con-sensus partition (MODULAR + UPGMA) as the statistically relaxed modules.

The composition of these clusters is described in Two Major Modes of DriverAccumulation and in SI Appendix, Table S3.

To evaluate the significance of eachmodule separately and to filter out thegenes and samples that do not follow a modular pattern, we ran the OSLOM(Order Statistics Local Optimization Method) software (42) with the options-singlet -r 0 -hr 0 -t 0.05 -hint, using the modules from the previous step as thereference partition. The significance threshold was set to 0.05. With thesesettings, OSLOM evaluates the probability that nodes from a random (non-modular) network display the connection patterns observed in the refer-ence partition and removes those nodes that do not reach the requiredsignificance threshold. Additionally, OSLOM accounts for the existence ofoverlapping modules by reassigning nodes to more than one modulewhen statistically supported. The result is a filtered partition that includesonly nodes (genes and samples) with a statistically significant modularstructure, sometimes assigned to more than one module (see for examplethe double assignation of TP53 in Table 1). The main results of our studyare based on the analysis of these 12 statistically significant modules obtainedby the combination of MODULAR + UPGMA + OSLOM (Table 1). The samepipeline was used to study the structure of the joint network of somatic mu-tations and copy number alterations (Copy Number Alterations and SI Appendix,Tables S4 and S5).

Together with the statistically relaxed and statistically significant modules,we built a set of best-match extended modules by attaching every node that ismissing from the statistically significant partition to the module that bestmatches its pattern of connections. By construction, the best-match extendedpartition has the same number of modules as the statistically significant par-tition, but it includes all the genes and samples. The best-match extendedpartition was obtained as a secondary output of OSLOM (output filetp_without_singletons when the program is run with option -singlet),which reassigns the nodes that do not reach the significance threshold tothe module that minimizes their P value. The best-match extended modules aredescribed in Two Major Modes of Driver Accumulation, Fig. 3, and SI Appendix,Table S2.

To test whether the results were robust to the choice of the networkpartitioning method, we repeated the network analysis using Infomap asthe module detection software (66), both in the first (MODULAR) andsecond (UPGMA) steps of the consensus clustering pipeline. In the secondcase, all entries with values <0.4 in the consensus matrix were set to 0, andInfomap was run on the network defined by the consensus matrix con-structed with this threshold. Both the unipartite and bipartite versions ofInfomap (options set to find hard partitions with two levels of hierarchy)were run, followed by the assessment of module significance with OSLOM.In all cases, the results were similar to those presented here (SI Appendix,Table S8).

Estimation of the Number of Driver Mutations. To estimate the number ofdriver mutations in different cancer types, we considered all small codingmutations affecting cancer genes and noncancer (passenger) genes. Thefunction aoctool in MATLAB R2015a was used to fit the number of mutationsin cancer genes to a “separate lines” linear model with cancer type andnumber of mutations in passenger genes as independent variables. Thestatistical significance of the model parameters was evaluated with anANCOVA with cancer types as factors and the number of mutations inpassenger genes as covariable. The entire analysis was carried out twice.First, all the samples in the dataset, classified into cancer types, were an-alyzed. Then, the analysis was limited to the samples that were repre-sented in any of the significant modules. Under the second approach,cancer types represented by fewer than 20 samples were excluded, andcancer types assigned to more than one significant module were split toobtain module-specific estimates. Although the interaction term betweencancer types and number of mutations in passenger genes explainsonly <3% of the observed variance, we chose to keep it in the model (andconsequently use different slopes to fit each cancer type) based on thehigh statistical significance of this term (P < 10−20) and the difference inthe Akaike Information Criterion (AIC) between models with global andfactorwise slopes (ΔAIC > 100).

The association between the number of drivers and the age of diagnosis wasinitially evaluated through a Spearman’s correlation analysis. To further assesswhether cancer types associated or not with significant modules differ in the ageat diagnosis while controlling for the number of driver mutations, an ANCOVAwas performed on the rank-converted data with the number of drivers as acovariable and the inclusion into a significant module as a binary factor.

To extend our estimates of the number of driver mutations to knowncancer genes not included in the cancer mutation network, we used the list of369 significantly mutated genes compiled in ref. 9.

E6018 | www.pnas.org/cgi/doi/10.1073/pnas.1803155115 Iranzo et al.

Dow

nloa

ded

by g

uest

on

Feb

ruar

y 2,

202

1

Page 10: Cancer-mutation network and the number and specificity of ... · mutations in 30 cancer types. We show that an unsupervised community-detection approach applied to the bipartite network

1. Stratton MR, Campbell PJ, Futreal PA (2009) The cancer genome. Nature 458:719–724.2. Hanahan D, Weinberg RA (2000) The hallmarks of cancer. Cell 100:57–70.3. Martincorena I, Campbell PJ (2015) Somatic mutation in cancer and normal cells.

Science 349:1483–1489.4. Yates LR, Campbell PJ (2012) Evolution of the cancer genome. Nat Rev Genet 13:795–806.5. Pon JR, Marra MA (2015) Driver and passenger mutations in cancer. Annu Rev Pathol

10:25–50.6. Tokheim CJ, Papadopoulos N, Kinzler KW, Vogelstein B, Karchin R (2016) Evaluating

the evaluation of cancer driver genes. Proc Natl Acad Sci USA 113:14330–14335.7. Lawrence MS, et al. (2014) Discovery and saturation analysis of cancer genes across

21 tumour types. Nature 505:495–501.8. Kandoth C, et al. (2013) Mutational landscape and significance across 12 major cancer

types. Nature 502:333–339.9. Martincorena I, et al. (2017) Universal patterns of selection in cancer and somatic

tissues. Cell 171:1029–1041.e21.10. Weghorn D, Sunyaev S (2017) Bayesian inference of negative and positive selection in

human cancers. Nat Genet 49:1785–1788.11. Vogelstein B, et al. (2013) Cancer genome landscapes. Science 339:1546–1558.12. Vogelstein B, Kinzler KW (2004) Cancer genes and the pathways they control. Nat

Med 10:789–799.13. Garraway LA, Lander ES (2013) Lessons from the cancer genome. Cell 153:17–37.14. Paez JG, et al. (2004) EGFR mutations in lung cancer: Correlation with clinical response

to gefitinib therapy. Science 304:1497–1500.15. Druker BJ, et al. (2001) Efficacy and safety of a specific inhibitor of the BCR-ABL ty-

rosine kinase in chronic myeloid leukemia. N Engl J Med 344:1031–1037.16. Courtney KD, Corcoran RB, Engelman JA (2010) The PI3K pathway as drug target in

human cancer. J Clin Oncol 28:1075–1083.17. Van Allen EM, et al. (2014) Whole-exome sequencing and clinical interpretation of

formalin-fixed, paraffin-embedded tumor samples to guide precision cancer medi-cine. Nat Med 20:682–688.

18. McFarland CD, Korolev KS, Kryukov GV, Sunyaev SR, Mirny LA (2013) Impact of deleteriouspassenger mutations on cancer progression. Proc Natl Acad Sci USA 110:2910–2915.

19. McFarland CD, Mirny LA, Korolev KS (2014) Tug-of-war between driver and passengermutations in cancer and other adaptive processes. Proc Natl Acad Sci USA 111:15138–15143.

20. Good BH, Desai MM (2014) Deleterious passengers in adapting populations. Genetics198:1183–1208.

21. Tomasetti C, Marchionni L, Nowak MA, Parmigiani G, Vogelstein B (2015) Only threedriver gene mutations are required for the development of lung and colorectalcancers. Proc Natl Acad Sci USA 112:118–123.

22. Armitage P, Doll R (1954) The age distribution of cancer and a multi-stage theory ofcarcinogenesis. Br J Cancer 8:1–12.

23. Knudson AG (2001) Two genetic hits (more or less) to cancer. Nat Rev Cancer 1:157–162.24. Nordling CO (1953) A new theory on cancer-inducing mechanism. Br J Cancer 7:68–72.25. Renan MJ (1993) How many mutations are required for tumorigenesis? Implications

from human cancer data. Mol Carcinog 7:139–146.26. Sawyers CL (1999) Chronic myeloid leukemia. N Engl J Med 340:1330–1340.27. Murphree AL, Benedict WF (1984) Retinoblastoma: Clues to human oncogenesis.

Science 223:1028–1033.28. Cancer Genome Atlas Research Network (2014) Comprehensive molecular profiling of

lung adenocarcinoma. Nature 511:543–550, and erratum (2014) 514:262.29. Hoadley KA, et al.; Cancer Genome Atlas Research Network (2014) Multiplatform

analysis of 12 cancer types reveals molecular classification within and across tissues oforigin. Cell 158:929–944.

30. Chang MT, et al. (2016) Identifying recurrent mutations in cancer reveals widespreadlineage diversity and mutational specificity. Nat Biotechnol 34:155–163.

31. Forbes SA, et al. (2015) COSMIC: Exploring the world’s knowledge of somatic muta-tions in human cancer. Nucleic Acids Res 43:D805–D811.

32. Zhang J, et al. (2011) International Cancer Genome Consortium Data Portal–A one-stop shop for cancer genomics data. Database (Oxford) 2011:bar026.

33. An O, Dall’Olio GM, Mourikis TP, Ciccarelli FD (2016) NCG 5.0: Updates of a manuallycurated repository of cancer genes and associated properties from cancer mutationalscreenings. Nucleic Acids Res 44:D992–D999.

34. Newman MEJ (2012) Communities, modules and large-scale structure in networks.Nat Phys 8:25–31.

35. Barber MJ (2007) Modularity and community detection in bipartite networks. PhysRev E Stat Nonlin Soft Matter Phys 76:066102.

36. Davoli T, et al. (2013) Cumulative haploinsufficiency and triplosensitivity drive an-euploidy patterns and shape the cancer genome. Cell 155:948–962.

37. Verhaak RG, et al.; Cancer Genome Atlas Research Network (2010) Integrated ge-nomic analysis identifies clinically relevant subtypes of glioblastoma characterized byabnormalities in PDGFRA, IDH1, EGFR, and NF1. Cancer Cell 17:98–110.

38. Brennan CW, et al.; TCGA Research Network (2013) The somatic genomic landscape ofglioblastoma. Cell 155:462–477.

39. Shah MA, Denton EL, Arrowsmith CH, Lupien M, Schapira M (2014) A global assessmentof cancer genomic alterations in epigenetic mechanisms. Epigenetics Chromatin 7:29.

40. Cancer Genome Atlas Research Network (2011) Integrated genomic analyses ofovarian carcinoma. Nature 474:609–615.

41. Guimerà R, Sales-Pardo M, Amaral LA (2004) Modularity from fluctuations in randomgraphs and complex networks. Phys Rev E Stat Nonlin Soft Matter Phys 70:025101.

42. Lancichinetti A, Radicchi F, Ramasco JJ, Fortunato S (2011) Finding statistically sig-nificant communities in networks. PLoS One 6:e18961.

43. Greenman C, et al. (2007) Patterns of somatic mutation in human cancer genomes.Nature 446:153–158.

44. Luebeck EG, Moolgavkar SH (2002) Multistage carcinogenesis and the incidence ofcolorectal cancer. Proc Natl Acad Sci USA 99:15095–15100.

45. Forbes SA, et al. (2017) COSMIC: Somatic cancer genetics at high-resolution. NucleicAcids Res 45:D777–D783.

46. Watson IR, Takahashi K, Futreal PA, Chin L (2013) Emerging patterns of somaticmutations in cancer. Nat Rev Genet 14:703–718.

47. Leiserson MD, et al. (2015) Pan-cancer network analysis identifies combinations of raresomatic mutations across pathways and protein complexes. Nat Genet 47:106–114.

48. Hwang T, et al. (2012) Co-clustering phenome-genome for phenotype classificationand disease gene discovery. Nucleic Acids Res 40:e146.

49. Supek F, Lehner B (2015) Differential DNA mismatch repair underlies mutation ratevariation across the human genome. Nature 521:81–84.

50. Blokzijl F, et al. (2016) Tissue-specific mutation accumulation in human adult stemcells during life. Nature 538:260–264.

51. Nowell PC (1976) The clonal evolution of tumor cell populations. Science 194:23–28.52. Preston SL, et al. (2003) Bottom-up histogenesis of colorectal adenomas: Origin in the

monocryptal adenoma and initial expansion by crypt fission. Cancer Res 63:3819–3825.

53. Snippert HJ, Schepers AG, van Es JH, Simons BD, Clevers H (2014) Biased competitionbetween Lgr5 intestinal stem cells driven by oncogenic mutation induces clonal ex-pansion. EMBO Rep 15:62–69.

54. Wang SS, et al. (2014) Bap1 is essential for kidney function and cooperates with Vhl inrenal tumorigenesis. Proc Natl Acad Sci USA 111:16538–16543.

55. Chalmers ZR, et al. (2017) Analysis of 100,000 human cancer genomes reveals thelandscape of tumor mutational burden. Genome Med 9:34.

56. Turajlic S, McGranahan N, Swanton C (2015) Inferring mutational timing and re-constructing tumour evolutionary histories. Biochim Biophys Acta 1855:264–275.

57. Gerlinger M, et al. (2012) Intratumor heterogeneity and branched evolution revealedby multiregion sequencing. N Engl J Med 366:883–892.

58. Welch JS, et al. (2012) The origin and evolution of mutations in acute myeloid leu-kemia. Cell 150:264–278.

59. Yachida S, et al. (2010) Distant metastasis occurs late during the genetic evolution ofpancreatic cancer. Nature 467:1114–1117.

60. Sottoriva A, et al. (2015) A Big Bang model of human colorectal tumor growth. NatGenet 47:209–216.

61. McLaren W, et al. (2016) The Ensembl variant effect predictor. Genome Biol 17:122.62. Waks Z, et al. (2016) Driver gene classification reveals a substantial over-

representation of tumor suppressors among very large chromatin-regulating pro-teins. Sci Rep 6:38988.

63. Fortunato S, Hric D (2016) Community detection in networks: A user guide. Phys Rep659:1–44.

64. Marquitti FMD, Guimaraes PR, Pires MM, Bittencourt LF (2014) MODULAR: Softwarefor the autonomous computation of modularity in large network sets. Ecography 37:221–224.

65. Chung F, Lu L (2002) Connected components in random graphs with given degreesequences. Ann Comb 6:125–145.

66. Rosvall M, Bergstrom CT (2011) Multilevel compression of random walks on networksreveals hierarchical organization in large integrated systems. PLoS One 6:e18209.

67. Zhang P, et al. (2008) Clustering coefficient and community structure of bipartitenetworks. Phys A 387:6869–6875.

Iranzo et al. PNAS | vol. 115 | no. 26 | E6019

MED

ICALSC

IENCE

SPN

ASPL

US

Dow

nloa

ded

by g

uest

on

Feb

ruar

y 2,

202

1