30
“alexes of all nations unite!” Epicenter Analysis in Cancer Alex Krasnitz, CSHL Search and knowledge building for biological datasets, UCLA, 11.26-30, 2007 •Input: segmented data from (ROMA) CGH. •A predictive signal: a whole-genome biomarker for survival. •Pinning algorithm. •Pins find cancer genes. •Pins predict tissue of origin. •Pins and progression.

“alexes of all nations unite!” Epicenter Analysis in Cancer Alex Krasnitz, CSHL Search and knowledge building for biological datasets, UCLA, 11.26-30,

Embed Size (px)

Citation preview

  • alexes of all nations unite! Epicenter Analysis in CancerAlex Krasnitz, CSHLSearch and knowledge building for biological datasets, UCLA, 11.26-30, 2007Input: segmented data from (ROMA) CGH.A predictive signal: a whole-genome biomarker for survival.Pinning algorithm.Pins find cancer genes.Pins predict tissue of origin.Pins and progression.

  • (ROMA) CGH in vitro and in silico:A method for measuring relative copy numbers of short fragments in a genome.A multistep process consisting ofDigestion a restriction enzyme (BglII) PCR short (0.2-1.2kb) fragments are selectedHybridization to an oligonucleotide (50mer probes) microarray (85K probe format used in present study, higher resolution work in progress)GriddingNormalizationSegmentationThresholdingCNP maskingHorizontal slicing

  • Raw and segmented ROMA profile; FISH validation of copy number variations detected by ROMA.Segmentation algorithm (B. Lakshmi, M. Wigler): replace a raw profile by a piecewise-constant function minimizing variance.

  • ROMA SNPsCNPs Copy Number PolymorphismsHeterozygousHomozygousCancer-free Female x Cancer-free Reference Male CNPs and SNPs are genetic markers

  • Typical tumor genomes are NOT normal. Still, they may contain CNPs that must be filtered out.

  • Genomic rearrangements in cancer (Bayani et al, Seminars in Cancer Biology 17, 5, 2007)

  • CNP masking: determine positions of frequent CNPs from a set of cancer-free genomes (~500 cases); excise these from cancer profiles in a minimally intrusive fashion.

  • Event identification: horizontal slicingAllow multiple events at a locus in a profile. Select vertically non-overlapping segments of maximal total length. These define tiers.Assign remaining segments each to the closest tier.

  • Breast cancer study257 frozen tissue samples of Scandinavian (140 Swedish, 117 Norwegian) origin.Accompanied by clinical documentation.

    *progesterone (PR) and estrogen (ER) receptors measured by ligand binding; pos=>0.5fg/mg protein+ ERBB2 amplification scored by ROMA as segmented ratio greater than 0.1 above baseline.

    Karolinska Inst. SwedenTotalNode (pos/neg)Median AgeAt Diag.Grade I/II/IIISize (mm)20PR* (+/-)ER* (+/-)ERBB2+ amp/normDiploid (Survival >7 yr)6028/31528/11/33 19/4141/943/73/57Diploid (Survival

  • A heuristic classification of breast cancer profiles: simplex, sawtooth and firestormSmall # of events overall & per chromosomeMultiple events, no clusteringMultiple clustered events

  • Fishers exact test: strong association with survival, no association with any clinical parameter except age at diagnosis. Initial observation: firestorms lead to poor survival. Quantify presence of firestorms by (sum over inverse average lengths of adjacent segments).

    Is F a predictor of survival, and if so, is it independent of clinical parameters?

  • KM plots for the Swedish diploid subset(no significant change when adjusted for age at diagnosis)

  • Search for epicentersKey assumption: observed amplifications and deletions are more likely than not to confer a selective advantage upon a neoplastic cell.If so, expect frequently amplified regions of the genome to be enriched in oncogenes.Require methods for detecting such regions.Frequency plot inadequate.

  • Potential BenefitsMassive data reduction (O(105) probes to ~100 epicenters); a manageable set of predictorsDisentanglementTarget selection for functional studies (cancer gene finding)

  • PinningConsider a smallest unit of the genome containing all its events (a chromosome).For a given N, find N positions within that unit that best explain the observed set of (amplification or deletion) events, i.e., N positions that are shared by the highest number k(N) of events.Multiple solutions occur, either due to a fuzzy pin or due to N being too low.Increment N until the increment I(N)=k(N)-k(N-1) reaches a pre-set minimal value. Note that I(N) is a non-increasing function of N.Pinning is convergent: it is guaranteed to recover the epicenters given enough data.

  • Greedy, N=2 (5 out of 6)Non-greedy, N=2 (6 out of 6)Greedy pinning is not optimalRequired: exhaustive enumeration of all possible N-pin configurations.Pin positions: a fixed grid or determined by break points in the data.In present data set: up to 5 pins per chromosome, O(100) pin positions.

  • Test of significanceFor the optimal N-pin solutions determine the event score k(N), and the gain IN=k(N)-k(N-1).Perform multiple whole-genome shuffles of the events, including those of the opposite sign. For each shuffle find its IN. Estimate a p-value by comparison to the true IN.

  • Interpretation of results: consider only the top-scoring pin configurations. Then, for pin #i in a top-scoring configuration, compute, at coordinate x

    (the sum is over the inverse lengths the events pinned by #i and containing x)

    Example: 17q, 5 pins

  • Lung cancer deletions: known tumor suppressors and novel elements (213 cases, courtesy S. Powers)

  • Estimates of utility

    Goal: select the most promising 10% of the genome to focus functional studies on.Is pinning useful in this sense?A test: how enriched is the top-scoring 10% quantile in known genetic elements implicated in breast cancer?We hit major known oncogenes, so can expect good results. More formally, perform a database search (top 10%, 17q).

  • Estimates of utility

    DatabaseHits in regionHits in top 10%Atlas of Genetics and Cytogenetics in Oncologyand Haematology 8 (annotated as amplified and/or overexpressed)8 (p=10-8)NCBI map viewer184 (hits on breast cancer)64 (p=210-16), likely overly conservativeNCBI map viewer47 (genes implicated in breast cancer)10 (p=0.016), likely overly conservative

  • Gene Enrichment

    Epicenters are enriched in (CCDS) genes compared to the genome and to the copy number events because (a) epicenters bracket genes and (b) genes are clustered.

  • Application: predicting tissue of origin

    Random forest classifier using joint sets of epicenters as predictors

    Organ 1

    Organ 2

    N1 (training)

    N2 (training)

    N1 (test)

    N2 (test)

    Training error 1

    Training error 2

    Test error 1

    Test error 2

    breast

    lung

    129

    107

    128

    106

    0.18

    0.24

    0.18

    0.21

    breast

    colon

    129

    69

    128

    68

    0.02

    0.29

    0.02

    0.19

    lung

    colon

    107

    69

    106

    68

    0.09

    0.36

    0.08

    0.21

  • Application: early events in breast cancerCompute frequency weighted by inverse number of events for contiguous groups of epicenters. Outliers: FISH-validated early 16p-1q translocation.

  • SummaryPinning is a method for finding copy number variation epicenters in (cancer) genomes.Applied to: a set of 257 FISH-validated breast cancer genome profiles; lung and colon cancer sets.The epicenters found by pinning are significantly enriched in genes.Epicenters find tissue of origin.Epicenters detect early lesions.

  • ROMA-based Cancer Biology at CSHL Mike Wigler, Jim Hicks, Rob Lucito, Scott Powers, David MuROMAMichael Riggs Diane Esposito Joan Alexander Jen Troge Evan Leibu BioinformaticsLakshmi Muthuswamy Boris Yamrom AK Vlad Grubor Yoon-Ha Lee Tony Leotta Jude Kendall Deepa Pai Andy Reiner John HealyFACS/DatabaseLinda RodgersFISH Primer Selection Program & Probes Nicholas Navin FISH (Karolinska) Susanne Maner Par Lundin Collaborators: Anders Zetterberg Karolinska Inst. Anne-Lise Borressen-Dale Norway Radium Hosp. Kenny Ye Albert Einstein Sch. Med. Thea Tlsty UCSF Larry Norton - MSKCCStatistics Xiaoyue Zhao Chris Yoon