57
nCounter Assay mRNA Capture & Reporter Probes Hybridize CodeSet to RNA Automated Process Remove excess reporters Bind reporter to surface Immobilize and align reporter Image surface Count codes slides courtesy of Sean Ferree, Nanostring Technologies

nCounter Assay - bd2k.web.unc.edubd2k.web.unc.edu/files/2016/...FEB...v3-post_Part2.pdf · Challenges and issues concerning sequencing data 1. these data are genetically identifiable,

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: nCounter Assay - bd2k.web.unc.edubd2k.web.unc.edu/files/2016/...FEB...v3-post_Part2.pdf · Challenges and issues concerning sequencing data 1. these data are genetically identifiable,

nCounter Assay

mRNA Capture & Reporter Probes

HybridizeCodeSet to RNA

Automated Process

Removeexcess

reporters

Bindreporter

to surface

Immobilize and align reporter

Image surface

Count codes

slides courtesy of Sean Ferree, Nanostring Technologies

Page 2: nCounter Assay - bd2k.web.unc.edubd2k.web.unc.edu/files/2016/...FEB...v3-post_Part2.pdf · Challenges and issues concerning sequencing data 1. these data are genetically identifiable,

nCounter Assay

HybridizeReporter to RNA

Automated ProcessnCounter Prep Station

Removeexcess

reporters

Bindreporter

to surface

Immobilize and align reporter

Image surface

Count codes

Hybridized mRNA

slides courtesy of Sean Ferree, Nanostring Technologies

Page 3: nCounter Assay - bd2k.web.unc.edubd2k.web.unc.edu/files/2016/...FEB...v3-post_Part2.pdf · Challenges and issues concerning sequencing data 1. these data are genetically identifiable,

nCounter Assay

HybridizeReporter to RNA

Automated ProcessnCounter Prep Station

Removeexcess

reporters

Bindreporter

to surface

Immobilize and align reporter

Image surface

Count codes

Hybridized Probes Bind to Cartridge

Surface of cartridge is coatedwith streptavidin

slides courtesy of Sean Ferree, Nanostring Technologies

Page 4: nCounter Assay - bd2k.web.unc.edubd2k.web.unc.edu/files/2016/...FEB...v3-post_Part2.pdf · Challenges and issues concerning sequencing data 1. these data are genetically identifiable,

nCounter Assay

HybridizeReporter to RNA

Automated ProcessnCounter Prep Station

Removeexcess

reporters

Bindreporter

to surface

Immobilize and align reporter

Image surface

Count codes

Immobilize and align reporter for image collecting and barcode counting

slides courtesy of Sean Ferree, Nanostring Technologies

Page 5: nCounter Assay - bd2k.web.unc.edubd2k.web.unc.edu/files/2016/...FEB...v3-post_Part2.pdf · Challenges and issues concerning sequencing data 1. these data are genetically identifiable,

nCounter Assay

One coded reporter = 1 mRNA = 1 gene

HybridizeReporter to RNA

Automated ProcessnCounter Digital Analyzer

Removeexcess

reporters

Bindreporter

to surface

Immobilize and align reporter

Image surface

Count codes

slides courtesy of Sean Ferree, Nanostring Technologies

Page 6: nCounter Assay - bd2k.web.unc.edubd2k.web.unc.edu/files/2016/...FEB...v3-post_Part2.pdf · Challenges and issues concerning sequencing data 1. these data are genetically identifiable,
Page 7: nCounter Assay - bd2k.web.unc.edubd2k.web.unc.edu/files/2016/...FEB...v3-post_Part2.pdf · Challenges and issues concerning sequencing data 1. these data are genetically identifiable,
Page 8: nCounter Assay - bd2k.web.unc.edubd2k.web.unc.edu/files/2016/...FEB...v3-post_Part2.pdf · Challenges and issues concerning sequencing data 1. these data are genetically identifiable,

Prediction of late distant recurrence after 5 years of endocrine treatment: a combined analysis of patients from the Austrian breast and colorectal cancer study group 8 and anastrozole, tamoxifen alone or in combination randomized

trials using the PAM50 risk of recurrence score.Sestak et al., JCO 2014 (PMID: 25332252)

Page 9: nCounter Assay - bd2k.web.unc.edubd2k.web.unc.edu/files/2016/...FEB...v3-post_Part2.pdf · Challenges and issues concerning sequencing data 1. these data are genetically identifiable,

24/28 Samples assayed by qPCR (FFPE) and NanoString (FFPE) and Microarray (Fresh Frozen) were grouped together using the PAM50 genes

(Bernard, Perou and Ellis, unpublished)

Page 10: nCounter Assay - bd2k.web.unc.edubd2k.web.unc.edu/files/2016/...FEB...v3-post_Part2.pdf · Challenges and issues concerning sequencing data 1. these data are genetically identifiable,

Current Gene Expression Technologies1. DNA microarrays

- Moderate precision covering ALL genes, fast, and inexpensive- long oligos (60-70mers/Agilent/Nimblegen) – 1-2 oligos/gene - short oligos (~25mers/Affymetrix) – 5-10 oligos/gene- cDNA arrays (100-1000bp PCR products/Stanford) – 1-2 cDNA clones/gene

2. Quantitative RT-PCR - high precision, large dynamic range, fast, and inexpensive- smaller number of genes assayed (compatible with Formalin-fixed, Paraffin

Embedded Materials (FFPE), which is the medical standard of care)

3. Nanostring nCounter- high precision, large dynamic range, fast, and moderate expense- single mRNA molecule counting capabilities (compatible with FFPE)

4. Illumina-based RNA-sequencing- Gene expression via Massively Parallel Sequencing (MPS)- high precision, large dynamic range, not fast, and expensive- compatible with FFPE

Page 11: nCounter Assay - bd2k.web.unc.edubd2k.web.unc.edu/files/2016/...FEB...v3-post_Part2.pdf · Challenges and issues concerning sequencing data 1. these data are genetically identifiable,

DNA/cDNA(0.1-1 ug total RNA)

Single molecule arraySample

preparation Cluster growth5’

5’3’

G

T

C

A

G

T

C

A

G

T

C

A

C

A

G

TC

A

T

C

A

C

C

TAG

CG

TA

GT

1 2 3 7 8 94 5 6

Image acquisition Base calling

T G C T A C G A T …

Sequencing

Illumina Sequencing TechnologyReversible Terminator Chemistry Foundation

slides courtesy of Gary Schroth, Illumina Inc.

Page 12: nCounter Assay - bd2k.web.unc.edubd2k.web.unc.edu/files/2016/...FEB...v3-post_Part2.pdf · Challenges and issues concerning sequencing data 1. these data are genetically identifiable,

100 Microns

Sequencing with the Illumina Hiseq2500

~300-400 Million Clusters Per Lane (16 lanes / run)(2x100bp read = ~1,000,000,000,000 bases / run)

slides courtesy of Gary Schroth, Illumina Inc.

Page 13: nCounter Assay - bd2k.web.unc.edubd2k.web.unc.edu/files/2016/...FEB...v3-post_Part2.pdf · Challenges and issues concerning sequencing data 1. these data are genetically identifiable,
Page 14: nCounter Assay - bd2k.web.unc.edubd2k.web.unc.edu/files/2016/...FEB...v3-post_Part2.pdf · Challenges and issues concerning sequencing data 1. these data are genetically identifiable,

1. Spot X = Gene X2. Determine signal intensity at Spot X3. Signal intensity value is relative

measure of gene expression, which comes right out of the machine

4. Process is repeated for all spots,thus giving the complete data file

1. Spot X = ??? (have sequence 50-100bp)2. Map all sequences/spots to human

genome, or mRNA transcript library, etc.3. for each gene/transcript, count the

number of reads/sequences that“mapped” to it

4. determine “RPKM” value =Reads Per Kilobase Per Million mappedreads, or RSEM value (does isoforms)

5. Process is performed for all reads, givinga value of expression for each transcript

6. Mapping can be done at exon level also

Page 15: nCounter Assay - bd2k.web.unc.edubd2k.web.unc.edu/files/2016/...FEB...v3-post_Part2.pdf · Challenges and issues concerning sequencing data 1. these data are genetically identifiable,

Read Alignment to the Genome

ATGCCATTACACAGCGA

… CGATCTAACGTAGCTAGCTAGCTAGCTAGCATGCCATTACACAGCGAACAGGGAGCTTAGGCGC…

Read

Human Genome Reference

GTAGCTAGCTAGCTAGC GAACAGGGAGCTTAGG

CTAGCTAGCTAGCTAGC ACAGGGAGCTTAGGCGC

CGATCTAACGTAGCTAGC ATTACACAGCGAACAGG

GAACAGGGAGCTAAGG

ACAGGGAGCTAAGGCGC

For mRNA this would give us “9 counts” as a measure of expression.For DNA we have now obtained the sequence of this region and found a SNP

Page 16: nCounter Assay - bd2k.web.unc.edubd2k.web.unc.edu/files/2016/...FEB...v3-post_Part2.pdf · Challenges and issues concerning sequencing data 1. these data are genetically identifiable,

Genome Sequencing Overview

Sequence all fragments, align to reference genome and determine gene sequences= Whole Genome Sequencing (WGS)

Page 17: nCounter Assay - bd2k.web.unc.edubd2k.web.unc.edu/files/2016/...FEB...v3-post_Part2.pdf · Challenges and issues concerning sequencing data 1. these data are genetically identifiable,

Genome Sequencing Overview

Sequence all fragments= Whole Genome Sequencing (WGS)

Sequence all captured fragments, align, and determine gene sequences= ExomeSequencing (~2% of total genomic DNA)

Page 18: nCounter Assay - bd2k.web.unc.edubd2k.web.unc.edu/files/2016/...FEB...v3-post_Part2.pdf · Challenges and issues concerning sequencing data 1. these data are genetically identifiable,

UNC RNA-Seq Bioinformatics Pipeline

Alignment

MapSpliceSTARBWA

BAM

COUNTS(exons,

transcript level)

RSEM data matrix

UNC Microarray Database

UNC Tools

SPLICING (novel transcripts &

splice junctions)

SRF

UCSC TranscriptsUNC TranscriptsHuman Genomeother databases

Text

MUTATIONS FUSIONS

MapSplice SNVmix, GATK, orSamTools

Text

MAF

MapSplice

GEO:Gene

Expression Omnibus

FASTQ

Joel Parker

Page 19: nCounter Assay - bd2k.web.unc.edubd2k.web.unc.edu/files/2016/...FEB...v3-post_Part2.pdf · Challenges and issues concerning sequencing data 1. these data are genetically identifiable,

http://www.ncbi.nlm.nih.gov/gap

Challenges and issues concerning sequencing data1. these data are genetically identifiable, therefore proper patient consent is a must2. to put data in, and to take data out, requires proper scientific credentials3. these genetic data can’t be transferred to a 3rd party, and limitations on use may exist4. sequencing data sets are huge in size, and thus transporting from one place to

another, and long term data storage, are challenges

Page 20: nCounter Assay - bd2k.web.unc.edubd2k.web.unc.edu/files/2016/...FEB...v3-post_Part2.pdf · Challenges and issues concerning sequencing data 1. these data are genetically identifiable,

Benefits of sequencing-based platforms

1. Highly reproducible, great sensitivity, and a large dynamic range

2. no prior knowledge of sequence is needed

3. provides information on alternative splicing

4. provides information on mutations and other types of structural variants (i.e. gene fusions)

5. can identify foreign RNAs or DNAs (viruses or bacteria) depending upon the library protocol

Page 21: nCounter Assay - bd2k.web.unc.edubd2k.web.unc.edu/files/2016/...FEB...v3-post_Part2.pdf · Challenges and issues concerning sequencing data 1. these data are genetically identifiable,

Illumina Hiseq vs. Illumina GA2

-5 0 5 10

-50

510

Correlation of Log2(RPKM)

TCGA-AG-3583-01A-01R-0821-07 || HiSeq | 10

TCGA

-AG-

3583

-01A

-01R

-082

1-07

|| GA

T | 1

0091

4_

cor = 0.994cor=0.994

Page 22: nCounter Assay - bd2k.web.unc.edubd2k.web.unc.edu/files/2016/...FEB...v3-post_Part2.pdf · Challenges and issues concerning sequencing data 1. these data are genetically identifiable,

-4 -2 0 2 4 6

-6-4

-20

24

6

Correlation of log2(Ratio) All (13

Microarray || 9830-020018B-244Kv5

brea

st tu

mor

|| 9

830-

0200

18B

-MG

| 10

0902

_UN

cor = 0.872

-4 -2 0 2 4

-6-4

-20

24

6

Correlation of log2(Ratio) Intrins

Microarray || 9830-020018B-244Kv5

brea

st tu

mor

|| 9

830-

0200

18B

-MG

| 10

0902

_UN

cor = 0.905

Agilent 244K microarray versus mRNA-seq(Log2 ratio created for mRNA-seq using Common Reference)

cor=0.872 cor=0.9

Page 23: nCounter Assay - bd2k.web.unc.edubd2k.web.unc.edu/files/2016/...FEB...v3-post_Part2.pdf · Challenges and issues concerning sequencing data 1. these data are genetically identifiable,

236 TCGA tumors assayed by Agilent microarrays + mRNA-seq and co-clustered using 13,000 genes

mRNA-seq data was “platform” normalized to Agilent data

Page 24: nCounter Assay - bd2k.web.unc.edubd2k.web.unc.edu/files/2016/...FEB...v3-post_Part2.pdf · Challenges and issues concerning sequencing data 1. these data are genetically identifiable,

XPN (PMID: 18325927)DWD (PMID: 14693816)

Andrew NobelSteve Marron

Page 25: nCounter Assay - bd2k.web.unc.edubd2k.web.unc.edu/files/2016/...FEB...v3-post_Part2.pdf · Challenges and issues concerning sequencing data 1. these data are genetically identifiable,

Gene Expression (mRNAs) microRNAs

DNA Copy Number DNA Methylation Somatic Mutations

tissue

mut class

Myeloid

TP53-related

PIK3-related

Eclectic

DNA-damage

VHL-related

Gene Expression (mRNAs)25,000-75,000/sample

microRNAs1500/sample

DNA Copy Number1,000,000/sample

DNA Methylation450,000/sample

Somatic Mutations25,000/sample

Data Set Integration

Page 26: nCounter Assay - bd2k.web.unc.edubd2k.web.unc.edu/files/2016/...FEB...v3-post_Part2.pdf · Challenges and issues concerning sequencing data 1. these data are genetically identifiable,

Gene Expression (mRNAs)

DNA Copy Number DNA Methylation Somatic Mutations

tissue

mut class

Myeloid

TP53-related

PIK3-related

Eclectic

DNA-damage

VHL-related

Gene Expression (mRNAs)300-500 Signatures/Modules

DNA Copy Number1,000,000/sample

DNA Methylation450,000/sample

Somatic Mutations25,000/sample

Clinical Data6 features/sample

Data Set Integration

Chris Fan

Page 27: nCounter Assay - bd2k.web.unc.edubd2k.web.unc.edu/files/2016/...FEB...v3-post_Part2.pdf · Challenges and issues concerning sequencing data 1. these data are genetically identifiable,

The Data Set was 1) 5 public microarray data sets combined (DWD combined)2) Only patients used were those who received no systemic therapy (n=550) 3) Endpoint was relapse-free survival at 7 years4) Used 323 genomic signatures/modules and 6 clinical variables

Fan et al., BMC 2011 (PMID: 21214954)

Page 28: nCounter Assay - bd2k.web.unc.edubd2k.web.unc.edu/files/2016/...FEB...v3-post_Part2.pdf · Challenges and issues concerning sequencing data 1. these data are genetically identifiable,

Gene Expression Signatures/Modules

Some gene lists are heterogeneously expressed in a dataset, for example, in this gene clusterThe samples in the left part has about 3/2 genes down regulated (green), 1/3 up regulated (red)We cannot take median or mean of the gene expression to represent this gene list, because thesignal will cancel each other. In this case, we take 1st PC of the gene expression.

Examples = Proliferation gene set

D) Correlations to Centroids

Gene Expression data matrix converted to correlation values

B) Heterogeneously expressed gene list = 1st Principle Component of gene expression

A) Homogeneously expressed gene set = take median or mean expression for each patient

C) Predetermined Models = OncotypeDX, Mammaprint, PAM50 ROR-PT, and many more

Example = Score for correlation to each Subtype Centroid

Page 29: nCounter Assay - bd2k.web.unc.edubd2k.web.unc.edu/files/2016/...FEB...v3-post_Part2.pdf · Challenges and issues concerning sequencing data 1. these data are genetically identifiable,

Fan et al., BMC 2011 (PMID: 21214954)

Page 30: nCounter Assay - bd2k.web.unc.edubd2k.web.unc.edu/files/2016/...FEB...v3-post_Part2.pdf · Challenges and issues concerning sequencing data 1. these data are genetically identifiable,

Fan et al., BMC 2011 (PMID: 21214954)

Page 31: nCounter Assay - bd2k.web.unc.edubd2k.web.unc.edu/files/2016/...FEB...v3-post_Part2.pdf · Challenges and issues concerning sequencing data 1. these data are genetically identifiable,

Fan et al., BMC 2011 (PMID: 21214954)

Page 32: nCounter Assay - bd2k.web.unc.edubd2k.web.unc.edu/files/2016/...FEB...v3-post_Part2.pdf · Challenges and issues concerning sequencing data 1. these data are genetically identifiable,

Fan et al., BMC 2011 (PMID: 21214954)

ER+ Patients (n=395)All Patients (n=550)

Page 33: nCounter Assay - bd2k.web.unc.edubd2k.web.unc.edu/files/2016/...FEB...v3-post_Part2.pdf · Challenges and issues concerning sequencing data 1. these data are genetically identifiable,

Fan et al., BMC 2011 (PMID: 21214954)

ER+ Patients (n=395)All Patients (n=550)

Page 34: nCounter Assay - bd2k.web.unc.edubd2k.web.unc.edu/files/2016/...FEB...v3-post_Part2.pdf · Challenges and issues concerning sequencing data 1. these data are genetically identifiable,

Gene Expression (mRNAs) Proteins

DNA Copy Number DNA Methylation Somatic Mutations

Data Set Integration using Clinical, RNAseq, and DNAseq (exomes)

tissue

mut class

Myeloid

TP53-related

PIK3-related

Eclectic

DNA-damage

VHL-related

Gene Expression (mRNAs)300-500 Signatures/Modules

Proteins200/sample

DNA Copy Number1,000,000/sample

DNA Methylation450,000/sample

Somatic Mutations25,000/sample

Clinical Data6 features/sample

Page 35: nCounter Assay - bd2k.web.unc.edubd2k.web.unc.edu/files/2016/...FEB...v3-post_Part2.pdf · Challenges and issues concerning sequencing data 1. these data are genetically identifiable,
Page 36: nCounter Assay - bd2k.web.unc.edubd2k.web.unc.edu/files/2016/...FEB...v3-post_Part2.pdf · Challenges and issues concerning sequencing data 1. these data are genetically identifiable,

Adenocarcinoma

Squamous Cell

Large Cell

SCLC

Lung Cancer Subtypes

BronchioidMagnoidSquamoid

PrimitiveClassical

SecretoryBasal

• Non-small-cell lung carcinoma (NSCLC, 85%)– adenocarcinoma (LAD, 40%)

• Most cases of adenocarcinoma are associated with smoking, Adenocarcinoma is the most common form of lung cancer for never smokers

– Bronchioid– Magnoid– Squamoid

– squamous-cell lung carcinoma (SCC, 30%)– Primitive– Classical– Secretory– Basal (worst outcome)

– large-cell lung carcinoma (LC, 9%)• Small-cell lung carcinoma (SCLC, 10%-15%)

• These cancers grow quickly and spread early in the course of the disease. This type of lung cancer is strongly associated with smoking.

• Other Subtypes• glandular tumors• carcinoid tumors• adenosquamous carcinoma• sarcomatoid carcinoma• undifferentiated carcinomas

Lung cancer subtypes

We selected Adenocarcinomas because of its sample size and clinical features predicted outcomes

Page 37: nCounter Assay - bd2k.web.unc.edubd2k.web.unc.edu/files/2016/...FEB...v3-post_Part2.pdf · Challenges and issues concerning sequencing data 1. these data are genetically identifiable,

356 TCGA LUAD patients with OS data and OS time > 20 days

Page 38: nCounter Assay - bd2k.web.unc.edubd2k.web.unc.edu/files/2016/...FEB...v3-post_Part2.pdf · Challenges and issues concerning sequencing data 1. these data are genetically identifiable,

Split data into Training and Testing Sets

Fisher's Test PGender Female 122 ( 60.4% ) 80 ( 39.6% ) 0.9134

Male 97 ( 61.4% ) 61 ( 38.6% )Pathologic_T T1 67 ( 61.5% ) 42 ( 38.5% ) 0.4676

T2 123 ( 60.9% ) 79 ( 39.1% )T3 21 ( 67.7% ) 10 ( 32.3% )T4 7 ( 43.8% ) 9 ( 56.3% )

pathologic_N N0 135 ( 60.0% ) 90 ( 40.0% ) 0.5474N1 45 ( 66.2% ) 23 ( 33.8% )N2 32 ( 57.1% ) 24 ( 42.9% )

Total 219 ( 60.8% ) 141 ( 39.2% )

Train Test

Test Train

4050

6070

80

age_at_initial_patholog

Test Train

4050

6070

80

age_at_initial_patholog

T.test P =0.862

0 10 20 30 40 50 60

0.0

0.2

0.4

0.6

0.8

1.0 Split

Months

Over

all S

urviv

al (P

roba

bil

Test 37/141Train 57/219 log rank p=0.869

Page 39: nCounter Assay - bd2k.web.unc.edubd2k.web.unc.edu/files/2016/...FEB...v3-post_Part2.pdf · Challenges and issues concerning sequencing data 1. these data are genetically identifiable,

0 10 20 30 40 50 60

0.0

0.2

0.4

0.6

0.8

1.0

pathologic_T

Months

Ove

rall S

urviv

al (P

roba

bility

)

1 15/1092 61/2023 9/314 9/16 log rank p = 0.00116

0 10 20 30 40 50 600.

00.

20.

40.

60.

81.

0

Train

Months

Ove

rall S

urviv

al (P

roba

bility

)

1 9/672 38/1233 7/214 3/7 log rank p = 0.0115

0 10 20 30 40 50 60

0.0

0.2

0.4

0.6

0.8

1.0

Test

Months

Ove

rall S

urviv

al (P

roba

bility

)

1 6/422 23/793 2/104 6/9 log rank p = 0.0165

0 10 20 30 40 50 60

0.0

0.2

0.4

0.6

0.8

1.0

pathologic_N

Months

Ove

rall S

urviv

al (P

roba

bility

)

0 39/2251 31/682 22/56

log rank p = 3.4e-05

0 10 20 30 40 50 60

0.0

0.2

0.4

0.6

0.8

1.0

Train

Months

Ove

rall S

urviv

al (P

roba

bility

)

0 23/1351 20/452 12/32

log rank p = 0.00131

0 10 20 30 40 50 60

0.0

0.2

0.4

0.6

0.8

1.0

Test

Months

Ove

rall S

urviv

al (P

roba

bility

)0 16/901 11/232 10/24

log rank p = 0.00187

Split data into Training and Testing Sets

Page 40: nCounter Assay - bd2k.web.unc.edubd2k.web.unc.edu/files/2016/...FEB...v3-post_Part2.pdf · Challenges and issues concerning sequencing data 1. these data are genetically identifiable,

Cluster of 390 Gene Expression Signatures in 356 Lung Adenocarcinoma Patients(200 signatures from Perou Lab and 190 from other publications,

see Fan et al., BMC Medical Genomics, 2011 PMID:21214954)

Page 41: nCounter Assay - bd2k.web.unc.edubd2k.web.unc.edu/files/2016/...FEB...v3-post_Part2.pdf · Challenges and issues concerning sequencing data 1. these data are genetically identifiable,

807 DNA Copy Number

Altered Regions(segments)

174 Mutations

390 GeneExpressionSignatures

356 LUAD Patients, 1371 features (3 data types)

Elastic Net Modeling using Multiple Data Types on TCGA Lung Adenocarcinoma Samplesdone by Chris Fan of the Lineberger Bioinformatics Group

Fan and Perou 2017, unpublished data, do not distribute

Page 42: nCounter Assay - bd2k.web.unc.edubd2k.web.unc.edu/files/2016/...FEB...v3-post_Part2.pdf · Challenges and issues concerning sequencing data 1. these data are genetically identifiable,

• We used Overall Survival at 5 years as the “Supervising” prognostic endpoint

• COX Proportional Hazards Models built using 5-fold Cross Validation (CV) on a training data set (60% used for training)

• Predict onto testing data (40% used as test set)• Report training and testing sets Kaplan-Meier

survival plot p-values and Harrell's C-Index values

COX Model building in TCGA LUAD Clinical + RNA + DNA Copy Number + DNA Mutation using Elastic Net

Fan and Perou 2017, unpublished data, do not distribute

Page 43: nCounter Assay - bd2k.web.unc.edubd2k.web.unc.edu/files/2016/...FEB...v3-post_Part2.pdf · Challenges and issues concerning sequencing data 1. these data are genetically identifiable,

• ELASTIC NET is a modeling approach that can be used to perform both feature selection (from multiple data types) and parameter estimation. It is a hybrid of Ridge Regression and Least Absolute Shrinkage and Selection Operator (LASSO) Regression. Like the LASSO, ELASTIC NET performs automatic feature selection and shrinkage to produce sparse models with high prediction accuracy.

• LASSO sometimes fails to do grouped feature selection, and it tends to select one feature from a group of correlated features and ignore the others. ELASTIC NET does not have this limitation, and seems to strike a good balance between selecting just one correlated feature versus selecting all correlated features (Hastie, http://www-stat.stanford.edu/~hastie/TALKS/glmnet.pdf )

COX Model building in TCGA LUAD Clinical + RNA + DNA Copy Number + DNA Mutation using Elastic Net

• α = 0 ---> Ridge regression• α = 1 ---> LASSO• 0 < α < 1 ---> Elastic Net

Page 44: nCounter Assay - bd2k.web.unc.edubd2k.web.unc.edu/files/2016/...FEB...v3-post_Part2.pdf · Challenges and issues concerning sequencing data 1. these data are genetically identifiable,

mRNA Signaturesonly

UNC_MPYMT_NEU_Cluster_Median_BMC.Med.Genomics.2011_PMID.21214954 -0.151608UNC_MM_Green12_Median_BMC.Med.Genomics.2011_PMID.21214954 -0.0436784UNC_MUnknown_20_Median_BMC.Med.Genomics.2011_PMID.21214954 -0.0359065UNC_IGG_Cluster_Median_BMC.Med.Genomics.2011_PMID.21214954 -0.0186588UNC_MM_Red17_Median_BMC.Med.Genomics.2011_PMID.21214954 -0.0165798UNC_VEGF_13genes_Median_BMC.Med.2009_PMID.19291283 0.0130258UNC_HS_Green19_Median_BMC.Med.Genomics.2011_PMID.21214954 0.0253536UNC_13q14_Amplicon_Median_BMC.Med.Genomics.2011_PMID.21214954 0.0391103UNC_Scorr_Basal_Correlation_JCO.2009_PMID.19204204 0.0514695UNC_MM_p53null.Basal_MedianFrom.Adam.Pfefferle_Pending_PMID.24220145 0.0877532Extensive_Residual_Diesase_ER54_Median_JAMA.2011_PMID.21558518 0.1121707

0 10 20 30 40 50 60

0.0

0.2

0.4

0.6

0.8

1.0 Train

Months

Over

all S

urviv

al (P

roba

bility

)

low 15/109high 46/109

P = 7.25e-07

C-index = 0.729

0 10 20 30 40 50 60

0.0

0.2

0.4

0.6

0.8

1.0 Test

Months

Over

all S

urviv

al (P

roba

bility

)

low 14/69high 24/69

P = 0.0185

C-index = 0.662

Fan and Perou 2017, unpublished data, do not distribute

Page 45: nCounter Assay - bd2k.web.unc.edubd2k.web.unc.edu/files/2016/...FEB...v3-post_Part2.pdf · Challenges and issues concerning sequencing data 1. these data are genetically identifiable,

DNA.Cytoband_19p13.12 -0.13738DNA.Cytoband_19q11 -0.07261DNA.Cytoband_6q11.1 -0.05323DNA.Cytoband_19p12 -0.039DNA.Cytoband_6q12 -0.03579DNA.Cytoband_1p36.11 -0.01076DNA.Cytoband_10p12.33 -0.00808DNA.Cytoband_10p15.3 -0.00275DNA.Cytoband_19p13.3 -0.00263DNA.Cytoband_22q13.32 0.003424DNA.Cytoband_22q13.33 0.003443DNA.Cytoband_16q24.3 0.005053DNA.Cytoband_8p11.21 0.010792DNA.Cytoband_3q26.31 0.013839DNA.Cytoband_8q12.1 0.041234DNA.Cytoband_16q24.2 0.044561DNA.Cytoband_14q21.2 0.07531DNA.Cytoband_Xp11.1 0.089138DNA.Cytoband_22q13.2 0.115788

0 10 20 30 40 50 60

0.00.2

0.40.6

0.81.0

Train

Months

Over

all S

urviv

al (P

roba

bility

)

low 17/109high 44/109

P = 7.1e-06

C-index = 0.719

0 10 20 30 40 50 60

0.00.2

0.40.6

0.81.0

Test

Months

Over

all S

urviv

al (P

roba

bility

)

low 16/69high 22/69

log rank p=0.733

C-index = 0.527

DNA Copy Number Changes only

Fan and Perou 2017, unpublished data, do not distribute

Page 46: nCounter Assay - bd2k.web.unc.edubd2k.web.unc.edu/files/2016/...FEB...v3-post_Part2.pdf · Challenges and issues concerning sequencing data 1. these data are genetically identifiable,

Mutation_FAM75D1 -0.8966703Mutation_ZAN -0.7062429Mutation_MUC5B -0.6747349Mutation_LRRC7 -0.4113941Mutation_TSHZ3 -0.3721812Mutation_AHNAK2 -0.3242386Mutation_ANKRD20A9P -0.2811364Mutation_CUBN -0.2718021Mutation_HMCN1 -0.2343693Mutation_RYR2 -0.2308531Mutation_EPHA5 -0.1921632Mutation_CNTNAP5 -0.1647605Mutation_LAMA2 -0.1629955Mutation_SORCS1 -0.1595584Mutation_CSMD3 -0.130134Mutation_PCDH10 -0.1001778Mutation_ZNF208 -0.0930673Mutation_TPTE -0.0702816Mutation_BC080605 -0.0304059Mutation_DNAH7 -0.028821Mutation_STK11 0.00957196Mutation_EGFR 0.01215757Mutation_TRPS1 0.01481218Mutation_SSPO 0.09275657Mutation_USH2A 0.10742406Mutation_PRUNE2 0.13405142Mutation_GPR98 0.16798704Mutation_KEAP1 0.24747142Mutation_TP53 0.38780379Mutation_PCDHGC5 0.4229068Mutation_ANKRD30A 0.49702014Mutation_COL6A3 0.6345502

0 10 20 30 40 50 60

0.0

0.2

0.4

0.6

0.8

1.0 Train

Months

Over

all S

urviv

al (P

roba

bility

)

low 17/121high 44/97

P = 4.86e-11

C-index = 0.785

0 10 20 30 40 50 60

0.0

0.2

0.4

0.6

0.8

1.0 Test

Months

Over

all S

urviv

al (P

roba

bility

)

low 16/69high 22/69

log rank p=0.2

C-index = 0.618

DNA Mutations(Exomes only)

Fan and Perou 2017, unpublished data, do not distribute

Page 47: nCounter Assay - bd2k.web.unc.edubd2k.web.unc.edu/files/2016/...FEB...v3-post_Part2.pdf · Challenges and issues concerning sequencing data 1. these data are genetically identifiable,

Mutation_FAM75D1 -0.33819UNC_MPYMT_NEU_Cluster_Median_BMC.Med.Genomics.2011_PMID.21214954 -0.15618Mutation_MUC5B -0.10541Mutation_ANKRD20A9P -0.07875Mutation_AHNAK2 -0.04182UNC_MM_Green12_Median_BMC.Med.Genomics.2011_PMID.21214954 -0.02901DNA.Cytoband_19q11 -0.02897DNA.Cytoband_6q12 -0.02146UNC_MM_Red17_Median_BMC.Med.Genomics.2011_PMID.21214954 -0.02084UNC_13q14_Amplicon_Median_BMC.Med.Genomics.2011_PMID.21214954 0.001087UNC_Duke_Module18_ras_Median_Mike 0.00544UNC_Scorr_Basal_Correlation_JCO.2009_PMID.19204204 0.028295DNA.Cytoband_22q13.2 0.042591Extensive_Residual_Diesase_ER54_Median_JAMA.2011_PMID.21558518 0.045894UNC_MM_p53null.Basal_MedianFrom.Adam.Pfefferle_Pending_PMID.24220145 0.060183

0 10 20 30 40 50 60

0.00.2

0.40.6

0.81.0

Train

Months

Over

all S

urviv

al (P

roba

bility

)

low 14/109high 47/109

P = 1.04e-08

C-index = 0.762

0 10 20 30 40 50 60

0.00.2

0.40.6

0.81.0

Test

Months

Over

all S

urviv

al (P

roba

bility

)

low 14/69high 24/69

P = 0.00946

C-index = 0.676

mRNA Signatures+ DNA Copy Number+ DNA mutations

Fan and Perou 2017, unpublished data, do not distribute

Page 48: nCounter Assay - bd2k.web.unc.edubd2k.web.unc.edu/files/2016/...FEB...v3-post_Part2.pdf · Challenges and issues concerning sequencing data 1. these data are genetically identifiable,

gender_female_0 -0.27192age_at_initial_pathologic_diagnosis 0.019087pathologic_N 0.528502pathologic_T 0.643398

0 10 20 30 40 50 60

0.0

0.2

0.4

0.6

0.8

1.0 Train

Months

Over

all S

urviv

al (P

roba

bility

)

low 16/109high 45/109

P = 2.12e-05

C-index = 0.713

0 10 20 30 40 50 60

0.0

0.2

0.4

0.6

0.8

1.0 Test

Months

Over

all S

urviv

al (P

roba

bility

)

low 11/69high 27/69

P = 0.0308

C-index = 0.643

Clinical Dataonly

Fan and Perou 2017, unpublished data, do not distribute

Page 49: nCounter Assay - bd2k.web.unc.edubd2k.web.unc.edu/files/2016/...FEB...v3-post_Part2.pdf · Challenges and issues concerning sequencing data 1. these data are genetically identifiable,

UNC_MPYMT_NEU_Cluster_Median_BMC.Med.Genomics.2011_PMID.21214954 -0.11342UNC_MM_Green12_Median_BMC.Med.Genomics.2011_PMID.21214954 -0.03019UNC_IGG_Cluster_Median_BMC.Med.Genomics.2011_PMID.21214954 -0.0027age_at_initial_pathologic_diagnosis 0.004586UNC_13q14_Amplicon_Median_BMC.Med.Genomics.2011_PMID.21214954 0.02792UNC_Scorr_Basal_Correlation_JCO.2009_PMID.19204204 0.044371UNC_MM_p53null.Basal_MedianFrom.Adam.Pfefferle_PMID.24220145 0.066309Extensive_Residual_Diesase_ER54_Median_JAMA.2011_PMID.21558518 0.082409pathologic_N 0.176769

mRNA Signatures+ DNA Copy Number+ DNA mutations+ Clinical Data

0 10 20 30 40 50 60

0.0

0.2

0.4

0.6

0.8

1.0 Test

Months

Over

all S

urviv

al (P

roba

bility

)

low 11/69high 27/69

P = 0.0308

C-index = 0.643

0 10 20 30 40 50 60

0.0

0.2

0.4

0.6

0.8

1.0 Test

Months

Over

all S

urviv

al (P

roba

bility

)

low 14/69high 24/69

P = 0.00946

C-index = 0.676

mRNA Signatures+ DNA Copy Number+ DNA mutations

Clinical Data Only

0 10 20 30 40 50 60

0.0

0.2

0.4

0.6

0.8

1.0

Train

Months

Ove

rall

Sur

viva

l (P

roba

bilit

y)

low 12/109high 49/109

P = 9.29e-07

C-index = 0.755

0 10 20 30 40 50 60

0.0

0.2

0.4

0.6

0.8

1.0

Test

Months

Ove

rall

Sur

viva

l (P

roba

bilit

y)

low 10/69high 28/69

P = 0.00074

C-index = 0.712

Fan and Perou 2017, unpublished data, do not distribute

Page 50: nCounter Assay - bd2k.web.unc.edubd2k.web.unc.edu/files/2016/...FEB...v3-post_Part2.pdf · Challenges and issues concerning sequencing data 1. these data are genetically identifiable,

UNC_MPYMT_NEU_Cluster_Median_BMC.Med.Genomics.2011_PMID.21214954 -0.11342UNC_MM_Green12_Median_BMC.Med.Genomics.2011_PMID.21214954 -0.03019UNC_IGG_Cluster_Median_BMC.Med.Genomics.2011_PMID.21214954 -0.0027age_at_initial_pathologic_diagnosis 0.004586UNC_13q14_Amplicon_Median_BMC.Med.Genomics.2011_PMID.21214954 0.02792UNC_Scorr_Basal_Correlation_JCO.2009_PMID.19204204 0.044371UNC_MM_p53null.Basal_MedianFrom.Adam.Pfefferle_PMID.24220145 0.066309Extensive_Residual_Diesase_ER54_Median_JAMA.2011_PMID.21558518 0.082409pathologic_N 0.176769

Systems Biology

0 10 20 30 40 50 60

0.0

0.2

0.4

0.6

0.8

1.0

Train

Months

Ove

rall

Sur

viva

l (P

roba

bilit

y)

low 12/109high 49/109

P = 9.29e-07

C-index = 0.755

0 10 20 30 40 50 60

0.0

0.2

0.4

0.6

0.8

1.0

Test

Months

Ove

rall

Sur

viva

l (P

roba

bilit

y)

low 10/69high 28/69

P = 0.00074

C-index = 0.712

Fan and Perou 2017, unpublished data, do not distribute

Page 51: nCounter Assay - bd2k.web.unc.edubd2k.web.unc.edu/files/2016/...FEB...v3-post_Part2.pdf · Challenges and issues concerning sequencing data 1. these data are genetically identifiable,

Big Data Cancer Genomic Computational Challenges

1. Identification of patterns/structure in the data using unsupervised and supervised learning methods are key approaches for Big Data Mining. Building formal classifiers based upon this structure provides an objective measure for assessing the value of the selected features

2. Methods like Elastic Net and LASSO can be used to develop predictors of outcomes, or response (binary variable), using single platform and/or multi-platform integrated data

3. Integration of data across technology platforms for biological discovery and assay development is becoming increasing important

4. Identification (and correction) of technical batch effects, or other systematic features present within large data sets, is a must for the effective mining of Big Data Resources

Page 52: nCounter Assay - bd2k.web.unc.edubd2k.web.unc.edu/files/2016/...FEB...v3-post_Part2.pdf · Challenges and issues concerning sequencing data 1. these data are genetically identifiable,

Homework

Reading listEisen et al., PNAS 1998 (PMID:9843981) Perou et al., NATURE 2000 (PMID:10963602) Parker et al., JCO 2009 (PMID:19204204)

TCGA Breast Cancer Genomic Data Siteshttps://tcga-data.nci.nih.gov/docs/publications/brca_2015/ (all open access TCGA Breast Cancer Data)https://lbg.unc.edu/~hoadley/BRCA.817.rsemg.uqnorm.counts.txt (all 20,000 gene expression values)https://lbg.unc.edu/~hoadley/BRCA.817.rsemg.uqnorm.counts.intrinsic.txt (~2000 gene “classification list”)

Take the “classification list” data, and make a hierarchical cluster. 1) I recommend using “Cluster 3.0” and under “adjust data” select log2 transform and median

center the genes. 2) Next select the “hierarchical” tab and select cluster (both genes and arrays), similarity metric

“centered” and cluster method “centroid linkage”. 3) The data can then be viewed and explored using “Java Treeview”. Look for and find the

“proliferation cluster”, which was showed earlier in this presentation

2/20/17 – Chuck Perou (Department of Genetics) – Introduction to Genomics and Big Data, and Cancer Subtype Class Discovery using gene expression data- Katie Hoadley (Department of Genetics) – Introduction to TCGA Data Portal

Page 53: nCounter Assay - bd2k.web.unc.edubd2k.web.unc.edu/files/2016/...FEB...v3-post_Part2.pdf · Challenges and issues concerning sequencing data 1. these data are genetically identifiable,
Page 54: nCounter Assay - bd2k.web.unc.edubd2k.web.unc.edu/files/2016/...FEB...v3-post_Part2.pdf · Challenges and issues concerning sequencing data 1. these data are genetically identifiable,
Page 55: nCounter Assay - bd2k.web.unc.edubd2k.web.unc.edu/files/2016/...FEB...v3-post_Part2.pdf · Challenges and issues concerning sequencing data 1. these data are genetically identifiable,
Page 56: nCounter Assay - bd2k.web.unc.edubd2k.web.unc.edu/files/2016/...FEB...v3-post_Part2.pdf · Challenges and issues concerning sequencing data 1. these data are genetically identifiable,
Page 57: nCounter Assay - bd2k.web.unc.edubd2k.web.unc.edu/files/2016/...FEB...v3-post_Part2.pdf · Challenges and issues concerning sequencing data 1. these data are genetically identifiable,