USING GENE EXPRESSION DATA TO PREDICT CLINICAL ...cs229.stanford.edu/proj2016/report/Abell-UsingGene...REFERENCES [1] DHanahan,RAWeinbergHallmarksofCancer: TheNext Generation.Cell.2011;144(5):646-74

USING GENE EXPRESSION DATA TO PREDICT CLINICALINFORMATION IN SEVEN HUMAN CANCERS

Nathan AbellDepartment of Genetics

Stanford University School of Medicine

AbstractThe expansive heterogeneity of known cancers share one

key property - genetic and transcriptomic abnormalities. Dis-covering and quantifying the landscape of genomic abnor-malities is a long-standing goal of the cancer research com-munity. To this end, the Genomic Data Commons (GDC)has aggregated and standardized tens of thousands of exper-imental datasets from dozens of human cancers. Here, wedescribe our efforts to construct a pipeline for the predictionof a specific clinical feature (including a range of quantita-tive and categorical responses) on all available samples fora given cancer type. We then apply this pipeline to a set offeatures in seven human tissues, and obtain very accurateclassifiers for categorical clinical features, with comparablypoor performance for quantitative features.

INTRODUCTION AND BACKGROUNDCancer, beneath its heterogeneity, is a fundamentally ge-

netic disease. In many contexts, cancers arise from geneticdisregulation of cell growth circuits due to somatic muta-tion, which can have downstream effects on many processesincluding gene expression. [1, 2] Accordingly, significantefforts have been devoted to identifying predictive gene ex-pression signatures of human cancers, to both (a) understandcarcinogenic molecular drivers and (b) assist in diagnosisand sub-classification in a clinical context. These effortshave been fruitful, identifying molecular subtypes of manycancers that correlate well to clinical outcomes and otherdisease phenotypes. In particular, gene expression has anunsurprisingly clear effect in many cancers, and is highly pre-dictive of numerous morphological, cellular, and outcomephenotypes. However, many of these studies over the past5-10 years have only had limited datasets (relative to whatis presently available), with intensive analysis of individualcancer profiles using predictive and biomarker-identifyingunsupervised approaches [3–6]For this reason, we are interested in gene expression as

predictive of human cancer traits. The most significantmodern effort to scale-up this approach, in terms of tis-sue diversity and sample size, are consortium efforts likeThe Cancer Genome Atlas (TCGA) and Therapeutically Ap-plicable Research To Generate Effective Treatments (TAR-GET), both of which have generated massive, petabyte-scaledatasets. [9, 10] To integrate and standardize findings acrossresearch consortia, the NIH-sponsored Genomic Data Com-mons (GDC: https://gdc-portal.nci.nih.gov) was established,particularly to link clinical and genetic datasets. [11] Thisintegrative approach has enabled previously impossible stud-

ies, in particular the comparison of the genetic drivers ofcancer across tissue types, beyond their simple identifica-tion. [7, 8]

Our ultimate goal, then, is to design a pipeline capable ofobtaining, processing, and regressing/predicting outcomesfor any specific cancer type. Crucially, this pipeline must besufficiently modular that it can be automatically run for eachcancer type. Here, we present the details of the pipeline inits current state, and preliminary results for two of the largestGDC cancer types - kidney and breast.

DATASET AND CLINICAL FEATURESThe GDC, collectively, gene expression data for 29 tis-

sue/cancer types (e.g. brain) from 39 distinct projects (e.g.TCGA-lung). In the vast majority of cases, there is matchedclinical and biospecimen data for each tumor, though theattached clinical features vary substantially between tissues.In this study, we restricted our focus to seven specific hu-man cancers for practical reasons: bladder, brain, breast,kidney, lung, prostate, and skin. However, the describedprocedures could be applied to any tissue represented in theGDC database.We built a suite of functions for downloading all data of

a given tissue using the GDC API through the R langua-gae. [12] We downloaded RNA-sequencing data from theGDC Data Portal for our selected tissues, retrieving the readcounts per gene as quantified by HTSeq. [13] Gene expres-sion counts are notoriously skewed, with relatively smallnumbers of genes having very large expression counts. Forthis reason, normalizing expression data by the sum of allreads is subject to additional variability of there are largeupper outliers. Thus, we normalized the data by dividing allvalues of a given dataset by its 75th percentile value, follow-ing GDC recommendation. Notably, this is a within-samplenormalization, and does not involve sharing information (e.g.the mean) between individual gene expression profiles.

Thus, we obtain (for each tissue) a n by p matrix, where nis the sample number and p is the number of genes (60484in all tissues), where each sample corresponds to a vector of60484 real numbers. In Figure 1A, we show a hierarchicallyclustered heatmap (using average linkage and the hclustfunction in R) of the sample Pearson correlation matrix forone representative tissue (lung cancer).There are clear differences between different subsets of

tumors, and clear clusters of similar gene expression pro-files. This suggests that many of the underlying gene expres-sion measurements are highly correlated, which is stronglyexpected for biological and experimental reasons. This

Figure 1: A: Representative Clustered Sample CorrelationMatrix Heatmap - Lung Cancers; B: Sample Size for EachTissue

presents an immediate problem, as many predictors are al-most perfectly co-linear. Reducing the large set of genes toa representative set of variables is a crucial first task.Next, we assessed the available clinical data attached to

each sample, and removed those samples for which there wasinsufficient or missing clinical information. This requiredsubstantial pre-processing in a somewhat tissue-specificmanner, as we quickly learned that the standardization ef-forts of the GDC were clearly more oriented towards thegene expression measurements, rather than the clinical infor-mation. The final input datasets contained the sample countsshown in Figure 1B.

After subsetting the potential input samples, we selecteda set of outcomes for each tissue we would like to predictfrom the gene expression matrix. For each tissue, there are arange of quantitative (e.g. age at initial pathologic diagnosis),binomial (e.g. progesterone receptor positive/negative), andmultinomial (disease subtype) responses. In Figure 2, weshow a representative set of the kinds of features of interestfor kidney and breast cancers.

This reveals another issue with the dataset: many categor-ical variables do not have many samples in some categories.Thus, going forward, we made two modifications: first, forany binomial or multinomial prediction, we removed allclasses that did not contain at least 15% of the data; sec-ond, for all questions of tumor stage, we converted the StageI-IV labels to simple "Low" and "High", where stages Iand II were low, and III and IV were high. Overall, weused some features in all tissues (e.g. gender, age at diag-nosis, disease subtype) and some in a tissue-specific way(e.g. progesterone receptor status, breast specific). After thispreparation, the next step was to construct a pipeline thatcould accept a tissue-feature pair and output a set of fittedmodels and predictions.

METHODSWe begin, before any other step, by separating each tissue

into a 70/30 training/validation split. Wewill operate only on

0

200

400

histological_type

coun

t histological_typeKidney Clear Cell Renal CarcinomaKidney Papillary Renal Cell CarcinomaKidney Chromophobe

histological type: kidneyA

0

200

400

600

800

histological_type

coun

t

histological_type

Infiltrating Carcinoma NOSInfiltrating Ductal CarcinomaInfiltrating Lobular CarcinomaMedullary CarcinomaMetaplastic CarcinomaMixed Histology (please specify)Mucinous CarcinomaOther, specifyNA

histological type: breast

0

100

200

300

400

stage_event_pathologic_stage

coun

t


Stage IStage IIStage IIIStage IV

stage: kidneyB

0

100

200

300


coun

t


Stage IStage IAStage IBStage IIStage IIAStage IIBStage IIIStage IIIAStage IIIBStage IIICStage IVStage XNA

stage: breast

0

100

200

300

400

hemoglobin_result

coun

t

hemoglobin_result

ElevatedLowNormal

hemoglobin: kidneyC

0

200

400

600

breast_carcinoma_progesterone_receptor_status

coun

t

breast_carcinoma_progesterone_receptor_status

IndeterminateNegativePositiveNA

progesterone receptor: breastD

0

10

20

30

25 50 75

age_at_initial_pathologic_diagnosis

coun

t

stage: kidneyE

0

20

40

40 60 80

age_at_initial_pathologic_diagnosis

coun

t

stage: breast

Figure 2: Distribution of Several Clinical Attributes in Kid-ney (Left) and Breast Cancer (Right)

the training set (using cross validation within the training setto fit models), and evaluate the final models on the validationset.

Next, since we have set up a large set of predictors and acomparably small sample size (n < p), we expect we need tosomehow reduce the number of predictors in each dataset.Given the strong observed correlation between many fea-tures, we would like a reduced feature set that captures theimportant axes of variation, and deals appropriately withcorrelated predictors. I considered principal component re-gression and ridge regression, but discarded those since theywould not actually remove any individual predictor - thedimensions of the lower-dimensional space would still belinear combinations of many, perhaps most, of the predictors.

Thus, for a given tissue-feature pair, I applied lasso regres-sion using the glmnet package and extracted the list of allpredictors with non-zero coefficients based on the lambdawith the lowest mis-classification rate or RMSE based oncross-validation. [14, 15] For the quantitative variables, theapplication is direct using standard linear regression. Forcategorical variables, we model the prediction for a multino-mial model (or binomial, as a sub-case) as:

Pr (class = c|X = x) =exp(β0k + β

Tk

x∑Ki=1 exp(β0i | β

Ti x

)

The sparse β that minimizes the error for the lasso is then:

β = min(| |y =J∑j=1

X j β j | |22 + λ

J∑j=1| | β1 j | |2)

So, to extract a feature set and thus a model, all thatremains is to tune the λ parameter. I used 10-fold cross-validation to select a value of lambda, and see what ourmodel looks like after doing so.In Figure 3, we show an overview of our approach as

applied to each tissue-feature pair. After splitting off thevalidation data, we normalize as described previously withineach sample, then select non-zero features using the lasso.Taking that restricted subset, we then fit a set of models,automatically tune them using cross-validation, and test theresulting model fit on the held-out validation data to ob-tain either RMSE and R-squared values, for quantitativevariables, or accuracy rates and AUC (as appropriate) forcategorical variables.

Figure 3: Statistical Approach

Now that we have a substantially reduced feature space,we used the reduced expression matrix containing only pre-dictors with non-zero lasso coefficients as input to a setof classifiers. For quantitative variables, we attempted topredict the given outcome using (i) the coefficients fromthe original lasso fit, (ii) principal component regression,and (iii) support vector regression using a Gaussian kernel.These were chosen to represent a range of model complexity,from direct linear regression to the nonlinear SVMs. Forcategorical attributes, we used (i) the coefficients from theoriginal lasso fit, (ii) linear discriminant analysis, (iii) sup-port vector classification with a Gaussian kernel, and (iv)random forests. Again, this represents a mix of linear andnon-linear, more and less complex classifiers. When specifichyper-parameter tuning was necessary, a grid search wasapplied using the tune() function in R. Now, we go into somebrief detail about the non-lasso-based models we applied(since the lasso-based models simply apply the linear regres-sion coefficients obtained in the feature selection processwith no modification).

Principal Component RegressionWe implemented PCR using the pls package in R, and

used all principal components for prediction (since this wasafter variable selection, so each PC was a linear combinationof the very small, lasso-selected variable subset). [16]PCR is a standard linear regression, except we use the

principal component decomposition of the data variance-covariance matrix instead of the data directly when estimat-ing the regression coefficients. Formally, for the standard lin-ear model, coefficients are estimated by β = (XT X )−1XTY .

For PCR, instead of using XT X directly, we instead get theprincipal component decomposition, set that equal to a newmatrix, and plug in

β̂ = (XT X )−1XTY

XT X = PDPT

AT A = PDPT

ˆβPC = (AT A)−1 ATY

where D is a diagonal matrix of eigenvalues and P is theeigenvector matrix. This is equivalent to projecting all sam-ples to PC-space, and then regressing Y on all the principalcomponents.

Support Vector Regression and ClassificationTo implement support vector machines, we used the e1071

package in R, which applies the common LibSVM pack-age available in many languages. [17, 18] For categori-cal variables we used "C-classification" and for quantita-tive variables we used "epsilon-regression". Formally, C-classification SVMs solve the following optimization prob-lem:

minw,b,ε12wTw + C

l∑i

εi

s.t . yi (wT φ(xi) + b) ≥ 1 − εi,εi ≥ 0, i = 1, ..., l

Epsilon-regression solves a very similar optimizationproblem, which is formally:

minw,b,ε,ε∗12wTw + C

l∑i

εi + Cl∑i

εi∗

s.t. wT φ(xi) + b − zi ≤ ε + εi,

−wT φ(xi) − b + zi ≤ ε + εi∗,

εi, εi∗ ≥ 0, i = 1, ..., l

For each model, fitting was performed using the grid-search function tune in LibSVM, which found optimal valuesfor C and epsilon as appropriate and automatically used thosevalues for validation prediction.

Linear Discriminant AnalysisWe used the lda function in the R package MASS (Mod-

ern Applied Statistics with S). [19] As implemented, LDAassumes that the conitional probability of observing a datapoint given class, p(X |Y = 0) or p(X |Y = 1), follows a nor-mal distribution with different mean vectors and covariancematrices. Then, after fitting the parameters of the distribu-tions for the two classes, we compute the log likelihood ratiofor the probability of membership in class 1 over class 2, and

assign the data point accordingly (whether it is above or be-low some threshold). Prior probabilities were incorporatedinto the ratio by the lda function, and were the proportionsof each class label in the training data.

Random ForestsFinally, we used the randomForest package to perform

random forest classification only (though RFs can be appliedto regression problems too). [20] Briefly, the algorithm con-structs a large number of trees, each trained on a differentrandomly sub-sampled (with replacement) subset of the train-ing data. The number of trees is a tunable parameter, buthere, we simply set the number to 500 (the recommendeddefault by the software authors). Then, for a new data point,all trees are used to predict the class of the data point andthe class is determined by a majority vote or some func-tion of all individual tree predictions (possibly incorporatingweights). The software implements a variety of strategies tooptimize eventual accuracy, and these shape properties likewhich points are sub-sampled at each tree, how the trees areweighted, and so on.

RESULTS AND DISCUSSIONFor each tissue-feature pair, we computed the regulariza-

tion path for the fitting parameter lambda, and chose thelambda that produced the least complex model with the low-est error. Across tissue-feature pairs, this selected between7 and 158 genes out of over 60000. For some of the categor-ical features, to qualitatively evaluate the effect of featurereduction, we constructed principal component biplots be-fore ( 60000 features and after (7-158 features). In Figure 4,we show a representative example for one specific analysis:breast cancer disease subtype, specifically differentiatingbetween ductal carcinomas and lobular carcinomas. We firstshow the lasso regularization path, by log(lambda), withthe dashed lines representing the lambda with the lowestmis-classification rate, and the lambda with the least com-plex model with an error within one standard error of thelowest mis-classification rate. Then, using the latter valuefor lambda, we construct PC plots, showing an increasedseparation after feature selection, though dramatically fewergenes are used. This is all a good sign.Through analyzing the results from the lasso fit (so, not

using the validation data yet), we quickly realized two facts:first, some categorical properties were highly predictableacross tissues. For example, all disease subtypes (includingthe example in Figure 4) had fitted cross-validation mis-classification rates of less than 10%. Second, quantitativeresponses showed very high cross-validated mean squareerror rates - all time-based responses (like age at diagnosis)had an RMSE of at least five years, and the regularizationpaths did not look very clean (i.e. there was not a rapiddecrease from right to left in the analogous plot shown inFigure 4A). This leads to a suspicion, that turns out to becorrect, that categorical validations will perform well, whilequantitative validations will not.

Figure 4: A: Lasso Regularization Path for Breast CancerSubtype; B: PC Biplot Before Lasso Feature Selection; C:PC Biplot After Lasso Feature Selection. Points in B and Care colored by tumor subtype.

After obtaining the feature subsets for each tissue-featurecomparison, we proceeded to auto-fit each model describedin Methods (automating hyperparameter selection) and thenpredicting the feature in the validation set. For quantita-tive responses, we then computed RMSE and R2 valuescomparing actual and predicted responses. For categori-cal responses, we computed precision, recall, accuracy, andAUC values.

Given the somewhat large number of fit models, and ourspace constraints in this report, we report the validationstatistics for the top five quantitative tissue-feature pairs,where the pairs are sorted by maximum R2 (since differentquantitative features have different ranges/scales). We alsoreport one of the better and one of the mediocre categor-ical tissue-feature pairs. (Note: not shown here, we alsofit models for gender as a "positive control" while buildingthe pipeline. For all tissues, gender was perfectly separablebased on fewer than 10 genes, which is strongly expectedgiven the known biology.) As is clear from Figure 5C, mostof the quantative variables performed quite poorly, usuallybeing wrong (in the best cases!) by years, making them func-

tionally useless. While it is interesting that certain tissuestend to be in the top over others (like brain and breast), wefeel very few conclusions can be drawn from these regres-sions.

Figure 5: A: ROC Plot for Progesterone Receptor Status inBreast Cancer, with associated statistics; B: ROC Plot forStage in Bladder Cancer, with associated statistics; C: BestQuantitative Prediction Results

However, within the categorical classifiers, there was sub-stantial variation in predictability across tissue-feature pairs,and also between models within the same tissue-feature pair.As representative examples of this diversity, Figures 5A and5B show ROC plots for two tissue-feature pairs, proges-terone receptor status in breast cancer (positive/negative)and tumor stage in bladder cancer (low/high). Clearly, theprogesterone receptor status is highly predictable from geneexpression, regardless of the underlying model chosen, whilenot only is bladder stage less predictable overall, it is verymodel-sensitive, with LDA significantly outperforming thebinomial lasso and random forests, which themselves out-perform the SVM. We also note that we performed, but didnot show here, disease subtype for all seven tissues (part ofwhich was shown in Figure 4, for breast). In all of thosebinomial or multinomial models, accuracy rates were above0.5, reflecting a very strong ability to classify subtype bygene expression. Instead, we selected two other binomialanalyses that represent part of the range of what we observedwith categorical variables.

CONCLUSIONS AND FUTUREDIRECTIONS

Our general conclusion, based on these results, is that (i)gene expression, as such, may not contain much informationabout things like age, or at least not enough to be usefulin this context, and (ii) responses that are more "molecu-

lar" in nature tend to be better classified than responses thatare more "anatomical" in nature. Put differently, responseslike tumor subtype or progesterone receptor status are gen-erally obtained by a pathologist doing some experimentalprocedure to a tissue and obtaining a specific signal, whileproperties like tumor stage encompass a variety of subjec-tive inputs like tumor size, the overall health of the patient,metastasis, and so on. Thus, "molecular" responses may be"closer" to the prediction data (gene expression), since theyare obtained from experiments on cells, while the "gross"responses are a bit more distant (though, as we can see fromFigure 5B, still predictable!).

Going forward, there are a large number of additions thatcould improve both the statistical approach and the finalresults. Most simplistically, adding more tissues through thepipeline would broaden the scope of the work. Additionally,there exist other data types in the GDC, that typically assignsome value (for copy number, or methylation status) to eachbase pair in the genome. An interesting extension could beto associated each gene in the expression dataset with somevalue that represents its overall copy number/methylationstatus/other regulatory property, derived from the scores ofthe base pairs at that transcripts’ origin. Then, additionallayers of data could be integrated with the current approach.Finally, analyzing the intersection and overlap between spe-cific gene subsets selected by lasso could help to connect thecreated models to some underlying molecular biology, asopposed to classification or regression - that is a substantiallymore difficult task, as the annotations of biological functionson many genes are notoriously noisy/inaccurate, but wouldbe an interesting perspective to add.

ACKNOWLEDGEMENTSWe thank other students in the Departments of Genetics

and Chemical and Systems Biology for discussion and ideas,particularly directing us to the GDC datasets and suggest-ing some initial analyses to perform (especially to look atprogesterone receptors). This analysis was performed onpersonal computers, or on Stanford University computingclusters. All figures and documentation were prepared solelyby Nathan Abell.

REFERENCES[1] D Hanahan, RA Weinberg Hallmarks of Cancer: The Next

Generation. Cell. 2011;144(5):646-74.[2] IR Watson, K Takahashi, PA Futreal, L Chin . Emerging

patterns of somatic mutations in cancer. Nat Rev Genet.2013;14(10):703-18.

[3] K Kourou, TP Exarchos, KP Exarchos, MV Karamouzis, DIFotiadis.Machine learning applications in cancer prognosisand prediction. Comput Struct Biotechnol J. 2015;13:8-17.

[4] RG Verhaak, KA Hoadley, E Purdom et al. Integrated genomicanalysis identifies clinically relevant subtypes of glioblastomacharacterized by abnormalities in PDGFRA, IDH1, EGFR,and NF1.. Cancer Cell. 2010;17(1):98-110.

[5] S Zheng, AD Cherniack, N Dewal, et al. Comprehensive Pan-Genomic Characterization of Adrenocortical Carcinoma. Can-cer Cell. 2016;30(2):363.

[6] Genomic Classification of Cutaneous Melanoma. Cell.2015;161(7):1681-96.

[7] KA Hoadley, C Yau, DMWolf, et al.Multiplatform analysisof 12 cancer types reveals molecular classification within andacross tissues of origin.. Cell. 2014;158(4):929-44.

[8] A Prat, B Adamo, C Fan, et al. Genomic Analyses across SixCancer Types Identify Basal-like Breast Cancer as a UniqueMolecular Entity. Sci Rep. 2015;5:8179

[9] https://cancergenome.nih.gov[10] https://ocg.cancer.gov/programs/target[11] https://gdc-portal.nci.nih.gov/

[12] R Core Team R: A Language and Environment for StatisticalComputing R Foundation for Statistical Computing, Vienna,Austria, 2016.

[13] S Anders, PT Pyl, W Huber HTSeq–a Python framework towork with high-throughput sequencing data. Bioinformatics.2015;31(2):166-9.

[14] R Tibshirani.Regression shrinkage and selection via the lasso.J. Royal. Statist. Soc B., 1996;58(1),267-288.

[15] J Friedman, T Hastie, R Tibshirani Regularization Paths forGeneralized Linear Models via Coordinate Descent. Coordi-nate Descent. Journal of Statistical Software, 2010;33(1),1-22.

[16] BH Mevik, R Wehrens. The pls Package: Principal Compo-nent and Partial Least Squares Regression in R Jour. of Stat.Soft. 2010;18(2):1-24.

[17] D Meyer, E Dimitriadou, K Hornik, A Weiingessel, F Leisch,C-C Chang, C-C Lin. e1071: Misc Functions of the Depart-ment of Statistics, Probability Theory Group https://cran.r-project.org/web/packages/e1071/index.html

[18] C-C Chang, C-J Lin. LIBSVM: a library for support vec-tor machines. ACM Transactions on Intelligent Systems andTechnology, 2001;2(27):1-27.

[19] WN Venables, BD Ripley. Modern Applied Statistics withS. Fourth Edition. Springer, New York. ISBN 0-387-95457-0.2004.

[20] A Liaw, MWiener. Classification and Regression by random-Forest. R News 2002;2(3):18-22.

Documents

USING GENE EXPRESSION DATA TO PREDICT CLINICAL ...cs229.stanford.edu/proj2016/report/Abell-UsingGene...REFERENCES [1] DHanahan,RAWeinbergHallmarksofCancer: TheNext Generation.Cell.2011;144(5):646-74