High Dimensional Surrogacy: A Joint Modeling Approach · 3 High Dimensional Surrogacy and Biomarker Detection Single Surrogacy for High Dimensional Data Different Surrogacy Measures

Overview

1 Introduction

2 Case Studies and Modeling ApproachesQSTAR ProjectMicrobiome ProjectJoint Modeling Approach

3 High Dimensional Surrogacy and Biomarker DetectionSingle Surrogacy for High Dimensional DataDifferent Surrogacy Measures

Multiple SurrogacyPartial SurrogacyOrthogonal Surrogacy

Computational Aspects

4 Conclusion

1 / 64

High Dimensional Surrogacy: A JointModeling Approach

Rudradev Sengupta

October 4, 2018

Research Team

Affiliation Collaborators

Interuniversity Institute for Biostatistics and StatisticalBioinformatics (I-BioStat), Belgium

Ariel Alonso Abad, GeertMolenberghs, Ziv Shkedy.

Janssen Pharmaceutical Companies of Johnson &Johnson, Beerse, Belgium

Luc Bijnens, Nolen JoyPerualila-Tan, Wim Van derElst.

3 / 64

Overview

1 Introduction





4 Conclusion

4 / 64

Overview

1 Introduction

2 Case Studies and Modeling Approaches

3 High Dimensional Surrogacy and Biomarker Detection

4 Conclusion

5 / 64

Clinical Trials

Very slow, costly and inefficient development process.The choice of endpoint(s), to assess the drug efficacy,plays an important role.Measuring the endpoint(s) can become difficult, timeconsuming and expensive.Surrogacy in clinical trials.

6 / 64

“Surrogate” and “True” Endpoint in Clinical Trials

A “true” endpoint can be a response or a clinical outcomeor time to event etc.A “surrogate” endpoint serves as a substitute for the “true”endpoint as it can usually be measured more cheaply andconveniently.Before using a surrogate as a substitute for the trueendpoint it should be validated.Statistical methods for the identification and evaluation ofsurrogate endpoints in randomized clinical trials have beendeveloped over last three decades.

7 / 64

Biomarkers in Clinical Trials

A biomarker is objectively measured and evaluatedindicator of normal biological or pathogenic processes orpharmacologic responses to a therapeutic intervention.A surrogate marker is a biomarker intended to substitute aclinical endpoint.All surrogate markers are biomarkers, but not allbiomarkers can qualify as surrogate markers.

8 / 64

Biomarkers in Drug Discovery Experiments

Understanding the mechanism of action of a newcompound.Integrating multiple data sources.High dimensional data.

I High dimensional surrogacy.

9 / 64

The Surrogacy Framework: Graphical Illustration

X

Y

Z

The surrogacy framework for two endpoints, X and Y .The variable Z represents a binary grouping variable.The association between the biomarker (X ) and the clinicalendpoint (Y ) after adjusting for the grouping variable (Z ).

10 / 64

Three Decades of Surrogacy

1989: Prentice

Surrogate endpoints in clinical trials

1998: Buyse and Molenberghs

Individual Level Surrogacy

2000: Buyse et al.

Trial Level Surrogacy

2005: Burzykowski et al. Evaluation of Surrogate Endpoints

2007: Alonso and Molenberghs

Information Theory Approach

2016: Alonso et al.

Applied Surrogate Endpint Evaluation with SAS and R

2016: Perualila et al.

Joint Model

2010: Lin et al.

Biomarkers in pre-clinical and clinical microarray experiments

2012: Van Sanden et al.

Genomic biomarkers in microarray experiments

2015: Verbist et al.

Lessons learned from the qstar project.

Clinical Trials

High Dimensional DataNon-clinical Trials

2018: Sengupta et al.

1992: Freedman et al.

Statistical Validation of Intermediate Endpoints

Main focus: different approaches to evaluate individual level surrogacy.

11 / 64

Overview

1 Introduction



4 Conclusion

12 / 64

Overview

1 Introduction





4 Conclusion

13 / 64

QSTAR Framework

Data integration in drug discovery.Why is QSTAR important?

Compound

target

Biological processes

Known: the chemical structure of the new compound. Unknown: targets & biological process. The main idea: Information about gene expression will help to understand the biological processes related to the new compound (i.e. understanding the mechanism of action).

14 / 64

QSTAR Data Structure

An indicator variable for the k th fingerprint feature (FF) and i thcompound,

Zki =

{1, if the k th FF is present in the i th compound,0, otherwise.

15 / 64

Overview

1 Introduction





4 Conclusion

16 / 64

transPAT Data

PAT - Pulsed Antibiotic Treatment model of pediatric exposures.

Hypothesis: A series of short, therapeutic-dose pulses ofantibiotic administered early in life will perturb the intestinalmicrobiota and lead to long-lasting alterations in metabolic andimmune profiles.

Exactly same data structure.

“Donor” Mouse “Donor” Mouse

Germ-free mice (n=7) Germ-free mice (n=8)

Microbiota Transfer

Pulsed Antibiotic (Tylosin) Treatment

Normal Microbiota Development

PAT-altered Microbiota (loss of some early-life protective bacteria)

17 / 64

transPAT Data Structure

intervention variable

Similar setting as beforewith three different datasources.Main goal:

The associationbetween microbiomeand immunity taking theintervention intoaccount).Development of modelsto identify microbiomebiomarkers.

18 / 64

CERTIFI Case Study

A phase II study.Explores association between the fecal microbiota and itsrole in therapeutic response of Chron’s disease.Patients are treated with ustekinumab (UST; Stelara).Brings back to the biomarker framework.Talk by Dea Putri - Session 6a, 16:35 - 16:55).

19 / 64

Overview

1 Introduction





4 Conclusion

20 / 64

Model Formulation

1989: Prentice

Surrogate endpoints in clinical trials

1998: Buyse and Molenberghs

Individual Level Surrogacy

2000: Buyse et al.

Trial Level Surrogacy

2005: Burzykowski et al. Evaluation of Surrogate Endpoints

2007: Alonso and Molenberghs

Information Theory Approach

2016: Alonso et al.

Applied Surrogate Endpint Evaluation with SAS and R

2016: Perualila et al.

Joint Model

2010: Lin et al.

Biomarkers in pre-clinical and clinical microarray experiments

2012: Van Sanden et al.

Genomic biomarkers in microarray experiments

2015: Verbist et al.

Lessons learned from the qstar project.

Clinical Trials

High Dimensional DataNon-clinical Trials

2018: Sengupta et al.

1992: Freedman et al.

Statistical Validation of Intermediate Endpoints

Xj

Y

Z ρj

α j

β

αj : fingerprint effect on the j th gene.

ρj : fingerprint-adjusted associationbetween the gene expression andbioactivity data.

21 / 64

Joint Model: Estimation and Inference

Estimation:(XjiYi

)∼ N

[(µj + αjZiµY + βZi

),Σj

],

Σj =

(σjj σjYσjY σYY

)and ρjk =

σjY√σjjσYY

.

Inference:H0j : αjk = 0,H1j : αjk 6= 0.

H0j : ρjk = 0,H1j : ρjk 6= 0.

Gene-specific analysis, per fingerprint feature.BH-FDR multiplicity adjustment is done.

22 / 64

Overview

1 Introduction


3 High Dimensional Surrogacy and Biomarker DetectionSingle Surrogacy for High Dimensional DataDifferent Surrogacy MeasuresComputational Aspects

4 Conclusion

23 / 64

High Dimensional Surrogacy

Computational solutions for an upscaled analysis.Surrogacy setting with multiple candidates that can serveas biomarkers.

Application of the Joint Model within a High Dimensional Setting

- Single Surrogacy - Multiple Surrogacy - Partial Surrogacy - Orthogonal Surrogacy

Modeling Aspects


- Optimized Implementation with R

- Parallel computing using computer cluster

24 / 64

Overview

1 Introduction





4 Conclusion

25 / 64

Example - EGFR Project

Biomarkers (X): A 3595 × 35 transcriptomics matrix.

Primary endpoint (Y): The bioassay measurements (i.e. the pIC50

values) is a vector of length 35.

Z: A 138 × 35 binary grouping variable.

Per fingerprint feature, there are 3595 models to be fitted.

26 / 64

Example of One Gene (FOSL1)

X

X

X

ρ=−0.76

Computation time toanalyze onefingerprint and all3595 genes ∼ 377seconds (in laptop).Fingerprint effect ongene experession.Negativeassociation.

27 / 64

Top 5 Differentially Expressed Genes with HighAdjusted Correlation

Verbist et al. (2015) linked cell growth activity with downregulation ofgenes FOSL1 and FGFBP1 for a particular chemical feature.

28 / 64

Overview

1 Introduction





4 Conclusion

29 / 64

Single Surrogacy

Xj

Y

Z ρj

Models to identify one biomarkerat a time.

Reduction in computation time tofind one biomarker.

30 / 64

Overview

1 Introduction





4 Conclusion

31 / 64

Multiple Surrogacy: Introduction

Once a primary biomarker is known,can we add something more in thecontext of surrogacy?

A subset of k genes is used as abiomarker - multiple adjustedassociation replaces single surrogacy.

32 / 64

Multiple Surrogacy: Model Formulation (I)

Considers a subset of k genes that can be used as a jointsurrogate for pIC50.Example: genes in the same biological pathway that wasfound by the joint model.Van der Elst et al. (2018) extended the joint model,

Xi1Xi2...

XikYi

∼ N

µ1 + α1Ziµ2 + α2Zi

...µk + αkZiµY + βZi

,Σ

.

33 / 64

Multiple Surrogacy: Model Formulation (II)

Σk =

σ11 σ12 . . . σ1k σ1yσ21 σ22 . . . σ2k σ2y

......

. . ....

...σk1 σk2 . . . σkk σkyσy1 σy2 . . . σyk σyy

Adjusted correlation between two biomarkers:

ρij =σij√σiiσjj

.

Adjusted correlation between the j th biomarker and theresponse, pIC50

ρyj =σyj√σyyσjj

.

34 / 64

Multiple Surrogacy: Model Formulation (III)

The covariance matrix:

Σk =

(ΣX ,X Σ

′

X ,YΣX ,Y σY ,Y

).

Multivariate adjusted association:

γ2 = ρ2Y ,X1,X2,...,Xk

=ΣX ,Y Σ−1

X ,X Σ′

X ,Y

σY ,Y.

35 / 64

Gene FOSL1

FOSL1 is used as a known primary biomarker.

X

X

X

ρ=−0.76

36 / 64

EGFR Project: Illustration when K=2

Joint model: FOSL1i1Xi2Yi

∼ N

µFOSL1 + αFOSL1Ziµ2 + α2ZiµY + βZi

,Σ

.γ2 = ρ2

Y ,FOSL1,X2= joint surrogacy value of X2 and FOSL1.

ρY ,FOSL1 and ρY ,X2 are the marginal surrogacy values forFOSL1 and X2, respectively.Gain in surrogacy = ρ2

Y ,FOSL1,X2- ρ2

Y ,FOSL1.

37 / 64

EGFR Project: Multiple Surrogacy

Top 5 genes, sorted according to their multiple adjustedassociation, when used together with FOSL1:

Genes ρY ,X2 ρ2Y ,FOSL1,X2

Gain in Surrogacy ValueMPHOSPH9 -0.26 0.69 0.11TOP2A -0.35 0.69 0.11MYO6 0.73 0.68 0.10PNISR 0.76 0.68 0.10EREG -0.60 0.67 0.09

ρ2Y ,FOSL1 = 0.58.

Gain in surrogacy = ρ2Y ,FOSL1,X2

- 0.58.

38 / 64

Top Genes

Density of multipleadjusted association forthe remaining genes,given FOSL1:Multiple Adjusted Association for the remaining genes|FOSL1

ρY,FOSL1,X2

2

Density

0.58 0.60 0.62 0.64 0.66 0.68 0.70

020

40

60

80

MPHOSPH9

TOP2A

MYO6

PNISR

EREG

KRT10

Example of the top gene,MPHOSPH9:

ρY ,MPHOSPH9 = -0.26.ρ2

Y ,MPHOSPH9,FOSL1 = 0.69.

pIC50

−1.5 −0.5 0.5 1.0 1.5

−1

.0−

0.5

0.0

0.5

1.0

−1

.5−

0.5

0.5

1.0

1.5

−0.76 FOSL1

−1.0 −0.5 0.0 0.5 1.0

−0.26 0.66

−0.15 −0.05 0.05 0.15

−0

.15

−0

.05

0.0

50

.15

MPHOSPH9

39 / 64

Overview

1 Introduction





4 Conclusion

40 / 64

Partial Surrogacy (I)

X1

Y

Z X2

ρY,X

1 ρX1,X2

ρY ,X2

Adjusted association between Y and X1: ρY ,X1|Z .Adjusted association between Y and X2: ρY ,X2|Z .Adjusted association between X1 and X2: ρX1,X2|Z .

41 / 64

Partial Surrogacy (II)

partial surrogacy effect : surrogacy value of X2, given X1and Z .For k = 2, the covariance matrix:

Σ =

σ11 σ12 σ1yσ21 σ22 σ2yσy1 σy2 σyy

.Partial adjusted association:

ρY ,X2|X1,Z = ρY ,X2|X1 =ρy2 − ρy1ρ12√

(1− ρ2y1)(1− ρ2

12).

42 / 64

Graphical Illustration: Partial Surrogacy (I)

Low correlation between all three variables.Low partial adjusted correlation between Y and X2, givenX1.

Y

−1 0 1 2

−1

01

2

−1

01

2

0.19 X1

−1 0 1 2

0.0023 0.19

−2 −1 0 1 2

−2

−1

01

2

X2

ρY,X2|X1= − 0.0396

−1

0

1

2

−2 −1 0 1 2

Residuals: X2*

Resid

uals

: Y

*

FP: 0 − absent 1 − present

43 / 64

Graphical Illustration: Partial Surrogacy (II)

Three correlated variables.Low partial adjusted correlation between Y and X2, givenX1.

Y

−1.5 −0.5 0.5 1.0 1.5

−2.

0−

1.0

0.0

0.5

1.0

−1.

5−

0.5

0.5

1.0

1.5

0.87 X1

−2.0 −1.0 0.0 0.5 1.0

0.85 0.98

−1 0 1 2

−1

01

2

X2

ρY,X2|X1= − 0.0028

−0.5

0.0

0.5

1.0

−0.2 0.0 0.2

Residuals: X2*

Resid

uals

: Y

*


44 / 64

Graphical Illustration: Partial Surrogacy (III)

Three correlated variables.Relatively high partial adjusted correlation between Y andX2, given X1.

Y

−2 −1 0 1 2

−1

01

2

−2

−1

01

2

0.82 X1

−1 0 1 2

0.67 0.54

−2 −1 0 1 2

−2

−1

01

2

X2

ρY,X2|X1= 0.4587

−0.5

0.0

0.5

1.0

1.5

−1 0 1

Residuals: X2*

Resid

uals

: Y

*


45 / 64

EGFR Project: Partial Surrogacy (I)

Density of partial correlationfor all the genes, excludingFOSL1:

Partial Correlation for the remaining genes|FOSL1

ρY,X2 | FOSL1

Density

−0.4 −0.2 0.0 0.2 0.4 0.6

0.0

0.5

1.0

1.5

2.0

2.5

MPHOSPH9

TOP2A

MYO6

PNISR

EREG

TCIRG1

MYC

Top 5 genes:

Genes ρY ,X2 ρY ,X2|FOSL1

MPHOSPH9 -0.26 0.51TOP2A -0.35 0.51MYO6 0.73 0.49PNISR 0.76 0.48EREG -0.60 0.47

46 / 64

Gene TCIRG1

●●●●

●

●

●●●● ●●●●●● ●● ●

●

●

●●

●●

●

●

●

●

●

●

●●

●●

TCIRG1,Observed:−442307337

9.5 9.6 9.74.5

5.0

5.5

6.0

6.5

7.0

Gene Expression

pIC

50Unadj. Asso. −0.3569

●

●

●

●●

●

●●●● ●●●●●● ●● ●

●

●

●

●

●

●

●

●

●

●

●●

●●

●●

TCIRG1,Residuals:−442307337

−0.1 0.0 0.1

−1.0

−0.5

0.0

0.5

1.0

Gene Expression

pIC

50

FP: ● ●Absent Present

Adj. Asso. −0.3476

Negatively associatedwith pIC50.ρY ,TCIRG1 = −0.35.

47 / 64

EGFR Project: Partial Surrogacy (II)

Three correlated variables.Zero partial adjusted correlation between pIC50 andTCIRG1, given FOSL1.

pIC50

−1.5 −0.5 0.5 1.0 1.5

−1.0

−0.5

0.0

0.5

1.0

−1.5

−0.5

0.5

1.0

1.5

−0.76 FOSL1

−1.0 −0.5 0.0 0.5 1.0

−0.35 0.46

−0.15 −0.05 0.05

−0.1

5−

0.0

50.0

5

TCIRG1

ρpIC50,TCIRG1|FOSL1 = 0

−0.5

0.0

0.5

1.0

−0.10 −0.05 0.00 0.05

Residuals: TCIRG1*

Resid

uals

: pIC

50

*


48 / 64

Overview

1 Introduction





4 Conclusion

49 / 64

Orthogonal Surrogacy: Introduction

X1

Y

Z X2

ρY,X

1

ρY ,X2

Adjusted association between Y and X1: ρYX1|Z .Adjusted association Y and X2: ρYX2|Z .X1 and X2 are conditionally independent: ρX1X2|Z = 0.High partial surrogacy: ρYX2|X1,Z .

50 / 64

Orthogonal Surrogacy

Σ =

σ11 0 σ1y0 σ22 σ2yσy1 σy2 σyy

and P =

ρ11 0 ρ1y0 ρ22 ρ2yρy1 ρy2 ρyy

It is a special case of partial surrogacy.X1 and X2 are uncorrelated but both are correlated with Y .High partial adjusted association between X2 and Y sinceX1 does not explain the variation of X2.

ρY ,X2|X1,Z = ρY ,X2|X1 =ρy2√

(1− ρ2y1)

Inference:H0 : σ12 = 0,H1 : σ12 6= 0.

51 / 64

Graphical Illustration: Orthogonal Surrogacy

X1 and X2 are independent.High adjusted partial correlation between Y and X2, givenX1.

Y

−3 −2 −1 0 1 2

−2

−1

01

2

−3

−2

−1

01

2

0.71 X1

−2 −1 0 1 2

0.58 −0.14

−3 −2 −1 0 1 2

−3

−2

−1

01

2

X2

ρY,X2|X1= 0.9677

−2

−1

0

1

2

−3 −2 −1 0 1 2

Residuals: X2*

Resid

uals

: Y

*


52 / 64

Overview

1 Introduction





4 Conclusion

53 / 64

Computational Issues

Computation time for one fingerprint feature ∼ 377seconds.R code with loop over all genes.For all fingerprint features and all genes: a loop over allgenes, nested within a loop over all fingerprint features -takes around 14.45 hours.

Main Question: How to have faster implementation when wehave more data and do further analysis to utilize all the data ?

54 / 64

Code Structure

Loop over genes.For each gene:

gls() - joint model.Summarize and combine results from the model.

Summarize and combine results for all the genes.Computation time for all genes and all fingerprint features∼ 14.45 hours.

55 / 64

Distribution of Computational Time for Joint Model

All the functions used for the analysis fall into three groups:

gls(),anova() & summary() functions andall other functions e.g., data.frame() and cor().

gls anova + summary others

Functions

% o

f Tota

l C

om

puta

tional T

ime

020

40

60

80

56 / 64

Parallelization for the Joint Model

Using R packages:foreach package - foreach().parallel package - clusterApply(),clusterApplyLB().

Using worker framework:It is a “master-slave” framework.Master: divides the bigger and more complex main probleminto smaller subproblems and supplies them to the slaves.Slaves: finish the computations and return the results backto the master and check for the next jobs assigned to them,if any.A user-specific parallelization framework in a cluster.Requires small tweaks, e.g., additional files, restructuringthe code etc., to run the code.

57 / 64

Upscaling the Analysis for EGFR Project

Computational time for complete analysis with for loopover genes and fingerprint features ∼ 14.45 hours.With the worker framework:

With one master and 138 workers and each worker with afor loop over 3595 genes for one of the 138 fingerprintfeatures = 259.35 seconds.With 880 cores and 190 genes per core = 97.64 seconds,

67 seconds to fit the models and 30.64 to gather the resultsfrom different cores and combine them.880139 = 6.33 times more cores.259.3597.64 = 2.66 times speedup.

Sengupta et al. (2018) - accepted in Journal ofBiopharmaceutical Statistics.

58 / 64

Density of Adjusted Correlation

All fingerprint features for one gene (FGFBP1):

−0.85 −0.80 −0.75

010

20

30

40

Gene FGFBP1

Estimated adjusted correlation (ρ̂)

Density

−442307337

59 / 64

Gene FGFBP1 for a Particular Fingerprint Feature(-1592278635)

α̂ = 1.24239.

−2 −1 0 1 2

0.0

0.2

0.4

0.6

0.8

Gene FGFBP1

Estimated fingerprint effect on gene expression (α̂)

Density

−1592278635

60 / 64

Overview

1 Introduction



4 Conclusion

61 / 64

Summary (I)

Biomarker

Clinical Endpoint

Treatment

For drug discovery often some biomarkers are known.Partial and orthogonal surrogacy allow us to evaluate thesurrogacy value of adding possible biomarker(s), fromdifferent sources, to the primary biomarker.

62 / 64

Summary (II)

Similar approach can be implemented in other experimentsas well,

Joint model to identify microbiome biomarkers (talk by DeaPutri - Session 6a, 16:35 - 16:55).Multiple surrogacy in the context of microbiome data hasbeen studied by Van der Elst et al., 2018.

63 / 64

64 / 64

Documents

High Dimensional Surrogacy: A Joint Modeling Approach · 3 High Dimensional Surrogacy and Biomarker Detection Single Surrogacy for High Dimensional Data Different Surrogacy Measures