yunliweb.its.unc.edu · Web viewFor each block (with additional 100kb flanking both up-stream and down-stream), DISSCO imputes untyped markers based on typed markers within the block…

Supplementary material for: DISSCO: Direct imputation of summary statistics allowing covariates

Zheng Xu, Qing Duan, Song Yan, Wei Chen, Mingyao Li, Ethan Lang, Yun Li

S1 Proof of Corr (Z j , Zk)=Corr (G. j ,G.k ) in the absence of covariates

We first prove Corr ( β j , βk)=Corr (G. j ,G.k ) in S1.1 and then Corr (Z j , Zk)=Corr ( β j , βk) in S1.2.

S1.1 Proof of Corr ( β j , βk)=Corr (G. j ,G.k ) in the absence of covariates

In the absence of confounders, a GWAS single marker scan corresponds to fitting the following regression model for any marker l, l = 1, 2,…, P:

Y i=β0+β l Gil+εi,

for i=1,2 ,…, n , where ε i ' s follow independent and identical Gaussian distribution with mean zero.

In the simple linear regression framework, by least square formula, for two particular markers j and k, we have

β j=(01 ) ( [1n ×1G . j ]' [1n× 1G. j ] )−1

(1n× 1G. j )' Y , and

βk=(01 ) ( [1n× 1G. k ]' [1n×1 G.k ] )−1(1n× 1G. k )'Y.

Denote

A=(01 ) ( [1n ×1 G. j ]' [1n× 1G. j ] )−1

(1n× 1G. j )'=(0 1)( n ∑

i=1

n

Gij

∑i=1

n

Gij ∑i=1

n

Gij2)

−1

(1n ×1'

G. j' ),

B=(01 ) ( [1n× 1G. k ]' [1n ×1 G.k ] )−1(1n× 1G. k )'=(01 )( n ∑

i=1

n

Gik

∑i=1

n

Gik ∑i=1

n

Gik2 )

−1

(1n× 1'

G. k' ) .

Then under the null hypothesis, we have

Cov ( β j , βk )=σ2 AB' , Var ( β j )=σ 2 AA ' and Var ( βk )=σ2 BB ' .

1

Thus Corr ( β j , βk)=(AB ' )/√AA ' BB '.

We have

A A'=(0 1 )( n ∑i=1

n

Gij

∑i=1

n

G ij ∑i=1

n

Gij2)

−1

(1n×1'

G. j' )( 1n× 1G. j )( n ∑

i=1

n

Gij

∑i=1

n

Gij ∑i=1

n

Gij2)

−1

(01)

=(01 )( n ∑i=1

n

Gij

∑i=1

n

Gij ∑i=1

n

Gij2 )

−1

( n ∑i=1

n

Gij

∑i=1

n

Gij ∑i=1

n

Gij2)( n ∑

i=1

n

Gij

∑i=1

n

G ij ∑i=1

n

Gij2)

−1

(01)

=(01 )( n ∑i=1

n

Gij

∑i=1

n

Gij ∑i=1

n

Gij2 )

−1

(01)=(0 1 ) 1

n∑i=1

n

Gij2−(∑

i=1

n

Gij)2 ( ∑i=1

n

Gij2 −∑

i=1

n

Gij

−∑i=1

n

Gij n )(01)

=(∑i=1

n

(Gij−G. j)2)

−1=(nVar (G. j))

−1.

Similarly, B B'=(n Var (G. j))−1.

Now, for A B',

A B'=(01 )( n ∑i=1

n

Gij

∑i=1

n

Gij ∑i=1

n

Gij2)

−1

(1n ×1'

G. j' ) (1n ×1 Gk )( n ∑

i=1

n

Gik

∑i=1

n

Gik ∑i=1

n

Gik2 )

−1

(01)

= (0 1 )( n ∑i=1

n

Gij

∑i=1

n

Gij ∑i=1

n

Gij2 )

−1

( n ∑i=1

n

Gik

∑i=1

n

Gij ∑i=1

n

Gij Gik)( n ∑i=1

n

Gij

∑i=1

n

Gij ∑i=1

n

Gij2)

−1

(01)

= 1

n∑i=1

n

Gij2−(∑i=1

n

Gij )2

1

n∑i=1

n

Gik2 −(∑i=1

n

Gik)2

(0 1 )( ∑i=1

n

Gij2 −∑

i=1

n

Gij

−∑i=1

n

Gij n )( n ∑i=1

n

Gik

∑i=1

n

Gij ∑i=1

n

Gij Gik)2

( ∑i=1

n

Gik2 −∑

i=1

n

Gik

−∑i=1

n

Gik n )(01)

= 1

n Var (G. j) Var (G.k )(0 1 )( G. j

2 −G. j

−G . j 1 )( 1 G. k

G . j G jk)( G.k2 −G.k

−G.k 1 )(01)

=1

n Var (G. j) Var (G.k )(0 1 )(¿ ¿

¿ G jk−G. j G.k )(01)= Cov(G. j , G.k )n Var (G. j) Var (G. k)

,

where * denote terms that will not change the final result due to zero’s to be multiplied.

Plugging in, we have

Corr ( β j , βk)= A B'

√A A ' B B' =

Cov (G. j ,G. k)n Var (G. j)Var (G.k )

√(n Var (G. j))−1(n Var (G. j))

−1=

Cov(G. j , G.k )

√ Var (G. j) Var (G. k)=Corr (G. j ,G. k)

.

S1.2 Proof of Corr (Z j , Zk)=Corr ( β j , βk) in the absence of covariates

Because Var ( β j )=σ 2 ( A A' )=σ2(n Var (G. j))−1 and Var ( βk )=σ2 ( AB )=σ2(nVar (G .k ))

−1, we have the formulae for Z statistics as

Z j=β j

√ Var ( β j )= β j √ nVar (G . j)

σ j2 and

Zk=βk

√Var ( βk )= βk√ nVar (G .k)

σ k2 ,

where σ j2 (or σ k

2) is the MLE estimator of σ 2 in the simple linear regression framework with only the

genetic marker j (or k) included. We know that σ j2→ σ2 and σ k

2→ σ2 in probability as n → ∞.

Now, we define two auxiliary random variables T j=β j

√Var ( β j )= β j√ nVar (G. j)

σ j2 and

T k=βk

√Var ( βk)= βk√ n Var (G.k )

σk2 . Employing the scale-invariant properties of the correlation function,

i.e.

3

Corr (C j R j ,C k R k )=Corr (R j ,R k) where C j, C k are scalars and R j, Rk are random variables, we have

Corr (T j ,T k )=Corr ( β j , βk ) because √ nVar (G . j)

σ j2

and √ nVar (G .k )

σ k2

are scalars.

Then by Slutsky theorem, we have ( Z j ,Zk ) and (T j ,Tk ) sharing the same asymptotic distribution, i.e.

( Z j ,Zk )=(T j √ σ j/σ , T k √σ k /σ )→ (T j ,T k ) in distribution. Thus they have the same correlations, i.e.

Corr (Z j , Zk)=Corr (T j ,T k ) . In summary, we have proved

Corr (Z j , Zk)=Corr (T j ,T k )=Corr ( β j , βk ).

4

S2 Proof of Corr (Z j , Zk)=Corr (G. j ,G.k∨C ) in the presence of covariates

We first prove Corr ( β j , βk)=Corr (G. j ,G.k∨C ) in S2.1 and then Corr (Z j , Zk)=Corr ( β j , βk) in S2.2.

S2.1 Proof of Corr ( β j , βk)=Corr (G. j ,G.k∨C ) in the presence of covariates

Considering the following single marker model with covariates for genetic markers j and k,

Y=γ ' C+β jG. j+ε and Y=γ ' C+βk G.k+ε,

where C=[1n ×1 C1 C2⋯CS ] represents the intercept vector and S covariate vectors. Thus by the formula of least square estimator, we have

β j=(01 ×(S+1)1 ) ( [C G. j ]' [C G. j ] )

−1(C G. j )

' Y and βk=(01 ×(S+1 )1 ) ( [C G.k ]' [C G.k ] )−1(C G.k )' Y

Denote

A=(01 ×(S+1)1 ) ( [C G. j ]' [C G. j ] )

−1( C G. j )

' and B=( 01 ×(S+1 )1 ) ( [C G.k ]' [C G.k ] )−1(C G.k )' .

Then under the null hypothesis, we have

Cov ( β j , βk )=σ2 AB' , Var ( β j )=σ 2 AA ' and Var ( βk )=σ2 BB ' .

Thus Corr ( β j , βk)=(AB ' )/√AA ' BB '. Next, we derive AA ', BB' and AB' .

First, notice that for an n × n matrix A=( A11 A12

A21 A22), where A11is a square matrix and suppose A, A11and

A22 are not singular, we have

A−1=( ( A11−A12 A22−1 A21)

−1 −A11−1 A12( A22−A21 A11

−1 A12)−1

−A22−1 A21( A22−A21 A11

−1 A12)−1 ( A22−A21 A11

−1 A12)−1 ).

We calculate

A A'=(01 ×(S+ 1)' 1)(C ' C C ' G. j

G. j' C G. j

' G. j)−1

(C 'G. j

' )(C G. j )(C ' C C ' G. j

G. j' C G. j

' G. j)−1

(01 ×(S+1)

1 )

5

= (01 ×(S+1 )' 1)(C ' C C ' G. j

G. j' C G. j

' G . j)−1

(C ' C C ' G. j

G. j' C G. j

' G. j)( C ' C C ' G. j

G. j' C G. j

' G. j)−1

(01 ×(S+1)

1 )=(01×(S+1 )

' 1)(C ' C C ' G. j

G. j' C G. j

' G . j)−1

(01×(S+1)

1 ) = (01×(S+1 )' 1 )(¿ ¿

¿ (G. j' G. j−G. j

' C (C ' C)−1 C ' G. j)−1)(01×(S+1)

1 )=(G. j

' G . j−G. j' C (C ' C)−1 C ' G. j)

−1 =(G. j' ( I n ×n−C (C' C )−1C ')G . j)

−1

=(G. j' ( I n ×n−PC)G. j)

−1 = (n Var (G. j∨C ))−1,

where * denote terms that will not change the final result, PC=C (C ' C )−1C ' is the projection matrix,

Var (G. j∨C ) is the sample conditional variance (conditional on C). The sample conditional variance and conditional covariance are

Var (G. j|C )=Var ( εG . j∨C )=n−1 G. j' ( I n ×n−PC )G . j ,

Var (G.k|C )=Var ( εG.k∨C )=n−1 G.k' ( I n ×n−PC )G. k ,

Cov (G. j , G.k|C )=Cov ( εG. j∨C , εG.k∨C )=n−1G. j' ( I n× n−PC ) G.k ,

where ε G. j∨C and ε G.k∨C are the residuals obtained by regressing G. j and G. kon C.

Similarly, we have B B'=(n Var (G. k∨C))−1.

We calculate

A B'=(01×(S+1)' 1)( C ' C C ' G. j

G . j' C G. j

' G. j)−1

(C 'G. j

' ) (C G. k )( C ' C C ' G.k

G. k' C G. k

' G. k)−1

(01×(S+1)

1 )=(01 ×(S+1 )

' 1 )(C ' C C ' G. j

G. j' C G. j

' G . j)−1

(C ' C C ' G.k

G. j' C G. j

' G. k)( C ' C C ' G. k

G. k' C G. k

' G. k)−1

(01×(S+1)

1 )=

(01×(S+1 )' 1 )( ¿ ¿

−(G . j' G. j)

−1(G. j' C)(C ' C−C ' G. j(G. j

' G. j)−1 G. j

' C)−1 (G. j' G . j−G. j

' C (C ' C)−1 C ' G. j)−1)

(C ' C C ' G.k

G. j' C G. j

' G. k)(¿ −(C' C)−1(C ' G. k)(G. k' G .k−G. k

' C (C ' C )−1C ' G. k )−1

¿ (G. k' G. k−G. k

' C (C ' C)−1 C ' G. k)−1 )(01× (S+1)

1 )=(−(G. j

' G. j)−1(G. j

' C )(C' C−C ' G. j(G. j' G. j)

−1G. j' C)−1(G. j

' G. j−G. j' C (C ' C)−1 C ' G. j)

−1)

6

(C ' C C ' G.k

G. j' C G. j

' G. k)(−(C ' C)−1(C ' G .k )(G.k

' G.k−G.k' C(C ' C)−1C ' G.k )

−1

(G. k' G. k−G. k

' C (C ' C)−1C ' G. k)−1 )

=(−(G. j' G. j)

−1(G. j' C)(C '( I−PG . j

)C )−1(G. j' (I−PC)G. j)

−1)

(C ' C C ' G.k

G. j' C G. j

' G. k)(−( C' C )−1

(C ' G.k )1 ) 1

G.k' ( I−PC)G. k

=[G. k' ( I−PC )G. k ]−1

¿

+(G. j' ( I−PC ) G. j )

−1G. j

' C (−( C' C )−1 (C ' G. k ))−(G. j' G. j)

−1(G. j' C) (C ' ( I−PG. j

)C )−1C ' G .k

+(G. j' ( I−PC ) G. j)

−1G. j' G .k ¿

=[G. k' ( I−PC )G. k ]−1

[ (G. j' ( I−PC ) G. j )

−1 G. j' C (−(C ' C)−1(C ' G. k))+(G. j

' ( I−PC ) G. j )−1G . j

' G.k ]

=[G. k' ( I−PC )G. k ]−1 (G. j

' (I −PC )G . j )−1 (G. j

' ( I−PC ) G. k )−1

=n−1( Var (G.k∨C))−1(Var (G. j∨C))−1Cov (G. j ,G. k∨C).

Plugging in, we have

Corr ( β j , βk)= A B'

√A A ' B B' =

Cov(G. j , G.k∨C)nVar (G. j∨C)Var (G .k∨C )

√(n Var (G. j∨C))−1(nVar (G. j∨C ))−1=

Cov (G. j ,G. k∨C)

√Var (G. j∨C) Var (G.k∨C)=Corr (G. j ,G. k∨C)

.

S2.2 Proof of Corr (Z j , Zk)=Corr ( β j , βk) in the presence of covariates

The formulae for Z statistics in the presence of covariates are

Z j=β j

√ Var ( β j )= β j √ nVar (G . j∨C)

σ j2 and Z k=

βk

√Var ( βk )= βk√ nVar (G .k∨C )

σk2 ,

where σ j2 (or σ k

2) is the MLE estimator of σ 2 in the regression model with covariates C and genetic

markers j (or k) included. We know that σ j2→ σ2 and σ k

2→ σ2 in probability as n → ∞. More specifically,

we have σ j2=(n−S−1)−1 Y ' ( I−P(C G. j))Y →σ 2 in probability as n→ ∞, and similarly for σ k

2.

7

Define two auxiliary random variables T j=β j

√Var ( β j )=β j√ nVar (G. j∨C)

σ j2 and

T k=βk

√Var ( βk)= βk√ n Var (G.k∨C)

σ k2 . Similar to the scenario in the absence of covariates, applying the

scale-invariant properties of the correlation function, i.e. Corr (C j R j ,C k R k )=Corr (R j ,R k) for fixed C j

, C kand random R j, Rk , we have Corr (T j ,T k )=Corr ( β j , βk ).

Again applying Slutsky theorem, we have ( Z j , Zk ) and (T j ,Tk ) sharing the same asymptotic distribution,

i.e. ( Z j , Zk )=(T j √ σ j/σ , T k √σ k /σ )→ (T j ,T k ) in distribution. Thus they have the same correlations, i.e.

Corr (Z j , Zk)=Corr (T j ,T k ) . In this way, we have proved that under the presence of covariates, we still have:

Corr (Z j , Zk)=Corr (T j ,T k )=Corr ( β j , βk ).

8

S3 Multinomial predictors and Gaussian confounder

To mimic the discrete genotypes observed, we simulated a categorical vector (G1G 2) with specified cell probabilities πab≡ P ( G1=a ,G2=b ) ,where both a and b take three possible values 0, 1, and 2 for the number of a particular allele, mimicking genotypes at two markers. Given the cell probabilities, the expectation values E (G1 ) , E (G2), variances Var (G1 ) ,Var (G2) and Cov (G1G2) are determined. We then simulated a confounder C=β1G1+β2G2+e, where e N (0 ,V e). So we have:

Var (C )=β12Var (G1 )+ β2

2Var (G2 )+2 β1 β2 Cov (G1 ,G2 )+V e,

Cov (C ,G1 )=β1Var (G1 )+β2Cov (G1 , G2) and

Cov (C ,G2 )=β2Var (G2 )+β1Cov (G1 , G2).

Based on the above quantities, we can calculate the correlation ρG1 G 2 and partial correlation ρG1 G 2∨C.

We generated ¿ β0+βC C+ε , where ε is an independent normal random variable with mean zero and variance V. We again fit two multiple regression models, mimicking GWAS single marker analysis, to obtain Z statistics testing the association between G1(G2) and Y controlling for C.

We considered different settings formed by combinations of (πab , β1 , β2). For each setting, we again conducted 10000 simulations each with 300 observations. Table S1 summarizes the results. We reach the same conclusion as in Section 3.1 that the correlation of the Z statistics approaches the partial correlation instead of the marginal correlation.

Table S1. Multinomial predictors and Gaussian confounder

Setting β1 β2 πab ρG1 G 2ρG1 G 2∨C ρZ1 Z2

1 1 0.5 Set 1 0 -0.673 -0.674 [-0.685, -0.663]

2 1 0.5 Set 2 0.223 -0.600 -0.600 [-0.612,-0.587]

3 0.7 0.3 Set 1 0 -0.463 -0.463 [-0.478,-0.447]

4 0.7 0.3 Set 2 0.223 -0.350 -0.350 [-0.367,-0.333]

5 0.4 0.8 Set 1 0 -0.587 -0.584 [-0.597,-0.571]

6 0.4 0.8 Set 2 0.223 -0.491 -0.486 [-0.501,-0.471]

7 0 1 Set 1 0 0 0.011 [-0.008,0.031]

8 0 1 Set 2 0.223 0.096 0.108 [0.089,0.127]

9 1 0 Set 1 0 0 -0.003[-0.022,0.017]

10 1 0 Set 2 0.223 0.090 0.086 [0.066,0.105]

11 0 0 Set 1 0 0 0.007 [-0.013,0.026]

9

12 0 0 Set 2 0.223 0.223 0.228 [0.209,0.246]

Note ρZ1 Z2include the point estimate and 95% confidence interval for the correlation between the two Z

statistics. Without loss of generality, we set β0=1, βC=1, V e=0.09, V=1. We considered two sets of values for πab’s.

In “Set 1”, we assume two genetic markers each satisfying Hardy Weinberg equilibrium (HWE) and are independent (in linkage equilibrium), with minor allele frequency (MAF) = 0.3 and 0.4, so that the (marginal) probabilities for genetic markers 1 and 2 are (0.49,0.42,0.09) and (0.36,0.48,0.16) respectively. Then the joint/cell probabilities are the products of the two vectors of marginal probabilities as shown below.

Marginal 0.36 0.48 0.16

Marginal G1\G2 G2=0 G2=1 G2=2

0.49 G1=0 0.1764 0.2352 0.07840.42 G1=1 0.1512 0.2016 0.06720.09 G2=2 0.0324 0.0432 0.0144

In “Set 2”, the joint probabilities are generated from cell probability in “Set 1”, by adding D to π00 and π11, and subtracting D from π01 and π10. We set D=0.1, resulting in the following cell probabilities:

Marginal 0.36 0.48 0.16Margina

l G1\G2 G2=0 G2=1 G2=2

0.49 G1=0 0.2764 0.1352 0.07840.42 G1=1 0.0512 0.3016 0.06720.09 G2=2 0.0324 0.0432 0.0144

10

S4. Covariate projection accuracy and its impact on partial correlation estimation

To assess the accuracy of DISSCO’s covariate projection, we considered a simplistic two random variables scenario. We simulated 10,000 datasets each with 2,000 observations of (G ,C ) BVN (0,0 ;1,1 , ρ), where G and C denote the genotype vector for a typed marker and the covariate vector respectively. Then we divided the 2000 observations into two groups: 1000 observations

(G study ,C study) as the study sample, and 1000 observations (Grefer ,C refer ) as the reference panel. The

projection method was conducted by (1) regressing C study on (1 Gstudy ), where 1 is the intercept vector, to

obtain the regression coefficient estimate β=([1G study ]' [1Gstudy ])−1 [1 Gstudy ]' C study and study sample

residual ε=C study−[1Gstudy ] β , and (2) generating the pseudo-covariate C refer=[1 Grefer] β+ε ¿, where ε ¿ was a bootstrap sample of ε . Covariate projection accuracy was gauged using the correlation between

C refer and C refer. Results from 10,000 simulated datasets were show in Table S2. Not surprisingly, we

observed that covariate projection accuracy depends on the correlation between typed marker and the covariate to be projected. Similar observations were made in slightly more complicated scenarios described below (results in Tables S3A, S3B, and S3C).

Table S2. Covariate projection performance

ρ Mean of Corr (C refer ,C refer) Standard deviation of Corr (C refer , C refer)-0.9 0.8101 0.0115-0.7 0.4899 0.0257-0.5 0.2498 0.0313-0.3 0.0899 0.0322-0.1 0.0101 0.03170.1 0.0104 0.03170.3 0.0907 0.03240.5 0.2508 0.03150.7 0.4907 0.02570.9 0.8103 0.0115

Although the covariate prediction performance itself is of some interest, partial correlation estimation accuracy based on the projected covariates is more relevant to our ultimate goal of association summary statistics imputation. We therefore considered a simple scenario of three random variables to examine the impact of covariate projection on partial correlation estimation. The setting is the same as that in section 3.1. Specifically, we simulated 10,000 datasets each with 2,000 observations (C X1 X2) following a standard trivariate Gaussian distribution with correlations ρC X 1

, ρC X2and ρX 1 X 2

. We again divided the 2,000 observations into two data sets: study sample and reference panel. Covariate projection was conducted as previously described. Results are summarized in Tables S3A, S3-B, and S3-C. We found that partial correlation among typed markers can be estimated quite accurately based on projected covariate values (comparing column ρX 1 X 2∨C with column ρX 1 X 2∨C in Table S3-A) even when the

covariate itself is not perfectly predicted (column ρC Cin Table S3-A not close to 1); while partial

11

correlation estimates (ρX 1 X 2∨C ) between typed and untyped markers (Tables S3-B and S3-C) showed a mixture results compared with the standard marginal correlation estimates. The better performer in each case for approximating true partial correlation (ρX 1 X 2∨C) is underlined.

Table S3A. Partial correlation estimation (X1and X2 both typed)

Setting ρC X 1ρC X 2

ρX 1 X 2ρX 1 X 2∨C ρX 1 X 2

ρX 1 X 2∨C ρC C1 0.5 0.9 0.8 0.927 0.800 0.927 0.9452 0.5 0.9 0.5 0.133 0.500 0.132 0.8143 0.4 0.5 0 -0.252 0.000 -0.252 0.4104 0 0.8 0.3 0.500 0.300 0.500 0.7035 0.6 0 0.5 0.625 0.500 0.625 0.4806 0 0 0.5 0.500 0.500 0.500 0.0007 0.6 0.8 0.3 -0.375 0.300 -0.375 0.7838 0.9 0.8 0.5 -0.841 0.500 -0.841 0.973

Note ρX 1 X 2 is the sample correlation of X1and X2 in the reference, ρX 1 X 2∨C is the sample correlation of X1

and X2, given pseudo-covariate C in the reference. The table reports the average of these estimators across 10,000 simulations. All standard deviations of these estimators (not shown) are less than 0.03.

Table S3B. Partial correlation estimation (X1 typed, X2 untyped)

Setting ρC X 1ρC X 2

ρX 1 X 2ρX 1 X 2∨C ρX 1 X 2

ρX 1 X 2∨C ρC CD.TruePartial

D. Est.Marginal

D. Est.Partial

1 0.5 0.9 0.8 0.927 0.800 0.756 0.250 0.299 0.314 0.3272 0.5 0.9 0.5 0.133 0.500 0.447 0.250 0.791 0.846 0.8323 0.4 0.5 0 -0.252 0.000 0.000 0.160 0.772 0.800 0.8004 0 0.8 0.3 0.500 0.300 0.300 0.000 0.691 0.707 0.7075 0.6 0 0.5 0.625 0.500 0.419 0.360 0.623 0.630 0.6426 0 0 0.5 0.500 0.500 0.500 0.000 0.691 0.691 0.6917 0.6 0.8 0.3 -0.375 0.300 0.244 0.360 0.740 0.916 0.8908 0.9 0.8 0.5 -0.841 0.500 0.244 0.810 0.432 1.148 0.963


and X2, given pseudo-covariate C in reference. The table reports the average of these estimators across 10,000 simulations. All standard deviations of these estimators (not shown) are less than 0.03.

Table S3C. Partial correlation approximation (X1 untyped, X2 typed)

Set-ting

ρC X 1ρC X 2

ρX 1 X 2ρX 1 X 2∨C ρX 1 X 2

ρX 1 X 2∨C ρC CD.TruePartial

D. Est.Marginal

D. Est.Partial

1 0.5 0.9 0.8 0.927 0.800 0.502 0.810 0.300 0.316 0.4502 0.5 0.9 0.5 0.133 0.500 0.244 0.810 0.785 0.840 0.7913 0.4 0.5 0 -0.252 0.000 0.000 0.250 0.765 0.792 0.7924 0 0.8 0.3 0.500 0.300 0.185 0.640 0.688 0.705 0.7315 0.6 0 0.5 0.625 0.500 0.499 0.000 0.622 0.629 0.6296 0 0 0.5 0.500 0.500 0.500 0.000 0.688 0.688 0.6887 0.6 0.8 0.3 -0.375 0.300 0.186 0.640 0.731 0.909 0.8588 0.9 0.8 0.5 -0.841 0.500 0.327 0.640 0.428 1.146 1.020


and X2, given pseudo-covariate C in the reference. The table reports the average of these estimators across 10,000 simulations. All standard deviations of these estimators (not shown) are less than 0.03.

12

S5. Scatter plots comparing the performance of different association summary statistics imputation methods in the CLHNS dataset

(A) DISSCO VS DIST*

(B) DISSCO VS ImpG-Summary*

13

(C) DISSCO VS ImpG-SummaryLD*

CEBU_GeneralCovariates_2014_0304_REDO

Figure S1: Scatter plots of %D via other methods VS %D via DISSCO for CLHNS data general covariates before quality filtering. Panels A, B and C in this figure correspond to %D of DISSCO versus %D of DIST*/ ImpG-Summary*/ ImpG-SummaryLD* respectively. Blue points are for markers with reference MAF bigger than 0.05 and red points are for markers with reference MAF smaller than or equal to 0.05. Blue line is 45 degree line, black line is the smooth/trend curve for the markers with reference MAF bigger than 0.05, and yellow line is the smooth/trend curve for the markers with reference MAF equal to or less than 0.05.

The advantage of our method, compared with the other methods can be found from the above scatter plots and trend/smoothing line of the scatter plot. We found that most markers are above 45% line, suggesting DISSCO might be better than other method. For example, we have 1494 observations with DISSCO relative absolute deviation<100 and at least two of the three methods (DIST*, ImpG-Summary* and ImpG-SummaryLD*) >100. However, we have only 154 observations with DISSCO relative absolute deviation>100 and at least two of the three methods (DIST*, ImpG-Summary* and ImpG-SummaryLD*) <100. After filtering by imputation quality, we have 740 observations with DISSCO relative absolute deviation<100 and at least two of the three methods (DIST*, ImpG-Summary* and ImpG-SummaryLD*) >100. However, we have only 90 observations with DISSCO relative absolute deviation>100 and at least two of the three methods (DIST*, ImpG-Summary* and ImpG-SummaryLD*) <100. The above may show some advantage of our DISSCO than other methods (DIST*, ImpG-Summary* and ImpG-SummaryLD*).

14

S6. More pronounced difference for lower frequency variants in the WHI dataset

We found that DISSCO resulted in more pronounced improvement for lower frequent variants. In addition to Figure S1 in the Supplementary Material S5 for the CLHNS dataset, we observed similar patterns in the WHI dataset.

Figures S2 and S3 summarizes the results for WHI data set accommodating admixture via PCs, and the results for WHI data set for general covariates in terms of absolute relative percentage deviation.

Figure S2. WHI data set accommodating admixture via PCs. X-axis is the minor allele frequency and Y axis is the absolute relative percentage deviation from the truth.

15

Figure S3. WHI data set accommodating general covariates. Again, X-axis is the minor allele frequency and Y axis is the absolute relative percentage deviation from the truth.

16

S7. PCs treated as general covariates

An alternative method to incorporate PCs is to treat them as general covariates and use the projection method to obtain PCs in reference, instead of performing PCA to the sample and reference together to obtain PCs in reference. Table S4 shows the performance of this alternative method. We observe almost identical performance, compared with Table 2. Nearly identical results are also obtained for the one-side Wilcoxon signed rank test on the paired difference between %D_DIST / ImpG-Summary* / ImpG-Summary LD* and %D_DISSCO*.

Table S4. WHI data set: accommodating admixture via PCs, where sample PCs are projected to obtain reference PCs.

MeasurePost

Imputation Filtering

#SNPs DIST ImpG-Summary*

ImpG-Summary

LD*DISSCO*

D None 162443 0.418(4.3%)

0.410(2.4%)

0.408(2.0%)

0.400

%D None 162443 56.2(8.7%)

54.7(6.2%)

55.1(6.9%)

51.3

R2 None 162443 0.697(2.4%)

0.700(2.0%)

0.708(0.8%)

0.714

D >0.6 150251 0.387(3.6%)

0.379(1.6%)

0.377(1.1%)

0.373

%D >0.6 150251 52.7(7.8%)

50.9(4.5%)

51.6(5.8%)

48.6

R2 >0.6 150251 0.743(1.7%)

0.750(0.8%)

0.755(0.1%)

0.756

Best performing methods are highlighted as bold and underlined. Smaller D, smaller %D and larger R2 are better. The values in the bracket are the relative improvement of DISSCO over DIST*/ ImpG-Summary* / ImpG-SummaryLD*.

S8. Impact of the number of LD tags

We suspect DISSCO’s more pronounced advantage for lower frequency variants is due to the smaller number of LD tags for these variants compared with common variants. To illustrate the influence of the number of LD tags on association summary statistics imputation, we conducted the following simulation studies. We considered s=3 confounders, 1 untyped marker and p−1typed markers. Let the correlation between any two markers, the correlation between any two confounders and the correlation between a marker and a confounder be 0.8, 0.8 and 0.6 respectively. We studied the imputation performance when the total number of markers, p changed from 2 to 30.

Let Σ be the correlation matrix with the size ( p+s)×( p+s) and partitioned into Σ=( Σ¿ ΣGC

ΣCG ΣCC), where

Σ¿ is the correlation matrix among markers (including both typed and untyped), ΣCC the correlation

matrix among, and ΣCG and ΣGC the correlation matrix between confounders and markers. Given the correlations specified above, Σ¿ and ΣCC both are square matrices with value 1 in the diagonal entries and 0.8 off-diagonal. ΣCG and ΣGC are of dimension p × s and s× p with all entries being 0.7. We generate the true p association statisticsZ MVN (0 , ΣG∨C ), where ΣG∨C is the conditional correlation matrix. Let

17

the pth marker be the untyped marker. We used the first ( p−1) entries of Z vector to impute the last entry of Z vector using either the partial correlation (partial) or marginal correlation (marginal). In this example, the imputation performance of both methods improved with p, which is expected because a marker can be better imputed with more typed markers correlated with it (i.e., more information available). While partial always outperforms marginal, the advantage is more obvious in more challenging the number of LD tags is smaller (i.e., less information). We have shown the results from 1,000,000 simulations in Figure S4 where the X-axis is the number of LD tags (p – 1) and Y-axis the absolute relative percentage deviation from the true association summary statistics.

Figure S4. Impact of the number of LD tags. X-axis is the number of LD tags and Y axis is the absolute relative percentage deviation from the truth.

18

S9. Computational complexity

Following ImpG-Summary/LD, our DISSCO implementation first divides each chromosome into non-overlapping blocks with a predetermined length (1Mb by default). For each block (with additional 100kb flanking both up-stream and down-stream), DISSCO imputes untyped markers based on typed markers within the block. Assume there are p=p t+ put markers, including pt typed and put untyped markers in the block, N study individuals in study sample and N refer individuals in reference sample, and S covariates.

After we obtain Zt , Z statistics at typed markers, the computing costs for DIST, ImpG-Summary/LD and DISSCO can be broken down into the following steps: (1) calculation of reference correlation matrix Σunadj, (2) calculation of sample correlation matrix among typed markers, (3) generation of reference pseudo-covariates, (4) calculation of reference partial correlation, and (5) actual imputation via the following formula:

Zi.=Σi ,t

., refer ( Σt ,t. ,refer )−1 Z t ,

and (6) the normalization of imputed values. All methods need step (5); on top of (5), DIST needs (1); ImpG-Summary (1) and (6); ImpG-SummaryLD (1), (2), and (6); and DISSCO (3) and (4).

Now we will quantify the computational complexity for each step. Step (1) is the calculation of correlation matrix Σunadj in reference. This matrix has p rows and p columns with each entry being the correlation of a pair of markers calculated from N refer individuals. Thus, in total, we need 4N refer p (p+1) calculations for sample covariance matrix and an additional 2 p( p−1) calculations for sample correlation matrix. Thus, the computational complexity is O(N refer p2). Similarly, step (2), calculation of the correlation matrix among pt typed markers from based on the N study individuals in study sample, has computational complexity O(N study p t

2).

Step (3) is for the generation of reference pseudo-covariates. We use the projection method introduced in the main text for this purpose. Specifically, our projection method involves first regressing S covariates on genotypes at typed markers among N study individuals in study sample, and then multiplying the estimated coefficients by genotypes of typed markers for the N Refer individuals in reference sample, i.e.,

C refer=[1 Gtrefer ]([1Gt

study] ' [1 Gtstudy ])−1[1Gt

study] ' C ,

Where C is N study × S matrix, Gtstudyis N study *( pt+ 1) matrix, Gt

ℜ fer is N refer∗( pt+1 )¿ Computing costs therefore are

( pt+1 )¿.

Since normally S << pt, the leading terms are N study p t2and pt

3, leading to the computational complexity of O(N study p t

2+ pt3).

Step (4) is to generate partial correlation in reference. After we obtain the reference pseudo-covariates in step (3), we can calculate each element in the partial covariance matrix via:

Cov (Gi ,G j|C )=Cov (Gi , G j )−Cov (Gi ,C ) Var−1 (C ) Cov (C ,G j),

19

where Cov (Gi , G j ) is the row i column j entry of the reference covariance matrix [which has been calculated in step (1)], Var−1 (C ) is the inverse of covariance matrix of S pseudo-covariates, and Cov (Gi ,C ) is the covariance between the genotypes at marker i and S pseudo-covariates. Since Cov (Gi ,G j ) have already been calculated in step (1), we only need to calculate the variance matrix of S pseudo-covariates, and the covariance matrix between S pseudo-covariates and genotype data at p markers, which requires 4N refer Sp calculations totally.

Step (5) is to impute using the following formula,

Zi.=Σi ,t

., refer ( Σt ,t. ,refer )−1 Z t ,

where the subscript “.” differs across methods as detailed in main text. Computing cost to impute the summary statistics for each untyped marker is pt

2+O( pt3)+ p t, where pt

2 is the computational cost for the

multiplication between a vector of length pt and a pt×pt matrix; O ( pt3 )for the inversion of a pt×pt

matrix; and pt for the multiplication between a vector of length pt and a column vector of length pt.Thus

for put untyped markers, it is put pt+ put pt2+O( put p t

3) Note we specify the amount required for the

inverse of the matrix with m× m entries is O(m3) based on standard matrix inverse algorithm based on LDU decomposition (Cormen, et al., 2009).

Step (6) is the normalization of imputed values. Specifically, the normalized test statistics are:

ZiImpSummary=

Z iDIST

√ Σi ,tcor r ,refer ( Σt , t

corr , refer)−1( Σi ,t

corr ,refer )'

ZiImpSummaryLD=

Z iDIST

√ Σi ,tcorr , refer ( Σt ,t

corr ,refer )−1Σt , t

corr , study ( Σt ,tcorr , refer )−1

( Σi ,tcorr , refer)'

Similar to step (5), for ImpG-Summary, the computational cost required for put untyped markers is

put ( pt2+ pt ) where pt

2+ pt is the cost for each untyped marker. Specifically, pt

2 is the computational cost

for multiplication between a vector of length pt and a pt×pt matrix; and pt for the multiplication between a vector of length pt and a column vector of length pt. For ImpG-SummaryLD, the computational cost is put (3 pt

2+ pt ), where 3 pt2+p t is the cost for each untyped marker. Cost decomposition is similar to the

above for ImpG-Summary except that we need to perform three times the multiplication between a vector of length pt and a pt×pt matrix.

20

S10. Control of type 1 error in simulations under the null for our four scenarios

To evaluate the control of Type 1 error, we simulated phenotypic outcome under the null hypothesis through permutations in all four real data scenarios: (1) CLNHNS with PCs, (2) CLNHNS with general covariates, (3) WHI with PCs and (4) WHI with general covariates. We quantified to evaluate whether DISSCO correctly maintains the type 1 error rate at a range of given nominal rates (10-1 – 10-5).

We found that type I error rate is under control in all scenarios attempted with proper regulation (specifically when lambda is set over 0.02-0.03) according to Tables S5A-S5D.

We further conducted simulation studies based on data from the CLHNS study, using a mismatched reference panel (the EUR haplotypes from the 1000 Genomes Project). We observed inflated type I error rate with the recommended level of regularization (lambda = 0.03) according to Table S5E.

Table S5A: CLHNS-PC Study. Ratio of observed versus expected number of associations at different thresholds in null data simulations.

lambda 10-1 10-2 10-3 10-4 10-5

r2 predCO no constraint, 1.98 M Simulations0.001 0.95 1.03 1.37 2.50 6.770.002 0.93 0.96 1.19 1.99 4.340.003 0.91 0.92 1.09 1.78 3.940.004 0.89 0.89 1.02 1.61 3.230.005 0.88 0.86 0.98 1.44 2.880.006 0.87 0.84 0.94 1.33 2.580.007 0.86 0.83 0.91 1.24 2.370.008 0.86 0.81 0.88 1.19 2.220.009 0.85 0.80 0.86 1.15 2.070.01 0.84 0.79 0.85 1.10 1.920.02 0.80 0.70 0.70 0.81 0.960.03 0.77 0.65 0.62 0.71 0.810.04 0.74 0.61 0.57 0.61 0.660.05 0.72 0.58 0.53 0.57 0.500.06 0.70 0.55 0.50 0.53 0.500.07 0.69 0.53 0.47 0.48 0.500.08 0.67 0.51 0.44 0.42 0.500.09 0.66 0.49 0.42 0.38 0.400.1 0.64 0.47 0.40 0.35 0.40

r2 predCO>0.6, 1.757 M Simulations0.001 0.99 1.05 1.31 2.17 4.900.002 0.97 0.99 1.18 1.81 3.470.003 0.95 0.96 1.11 1.66 3.250.004 0.94 0.94 1.05 1.53 2.68

21

0.005 0.93 0.92 1.01 1.38 2.390.006 0.92 0.90 0.98 1.29 2.220.007 0.92 0.89 0.96 1.21 2.110.008 0.91 0.87 0.93 1.17 1.990.009 0.91 0.86 0.92 1.14 1.880.01 0.90 0.85 0.90 1.11 1.760.02 0.86 0.77 0.76 0.84 0.910.03 0.83 0.72 0.69 0.76 0.800.04 0.81 0.68 0.63 0.66 0.630.05 0.79 0.64 0.59 0.63 0.510.06 0.77 0.62 0.55 0.58 0.510.07 0.75 0.59 0.53 0.53 0.510.08 0.74 0.57 0.49 0.47 0.510.09 0.72 0.55 0.47 0.43 0.400.1 0.71 0.53 0.44 0.39 0.40

Table S5B: CLHNS-GC Study. Ratio of observed versus expected number of associations at different thresholds in null data simulations.

lambda 10-1 10-2 10-3 10-4 10-5


r2 predCO>0.6, 1.756 M Simulations0.001 0.98 1.05 1.35 2.22 5.810.002 0.96 0.99 1.21 1.76 4.040.003 0.95 0.96 1.14 1.54 3.59

22

0.004 0.94 0.93 1.08 1.41 3.190.005 0.93 0.91 1.04 1.29 2.960.006 0.92 0.89 1.01 1.24 2.790.007 0.91 0.88 0.98 1.17 2.620.008 0.91 0.87 0.94 1.12 2.450.009 0.90 0.85 0.91 1.08 2.340.01 0.90 0.84 0.90 1.03 1.990.02 0.86 0.76 0.76 0.84 1.250.03 0.83 0.71 0.67 0.67 0.910.04 0.80 0.67 0.61 0.59 0.850.05 0.78 0.63 0.57 0.52 0.740.06 0.77 0.60 0.52 0.48 0.570.07 0.75 0.58 0.48 0.44 0.570.08 0.73 0.55 0.46 0.39 0.570.09 0.72 0.53 0.42 0.35 0.510.1 0.71 0.51 0.40 0.34 0.51

Table S5C: WHI-PC Study. Ratio of observed versus expected number of associations at different thresholds in null data simulations.

lambda 10-1 10-2 10-3 10-4 10-5


r2 predCO>0.6, 3.028 M Simulations0.001 1.17 1.66 2.98 7.07 21.070.002 1.10 1.46 2.35 4.98 12.78

23

0.003 1.07 1.34 2.03 4.04 9.780.004 1.04 1.25 1.82 3.46 7.460.005 1.02 1.19 1.66 3.02 6.180.006 1.00 1.14 1.55 2.67 5.250.007 0.98 1.09 1.45 2.40 4.860.008 0.97 1.06 1.37 2.18 4.430.009 0.95 1.02 1.31 2.04 3.700.01 0.94 0.99 1.25 1.89 3.440.02 0.86 0.81 0.90 1.12 1.490.03 0.81 0.71 0.72 0.82 1.060.04 0.77 0.63 0.61 0.67 0.830.05 0.73 0.58 0.53 0.54 0.630.06 0.71 0.54 0.46 0.45 0.460.07 0.69 0.50 0.42 0.39 0.460.08 0.66 0.47 0.38 0.32 0.360.09 0.65 0.45 0.35 0.29 0.330.10 0.63 0.42 0.32 0.26 0.26

Table S5D: WHI-GC Study. Ratio of observed versus expected number of associations at different thresholds in null data simulations.

lambda 10-1 10-2 10-3 10-4 10-5


r2 predCO>0.6, 2.821 M Simulations0.001 1.18 1.70 3.15 7.20 20.21

24

0.002 1.12 1.48 2.48 4.96 11.420.003 1.08 1.35 2.11 3.82 8.790.004 1.05 1.27 1.87 3.18 6.910.005 1.02 1.20 1.71 2.72 5.740.006 1.00 1.14 1.59 2.38 4.610.007 0.99 1.10 1.48 2.18 3.720.008 0.97 1.06 1.39 1.99 3.230.009 0.96 1.03 1.32 1.85 2.980.01 0.95 1.00 1.26 1.73 2.770.02 0.86 0.81 0.88 0.99 1.670.03 0.81 0.70 0.68 0.70 1.130.04 0.77 0.63 0.57 0.56 0.890.05 0.74 0.58 0.49 0.48 0.640.06 0.71 0.54 0.44 0.40 0.530.07 0.68 0.50 0.39 0.34 0.390.08 0.66 0.47 0.35 0.32 0.280.09 0.64 0.44 0.32 0.29 0.210.10 0.63 0.42 0.29 0.24 0.21

Table S5E: CLHNS-GC Study using the mismatched reference, i.e. EUR reference. Ratio of observed versus expected number of associations at different thresholds in null data simulations.

lambda 10-1 10-2 10-3 10-4 10-5


25

0.11 0.58 0.44 0.46 0.86 3.210.12 0.57 0.41 0.43 0.74 2.400.13 0.55 0.40 0.40 0.65 1.780.14 0.54 0.38 0.36 0.59 1.380.15 0.53 0.36 0.34 0.53 1.020.16 0.52 0.35 0.32 0.46 0.920.17 0.51 0.34 0.29 0.38 0.660.18 0.50 0.32 0.28 0.33 0.660.19 0.49 0.31 0.26 0.31 0.560.20 0.48 0.30 0.24 0.29 0.41

r2 predCO>0.6, 1.693 M Simulations0.001 1.02 1.34 2.74 9.49 47.670.002 0.99 1.24 2.36 8.00 39.170.003 0.97 1.17 2.16 7.10 34.030.004 0.95 1.13 2.02 6.45 30.960.005 0.94 1.09 1.90 5.98 28.710.006 0.93 1.06 1.81 5.62 26.760.007 0.92 1.03 1.73 5.23 24.990.008 0.91 1.01 1.68 5.00 22.800.009 0.90 0.99 1.62 4.81 22.090.01 0.89 0.97 1.58 4.58 20.680.02 0.84 0.84 1.23 3.30 13.470.03 0.80 0.76 1.04 2.43 9.330.04 0.77 0.69 0.89 1.89 7.210.05 0.75 0.65 0.79 1.59 5.260.06 0.72 0.61 0.69 1.33 3.780.07 0.70 0.57 0.63 1.18 3.070.08 0.69 0.54 0.57 0.99 2.660.09 0.67 0.52 0.53 0.83 2.420.10 0.65 0.50 0.50 0.72 2.010.11 0.64 0.47 0.45 0.61 1.650.12 0.63 0.46 0.42 0.56 1.300.13 0.62 0.44 0.40 0.51 1.000.14 0.60 0.42 0.37 0.49 0.830.15 0.59 0.40 0.35 0.45 0.530.16 0.58 0.39 0.33 0.41 0.470.17 0.57 0.38 0.31 0.35 0.350.18 0.56 0.36 0.30 0.30 0.350.19 0.55 0.35 0.29 0.28 0.300.20 0.54 0.34 0.27 0.27 0.30

References

Cormen, T.H., et al. (2009) Introduction to Algorithms. The MIT Press.

26

27

Documents

yunliweb.its.unc.edu · Web viewFor each block (with additional 100kb flanking both up-stream and down-stream), DISSCO imputes untyped markers based on typed markers within the block…