1
Associating Genomic Variations with
Phenotypes
Model comparison, rare variants, and analysis pipeline
Qunyuan ZhangDivision of Statistical Genomics & Genome InstituteWashington University School of Medicine
2
Data & Question
Relationshipbetween X and Y ?
nmnnn
m
m
xxxyn
xxxyxxxy
XYi
.......................
...2
...1
21
222212
112111
Genotypes:SNP
InsertionDeletion
DuplicationInversion
Translocation…
Phenotypes(quantitative,categorical)
3
Linkage & Association
Association: (Y,X)
Linkage: (Y,Q)Q is unobservable
...
.....................
...2
...1
221
2222212
1212111
nnnn xqxyn
xqxyxqxy
XYi Genotypes
Phenotype
Putative QTL
r1 Q r2
4
A Fixed-effect Mixture Model For LinkageCommonly used in plant genetics
r1 Q r2
P1 X P2
F1
F2
3
1
),|()(j
iji rXQPyf
2)(
21exp
21
j
jiy
j
n
iiyfYL
1
)()(
SNP A SNP B
5
A Variance-component Model For LinkageCommonly used in human genetics
r1 Q r2
)()(
21exp
||)2(1)( 1
2/12/
YYYL Tn V
V
222)( eggQQYCov IΔΔV
Background IBD matrix
QTL IBD matrix
Diagonal unit matrix
QΔ
SNP A SNP B
6
Variance-component Model = Random-effect Linear Model
222eggQQ IΔΔV
eγZγZμ ggQQY
),0( 2QQMVN Δ ),0( 2
ggMVN Δ ),0( 2eN
)()(
21exp
||)2(1)( 1
2/12/
YYYL Tn V
V
Random effects
7
From Linkage to Association
22egg IΔV
eγZγZμ ggQQY
)()(
21exp
||)2(1)( 1
2/12/
XYXYYL Tn V
V
eγZXβμ ggY
marker effect(s)
Family-based association model
Linkage model
QTL effect(s)
fixed effect(s)
8
A Simple Association ModelFor Unrelated Subjects
2eIV
)()(
21exp
||)2(1)( 1
2/12/
XYXYYL Tn V
V
eXβμ Y
n
i e
i Xy
e1
2)(21exp
21
9
Covariate(s): Adjusting For Confounder(s)
eβXXβμ CCY
Observed confounders: age, sex etc.Hidden confounders: population structure
Population structure can be estimated by:-PCA-Clustering-Admixture/ancestry
10
Modeling Hidden Genetic CorrelationBetween Subjects
22egg IΔV
eγZβXXβμ ggCCY
marker fixed effect(s)
Family data, pedigree => IBD matrixPopulation data, hidden, marker data => IBS matrix
covariate fixed effect(s)
Genetic background random effects
11
Modeling Rare Variants
eγZβXXβμ ggCCY
...11 XY μ
......2211 kkXXXY μ
Common variants, tested individually, H0: β1=0. One p-value per variant
Rare variants, tested as an entire group (burden test), usually by geneH0: β1= β2=…=βk=0 . One p-value per group of variants
Incorporated with variable selection, with loose criteria
β can be treated as random effects, variance components test, can be weighted by prior information
12
Collapsing Model
......2211 kkXXXY μ
... XY μ
110
001311020001
321 XXXXsubject
Collapsing multiple variables into one
13
Weighted Sum Model......2211 kkXXXY μ
...)(1
k
jjjXwY μ
2.08.00.0
001311020001
3.05.02.0 1
3
1
2
1
1 SwX
wX
wXsubject
Weighted sum score
... SY μ
14
Weighting VariantsBase on allele frequency, continuous or binary(0,1) weight,
variable threshold;Based on function annotation/prediction;Based on sequencing quality (coverage, mapping quality,
genotyping quality, validated or not etc.);Data-driven, using both genotype and phenotype data,
learning weights (including effect directions) from data, requiring permutation test;
Any combination …
Grouping VariantsBy gene By transcript By exonBy gene set / pathway By protein domain……
15
Modeling More Data TypesGeneralized Linear (Mixed) Model
eXβμ ...)(Yg
Link function
For binary Y, logistic model
)0(1)1(log)(log)(
YPYPYitYg
1)...exp()...exp()1(
eXβμeXβμYP
16
Longitudinal Data (quantitative)
Fixed effect, time as covariate
Repeated measures, random effect, correlation within subjects
Time
17
Longitudinal Data (binary)
Linear model, time as covariate
Survival analysis, CoxPH model etc.
Time
18
Tools
SAS ProceduresREG, LOGISTIC, GENMOD, MIXED, HPMIXED, GLIMMIX, PHREG/LIFETEST
R Functions/Packageslm (), glm()gee, nlme, kinship2/coxme, lme4, survival
Other ProgramsSOLAR, MMAP, EMMA, EMMAX, SKAT
19
Pipeline
job1 job2 ….. Job N
Input (data + options)
Options.jobi => self-programmed modules (SAS, R,…)
Options.jobi => external program modules (MMAP, SKAT,..)
Result 1
Result 2
….. Result N
Job generating/submitting module
Job number controlling module
Job status monitoring module (all done ?)
Yes
Result summarizing module
no
Wait …
LSF bsub
20
gwas.sh options.gwa
#!/bin/shOPFILE=$1...…
[DATA]database=SASgenotype_dir=/dsg1/gwas/fhsgenogenotype_file=
phenotype_file=fhs100markerinfo_file=mapallmarker_selection=MAF>0.01pedigree_file=pediallsubjectID=subjectpedgreeID=famidmarkername=snp…[ANALYSIS]phenolist_file=pheno_list=bmi/qtcovariates=program=SASGLManalysis=mixed[OUTPUT]output_dir=/dsguser/qunyuan/fhs/bmioutput_file=output_replace=no[RUN]clusterjobname=bmimixedmemsize=1000Mmaxjobn=300…
Pheno type covar program analysis runBmi qt age,sex SASGLM mixed YESObes ql NA SASGLM gee YESHD ql age SASGLM gee NOAge …Sex ……
Program language location Maintainer SASGLM SAS /dsg1/code/sas/glm.sas Q.ZhangGSTAT R /dsg1/code/R/gstat.R Q.ZhangMMAP C /dsg1/code/sas/mmap.sh J. Czajkowski…
21
Thanks !