Novel Statistical Methods
Gary K. ChenUniversity of Southern California
May 17, 2011
An outline
Association testing in admixed populations
Gene-gene interactions
Copy number inferences
Local ancestry inference
I Assumption: 2 or more homogeneouspopulations gave rise to today’s admixedpopulation. e.g. Hispanics, African Americans
I Software:I LAMPI HAPAAI Hapmix
I Relevance:I Not taking ancestry into account can cause large
problems in confoundingI However, understanding local ancestry can enhance
inference in gene mapping
Hidden Markov Model of HAPMIXprogram
Combining evidence from both localancestry and association
Novel MIX score statistic
I A χ21 test combining association and admixture
associationI Likelihood:
I Lcombined(pA, pE ,R) =LAA,AE ,AA(pA, pE ,R)Ladmix(Ω(R))
I Assumption: the SNP odds ratio R is re-used inthe ancestry odds ratio Ω(R)
I MIX = 2[ maxpA,0,pE ,0,R
logLcombined(pA,0, pE ,0,R)−
max pA,0, pE ,0logLcombined(pA,0, pE ,0, 1)]
Afr-Am Prostate Cancer GWAS
Afr-Am Prostate Cancer Admixture scan
Top results from scan of MIX statistic
chr position rs adm mix snp beta se pvalue8 128187997 rs7844219 49.9614 83.6454 54.5324 0.279106 0.0366375 1.96509e-148 128193308 rs1551512 50.4199 81.3104 51.5792 0.266105 0.0364108 2.15938e-138 128198554 rs6989838 52.7453 80.7646 51.0694 0.266931 0.036335 1.61315e-138 128199669 rs7013255 50.4199 80.7332 51.7533 0.266914 0.0363362 1.62204e-138 128194098 rs16901979 49.9614 79.6505 51.0524 0.265175 0.0363043 2.22267e-138 128176062 rs6983561 49.9614 79.386 49.888 0.276348 0.037292 9.90319e-148 128194377 rs10505483 51.8086 78.1544 49.3351 0.260651 0.0363121 5.71765e-138 128174913 rs7012442 49.5051 77.371 48.4881 0.278969 0.0376443 9.78106e-148 128219343 rs6987409 49.5051 62.8232 46.5145 0.36881 0.0502564 1.41553e-138 128202258 rs7000307 49.9614 57.1498 37.9689 0.254356 0.0408703 4.13043e-108 128225845 rs7822987 54.6449 56.9699 40.3666 0.349385 0.0501898 2.4144e-128 128204516 rs7840773 52.2758 56.5461 36.8886 0.2491 0.0405111 6.68617e-108 128223073 rs7018243 49.051 56.4211 40.683 0.345422 0.0498431 3.02292e-128 128225870 rs7822995 49.9614 56.1409 40.2347 0.349646 0.0502108 2.37455e-128 128173525 rs13254738 52.7453 55.9177 37.4838 -0.255658 0.0386944 3.37064e-118 128204547 rs7824364 47.2561 55.7612 37.2464 0.2476 0.0404363 7.88135e-108 128173119 rs1456315 49.9614 54.9189 41.9607 -0.231627 0.0357647 8.22321e-118 128482487 rs6983267 55.6079 49.2272 19.0306 -0.280904 0.0593348 2.00619e-068 128168637 rs1840709 50.8806 44.0089 22.5331 -0.204043 0.0410319 6.27666e-078 128257237 rs16902003 50.4199 38.9456 29.1061 0.340009 0.0612992 2.29044e-08
An outline
Association testing in admixed populations
Gene-gene interactions
Copy number inferences
Detecting higher order interactions
I Statistical epistasis may account for somehidden heritability
I Statistical and computational challenges areobvious
I Possible approaches for variable selectionI Constrain search to only variables with strong
marginal effectsI Place priors on the effect sizes, informed through
biology: (e.g. Chen and Thomas Genetic Epi 2010)
I Search space can still be hugeI Implement massively parallel optimization
algorithmsI Provide a good fit for hardware architecture of
Graphics Processing Units
Organization of gridblock of threadblockson GPU
Overview of algorithmI Newton-Raphson kernel
I Each threadblock maps to a block of 512 subjects(theads) for 1 variable
I Each thread calculates subject’s contribution togradient and hessian
I Sum (reduction) across 512 subjectsI Sum (reduction) across subject blocks in new
kernel
I Compute log-likelihood change for eachvariable (like above).
I Apply a max operator (log2 reduction) toselect variable with greatest contribution tolikelihood.
I Iterate repeatedly until likelihood increase lessthan epsilon
Evaluation on large dataset
I GWAS dataI 6,806 African American subjects in a case control
study of prostate cancerI 1,047,986 SNPs typed
I Elapsed walltime for 1 LASSO iteration (sweepacross all variables)
I 15 minutes on optimized serial implementationacross 2 slave CPUs
I 5.8 seconds on parallel implementation across 2nVidia Tesla C2050 GPU devices
I 155x speed up
Application
I Defined 28 risk regions (Haiman et al PLoSGenet in press)
I 6,256 SNPs typed
I Fit a model with 19,571,896 variables usingLASSO penalized multivariate logisticregression
I Avg run time per variable: 1 min 40 seconds
Results
Table: 1st 10 variables to enter the model
Interaction β 1df χ2
SNP 1 SNP 2 Multivariate Univariate Interaction SNP 1 SNP 2rs10050937 rs17794619 -0.472152 -0.512223 35.0549 6.6248 15.6842rs12484747 rs5759052 -0.382707 -0.34638 27.221 3.89604 18.1621rs12943477 rs7130881 0.243003 0.267117 42.1636 5.71494 31.5322rs13417654 rs5759256 0.216708 0.240687 32.3129 16.0361 0.0104941rs2625403 rs4872172 -0.12534 -0.148221 30.0041 11.1439 14.8556rs266880 rs7949453 0.136513 0.152762 28.983 12.0237 10.7806rs2963275 rs360802 -0.53471 -0.583309 29.8975 1.63451 7.26303rs339319 rs7075009 -0.225684 -0.263573 31.4443 22.2988 6.06348rs4129455 rs9333335 -1.33385 -1.78312 33.2588 6.37851 1.19051rs6798749 rs8079894 0.179629 0.201867 29.0323 12.0371 9.97029
An outline
Association testing in admixed populations
Gene-gene interactions
Copy number inferences
Application to cancer tumor data
I Copy number inference in tumors morechallenging
I Tissues can be contaminated with normal cellsI Furthermore, intra tumor heterogeneity can lead to
sub-clones with distinct CN profiles
I A large state space HMMI Consider differing normal-tumor copy number and
genotype combinationsI For each combination, a possible contamination
proportionI Copy Num: z = (1-α)znormal + α ztumor
Simplified Example of a State Spacestate CNfrac BACnormal CNtumor BACtumor0 2 0 2 01 2 1 2 12 2 2 2 23 0 0 0 04 0 1 0 05 0 2 0 06 0.5 0 0 07 0.5 1 0 08 0.5 2 0 09 1 0 1 010 1 1 1 011 1 1 1 112 1 2 1 113 1.5 0 1 014 1.5 1 1 015 1.5 1 1 116 1.5 2 1 117 2.5 0 3 018 2.5 1 3 119 2.5 1 3 220 2.5 2 3 321 3 0 4 022 3 1 4 123 3 1 4 224 3 1 4 325 3 2 4 426 3.5 0 4 027 3.5 1 4 128 3.5 1 4 229 3.5 1 4 330 3.5 2 4 4
Comparison of algorithms
I We implement 8 kernels. Examples:I Re-scaling transition matrix (for SNP spacing)
I Serial: O(2nm2); Parallel: O(n)
I Forward backwardI Serial: O(2nm2); Parallel: O(nlog2(m))
I Normalizing constant (Baum-Welch)I Serial: O(nm); Parallel: O(log2(n))
I MLE of transition matrix (Baum-Welch)I Serial: O(nm2); Parallel: O(n)
Speedups
Table: 1 iteration of HMM training on Chr 1 (41,263 SNPs)
states CPU GPU fold-speedup128 9.5m 37s 15x512 2h 35m 1m 44s 108x
Chr 21 0 percent tumor
Chr 21 100 percent tumor
Chr 21 50 percent tumor
Thanks to
I Admixture scoring: Bogdan Pasaniuc
I CNV work: Kai Wang, Christina Curtis
I Access to GPU server: Tim Triche, ZachRamjan
I (Chris’s Acknowledgement slide)