Upload
hyun-hwan-jeong
View
242
Download
0
Embed Size (px)
DESCRIPTION
a brief introduction to epistasis detection
Citation preview
A brief introduction toepistasis detection in GWAS
2014. 01. 27.
Hyun-hwan Jeong
Agenda
• Introduction
• Problem definition
• Computational detection methods
• Challenges
2
Introduction
Single Nucleotide Polymorphism
• A single letter change in DNA sequence• DNA sequence : 99.9% identical
• Common type of genetic variation
• ≥ 1% changes in general population
…ATTCGCCGGCTGCAACGTTAGA……ATTCGCCGGCTGCAGCGTTAGA……ATTCGCCGGCTGCATCGTTAGA…
4
Genotype, phenotype and Allele
http://en.wikipedia.org/wiki/Phenotype
genotype
phenotype
allele
5
Genome Wide Association Studyfor relation between Single SNP and disease
Manhattan plot of the GWAS of the discovery cohort comprising 2,346 SSc cases and 5,193 healthy controls. - Nature Genetics 42, 426–429 (2010)
6
Why is detecting epistasis needed in GWAS?
An illustration of interaction pattern between two SNPs with no marginal effect. -Bioinformatics 26, 30-37 (2010)
7
Problem Definition
8
Problem definitionEpistasis detection problem• Object
• Detection of causative SNPs for disease • Maximum value for defined measure
• Dataset• 0.5M ~ 1M SNPs
• 4,000 ~ 5,000 subjects• Binary disease status(case/control)
• 100MB ~ 1GB genotype data file
9
Problem definition – Data format
𝑆𝑁𝑃𝟎 𝑆𝑁𝑃𝟏 𝑆𝑁𝑃𝟐 𝑆𝑁𝑃𝟑 𝑆𝑁𝑃𝟒 𝑆𝑁𝑃𝟓 𝑆𝑁𝑃𝟔 𝑆𝑁𝑃𝟕 𝑆𝑁𝑃𝟖 𝑆𝑁𝑃𝟗 𝐶𝐿𝐴𝑆𝑆
1 1 0 0 0 0 1 0 1 1 1
0 0 1 0 0 0 1 1 0 2 1
0 0 0 0 0 0 1 0 0 0 1
1 1 0 0 0 0 0 1 0 2 1
0 0 0 0 0 0 0 1 0 0 1
0 0 0 0 0 0 0 0 0 1 0
1 1 0 1 0 0 0 1 1 1 0
0 0 0 1 1 1 0 1 1 1 0
0 1 0 2 0 0 0 1 0 1 0
0 0 0 1 0 0 1 2 1 0 0
• 3 values for SNP columns - 0(AA), 1(Aa/aA), 2(AA/AA)• Binary values for CLASS – 1(case/affected subjects), 0(control/normal)
10
Problem definition – measure(1/3)
• On contingency table
• Popular measure in epistasis detection• 𝜒2 − 𝑡𝑒𝑠𝑡
• Mutual information
Genotype 𝐴𝐴𝐵𝐵 𝐴𝐴𝐵𝑏 𝐴𝐴𝑏𝑏 𝐴𝑎𝐵𝐵 𝐴𝑎𝐵𝑏 𝐴𝑎𝑏𝑏 𝑎𝑎𝐵𝐵 𝑎𝑎𝐵𝑏 𝑎𝑎𝑏𝑏 sum
Case 39 91 95 92 14 31 63 4 71 500
Control 100 15 55 5 22 150 50 93 10 500
sum 139 106 150 97 36 181 113 97 81 1000
11
Problem definition – measure(2/3)
• 𝜒2 − 𝑡𝑒𝑠𝑡 based on 𝜒2 − distribution• 𝐻0 : no association between SNPs and disease status
𝜒2 = (𝑂𝑖−𝐸𝑖)
2
𝐸𝑖Genotype 𝐴𝐴𝐵𝐵 𝐴𝐴𝐵𝑏 𝐴𝐴𝑏𝑏 𝐴𝑎𝐵𝐵 𝐴𝑎𝐵𝑏 𝐴𝑎𝑏𝑏 𝑎𝑎𝐵𝐵 𝑎𝑎𝐵𝑏 𝑎𝑎𝑏𝑏 sum
Case 39 91 95 92 14 31 63 4 71 500
Control 100 15 55 5 22 150 50 93 10 500
sum 139 106 150 97 36 181 113 97 81 1000
Genotype 𝐴𝐴𝐵𝐵 𝐴𝐴𝐵𝑏 𝐴𝐴𝑏𝑏 𝐴𝑎𝐵𝐵 𝐴𝑎𝐵𝑏 𝐴𝑎𝑏𝑏 𝑎𝑎𝐵𝐵 𝑎𝑎𝐵𝑏 𝑎𝑎𝑏𝑏 sum
Case 69.5 53 75 48.5 18 90.5 56.5 48.5 40.5 500
Control 69.5 53 75 48.5 18 90.5 56.5 48.5 40.5 500
sum 139 106 150 97 36 181 113 97 81 1000
𝜒2 − 𝑣𝑎𝑙𝑢𝑒 ∶ 379.07, 𝑝 − 𝑣𝑎𝑙𝑢𝑒 = 2.76 × 10−77
Expected contingency table
12
Problem definition – measure(3/3)
• Mutual information(1/3)• Non-parametric measure
13
Problem definition – measure(3/3)
• Mutual information(2/3)
Genotype 𝐴𝐴𝐵𝐵 𝐴𝐴𝐵𝑏 𝐴𝐴𝑏𝑏 𝐴𝑎𝐵𝐵 𝐴𝑎𝐵𝑏 𝐴𝑎𝑏𝑏 𝑎𝑎𝐵𝐵 𝑎𝑎𝐵𝑏 𝑎𝑎𝑏𝑏 sum
Case 39 91 95 92 14 31 63 4 71 500
Control 100 15 55 5 22 150 50 93 10 500
sum 139 106 150 97 36 181 113 97 81 1000
Genotype 𝐴𝐴𝐵𝐵 𝐴𝐴𝐵𝑏 𝐴𝐴𝑏𝑏 𝐴𝑎𝐵𝐵 𝐴𝑎𝐵𝑏 𝐴𝑎𝑏𝑏 𝑎𝑎𝐵𝐵 𝑎𝑎𝐵𝑏 𝑎𝑎𝑏𝑏 sum
Case 0.039 0.091 0.095 0.092 0.014 0.031 0.063 0.004 0.071 0.500
Control 0.100 0.015 0.055 0.005 0.022 0.150 0.050 0.093 0.010 0.500
sum 0.139 0.106 0.150 0.097 0.036 0.181 0.113 0.097 0.081 1.000
Frequency Table
14
Problem definition – measure(3/3)
• Mutual information(3/3)
Genotype 𝐴𝐴𝐵𝐵 𝐴𝐴𝐵𝑏 𝐴𝐴𝑏𝑏 𝐴𝑎𝐵𝐵 𝐴𝑎𝐵𝑏 𝐴𝑎𝑏𝑏 𝑎𝑎𝐵𝐵 𝑎𝑎𝐵𝑏 𝑎𝑎𝑏𝑏 sum
Case 0.183 0.315 0.323 0.317 0.086 0.155 0.251 0.032 0.271 0.500
Control 0.332 0.091 0.230 0.038 0.121 0.411 0.216 0.319 0.066 0.500
sum 0.396 0.343 0.411 0.326 0.173 0.446 0.355 0.326 0.294
Entropy Table
𝐼 𝑔𝑒𝑛𝑜𝑡𝑦𝑝𝑒; 𝑑𝑖𝑠𝑒𝑎𝑠𝑒 = 𝐻 𝑔𝑒𝑛𝑜𝑡𝑦𝑝𝑒 + 𝐻 𝑑𝑖𝑠𝑒𝑎𝑠𝑒 − 𝐻 𝑔𝑒𝑛𝑜𝑡𝑦𝑝𝑒, 𝑑𝑖𝑠𝑒𝑎𝑠𝑒
= 3.07 + 1.00 − 3.76
= 0.31
15
Methods to detect epistasis
16
Methods – Computational Approaches• Multifactor Dimensionality Reduction
(Ritchie et al. 2002)
• SNPHarvester (Yang et al. 2009)
• SNPRuler (Wan et al. 2010)
• Mutual Information With Clustering(Leem et al. 2014)
17
MethodsMultifactor dimensionality reduction(1/2)
18
MethodsMultifactor dimensionality reduction(2/2)
• Model free, non-parametric methods
• Pattern-based method• Association rule for each combinations of SNPs and
phenotypes• i.e. 𝑖𝑓 𝑆𝑁𝑃10 = 0 𝑎𝑛𝑑 𝑆𝑁𝑃13 = 4 𝑡ℎ𝑒𝑛 𝑐𝑙𝑎𝑠𝑠 = 1
• Exhaustive Search• Computational Burden
• Cross Validation Consistency• To select best model
19
MethodsSNPHarvester(1/2)
20
MethodsSNPHarvester(2/2)• Local search
• Local optima problem
• PathSeeker algorithm• Successive Runs
• Score function : 𝜒2 − 𝑣𝑎𝑙𝑢𝑒
21
MethodsSNPRuler• Pattern-based method
• Predictive rule
• Branch-and-bound algorithm• Upper bound of 𝜒2 − 𝑣𝑎𝑙𝑢𝑒 in d.f. is 1
22
MethodsMutual Information With Clustering(1/2)
: SNPs: causative SNPs
d1
d2
distanceScore=d1+d2
Centroid 1
Centroid 2
Centroid 3
3 SNPs with the highest mutual
information value
m candidates
m candidates
m candidates
23
MethodsMutual Information With Clustering(2/2)
• Mutual information• As distance measure for clustering
• K-means clustering algorithm• Candidate selection
• Reduce search space dramtically
• Can detect high-order epistatic interaction• Also, shows better performance (power, execution time)
than previous methods
24
Challenges in epistasis detection
25
Challenges
• Reducing computational burden• Filtering
• Parallel processing
• Higher-order epistatic interaction detection• Larger than 2
• Novel measure of association between SNPs and disease
26