26
A brief introduction to epistasis detection in GWAS 2014. 01. 27. Hyun-hwan Jeong

a brief introduction to epistasis detection

Embed Size (px)

DESCRIPTION

a brief introduction to epistasis detection

Citation preview

Page 1: a brief introduction to epistasis detection

A brief introduction toepistasis detection in GWAS

2014. 01. 27.

Hyun-hwan Jeong

Page 2: a brief introduction to epistasis detection

Agenda

• Introduction

• Problem definition

• Computational detection methods

• Challenges

2

Page 3: a brief introduction to epistasis detection

Introduction

Page 4: a brief introduction to epistasis detection

Single Nucleotide Polymorphism

• A single letter change in DNA sequence• DNA sequence : 99.9% identical

• Common type of genetic variation

• ≥ 1% changes in general population

…ATTCGCCGGCTGCAACGTTAGA……ATTCGCCGGCTGCAGCGTTAGA……ATTCGCCGGCTGCATCGTTAGA…

4

Page 5: a brief introduction to epistasis detection

Genotype, phenotype and Allele

http://en.wikipedia.org/wiki/Phenotype

genotype

phenotype

allele

5

Page 6: a brief introduction to epistasis detection

Genome Wide Association Studyfor relation between Single SNP and disease

Manhattan plot of the GWAS of the discovery cohort comprising 2,346 SSc cases and 5,193 healthy controls. - Nature Genetics 42, 426–429 (2010)

6

Page 7: a brief introduction to epistasis detection

Why is detecting epistasis needed in GWAS?

An illustration of interaction pattern between two SNPs with no marginal effect. -Bioinformatics 26, 30-37 (2010)

7

Page 8: a brief introduction to epistasis detection

Problem Definition

8

Page 9: a brief introduction to epistasis detection

Problem definitionEpistasis detection problem• Object

• Detection of causative SNPs for disease • Maximum value for defined measure

• Dataset• 0.5M ~ 1M SNPs

• 4,000 ~ 5,000 subjects• Binary disease status(case/control)

• 100MB ~ 1GB genotype data file

9

Page 10: a brief introduction to epistasis detection

Problem definition – Data format

𝑆𝑁𝑃𝟎 𝑆𝑁𝑃𝟏 𝑆𝑁𝑃𝟐 𝑆𝑁𝑃𝟑 𝑆𝑁𝑃𝟒 𝑆𝑁𝑃𝟓 𝑆𝑁𝑃𝟔 𝑆𝑁𝑃𝟕 𝑆𝑁𝑃𝟖 𝑆𝑁𝑃𝟗 𝐶𝐿𝐴𝑆𝑆

1 1 0 0 0 0 1 0 1 1 1

0 0 1 0 0 0 1 1 0 2 1

0 0 0 0 0 0 1 0 0 0 1

1 1 0 0 0 0 0 1 0 2 1

0 0 0 0 0 0 0 1 0 0 1

0 0 0 0 0 0 0 0 0 1 0

1 1 0 1 0 0 0 1 1 1 0

0 0 0 1 1 1 0 1 1 1 0

0 1 0 2 0 0 0 1 0 1 0

0 0 0 1 0 0 1 2 1 0 0

• 3 values for SNP columns - 0(AA), 1(Aa/aA), 2(AA/AA)• Binary values for CLASS – 1(case/affected subjects), 0(control/normal)

10

Page 11: a brief introduction to epistasis detection

Problem definition – measure(1/3)

• On contingency table

• Popular measure in epistasis detection• 𝜒2 − 𝑡𝑒𝑠𝑡

• Mutual information

Genotype 𝐴𝐴𝐵𝐵 𝐴𝐴𝐵𝑏 𝐴𝐴𝑏𝑏 𝐴𝑎𝐵𝐵 𝐴𝑎𝐵𝑏 𝐴𝑎𝑏𝑏 𝑎𝑎𝐵𝐵 𝑎𝑎𝐵𝑏 𝑎𝑎𝑏𝑏 sum

Case 39 91 95 92 14 31 63 4 71 500

Control 100 15 55 5 22 150 50 93 10 500

sum 139 106 150 97 36 181 113 97 81 1000

11

Page 12: a brief introduction to epistasis detection

Problem definition – measure(2/3)

• 𝜒2 − 𝑡𝑒𝑠𝑡 based on 𝜒2 − distribution• 𝐻0 : no association between SNPs and disease status

𝜒2 = (𝑂𝑖−𝐸𝑖)

2

𝐸𝑖Genotype 𝐴𝐴𝐵𝐵 𝐴𝐴𝐵𝑏 𝐴𝐴𝑏𝑏 𝐴𝑎𝐵𝐵 𝐴𝑎𝐵𝑏 𝐴𝑎𝑏𝑏 𝑎𝑎𝐵𝐵 𝑎𝑎𝐵𝑏 𝑎𝑎𝑏𝑏 sum

Case 39 91 95 92 14 31 63 4 71 500

Control 100 15 55 5 22 150 50 93 10 500

sum 139 106 150 97 36 181 113 97 81 1000

Genotype 𝐴𝐴𝐵𝐵 𝐴𝐴𝐵𝑏 𝐴𝐴𝑏𝑏 𝐴𝑎𝐵𝐵 𝐴𝑎𝐵𝑏 𝐴𝑎𝑏𝑏 𝑎𝑎𝐵𝐵 𝑎𝑎𝐵𝑏 𝑎𝑎𝑏𝑏 sum

Case 69.5 53 75 48.5 18 90.5 56.5 48.5 40.5 500

Control 69.5 53 75 48.5 18 90.5 56.5 48.5 40.5 500

sum 139 106 150 97 36 181 113 97 81 1000

𝜒2 − 𝑣𝑎𝑙𝑢𝑒 ∶ 379.07, 𝑝 − 𝑣𝑎𝑙𝑢𝑒 = 2.76 × 10−77

Expected contingency table

12

Page 13: a brief introduction to epistasis detection

Problem definition – measure(3/3)

• Mutual information(1/3)• Non-parametric measure

13

Page 14: a brief introduction to epistasis detection

Problem definition – measure(3/3)

• Mutual information(2/3)

Genotype 𝐴𝐴𝐵𝐵 𝐴𝐴𝐵𝑏 𝐴𝐴𝑏𝑏 𝐴𝑎𝐵𝐵 𝐴𝑎𝐵𝑏 𝐴𝑎𝑏𝑏 𝑎𝑎𝐵𝐵 𝑎𝑎𝐵𝑏 𝑎𝑎𝑏𝑏 sum

Case 39 91 95 92 14 31 63 4 71 500

Control 100 15 55 5 22 150 50 93 10 500

sum 139 106 150 97 36 181 113 97 81 1000

Genotype 𝐴𝐴𝐵𝐵 𝐴𝐴𝐵𝑏 𝐴𝐴𝑏𝑏 𝐴𝑎𝐵𝐵 𝐴𝑎𝐵𝑏 𝐴𝑎𝑏𝑏 𝑎𝑎𝐵𝐵 𝑎𝑎𝐵𝑏 𝑎𝑎𝑏𝑏 sum

Case 0.039 0.091 0.095 0.092 0.014 0.031 0.063 0.004 0.071 0.500

Control 0.100 0.015 0.055 0.005 0.022 0.150 0.050 0.093 0.010 0.500

sum 0.139 0.106 0.150 0.097 0.036 0.181 0.113 0.097 0.081 1.000

Frequency Table

14

Page 15: a brief introduction to epistasis detection

Problem definition – measure(3/3)

• Mutual information(3/3)

Genotype 𝐴𝐴𝐵𝐵 𝐴𝐴𝐵𝑏 𝐴𝐴𝑏𝑏 𝐴𝑎𝐵𝐵 𝐴𝑎𝐵𝑏 𝐴𝑎𝑏𝑏 𝑎𝑎𝐵𝐵 𝑎𝑎𝐵𝑏 𝑎𝑎𝑏𝑏 sum

Case 0.183 0.315 0.323 0.317 0.086 0.155 0.251 0.032 0.271 0.500

Control 0.332 0.091 0.230 0.038 0.121 0.411 0.216 0.319 0.066 0.500

sum 0.396 0.343 0.411 0.326 0.173 0.446 0.355 0.326 0.294

Entropy Table

𝐼 𝑔𝑒𝑛𝑜𝑡𝑦𝑝𝑒; 𝑑𝑖𝑠𝑒𝑎𝑠𝑒 = 𝐻 𝑔𝑒𝑛𝑜𝑡𝑦𝑝𝑒 + 𝐻 𝑑𝑖𝑠𝑒𝑎𝑠𝑒 − 𝐻 𝑔𝑒𝑛𝑜𝑡𝑦𝑝𝑒, 𝑑𝑖𝑠𝑒𝑎𝑠𝑒

= 3.07 + 1.00 − 3.76

= 0.31

15

Page 16: a brief introduction to epistasis detection

Methods to detect epistasis

16

Page 17: a brief introduction to epistasis detection

Methods – Computational Approaches• Multifactor Dimensionality Reduction

(Ritchie et al. 2002)

• SNPHarvester (Yang et al. 2009)

• SNPRuler (Wan et al. 2010)

• Mutual Information With Clustering(Leem et al. 2014)

17

Page 18: a brief introduction to epistasis detection

MethodsMultifactor dimensionality reduction(1/2)

18

Page 19: a brief introduction to epistasis detection

MethodsMultifactor dimensionality reduction(2/2)

• Model free, non-parametric methods

• Pattern-based method• Association rule for each combinations of SNPs and

phenotypes• i.e. 𝑖𝑓 𝑆𝑁𝑃10 = 0 𝑎𝑛𝑑 𝑆𝑁𝑃13 = 4 𝑡ℎ𝑒𝑛 𝑐𝑙𝑎𝑠𝑠 = 1

• Exhaustive Search• Computational Burden

• Cross Validation Consistency• To select best model

19

Page 20: a brief introduction to epistasis detection

MethodsSNPHarvester(1/2)

20

Page 21: a brief introduction to epistasis detection

MethodsSNPHarvester(2/2)• Local search

• Local optima problem

• PathSeeker algorithm• Successive Runs

• Score function : 𝜒2 − 𝑣𝑎𝑙𝑢𝑒

21

Page 22: a brief introduction to epistasis detection

MethodsSNPRuler• Pattern-based method

• Predictive rule

• Branch-and-bound algorithm• Upper bound of 𝜒2 − 𝑣𝑎𝑙𝑢𝑒 in d.f. is 1

22

Page 23: a brief introduction to epistasis detection

MethodsMutual Information With Clustering(1/2)

: SNPs: causative SNPs

d1

d2

distanceScore=d1+d2

Centroid 1

Centroid 2

Centroid 3

3 SNPs with the highest mutual

information value

m candidates

m candidates

m candidates

23

Page 24: a brief introduction to epistasis detection

MethodsMutual Information With Clustering(2/2)

• Mutual information• As distance measure for clustering

• K-means clustering algorithm• Candidate selection

• Reduce search space dramtically

• Can detect high-order epistatic interaction• Also, shows better performance (power, execution time)

than previous methods

24

Page 25: a brief introduction to epistasis detection

Challenges in epistasis detection

25

Page 26: a brief introduction to epistasis detection

Challenges

• Reducing computational burden• Filtering

• Parallel processing

• Higher-order epistatic interaction detection• Larger than 2

• Novel measure of association between SNPs and disease

26