a brief introduction to epistasis detection

  • View
    243

  • Download
    0

  • Category

    Science

Preview:

DESCRIPTION

a brief introduction to epistasis detection

Citation preview

A brief introduction toepistasis detection in GWAS

2014. 01. 27.

Hyun-hwan Jeong

Agenda

• Introduction

• Problem definition

• Computational detection methods

• Challenges

2

Introduction

Single Nucleotide Polymorphism

• A single letter change in DNA sequence• DNA sequence : 99.9% identical

• Common type of genetic variation

• ≥ 1% changes in general population

…ATTCGCCGGCTGCAACGTTAGA……ATTCGCCGGCTGCAGCGTTAGA……ATTCGCCGGCTGCATCGTTAGA…

4

Genotype, phenotype and Allele

http://en.wikipedia.org/wiki/Phenotype

genotype

phenotype

allele

5

Genome Wide Association Studyfor relation between Single SNP and disease

Manhattan plot of the GWAS of the discovery cohort comprising 2,346 SSc cases and 5,193 healthy controls. - Nature Genetics 42, 426–429 (2010)

6

Why is detecting epistasis needed in GWAS?

An illustration of interaction pattern between two SNPs with no marginal effect. -Bioinformatics 26, 30-37 (2010)

7

Problem Definition

8

Problem definitionEpistasis detection problem• Object

• Detection of causative SNPs for disease • Maximum value for defined measure

• Dataset• 0.5M ~ 1M SNPs

• 4,000 ~ 5,000 subjects• Binary disease status(case/control)

• 100MB ~ 1GB genotype data file

9

Problem definition – Data format

𝑆𝑁𝑃𝟎 𝑆𝑁𝑃𝟏 𝑆𝑁𝑃𝟐 𝑆𝑁𝑃𝟑 𝑆𝑁𝑃𝟒 𝑆𝑁𝑃𝟓 𝑆𝑁𝑃𝟔 𝑆𝑁𝑃𝟕 𝑆𝑁𝑃𝟖 𝑆𝑁𝑃𝟗 𝐶𝐿𝐴𝑆𝑆

1 1 0 0 0 0 1 0 1 1 1

0 0 1 0 0 0 1 1 0 2 1

0 0 0 0 0 0 1 0 0 0 1

1 1 0 0 0 0 0 1 0 2 1

0 0 0 0 0 0 0 1 0 0 1

0 0 0 0 0 0 0 0 0 1 0

1 1 0 1 0 0 0 1 1 1 0

0 0 0 1 1 1 0 1 1 1 0

0 1 0 2 0 0 0 1 0 1 0

0 0 0 1 0 0 1 2 1 0 0

• 3 values for SNP columns - 0(AA), 1(Aa/aA), 2(AA/AA)• Binary values for CLASS – 1(case/affected subjects), 0(control/normal)

10

Problem definition – measure(1/3)

• On contingency table

• Popular measure in epistasis detection• 𝜒2 − 𝑡𝑒𝑠𝑡

• Mutual information

Genotype 𝐴𝐴𝐵𝐵 𝐴𝐴𝐵𝑏 𝐴𝐴𝑏𝑏 𝐴𝑎𝐵𝐵 𝐴𝑎𝐵𝑏 𝐴𝑎𝑏𝑏 𝑎𝑎𝐵𝐵 𝑎𝑎𝐵𝑏 𝑎𝑎𝑏𝑏 sum

Case 39 91 95 92 14 31 63 4 71 500

Control 100 15 55 5 22 150 50 93 10 500

sum 139 106 150 97 36 181 113 97 81 1000

11

Problem definition – measure(2/3)

• 𝜒2 − 𝑡𝑒𝑠𝑡 based on 𝜒2 − distribution• 𝐻0 : no association between SNPs and disease status

𝜒2 = (𝑂𝑖−𝐸𝑖)

2

𝐸𝑖Genotype 𝐴𝐴𝐵𝐵 𝐴𝐴𝐵𝑏 𝐴𝐴𝑏𝑏 𝐴𝑎𝐵𝐵 𝐴𝑎𝐵𝑏 𝐴𝑎𝑏𝑏 𝑎𝑎𝐵𝐵 𝑎𝑎𝐵𝑏 𝑎𝑎𝑏𝑏 sum

Case 39 91 95 92 14 31 63 4 71 500

Control 100 15 55 5 22 150 50 93 10 500

sum 139 106 150 97 36 181 113 97 81 1000

Genotype 𝐴𝐴𝐵𝐵 𝐴𝐴𝐵𝑏 𝐴𝐴𝑏𝑏 𝐴𝑎𝐵𝐵 𝐴𝑎𝐵𝑏 𝐴𝑎𝑏𝑏 𝑎𝑎𝐵𝐵 𝑎𝑎𝐵𝑏 𝑎𝑎𝑏𝑏 sum

Case 69.5 53 75 48.5 18 90.5 56.5 48.5 40.5 500

Control 69.5 53 75 48.5 18 90.5 56.5 48.5 40.5 500

sum 139 106 150 97 36 181 113 97 81 1000

𝜒2 − 𝑣𝑎𝑙𝑢𝑒 ∶ 379.07, 𝑝 − 𝑣𝑎𝑙𝑢𝑒 = 2.76 × 10−77

Expected contingency table

12

Problem definition – measure(3/3)

• Mutual information(1/3)• Non-parametric measure

13

Problem definition – measure(3/3)

• Mutual information(2/3)

Genotype 𝐴𝐴𝐵𝐵 𝐴𝐴𝐵𝑏 𝐴𝐴𝑏𝑏 𝐴𝑎𝐵𝐵 𝐴𝑎𝐵𝑏 𝐴𝑎𝑏𝑏 𝑎𝑎𝐵𝐵 𝑎𝑎𝐵𝑏 𝑎𝑎𝑏𝑏 sum

Case 39 91 95 92 14 31 63 4 71 500

Control 100 15 55 5 22 150 50 93 10 500

sum 139 106 150 97 36 181 113 97 81 1000

Genotype 𝐴𝐴𝐵𝐵 𝐴𝐴𝐵𝑏 𝐴𝐴𝑏𝑏 𝐴𝑎𝐵𝐵 𝐴𝑎𝐵𝑏 𝐴𝑎𝑏𝑏 𝑎𝑎𝐵𝐵 𝑎𝑎𝐵𝑏 𝑎𝑎𝑏𝑏 sum

Case 0.039 0.091 0.095 0.092 0.014 0.031 0.063 0.004 0.071 0.500

Control 0.100 0.015 0.055 0.005 0.022 0.150 0.050 0.093 0.010 0.500

sum 0.139 0.106 0.150 0.097 0.036 0.181 0.113 0.097 0.081 1.000

Frequency Table

14

Problem definition – measure(3/3)

• Mutual information(3/3)

Genotype 𝐴𝐴𝐵𝐵 𝐴𝐴𝐵𝑏 𝐴𝐴𝑏𝑏 𝐴𝑎𝐵𝐵 𝐴𝑎𝐵𝑏 𝐴𝑎𝑏𝑏 𝑎𝑎𝐵𝐵 𝑎𝑎𝐵𝑏 𝑎𝑎𝑏𝑏 sum

Case 0.183 0.315 0.323 0.317 0.086 0.155 0.251 0.032 0.271 0.500

Control 0.332 0.091 0.230 0.038 0.121 0.411 0.216 0.319 0.066 0.500

sum 0.396 0.343 0.411 0.326 0.173 0.446 0.355 0.326 0.294

Entropy Table

𝐼 𝑔𝑒𝑛𝑜𝑡𝑦𝑝𝑒; 𝑑𝑖𝑠𝑒𝑎𝑠𝑒 = 𝐻 𝑔𝑒𝑛𝑜𝑡𝑦𝑝𝑒 + 𝐻 𝑑𝑖𝑠𝑒𝑎𝑠𝑒 − 𝐻 𝑔𝑒𝑛𝑜𝑡𝑦𝑝𝑒, 𝑑𝑖𝑠𝑒𝑎𝑠𝑒

= 3.07 + 1.00 − 3.76

= 0.31

15

Methods to detect epistasis

16

Methods – Computational Approaches• Multifactor Dimensionality Reduction

(Ritchie et al. 2002)

• SNPHarvester (Yang et al. 2009)

• SNPRuler (Wan et al. 2010)

• Mutual Information With Clustering(Leem et al. 2014)

17

MethodsMultifactor dimensionality reduction(1/2)

18

MethodsMultifactor dimensionality reduction(2/2)

• Model free, non-parametric methods

• Pattern-based method• Association rule for each combinations of SNPs and

phenotypes• i.e. 𝑖𝑓 𝑆𝑁𝑃10 = 0 𝑎𝑛𝑑 𝑆𝑁𝑃13 = 4 𝑡ℎ𝑒𝑛 𝑐𝑙𝑎𝑠𝑠 = 1

• Exhaustive Search• Computational Burden

• Cross Validation Consistency• To select best model

19

MethodsSNPHarvester(1/2)

20

MethodsSNPHarvester(2/2)• Local search

• Local optima problem

• PathSeeker algorithm• Successive Runs

• Score function : 𝜒2 − 𝑣𝑎𝑙𝑢𝑒

21

MethodsSNPRuler• Pattern-based method

• Predictive rule

• Branch-and-bound algorithm• Upper bound of 𝜒2 − 𝑣𝑎𝑙𝑢𝑒 in d.f. is 1

22

MethodsMutual Information With Clustering(1/2)

: SNPs: causative SNPs

d1

d2

distanceScore=d1+d2

Centroid 1

Centroid 2

Centroid 3

3 SNPs with the highest mutual

information value

m candidates

m candidates

m candidates

23

MethodsMutual Information With Clustering(2/2)

• Mutual information• As distance measure for clustering

• K-means clustering algorithm• Candidate selection

• Reduce search space dramtically

• Can detect high-order epistatic interaction• Also, shows better performance (power, execution time)

than previous methods

24

Challenges in epistasis detection

25

Challenges

• Reducing computational burden• Filtering

• Parallel processing

• Higher-order epistatic interaction detection• Larger than 2

• Novel measure of association between SNPs and disease

26

Recommended