Allelic Pattern Sampler: Genetic Combinations Underlying Complex Diseases

Allelic Pattern Sampler: Genetic

Combinations Underlying Complex

Diseases

Polygenic diseases (traits)Polygenic diseases susceptibility arise due to

contribution of a set of genes.Heterogeneity: different genetic backgrounds arise

the same disease.The disease outcome is correlated with the genetic

background rather that is determined.

Environmental effect or heterogeneity: gang-specific eyebrows. А common signature is

improbable.

Polygenic contributionThe genes interact (epistasis)

The genes can behave as interacting only relatively to the disease.

• Complementary alleles. An allele’s trait explication requires another allele of another gene.

• Alternative pathways.

The genes can contribute independently in an additive way.

The pattern concept. An example: image recognition

(1,0) (1,1/2) (1/2,1/2)

(1,1) (1/2,1) (0,1)

Allelic (genetic) patternWe know levels of a trait (i.e. disease) and

we know alleles of candidate genes that these persons carry.

A pattern is a set of alleles of the genes, whose presence in a genome a whole is associated with the trait.

• Any subset of the pattern is associated less reliable than the while pattern is. Any superset, too. So, a pattern is a locally minimal subset satisfying the statements above.

• A pattern may contain only one allele.

0%

25%

50%

75%

100%

controls 48 183

patients 49 163

DR4 non - DR4

0%

25%

50%

75%

100%

controls 40 191

patients 52 160

CCR5 Del non - (CCR5 Del)0%

25%

50%

75%

100%

controls 1 230

patients 17 195

DR4 + CCR5 Del non - (DR4 + CCR5 Del)

OR 20.1

p<0.0001

The solid line points to an independent combination ratio.

Cross-sectional comparison of MS patients and controls among carriers and non-carriers of alleles of DRB1 HLA gene, CCR5 chemokine receptor gene deletion and their combination.

Favorova OO, Andreewski TV, Boiko AN, Sudomoina MA, Alekseenkov AD, Kulakova OG, Slanova AV, Gusev EI. 2002. The chemokine receptor CCR5 deletion mutation is associated with MS in HLA-DR4-positive Russians. Neurology 59(10):1652-5.

Example of a genetic pattern for a complex polygenic Example of a genetic pattern for a complex polygenic disease.disease.

Patterns hide each otherMore-than-2-allele-in-a-locus union of the combinations.....|0 0 | a b | 0 0 |........|0 0 | c 0 | 0 0 |....

The strongest association (not obligatory the most reliable) statistically shadows all the other ones.

disease level

Independency question

We cannot invent a correct concept of a space of patterns, because the operation of addition (as a union of allelic sets) is not defined for every pair, thus we cannot apply a component analysis technique.

Mutual isolation of patterns

We say that a pattern is considered isolated from a set of other patterns if we remove the influence of all the other patterns before we consider our pattern’s association with the trait.

• It is an analog of adjustment procedure.

Set of patternsAs far as we cannot take one pattern apart,

we consider a set of patterns simultaneously.

Data• We have genotypic data and phenotypic

trait level data for some individuals. • The trait levels are comparative

characteristics. They cannot be measured, they can only be compared.

• We want to obtain allelic patterns, which best characterizes the relation between genotypic and phenotypic data.

We will look for a whole set of patterns, which maximises the probability that all the patterns are associated with the disease in in the mutually isolated manner.

• A good patternset forms a kind of “gradient basis” in the genome-trait association.

The set of patterns is a variable to be optimized

Data structures

Trait Incidence Gene data

Level matrix

0.1 1 0 0 a c | d d | f s |....

0.4 0 1 1 c f | a b | b a |....

0.7 0 0 0 a a | c b | a c |....

0.9 0 0 1 c f | f b | b s |....

0.2 1 1 1 a f | a d | b c |....

… ....... ........................

The correspondence of the two matrices below shows the set of patterns quality.

Set of patterns

0 0 | d 0 | 0 0 |....

0 0 | a 0 | 0 0 |....

0 f | 0 0 | b 0 |....

The incidence classification

Incidence matrix1 0 0 0 1 10 0 00 0 11 1 11 0 10 1 01 0 10 0 1.......

All the cases are classified into 2n possible classes based on the row in the incidence matrix.

000

100

010

110

001

101

011

111

It is the direction of the second pattern.

The classes could be represented by the vertices of a hypercube.

A set of parallel edges of the cube corresponds to a pattern.

A pair of classes comparison

We can only compare the disease (trait) levels, so the appropriate statistics for the comparison is the inversions number.

000

100

010

110

001

101

011

111

x

y

Two classes of trait levels, which are on the same edge, differs due to the “isolated” influence of the edge’s pattern. So, we base the patternset consideration on such pairwise comparisons.

A pair of classes. Alternative hypotheses.

To test a pair of adjacent classes, we formulate three hypotheses about the corresponding patterncorresponding pattern: null-hypothesis: X and Y has the same median,

e.g. X≡Y “positive” hypothesis: median (Y) > median (X)

(predisposing pattern) “negative” hypothesis: median (Y) < median (X)

(protecting pattern).

We compare the hypotheses in a Bayesian paradigm.

null -+

const

inv# 80

p 0.25

The likelihoods for a pair: example

The larger the minor class is, the more sharp are all the likelihoods. If it is 1 or 0, all the 4 lines are equal.

The null-hypothesis posterior for a pattern

• A pattern’s likelihood for a hypothesis is a product of the likelihoods of all corresponding class pairs.

• If a pattern is carried by all the genomes in the data or is not carried by any (it is uninformative), null-hypothesis prior for the pattern is 1. For informative patterns, we use uniform prior.

0

0 0

0 0

for a pattern |

|

| | |

P H data

P data H P H

P data H P H P data H P H P data H P H

The quality of a set of patterns

• The pairwise comparisons for all classes, which correspond to parallel edges together qualify a pattern.

• All patterns together qualify a set of patterns.

• A good pattern set is one without bad patterns.

000

100

001

101

010

110

011

111

0 01

1 |p

i

P H P H i data

is the quality

0P H

of a set of patterns.

Optimization of the pattern set quality

• Direct enumeration is ineffective.• A kind of gradient maximisation is prone

to be locked in local maxima.

Thus, we use the Monte-Carlo Markov Chain (MCMC) method.

Definitely, it is a hybrid Metropolis-Hastings-Gibbs with random choice of updates.

A mutation:

0 0 | d 0 | 0 0 0 0 | a 0 | 0 0 0 f | 0 0 | b 0

0 0 | d 0 | 0 0 0 0 | a 0 | 0 0 0 f | c 0 | b 0

Possible updating steps

A recombination:

0 0 | d 0 | 0 0 0 0 | a 0 | 0 0 0 f | 0 0 | b 0

0 0 | d 0 | 0 0 0 0 | a 0 | b 0 0 f | 0 0 | 0 0

Output statistics*** Patternsets statistics: ***

| alpha-fibr | fibr-249 | fibr-148 | ApoE-491 | Apoe-427 | ApoE-epsilon | Hind_III | LPL-Ser447Ter | ACE | CMA/B | gender |

+-------------+-----------+-----------+-----------+-----------+---------------+-----------+---------------+-----+-------+---------+

| 0 0 | 0 0 | 0 0 | T 0 | 0 0 | 0 0 | 0 0 | 0 0 | 0 0 | 0 0 | 0 0 |

| 0 0 | 0 0 | 0 0 | 0 0 | C T | 0 0 | 0 0 | 0 0 | 0 0 | 0 0 | 0 0 |

Registered 64 times.

Pattern posteriors to be positive:

3.709e-10 7.143e-11

Pattern posteriors to be negative:

0.001556 0.03835

Point reliability = 5.9658e-05

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Patterns statistics:


+-------------+-----------+-----------+-----------+-----------+---------------+-----------+---------------+-----+-------+---------+

| 0 0 | 0 0 | 0 0 | 0 0 | C 0 | 0 0 | 0 0 | 0 0 | 0 0 | 0 0 | 0 0 |

Occured 5927 times. +/- : 0/5927

(Mentioned 41 times. +/- : 0/41 )

maximal reliabilities as + and - are 4.81058e-10 and 0.0172151 .


+-------------+-----------+-----------+-----------+-----------+---------------+-----------+---------------+-----+-------+---------+

| 0 0 | 0 0 | 0 0 | T 0 | 0 0 | 0 0 | 0 0 | 0 0 | 0 0 | 0 0 | 0 0 |

Occured 3022 times. +/- : 0/3022

(Mentioned 19 times. +/- : 0/19 )

maximal reliabilities as + and - are 4.74783e-06 and 0.00205254 .

A(llelic) P(attern) SamplerAPSampler software was developed …

Favorov AV, Andreewski TV, Sudomoina MA, Favorova OO, Parmigiani G, Ochs MF: A Markov chain Monte Carlo technique for identification of combinations of allelic variants underlying complex diseases in humans Genetics 2005, 171(4):2113-2121.

… and applied to real data

Favorova OO, Favorov AV, Boiko AN, Andreewski TV, Sudomoina MA, Alekseenkov AD, Kulakova OG, Gusev EI, Parmigiani G, Ochs MF: Three allele combinations associated with multiple sclerosis BMC Med Genet 2006, 7:63.

Sudomoina MA, Nikolaeva TY, Parfenov MG, Alekseenkov AD, Favorov AV, Gekht AB, Gusev EI, Favorova OO: Genetic risk factors of arterial hypertension: analysis of ischemic stroke patients from the Yakut ethnic group Dokl Biochem Biophys. 2006 Sep-Oct;410:324-6 (Rus).

Chikhladze NM, Samedova KhF, Sudomoina MA, Thant M, Htut ZM, Litonova GN, Favorov AV, Chazova IE, Favorova OO: Contribution of CYP11B2, REN and AGT genes in genetic predisposition to arterial hypertension associated with hyperaldosteronism Kardiologiia 2008;48(1):37-42 (Rus).

Validation I: Exact Fisher

Patients Controls

Carriers PC CC

Non-carriers PNC CNCpattern

p (pattern)

1-st n

ull

distrib

utio

n

2-n

d n

ull

distrib

utio

n

3-rd

d

istribu

tion

. . . . .

N-th

null

distrib

utio

nValidation II: permutation

Genetic data

Perm

ute

d

dise

ase

d

ata

Perm

ute

d

dise

ase

d

ata

Perm

ute

d

dise

ase

d

ata

. . . . .

Perm

ute

d

dise

ase

d

ata

Permutation

Dise

as

e d

ata

Null

distrib

utio

np

Pfail [pattern]=

Pfail [p (pattern)]

Validation III: FDR

Test passed

Test failed

True TP FN

False FP TN

p ≈FP/(FP+TN) FDR ≈FP/(FP+TP)

Validation III: FDR: evaluation

1-st n

ull

distrib

utio

n

2-n

d n

ull

distrib

utio

n

3-rd

d

istribu

tion

. . . . .

N-th

null

distrib

utio

n

Validation III: FDR: calculation

Genetic data

Perm

ute

d

dise

ase

d

ata

Perm

ute

d

dise

ase

d

ata

Perm

ute

d

dise

ase

d

ata

. . . . .

Perm

ute

d

dise

ase

d

ata

Orig

inal

distrib

utio

n

Permutation

Dise

as

e d

ata

Null

distrib

utio

np

Validation III: FDR: evaluation II

Evaluated directly

Approximated

T

FDR(T1) >FDR(T2)

Validation: FDR: example• 61 markers and gender• 120 controls and 255 MS patients• Among 255, 155 give response to a medication

Pattern contains 3 informative alleles: Gender:1; 27:T; 42:C.

The pattern is mentioned in statistics as occurred 1 times at line: 3011.

Occured in 1 patternsets 1 times.Mentioned in patternsets at lines: 731.

Fisher 4-pole table: 1 2 levels 51 51 carriers 60 171 noncarriers

p-value = 1.98632243779503e-05

FDR=0.00179340028694405 (2.5e-06/0.001394)

Pattern contains 3 informative alleles: 21:G; 37:T; 53:C.

The pattern is mentioned in statistics as occurred 1 times at line: 3227.

Occurred in 1 patternsets 1 times.Mentioned in patternsets at lines: 427.

Fisher 4-pole table: 0 1 levels 1 19 carriers 89 118 noncarriers

p-value = 0.000368247913041713

FDR <=1 (0.0067765/1e-06)

Authors

Alexander Favorov 1,3

Olga Favorova 2

Marina Sudomoina 2

Giovanni Parmigiani 3

Michael Ochs 3

1. State Scientific Centre “GosNIIGenetica”, Moscow, Russia.2. Russian State Medical University, Moscow, Russia. 3. The Sidney Kimmel Cancer Center at Johns Hopkins, Baltimore, MD, USA4. Faculty of Bioinformatics and Biotechnology, MSU, Moscow5. Yakut Research Center, Russian Academy of Medical Sciences and Government

of the Sakha Republic (Yakutia), Yakutsk 6. Institute of Information Transmission Problems RAS, Moscow, Russia7. School of Pharmacy - CCRCB – QUB, Belfast, UK

Acknowledgements Alexey Alexeenkov 2

Alexey Boiko 2

Evgeniy Gusev 2

Alexey Boiko 2

Mikhail Parfenov 2

Tatiana Nikolaeva 5

Mikhail Gelfand 6

Vsevolod Makeev 1

Andrew Mironov 4 Koen Vanderbroek 7

Thank your for your attention.

MS case-control study• The method was applied to a database that

contains results of the genotyping of DNAs from 237 unrelated patients with clinically defined MS and from 358 healthy unrelated controls (all of them were Russians).

• 15 polymorphous sites of candidate loci for MS development were analyzed.

• The phenotypic trait (i.e. the MS susceptibility) levels were 1 for patients and 0 for controls.

• There were two starts: one for 2 patterns, one for three.

• DRB1 *15(2)

• TNFa9

• CCR532 + DRB1 *04

TGF1-509 *C + DRB1 *18 + +49CTLA4 *G (trio 1)

-238 TNF *B1 + -308 TNF *A2 + +49CTLA4 *G (trio 2)

APSampler identified the following patterns as MS-associated:

The Fisher’s 4-pole association test result for the trios and their 2-elements subsets

CombinationsPatients, N

(%)Controls, N

(%)p Value

–509TGFβ1*C,DRB1*18(3),CTLA4*G (trio 1) 5 (5) 0 (0) 0.009

–509TGFβ1*C,DRB1*18(3) 5 (5) 2 (1) 0.114–509TGFβ1*C,CTLA4*G 60 (61) 88 (57) 0.603DRB1*18(3),CTLA4*G 5 (5) 1 (1) 0.035

–238TNF*B1,–308TNF*A2,CTLA4*G (trio 2) 11 (9) 0 (0) 0.003

–238TNF*B1,–308TNF*A2 13 (10) 4 (5) 0.198–238TNF*B1,CTLA4*G 38 (30) 15 (17) 0.037–308TNF*A2,CTLA4*G 23 (18) 13 (15) 0.580

The permutation test gave the values for the trios were less than 0.3%

Analysis of genetic background of ischemic stroke (IS) patients of Yakut descent

Total (n)(mean age ± SD)

Men (n)(mean age ± SD)

Women (n)(mean age ± SD)

115(58.1 ± 11.5)

75(55.9 ± 12.3)

40(62.2 ± 8.4)

108(57.7 ± 11.3)

64(55.9 ± 12.1)

44(60.3 ± 9.6)

Examined polymorphic lociGene Chromosome Coding region Regulatory regions

FGA4q28

A4266G (Thr312Ala)

FGB C-249T; C-148T

APOE 19q13.2T3937C + C4075T

(Cys112Arg + Arg158Cys)

A-491T; T-427C

LPL 8p22C1595G

(Ser447Ter)T495G

ACE 17q23 I/D

CMA 14q11.2 G-1903A

IS genetic background analysis

Allele or allelic combination p(pcorr) OR CI (95%)

APOE

-427C 0.001 0.3 0.1-0.6

-427T/C 0.0003 0.2 0.08-0.5

-427T/T 0.001 3.8 1.6-8.9

ε2 0.01* 0.3 0.1-0.8

ε2/ ε3 0.03* 0.3 0.09-0.7

APOE -491T + FGB -249T 0.02 0.3 0.1-0.9

APOE -491T + LPL 495T/T 0.01 0.3 0.08-0.8

Associations identifiedAssociations identified

**p-value is counted by Fisher criteria it 8-pole tablep-value is counted by Fisher criteria it 8-pole table

0

20

40

60

80

100

%

p<0.0001*

00

2211

33

Allele 495TLPL carriership

3-allelic pattern: -249C FGB, ε4 APOE and -1903A CMA carriership

0

10

20

30

40

50

%

p=0.0003*

00

33

22

11 -249С FGB + -1903A CMAp=0.017

ε4 APOE + -1903A CMAp=0.023

Documents

Allelic Pattern Sampler: Genetic Combinations Underlying Complex Diseases