27
Computational Issues on Statistical Genetics Develop Methods Data Collection Analyze Data Write Reports/Papers Research Questions Review the Literature Test the power and robustn ess by computer simulation Database construction (Excel, Access) Translate data to analyzable form Preliminary results (figures, tables) Program languages Efficient, feasible Graphics Excel graphics Programmable graphics

Computational Issues on Statistical Genetics

  • Upload
    tobias

  • View
    45

  • Download
    0

Embed Size (px)

DESCRIPTION

Computational Issues on Statistical Genetics. Research Question s Review the Literature. Test the power and robustness by c omputer simulation. Develop Method s. Database construction (Excel, Access) Translate d ata to analyzable form Preliminary results (figures, tables). - PowerPoint PPT Presentation

Citation preview

Page 1: Computational Issues on Statistical Genetics

Computational Issues on Statistical Genetics

Develop Methods

Data Collection

Analyze Data

Write Reports/Papers

Research Questions Review the Literature

Test the power and robustness by computer simulation

Database construction (Excel, Access)

Translate data to analyzable form

Preliminary results (figures, tables)

Program languages

Efficient, feasible

Graphics

Excel graphics

Programmable graphics

Page 2: Computational Issues on Statistical Genetics

Program Languages

• Fortran, C, C++ • Matrix language: MATLAB, S-Plus, R, SAS IML • Symbolic Calculation: Mathematika,Maple,Matlab• Interface Programming: dotnet, C#, Visual Basic • SAS, SPSS, BMDP• Database: Access, Excel, SQL, SAS, Oracle• MACRO

– Excel, Access, PowerPoint, Word– Editor: WinEdt– SAS Macro

Page 3: Computational Issues on Statistical Genetics

Two Point Analysis in F2Fully Informative Markers (codominant)

BB Bb bb

AA Obs n22 n21 n20

Freq ¼(1-r)2 ½r(1-r) ¼r2

Recom. 0 1 2

Aa Obs n12 n11 n10

Freq ½r(1-r) ½(1-r)2+½r2 ½r(1-r)

Recom. 1 2r2/[(1-r)2+r2] 1

aa Obs n02 n01 n00

Freq ¼r2 ½r(1-r) ¼(1-r)2

Recom. 2 1 0

Page 4: Computational Issues on Statistical Genetics

EM algorithm to estimate the recombination fraction r:

1. Given r(0), For t=0,1, 2,…2. Do While abs[r(t+1)-r(t)]>1.e-8

E-step: Calculate (t) = r(t)2/[(1-r(t))2+r(t)2] (expected the number of recombination events for the double heterozygote AaBb)

M-step: r(t+1)= 1/(2n)[2(n20+n02)+(n21+n12+n10+n01)+2(t)n11]

Page 5: Computational Issues on Statistical Genetics

Two Point Analysis in F2Fully Informative Markers (codominant)

AA

Aa

aa

BB Bb bb

n Start

Input: Result:

Resetr0

(t) = r(t)2/[(1-r(t))2+r(t)2]

r(t+1)= 1/(2n)[2(n20+n02)+(n21+n12+n10+n01)+2(t)n11]

Page 6: Computational Issues on Statistical Genetics

Two Point Analysis in F2Fully Informative Markers (codominant)

function r=rEstF2(n22,n21,n20,n12,n11,n10,n02,n01,n00)

n=n22+n21+n20+n12+n11+n10+n02+n01+n00;

r=0.2; r1=-1;

while (abs(r1-r)>1.e-8)

r1=r;

%E-step

phi=r^2/((1-r)^2+r^2);

%M step

r=1/(2*n)*(2*(n20+n02)+(n21+n12+n10+n01)+2*phi*n11);

end

Matlab program to estimate recombinant r

Page 7: Computational Issues on Statistical Genetics

Log-likelihood ratio test statistic

Two alternative hypothesesH0: r = 0.5 vs. H1: r 0.5

Likelihood value under H1L1(r|nij) = n!/(n22!...n00!) [¼(1-r)2]n22+n00[¼r2]n20+n02[½r(1-r)]n21+n12+n10+n01[½(1-r)2+½r2]n11

Likelihood value under H0L0(r=0.5|nij) = n!/(n22!...n00!) [¼(1-0.5)2]n22+n00[¼0.52]n20+n02[½0.5(1-0.5)]n21+n12+n10+n01[½(1-0.5)2+½0.52]n11

LOD = log10[L1(r|nij)/L0(r=0.5|nij)]

= {(n22+n00)2[log10(1-r)-log10(1-0.5)+…} = 6.08 > critical LOD=3

Page 8: Computational Issues on Statistical Genetics

Two Point Analysis in F2Fully Informative Markers (codominant)

function LOD=calcLOD_F2(r,n22,n21,n20,n12,n11,n10,n02,n01,n00)

%%log likelihood under H1

LOD=(n22+n00)*log10((1-r)^2/4)...

+(n20+n02)*log10(r^2/4)...

+(n21+n12+n10+n01)*log10(r*(1-r)/2)...

+n11*log10((1-r)^2/2+r^2/2);

%%log likelihood under H0

r=0.5;

LOD0=(n22+n00)*log10((1-r)^2/4)...

+(n20+n02)*log10(r^2/4)...

+(n21+n12+n10+n01)*log10(r*(1-r)/2)...

+n11*log10((1-r)^2/2+r^2/2);

LOD=LOD-LOD0;

Matlab program to calculate log likelihood test score (LOD)

Page 9: Computational Issues on Statistical Genetics

Two Point Analysis in F2Partial Informative Markers (codominant X dominant)

BB Bb bb

AA Obs n22 n21 n20

Freq ¼(1-r)2 ½r(1-r) ¼r2

Recom. 0 1 2

Aa Obs n12 n11 n10

Freq ½r(1-r) ½(1-r)2+½r2 ½r(1-r)

Recom. 1 2r2/[(1-r)2+r2] 1

aa Obs n02 n01 n00

Freq ¼r2 ½r(1-r) ¼(1-r)2

Recom. 2 1 0

Page 10: Computational Issues on Statistical Genetics

Two Point Analysis in F2Partial Informative Markers (codominant X dominant)

B_ bb

AA Obs n2_ =n22+n21 n20

Freq ¼(1-r)2+ ½r(1-r) ¼r2

Recom. C1= ½r(1-r)/[¼(1-r)2+ ½r(1-r)] 2

Aa Obs n1_ =n12+n11 n10

Freq ½r(1-r)+½(1-r)2+½r2 ½r(1-r)

Recom. C2=[½r(1-r) +r2]/ [½r(1-r)+½(1-r)2+½r2] 1

aa Obs n0_ =n02+n01 n00

Freq ¼r2+½r(1-r) ¼(1-r)2

Recom. C3=[2* ¼r2+½r(1-r)]/[¼r2+½r(1-r)] 0Estimate of r=(c1* n2_ +c2* n1_ +c3* n0_+2* n20 + n00)/(2n)

Page 11: Computational Issues on Statistical Genetics

Two Point Analysis in F2 Partial Informative Markers (codominant X dominant)

E-Step

C1= ½r(1-r)/[¼(1-r)2+ ½r(1-r)]

C2=[½r(1-r) +r2]/ [½r(1-r)+½(1-r)2+½r2]

C3=[2* ¼r2+½r(1-r)]/[¼r2+½r(1-r)]

M-Step

r=(c1* n2_ +c2* n1_ +c3* n0_+2* n20 + n00)/(2n)

Page 12: Computational Issues on Statistical Genetics

Two Point Analysis in F2 Partial Informative Markers (codominant X dominant)

AA

Aa

aa

B_ bb

n Start

Input: Result:

Resetr0

Page 13: Computational Issues on Statistical Genetics

Two Point Analysis in F2Partial Informative Markers (co dominant X dominant)

function r=rEstF2CoXdomin(n2_,n1_,n0_,n20,n10,n00)

n=n2_+n1_+n0_+n20+n10+n00;

r=0.2;r1=-1;

while(abs(r1-r)>1.e-8)

r1=r;

%E-step

c1= 1/2*r*(1-r)/[1/4*(1-r)^2+ 1/2*r*(1-r)];

c2=[1/2*r*(1-r)+r^2]/[1/2*r*(1-r)+1/2*(1-r)^2+1/2*r^2];

c3=[2*1/4*r^2+1/2*r*(1-r)]/[1/4*r^2+1/2*r*(1-r)];

%M-step

r=(c1*n2_+c2* n1_ +c3* n0_+2* n20 + n00)/(2*n);

end

Matlab program to estimate recombinant r

Page 14: Computational Issues on Statistical Genetics

Two Point Analysis in F2 Partial Informative Markers (co dominant X dominant)

Matlab program to calculate log likelihood test score (LOD)

function LOD=calcLOD_F2CoXdomin(r, n2_,n1_,n0_,n20,n10,n00)%%log likelihood under H1LOD=log([1/4*(1-r)^2+ 1/2*r*(1-r)])*n2_ ... +log([1/2*r*(1-r)+1/2*(1-r)^2+1/2*r^2])*n1_ ... +log([1/4*r^2+1/2*r*(1-r)])*n0_ ... +log(r^2/4)*n20+log(r*(1-r)/2)*n10+log((1-r)^2/4)*n00;%%log likelihood under H0r=0.5;LOD0=log([1/4*(1-r)^2+ 1/2*r*(1-r)])*n2_ ... +log([1/2*r*(1-r)+1/2*(1-r)^2+1/2*r^2])*n1_ ... +log([1/4*r^2+1/2*r*(1-r)])*n0_ ... +log(r^2/4)*n20+log(r*(1-r)/2)*n10+log((1-r)^2/4)*n00;LOD=LOD-LOD0;LOD=LOD/log(10);

Page 15: Computational Issues on Statistical Genetics

Two Point Analysis in F2Partial Informative Markers (dominant)

BB Bb bb

AA Obs n22 n21 n20

Freq ¼(1-r)2 ½r(1-r) ¼r2

Recom. 0 1 2

Aa Obs n12 n11 n10

Freq ½r(1-r) ½(1-r)2+½r2 ½r(1-r)

Recom. 1 2r2/[(1-r)2+r2] 1

aa Obs n02 n01 n00

Freq ¼r2 ½r(1-r) ¼(1-r)2

Recom. 2 1 0

Page 16: Computational Issues on Statistical Genetics

Two Point Analysis in F2Partial Informative Markers (dominant)

B_ bb

A_ Obs n1 =n22+n21 +n12 + n11 n2=n20 +n10

Freq ¼(1-r)2 +r(1-r) + ½(1-r)2+½r2 ¼r2

Recom. c1 c2

aa Obs n3=n02+n01 n4= n00

Freq ¼r2 +½r(1-r) ¼(1-r)2

Recom. C2= (2(¼r2 )+½r(1-r)) 0 /(¼r2 +½r(1-r))

where C1=[r2+r(1-r)]/[ ¼(1-r)2 +r(1-r) + ½(1-r)2+½r2], expected number of recombinant gametesEstimate of r=(c1* n1 +c2* n2 +c2* n3)/(2n)

Page 17: Computational Issues on Statistical Genetics

Two Point Analysis in F2Fully Informative Markers (codominant)

A_

aa

B_ bb

n Start

Input: Result:

Resetr0

C1=[r2+r(1-r)]/[ ¼(1-r)2 +r(1-r) + ½(1-r)2+½r2],

C2= (2(¼r2 )+½r(1-r)) /(¼r2 +½r(1-r)) Estimate of r=(c1* n1 +c2* n2 +c2* n3)/(2n)

Page 18: Computational Issues on Statistical Genetics

Two Point Analysis in F2Partial Informative Markers (dominant)

function r=rEstF2Partial(n1,n2,n3,n4)

n=n1+n2+n3+n4;

r=0.2;r1=-1;

while (abs(r1-r)>1.e-8)

r1=r;

%E-step

c1=(r^2+r*(1-r))/((1-r)^2/4+r*(1-r)+(1-r)^2/2+r^2/2);

c2=(r^2/2+r*(1-r)/2)/(r^2/4+r*(1-r)/2);

%M-step

r=1/(2*n)*(c1*n1+c2*n2+c2*n3);

end

Matlab program to estimate recombinant r

Page 19: Computational Issues on Statistical Genetics

Log-likelihood ratio test statistic Partial Informative Markers (dominant)

Two alternative hypotheses

H0: r = 0.5 vs. H1: r 0.5

Likelihood value under H1L1(r|nij) = n!/(n1!...n4!)

[3/4(1-r)2 +r(1-r) +½r2 ]n1[¼r2 +½r(1-r)]n2+n3[¼(1-r)2]n4

Likelihood value under H0L0(r=0.5|nij) = n!/(n1!...n4!)

[3/4(1-.5)2 +.5(1-.5) +½.52 ]n1[¼.52 +½.5(1-.5)]n2+n3[¼(1-.5)2]n4

LOD = log10[L1(r|nij)/L0(r=0.5|nij)]

= 3.17 > critical LOD=3

Page 20: Computational Issues on Statistical Genetics

Two Point Analysis in F2 Partial Informative Markers (dominant)

function LOD=calcLOD_F2Partial(r,n1,n2,n3,n4)

%%log likelihood under H1

LOD=(n1)*log10((1-r)^2*3/4+r^2/2+r*(1-r))...

+(n2+n3)*log10(r^2/4+r*(1-r)/2)...

+(n4)*log10((1-r)^2/4);

%%log likelihood under H0

r=0.5;

LOD0=(n1)*log10((1-r)^2*3/4+r^2/2+r*(1-r))...

+(n2+n3)*log10(r^2/4+r*(1-r)/2)...

+(n4)*log10((1-r)^2/4);

LOD=LOD-LOD0;

Matlab program to calculate log likelihood test score (LOD)

Page 21: Computational Issues on Statistical Genetics

Three Point Analysis in Backcrossa rice data

Page 22: Computational Issues on Statistical Genetics

RG472

RG24619.2

16.1K5U10RG532

W1RG173

RZ276

Amy1B

RG146

RG345

RG381

RZ19

RG690

RZ730

RZ801

RG810

RG331

4.84.7

15.315.5

15.03.8

3.3

34.3

2.5

23.5

8.2

13.2

33.1

2.6

9.2

RG437

RG544

RG171

RG157

RZ318

Pall

RZ58

CDO686

Amy1A/C

RG95

RG654

RG256

RZ213

RZ123

RG520

13.0

5.3

22.2

27.4

6.3

29.3

10.2

8.8

12.8

8.4

5.110.0

5.4

13.1

RG104RG348

RZ329RZ892

RG100

RG191RZ678

RZ574

RZ284

RZ394

pRD10A

RZ403

RG179

CDO337

RZ337A

RZ448

RZ519

Pgi -1

CDO87

RG910

RG418A

7.7

13.26.99.82.8

17.5

41.6

37.1

15.6

18.5

2.5

5.028.6

1.9

22.5

15.0

32.1

7.1

9.217.9

RG218

RZ262

RG190

RG908RG91RG449

RG788RZ565

RZ675

RG163

RZ590

RG214

RG143

RG620

8.18.6

12.6

13.73.2

16.18.4

16.8

21.4

28.2

2.7

12.2

5.9

chrom1 chrom2 chrom3 chrom4

Page 23: Computational Issues on Statistical Genetics

Three Point Analysis in BackcrossSummarized the data as

A,B,C A,B,C Obs. A & B B & C

111 abc nabc 0 0

112 abC nabC 0 1

121 aBc naBc 1 1

122 aBC naBC 1 0

211 Abc nAbc 1 0

212 AbC nAbC 1 1

221 ABc nABc 0 1

222 ABC nABC 0 0

Page 24: Computational Issues on Statistical Genetics

Rice Data

A,B,C A,B,C Obs. A & B B & C

111 abc nabc =31 0 0

112 abC nabC =10 0 1

121 aBc naBc = 1 1 1

122 aBC naBC =11 1 0

211 Abc nAbc = 5 1 0

212 AbC nAbC = 2 1 1

221 ABc nABc = 2 0 1

222 ABC nABC =38 0 0

Marker RG472 denoted by A, RG246 by B, K5 by C

Page 25: Computational Issues on Statistical Genetics

Multilocus likelihood – determination of a most likely gene order

• Consider three markers A, B, C, with no particular order assumed.• A triply heterozygous F1 ABC/abc backcrossed to a pure parent abc/abc

Genotype ABC or abc ABc or abC Abc or aBC AbC or aBcObs. n00 =69 n01=12 n10=16 n11=3

Frequency under Order A-B-C (1-rAB)(1- rBC) (1-rAB) rBC rAB(1- rBC) rAB rBC

Order A-C-B (1-rAC)(1- rBC) rAC rBC rAC(1-rBC) (1-rAC)rBC

Order B-A-C (1-rAB)(1- rAC) (1-rAB) rAC rABrAC rAB(1-rAC)

rAB = the recombination fraction between A and B= (n10 + n11)/n=0.19rBC = the recombination fraction between B and C=(n01 + n11)/n=0.15rAC = the recombination fraction between A and C=(n01 + n10)/n=0.28

Page 26: Computational Issues on Statistical Genetics

What order is the mostly likely?

LABC (1-rAB)n00+n01 (1-rBC)n00+n10 (rAB)n10+n11 (rBC)n01+n11

LACB (1-rAC)n00+n11 (1-rBC)n00+n10 (rAC)n01+n10 (rBC)n01+n11

LBAC (1-rAB)n00+n01 (1-rAC)n00+n11 (rAB)n10+n11 (rAC)n01+n10

Log(LABC) = -90.8932Loo(LACB) = -101.5662Log(LBAC) = -107.9176

According to the maximum likelihood principle, the linkage order that gives the maximum likelihood for a data set is the best linkage order supported by the data.

the best linkage order A B C 20cM 15cM

Page 27: Computational Issues on Statistical Genetics

Genotype ABC or abc ABc or abC Abc or aBC AbC or aBc

Obs. n00 =69 n01=12 n10=16 n11=3

DATA

Result:

rAB = =0.19

rBC = =0.15

rAC = =0.28

dAB =1/4*ln[(1+2 rAB)/(1-2 rAB)]=20

dBC =1/4*ln[(1+2 rBC)/(1-2 rBC)]=15

Log(LABC) = -90.8932

Loo(LACB) = -101.5662

Log(LBAC) = -107.9176

the best linkage order A B C 20cM 15cM