1
Poster Session A, Bay 30 Haplotype Inference by Entropy Minimization Ion Mandoiu and Bogdan Pasaniuc, CSE Department, University of Connecticut •A Single N ucleotide Polym orphism (SN P) is a position in the genom e at which exactly two of the possible four nucleotides occur in a large percentage of the population. SNPs account for most of the genetic variability betw een individuals,and m apping SN Ps in hum an population has become the next high-priority in genomics after the completion of the H um an G enom e project. In diploid organism s such as hum ans,there are tw o non-identicalcopies of each chrom osom e. A description of the SN Ps in each chrom osom e is called a haplotype , which can be viewed as a 0/1 vector, e.g., by representing the most frequent (dominant) SNP allele as a 0 and the alternate (m inor) allele as a 1. Introduction gcc{ A T}ac{ TG } gcc Tac G gcc A ac T gcc Tac T gcc A ac G At present, it is prohibitively expensive to directly determine the haplotypes of an individual, but it is possible to obtain rather easily the conflated SN P inform ation in the so called genotype . A genotype can be conveniently represented as a 0/1/2 vector, where 0 (1) means that both chromosomes contain the dominant (respectively minor) allele,and 2 m eans thatthe tw o chrom osom es contain differentalleles. ? M i n i m u m E n t r o p y P o p u l a t i o n P h a s i n g : G i v e n a s e t o f g e n o t y p e s , n d a p h a s i n g w i t h m i n i m u m e n t r o p y P r o b l e m D e f i n i t i o n A p a i r o f h a p l o t y p e s ( h , h ) e x p l a i n s g i f h ( i ) = h ( i ) = g ( i ) w h e n e v e r g ( i ) i s 0 o r 1 , a n d h ( i ) ? h ( i ) w h e n e v e r g ( i ) = 2 A p h a s i n g o f a s e t o f g e n o t y p e s G { 0 , 1 , 2 } k i s a f u n c t i o n f : G { 0 , 1 } k x { 0 , 1 } k s u c h t h a t , f o r e v e r y g , f ( g ) i s a p a i r o f h a p l o t y p e s t h a t e x p l a i n g E n t r o p y o f a p h a s i n g w h e r e c o v ( h , f ) , i s t h e n u m b e r o f g e n o t y p e s g f r o m G s u c h t h a t f ( g ) = ( h , h ' ) o r f ( g ) = ( h ', h ) p l u s t w i c e t h e n u m b e r o f o f g e n o t y p e s g s u c h t h a t f ( g ) = ( h , h ) ) | | 2 ) , cov( log( | | 2 ) , cov( ) ( 0 ) , cov( : G f h G f h f E ntropy f h h A pproaches to Phasing M axim um Likelihood PH ASE [Stevens etal.01]-repeatedly chooses a genotype at random ,and estim ates thatindividual’s haplotypes underthe assum ption thatall otherhaplotypes are correctly reconstructed G ER BIL [Kim m el& Sham ir05]-expectation m axim ization for genotype resolution and block partitioning PerfectPhylogeny Setofhaplotypes used in the phasing m ustbe consistentw ith a perfectphylogeny [G usfield02] Pure Parsim ony M inim izing the num berofdistincthaplotypes IntegerLinearProgram form ulations:exponential size [G usfield 04], polynom ial size [Brow n&H arrow er05] Entropy M inim ize the entropy ofthe phasing [H alperin&Karp 04]-sim ple greedy approxim ation algorithm Previous A pproaches Localoptim ization algorithm forentropy m inim ization 1.C reate a random phasing f 2. repeat forever Find the pair(g ,(h ,h’))thatm inim izes entropy(f’), w here f’ is obtained from fby re-explaining g w ith (h,h’) Ifentropy(f’)< entropy(f) update f(change the currentexplanation forg to (h,h’)) Else exitloop 3.O utputcover Phasing ShortG enotypes Sw itching Error(% ) -- 4.1 1.7 2.3 3.0 5.5 16.1 48.8 800 0 4.1 2 2.3 3.2 5.6 15.9 49 600 0 4.1 2.2 2.6 3.3 5.6 15.8 48.9 400 0.2 4.5 3.1 3.1 4 6.1 15.8 48.2 200 0.6 4.4 4.7 4.3 4.3 6.5 15.8 47.8 100 1.7 4.3 6.5 6 5.9 6.8 16.3 47.7 50 5q31-euro (99 snp) -- 11.3 3.0 4.0 7.5 10.8 24.8 48.5 800 0 11.4 2.7 4.3 7.2 11.2 24.6 48.5 600 0 11.5 3.6 5.2 8.1 11.7 24.7 48.5 400 0.1 11.5 5.3 6.6 9.2 12.1 24.7 48.1 200 0.6 9.9 9.1 11 12 12.9 24.9 48.2 100 2.8 12 17.2 15.6 16.1 15.7 25.8 47.5 50 5q31-wafr (89 snp) 9.8 11.1 18.4 17.6 17.9 17.3 25.1 35.4 29 G abriel 2.6 2.7 4.6 4.2 3.7 5.1 15.9 43.5 129 Daly k=9 k=7 k=5 k=3 k=1 PHASE G ERBIL ENTROPY_PHASE RAND #G en D atasets [D aly 2001]129 fam ily trios o ver a region of 10 3 S NPs [G abriel2002]60 blocks w ith an average of50 S N P s gen otype d for 29 individu als [F orton etal2004]S im ulated p op ula tions ge nerated as follow s -32 E uropean and 32 W estA frican fam ily trios w ere genotyped atthe IL8 and 5q31 regions [H ulletal.2000] -P opulatio n haplotyp es and their freque ncies w ere in ferred using P hamilyand PHASE -B ased on these haplotypes frequencies,100,000 random genotypes are generated,from w hich w e selected p opulatio ns of size betw een 50 and 800 E xp erim en talS etu p 5q 31w a fr - s w itch e rro r 0 5 10 15 20 25 30 50 100 200 400 600 800 #gen error rate w in=1 w in=3 w in=5 w in=7 w in=9 GERBIL PHASE S w itch erro r rate G iven the true haplotypes(t,t’) and the inferre d one s(h,h’),sw itch error rate is the num ber of tim es w e have to sw itch from reading h to h’to obtain t,divided by the num ber ofam b igu ous po sitio ns. IL8-datasets -- 8.4 5.2 6.6 8.3 10.7 16.3 47.1 800 0.1 8.5 5.6 6.9 8.8 10.9 16.2 47 600 0.2 8.7 6.4 7.5 9 10.9 16.5 46.9 400 0.7 8.8 7.6 8.2 9.5 11.4 16.6 46.5 200 2.5 9.7 10.1 9.7 11 12.2 16.5 45.9 100 4.9 9.6 10.9 11.6 11.6 12.5 17.2 45.1 50 IL8-w afr (52 snp) -- 2.6 2.1 2.1 2.7 3 4.8 47.8 800 0.4 2.6 2 2.1 2.7 3.1 4.7 47.6 600 0.4 2.7 2.2 2.3 2.7 3.2 4.8 47.6 400 0.5 2.6 2.3 2.2 2.8 3.1 4.7 47 200 1.2 2.8 3 2.7 3.2 3.8 5 46.2 100 1.7 2.9 3 2.8 3.1 3.7 4.9 44.9 50 IL8-euro (55 snp) k=9 k=7 k=5 k=3 k=1 PHASE G ERB IL ENTROPY_PHASE RAND #G en Entropy m inim ization gives a unified fram ew ork forvarious phasing problem variants,including phasing genotypes w ith m issing data and pedigree constrained phasing Prelim inary results show thatentropy m inim ization is com petitive w ith existing m ethods in haplotype reconstruction accuracy, particularly forlarge populations C urrently,w e are im plem enting trio-based entropy phasing and are exploring otherstrategies forphasing long genotypes R eferences V.Bafna,D .G usfield,G .Lancia,and S.Yooseph.H aplotyping as perfectphylogeny:a directapproach Technical R eportU C D avis 2002 D Brow n and IH arrow er,A new integerprogram m ing form ulation forthe pure parsim ony problem in haplotype analysis,Proceedings ofW ABI2004,254-265 D aly etal.,H igh resolution haplotype structure in the hum an genom e,N ature G enetics,29:229–232, 2001 G abriel etal.,The structure ofhaplotype blocks in the hum an genom e,Science,296:2225—2229,2002. E.H alperin and R .Karp.The M inim un-Entropy SetC overProblem . International C olloquium on Automata Languages and Program m ing 2004 J.H ull etal.,Association ofrespiratory syncytialvirus bronchiolitis w ith the interleukin 8 gene region in U K fam ilies.Thorax 55:1023-1027,2000 M .Stephens,N .Sm ith,and P.D onnelly.A new statistical m ethod forhaplotype reconstruction from population data.Am erican Journal ofH um an G enetics 68:978-989,2001 C onclusions Long G enotypes D ivide the genotypes into w indow s of size k R un the previous algorithm forw indow s ofsize 2*k by fixing the firstk snips. k H andling M issing D ata Any value is correctfora snip w ith m issing data;the resultis m ore pairs ofhaplotypes thatcan explain a genotype. The local im provem entalgorithm rem ains the sam e Trios A fam ily trio:tw o parents and a child O ne haplotype from m other,one from father Ateach step w e re-explain a w hole fam ily Extensions

recomb05 poster ent phasing

  • Upload
    roman

  • View
    51

  • Download
    0

Embed Size (px)

DESCRIPTION

Haplotype Inference by Entropy Minimization Ion Mandoiu and Bogdan Pasaniuc, CSE Department, University of Connecticut. - PowerPoint PPT Presentation

Citation preview

Page 1: recomb05 poster ent phasing

Poster Session A, Bay 30

Haplotype Inference by Entropy MinimizationIon Mandoiu and Bogdan Pasaniuc, CSE Department, University of Connecticut

• A Single Nucleotide Polymorphism (SNP) is a position in the genome at which exactly two of the possible four nucleotides occur in a large percentage of the population. SNPs account for most of the genetic variability between individuals, and mapping SNPs in human population has become the next high-priority in genomics after the completion of the Human Genome project.• In diploid organisms such as humans, there are two non-identical copies of each chromosome. A description of the SNPs in each chromosome is called a haplotype, which can be viewed as a 0/1 vector, e.g., by representing the most frequent (dominant) SNP allele as a 0 and the alternate (minor) allele as a 1.

Introduction

gcc{AT}ac{TG}

gccTacG

gccAacT

gccTacT

gccAacG

• At present, it is prohibitively expensive to directly determine the haplotypes of an individual, but it is possible to obtain rather easily the conflated SNP information in the so called genotype. A genotype can be conveniently represented as a 0/1/2 vector, where 0 (1) means that both chromosomes contain the dominant (respectively minor) allele, and 2 means that the two chromosomes contain different alleles.

?

M i n i m u m E n t r o p y P o p u l a t i o n P h a s i n g : G i v e n a s e t o f g e n o t y p e s , f i n d a p h a s i n g w i t h m i n i m u m e n t r o p y

P r o b l e m D e f i n i t i o n A p a i r o f h a p l o t y p e s ( h , h ’ ) e x p l a i n s g i f h ( i ) = h ’ ( i ) = g ( i ) w h e n e v e r g ( i ) i s 0 o r 1 ,

a n d h ( i ) ? h ’ ( i ) w h e n e v e r g ( i ) = 2

A p h a s i n g o f a s e t o f g e n o t y p e s G { 0 , 1 , 2 } k i s a f u n c t i o n f : G { 0 , 1 } k x { 0 , 1 } k

s u c h t h a t , f o r e v e r y g , f ( g ) i s a p a i r o f h a p l o t y p e s t h a t e x p l a i n g

E n t r o p y o f a p h a s i n g

w h e r e c o v ( h , f ) , i s t h e n u m b e r o f g e n o t y p e s g f r o m G s u c h t h a t f ( g ) = ( h , h ' ) o r f ( g ) = ( h ' , h ) p l u s t w i c e t h e n u m b e r o f o f g e n o t y p e s g s u c h t h a t f ( g ) = ( h , h )

)||2

),cov(log(||2

),cov()(0),co v (: G

fhG

fhfEntropyfhh

Approaches to Phasing

• Maximum Likelihood• PHASE [Stevens et al. 01] - repeatedly chooses a genotype at

random, and estimates that individual’s haplotypes under the assumption that all other haplotypes are correctly reconstructed

• GERBIL [Kimmel&Shamir 05] - expectation maximization for genotype resolution and block partitioning

• Perfect Phylogeny• Set of haplotypes used in the phasing must be consistent with a

perfect phylogeny [Gusfield 02]• Pure Parsimony

• Minimizing the number of distinct haplotypes• Integer Linear Program formulations: exponential size [Gusfield 04],

polynomial size [Brown&Harrower 05]• Entropy

• Minimize the entropy of the phasing• [Halperin&Karp 04] - simple greedy approximation algorithm

Previous Approaches

Local optimization algorithm for entropy minimization

1. Create a random phasing f 2. repeat forever

Find the pair (g ,(h ,h’)) that minimizes entropy(f’), where f’ is obtained from f by re-explaining g with (h,h’)If entropy(f’) < entropy(f) update f (change the current explanation for g to (h,h’))Else

exit loop3. Output cover

Phasing Short Genotypes

Switching Error (%)

--4.11.72.33.05.516.148.8800

04.122.33.25.615.949600

04.12.22.63.35.615.848.9400

0.24.53.13.146.115.848.2200

0.64.44.74.34.36.515.847.8100

1.74.36.565.96.816.347.750

5q31-euro(99 snp)

--11.33.04.07.510.824.848.5800

011.42.74.37.211.224.648.5600

011.53.65.28.111.724.748.5400

0.111.55.36.69.212.124.748.1200

0.69.99.1111212.924.948.2100

2.81217.215.616.115.725.847.550

5q31-wafr(89 snp)

9.811.118.417.617.917.325.135.429Gabriel2.62.74.64.23.75.115.943.5129Daly

k=9k=7k=5k=3k=1PHASEGERBIL

ENTROPY_PHASERAND#Gen

D atasets• [D aly 2001] 129 fam ily trios over a reg ion of 103 SN Ps• [G abrie l 2002] 60 b locks w ith an average of 50 S N P s genotyped for 29 ind iv iduals• [Forton et a l 2004] S im ulated popula tions generated as fo llows

-32 E uropean and 32 W est A frican fam ily trios were genotyped a t t he IL8 and 5q31 reg ions [H u ll e t a l. 2000]-P opulation hap lo types and the ir frequencies were in ferred using P ham ily and P H A S E-B ased on these haplo types frequencies, 100,000 random genotypes are generated, from which we se lec ted populations of s ize between 50 and 800

Experim ental Setup

5q 31w afr - s w itch e rro r

0

5

10

15

20

25

30

50 100 200 400 600 800

#ge n

erro

r ra

te

win= 1

win= 3

win= 5

win= 7

win= 9

G E RB IL

P HA S E

Sw itch error rateG iven the true hap lo types(t,t’) and the in ferred ones(h ,h ’), sw itch error ra te is the num ber of tim es we have to sw itch from reading h to h ’ to obta in t, d iv ided by the num ber of am biguous positions.

IL8-datasets

--8.45.26.68.310.716.347.1800

0.18.55.66.98.810.916.247600

0.28.76.47.5910.916.546.9400

0.78.87.68.29.511.416.646.5200

2.59.710.19.71112.216.545.9100

4.99.610.911.611.612.517.245.150

IL8-wafr(52 snp)

--2.62.12.12.734.847.8800

0.42.622.12.73.14.747.6600

0.42.72.22.32.73.24.847.6400

0.52.62.32.22.83.14.747200

1.22.832.73.23.8546.2100

1.72.932.83.13.74.944.950

IL8-euro(55 snp)

k=9k=7k=5k=3k=1PHASEGERBIL

ENTROPY_PHASERAND#Gen

• Entropy minimization gives a unified framework for various phasing problem variants, including phasing genotypes with missing data and pedigree constrained phasing

• Preliminary results show that entropy minimization is competitive with existing methods in haplotype reconstruction accuracy, particularly for large populations

• Currently, we are implementing trio-based entropy phasing and are exploring other strategies for phasing long genotypes

References• V. Bafna, D. Gusfield, G. Lancia, and S. Yooseph. Haplotyping as perfect phylogeny: a direct approach

Technical Report UCDavis 2002• D Brown and I Harrower, A new integer programming formulation for the pure parsimony problem in

haplotype analysis, Proceedings of WABI 2004, 254-265• Daly et al., High resolution haplotype structure in the human genome, Nature Genetics, 29:229–232,

2001• Gabriel et al., The structure of haplotype blocks in the human genome, Science, 296:2225—2229, 2002.• E. Halperin and R. Karp. The Minimun-Entropy Set Cover Problem. International Colloquium on

Automata Languages and Programming 2004• J. Hull et al., Association of respiratory syncytial virus bronchiolitis with the interleukin 8 gene region in

UK families. Thorax 55:1023-1027, 2000• M. Stephens, N. Smith, and P. Donnelly. A new statistical method for haplotype reconstruction from

population data. American Journal of Human Genetics 68:978-989, 2001

Conclusions

Long Genotypes• Divide the genotypes into windows of size k• Run the previous algorithm for windows of size 2*k by fixing the first k snips.

k

Handling Missing Data• Any value is correct for a snip with missing data; the result is more pairs of haplotypes that can explain a genotype.• The local improvement algorithm remains the same

Trios• A family trio: two parents and a child• One haplotype from mother, one from father• At each step we re-explain a whole family

Extensions