25
Christian Arnold Bioinformatics Group, University of Leipzig Bioinformatics Herbstseminar October 23th, 2009 Three Weeks of Experience at the formatics Institute

Christian Arnold Bioinformatics Group, University of Leipzig Bioinformatics Herbstseminar

  • Upload
    dooley

  • View
    43

  • Download
    3

Embed Size (px)

DESCRIPTION

Three Weeks of Experience at the formatics Institute. Christian Arnold Bioinformatics Group, University of Leipzig Bioinformatics Herbstseminar October 23th, 2009. Content. The 10kTrees Project Phylogenetic Targeting Acknowledgements. 1. The 10kTrees Project. Goals. - PowerPoint PPT Presentation

Citation preview

Page 1: Christian Arnold Bioinformatics Group, University of Leipzig Bioinformatics Herbstseminar

Christian ArnoldBioinformatics Group, University of Leipzig

Bioinformatics HerbstseminarOctober 23th, 2009

Three Weeks of Experience at the formatics Institute

Page 2: Christian Arnold Bioinformatics Group, University of Leipzig Bioinformatics Herbstseminar

Content

1. The 10kTrees Project

2. Phylogenetic Targeting

3. Acknowledgements

Page 3: Christian Arnold Bioinformatics Group, University of Leipzig Bioinformatics Herbstseminar

1. The 10kTrees Project

Page 4: Christian Arnold Bioinformatics Group, University of Leipzig Bioinformatics Herbstseminar

Goals

• Updated primate phylogeny that includes phylogenetic uncertainty– Use newest available sequence data, include as

much primate species as possible, and update regularly

– Produce a set of >=10,000 primate-wide trees (with branch lengths) that are appropriate for taxonomically broad comparative research on primate behavior, ecology and morphology using Bayesian methods

• Make it accessible to other researchers

Page 5: Christian Arnold Bioinformatics Group, University of Leipzig Bioinformatics Herbstseminar

Methodology

1. Download sequences from

GenBank

2. Select the longest available sequence

for each gene in each species

3. Create individual fasta file with all

available sequences for each gene

7. Improve alignment quality using GBLOCKS

6. Create MSA using Muscle

4. Create availability matrix

9. Concatenate sequences and create partitioned dataset in

MrBayes format

5. Identify species with non-overlapping

genes

8. Identify best substitution model for

each gene

10. Run MrBayes 12. Update website11. Evaluate MrBayes analysis and calculate

consensus tree

Page 6: Christian Arnold Bioinformatics Group, University of Leipzig Bioinformatics Herbstseminar

Version 1 vs. Version 2Version 1 Version 2

Species 187 231

Genes4 mitochondrial (COI, COII, CYTB and ND1) and 1 autosomal gene (SRY)

6 mitochondrial (12S rRNA, 16S rRNA, COI, COII, CYTB, cluster of other mitochondrial genes) and 3 autosomal genes (SRY, CCR5, MC1R)

Genetic loci 2 4Total No. of Sites 5134 ~9000

Collected sequences 413 out of 935 total(55.8% missing data)

1007 out of 2079 total(51.6% missing data)

No. of constraints 29 1Generations 8 millions 60 millions

Computing time~ 48 days (16 processors in parallel, ~ 3 days each)

~ 2 years (32 processors in parallel, ~ 3 weeks each)

Page 7: Christian Arnold Bioinformatics Group, University of Leipzig Bioinformatics Herbstseminar

Preliminary consensus treeGreen: Cercopithecines

Blue: Hominoids

Red: Platyrrhines

Yellow: Tarsiers

Brown: Strepsirrhines

Rooted with Galeopterus variegatus

Page 8: Christian Arnold Bioinformatics Group, University of Leipzig Bioinformatics Herbstseminar

The 10kTrees Websitehttp://10ktrees.fas.harvard.edu/

Page 9: Christian Arnold Bioinformatics Group, University of Leipzig Bioinformatics Herbstseminar

Current Progress

• Submitted to Evolutionary Anthropology, in press.

• Will be presented at the AAPA conference (April 2010) in Albuquerque, New Mexico

• Version 2 is almost finished

• Available at http://10kTrees.fas.harvard.edu

Page 10: Christian Arnold Bioinformatics Group, University of Leipzig Bioinformatics Herbstseminar

Summary

• Bayesian approach is time-consuming, but works well, even though data matrix is very sparse

• Increased number of sequences in Version 2 dramatically reduces need for constraints and improves quality of tree and branch lengths estimates

• Ongoing project

• Total number of downloaded trees since June 2009: 95800

Page 11: Christian Arnold Bioinformatics Group, University of Leipzig Bioinformatics Herbstseminar

2. Phylogenetic Targeting

Page 12: Christian Arnold Bioinformatics Group, University of Leipzig Bioinformatics Herbstseminar

Which species should we study?

Page 13: Christian Arnold Bioinformatics Group, University of Leipzig Bioinformatics Herbstseminar

For which species should we collect data in order to increase the size of comparative data sets ?

Goals

?

Page 14: Christian Arnold Bioinformatics Group, University of Leipzig Bioinformatics Herbstseminar

Example 1/2

• Hypothesis: Two characters (x and y) show correlated evolution

• Goal: Test this hypothesis comparatively (e.g. by using phylogenetically independent contrasts and correlation tests)

• Problem 1: Data has been only collected for x, but not for y• Solution 1: Collect data for y and test hypothesis

• Problem 2: From which species should we collect data for y?

• Solution 2: Phylogenetic targeting!?

Page 15: Christian Arnold Bioinformatics Group, University of Leipzig Bioinformatics Herbstseminar

Example 2/2Brain size

Cognitive data

4 ?

9 7

10 ?

3 ?

2 ?

s3

s1

s4

s5

s2

Collecting new data is time-consuming and expensive…

Page 16: Christian Arnold Bioinformatics Group, University of Leipzig Bioinformatics Herbstseminar

Methods• Systematically generate all possible pairwise comparisons

• For every pairwise comparison, calculate character differences for the two species that form the pair and assign a score

• Determine set of phylogenetically independent pairs that maximizes the sum of all selected pair scores (maximal pairing)

s3

s4

s2

s6

s1

s7

s5

Page 17: Christian Arnold Bioinformatics Group, University of Leipzig Bioinformatics Herbstseminar

Maximal pairing: Example

Page 18: Christian Arnold Bioinformatics Group, University of Leipzig Bioinformatics Herbstseminar

Time complexity: , for balanced trees:)( 3nO

Decomposition of the maximal pairing

)log( 22 nnO

)(maxmax

)(

)(

RsubtreesRR

TdescT SS

SS

Page 19: Christian Arnold Bioinformatics Group, University of Leipzig Bioinformatics Herbstseminar

Simulation results 1/2

• Random (Rnd) selection of species– Type 1 errors close to nominal level– Power: ~40%, independent of number of taxa– Uses 67% of available variation

• Phylogenetic targeting (PT) induced selection of species– Type 1 errors close to nominal level– Power: 67-81%, increases with number of taxa– Uses 89% of available variation

Detecting correlated character evolution, based on selection of 12 species

Page 20: Christian Arnold Bioinformatics Group, University of Leipzig Bioinformatics Herbstseminar

Simulation results 2/2

PT Rnd 12 18 24

Number of selected species

Frac

tion

of a

vaila

ble

vari

atio

n af

ter

sam

plin

g

PT Rnd PT Rnd PT Rnd

Page 21: Christian Arnold Bioinformatics Group, University of Leipzig Bioinformatics Herbstseminar

Current Progress

• A revised version will be resubmitted to American Naturalist in the not too distant future

• TODO: Extend simulations and clarify some issues

• Available at http://phylotargeting.fas.harvard.edu

Page 22: Christian Arnold Bioinformatics Group, University of Leipzig Bioinformatics Herbstseminar

Summary

• A focused selection of species can save valuable time and money

• Phylogenetic targeting provides a very flexible approach and can address different questions in the context of limited resources

• Dynamic programming algorithms are everywhere

Page 23: Christian Arnold Bioinformatics Group, University of Leipzig Bioinformatics Herbstseminar

3. Acknowledgements

Page 24: Christian Arnold Bioinformatics Group, University of Leipzig Bioinformatics Herbstseminar

• Harvard University

• Max-Planck Institute for Evolutionary Anthropology

• University of Leipzig

• Charlie Nunn

• Luke Matthews

• Peter F. Stadler

Thanks!

Page 25: Christian Arnold Bioinformatics Group, University of Leipzig Bioinformatics Herbstseminar

Thank you for your attention!

Questions?

If not: Cheers (it’s early, but not too early…)

Any Questions?