38
Tutorial Accounting for uncertainty in species delineation during the analysis of environmental DNA sequence data Jeff R. Powell

Accounting for uncertainty in species delineation during the analysis of environmental DNA sequence data

Embed Size (px)

DESCRIPTION

Tutorial accompanying the paper of the same name, published in Methods in Ecology and Evolution Full paper http://onlinelibrary.wiley.com/doi/10.1111/j.2041-210X.2011.00122.x/abstract

Citation preview

Page 1: Accounting for uncertainty in species delineation during the analysis of environmental DNA sequence data

Tutorial

Accounting for uncertainty in species delineation during the analysis of environmental DNA sequence data

Jeff R. Powell

Page 2: Accounting for uncertainty in species delineation during the analysis of environmental DNA sequence data

DNA-based taxonomic approaches and biodiversity estimation

• The current biodiversity crisis has lead some to advocate a primary role for high-throughput DNA sequencing technologies in taxonomic research

• Therefore, a major imperative in bioinformatics is the development of theoretical and practical approaches for generating biodiversity estimates from DNA sequences

• This is especially true for taxa that are relatively understudied from a taxonomic perspective (particularly microorganisms and cryptic taxa)

• A promising approach utilizes a mixed model that differentiates speciation events from population coalescent events based on timing of divergences within a taxon

Page 3: Accounting for uncertainty in species delineation during the analysis of environmental DNA sequence data

Estimating species boundaries from environmental DNA sequences using the General Mixed-Yule Coalescent (GMYC) model

Current GMYC approach:

Fit models predicting inter- and intra-specific divergence rates and threshold times differentiating these processes to multispecies coalescent trees; models contain single (Pons et al. 2006 Syst Biol) or multiple (Monaghan et al. 2009 Syst Biol) thresholds

Step 1: compare maximum likelihood (ML) single-threshold model to the null hypothesis of a single coalescent population (Fontaneto et a. 2007 PLoS Biol)

Step 2: compare ML multiple-threshold model to ML single-threshold model to determine if increased number of parameters significantly enhances model fit (Monaghan et al. 2009 Syst Biol)

This ignores models using thresholds that may fit the data slightly less well than the maximum likelihood models

after Pons et al. 2006 Syst Biol

Speciation

Page 4: Accounting for uncertainty in species delineation during the analysis of environmental DNA sequence data

Probabilistic diversity estimation with uncertain species boundaries (using GMYC and model averaging)

Extension to current approach: (Powell 2011 Methods Ecol Evol)

Step 1: Estimate AIC of each model (all single- and multiple-threshold models) and rank based on fit to the data

Step 2a: Estimate probabilities that two taxa belong to the same ‘species’ based on the weights associated with each model

Step 2b: Estimate sample richness (and variance associated with this estimate) using model averaging

Added benefit is that uncertainty in species boundaries can be directly incorporated into the variance associated with diversity estimates

Several models fit well

Page 5: Accounting for uncertainty in species delineation during the analysis of environmental DNA sequence data

R is available at http://cran.r-project.org/. The commands to enter are preceded by ‘> ‘, modify these as

appropriate for your data; notes are entered after the ‘#’ symbol ‘Powell_supplemental_script.R’ contains functions for GMYC

model averaging and most of the code used here; is available at: http://dx.doi.org/10.1111/j.2041-

210X.2011.00122.x

Page 6: Accounting for uncertainty in species delineation during the analysis of environmental DNA sequence data

Can open the source file with a text editor (e.g. Notepad, TextEdit) These R packages (and their dependencies) are required to run the

following functions; instructions for installing on the following slides (install ‘splits’ after other packages)

Page 7: Accounting for uncertainty in species delineation during the analysis of environmental DNA sequence data

Downloads and installs ‘geiger’ and its

dependencies

Page 8: Accounting for uncertainty in species delineation during the analysis of environmental DNA sequence data

Downloads and installs ‘igraph’ package

Downloads and installs ‘vegan’ package

Downloads and installs ‘gtools’ package

Page 9: Accounting for uncertainty in species delineation during the analysis of environmental DNA sequence data

‘ape’ and ‘paran’ are also required by the ‘splits’ package

‘splits’ needs to be installed from source, use the following:

Page 10: Accounting for uncertainty in species delineation during the analysis of environmental DNA sequence data

1) Read functions into R from source file in the working directory; calls to load required packages are also in source file

2) Show the workspace to check that functions were read correctly into the R workspace

Page 11: Accounting for uncertainty in species delineation during the analysis of environmental DNA sequence data

1) Read tree into R; normally would read tree from file in working directory:A. Newick format: “read.tree(‘treefile.phylo’)”B. Nexus format: “read.nexus(‘treefile.nex’)”

2) Tree summary to check that tree was read correctly (proper number of tips); the tree needs to be fully dichotomous (number of nodes is one fewer than number of tips)

Page 12: Accounting for uncertainty in species delineation during the analysis of environmental DNA sequence data

Plot tree; needs to be ultrametric, meaning the distance from root to each tip is the same Can check with ‘is.ultrametric(test.tr)’

Page 13: Accounting for uncertainty in species delineation during the analysis of environmental DNA sequence data

Plot the accumulation of branches (N) though time; GMYC model used to detect abrupt changes in this accumulation rate

Page 14: Accounting for uncertainty in species delineation during the analysis of environmental DNA sequence data

Fit the GMYC model to the tree using the single-threshold method (“method=‘s’”) Results are stored in object ‘test.sing’ The model is fit using each node (first

column) as the threshold, from the second to the last branching event (age in second column), and estimates the model likelihood (third column)

The last three columns are for diagnostic purposes (convergence warnings, number of iterations, and number of clusters), not important here

Page 15: Accounting for uncertainty in species delineation during the analysis of environmental DNA sequence data

Maximum likelihood (ML) model

Page 16: Accounting for uncertainty in species delineation during the analysis of environmental DNA sequence data

Less than one minute to run single-threshold procedure

Time required increases with tree size

Page 17: Accounting for uncertainty in species delineation during the analysis of environmental DNA sequence data

Summary of results1) Comparison of the ML

model (five parameters) to the null model (single coalescent population, two parameters)

2) Number of clusters, entities (clusters + singletons) predicted by the ML model; CI: models within 2 log-likelihood units of ML model

3) Node age for the threshold in the ML model

Page 18: Accounting for uncertainty in species delineation during the analysis of environmental DNA sequence data

Fit the GMYC model to the tree using the multiple-thresholds method (“method=‘m’”) Results are stored in object ‘test.mult’ Procedure starts by placing single

threshold at a fixed point in the tree, then introducing additional thresholds closer to/further from node for particular lineages

Model likelihood is printed to screen when improvement is observed

Procedure finished when improvements over null model or earlier GMYC models are no longer found

Page 19: Accounting for uncertainty in species delineation during the analysis of environmental DNA sequence data

Approximately five minutes to run multiple-threshold procedure here

Increases (approximately exponentially) with tree size, decreases if multiple thresholds not detected

Page 20: Accounting for uncertainty in species delineation during the analysis of environmental DNA sequence data

Summary of results1) Comparison of the ML

model (≥ six parameters to the null model (single coalescent population, two parameters)

2) Number of clusters, entities (clusters + singletons) predicted by the ML model; CI: models within 2 log-likelihood units of ML model

3) Node ages for the threshold in the ML model

Page 21: Accounting for uncertainty in species delineation during the analysis of environmental DNA sequence data

Calculate AICc scores for GMYC models using different thresholds1) Specify object(s) containing GMYC

model output fit using ‘gmyc.edit()’ Output: i) Model-averaged parameter estimatesii) Other information (e.g., only

single/multiple-threshold output objects specified)

Page 22: Accounting for uncertainty in species delineation during the analysis of environmental DNA sequence data

Generate some summary output: specify object contain model scores calculations; specify cutoff for maximum delta AICc to print model summary to screenOutput:1) Models ranked by increasing delta AICc; ‘step’ used to identify model output in

‘gmyc.edit()’ results

Page 23: Accounting for uncertainty in species delineation during the analysis of environmental DNA sequence data

Generate some summary output, continued:

Output:1) Models ranked by increasing delta AICc; last column (spilled over

in screen output here) indicates Akaike weight given to model in the model-averaged parameter estimates

Page 24: Accounting for uncertainty in species delineation during the analysis of environmental DNA sequence data

Generate some summary output, continued:

Output:2) Model-averaged parameter estimates (this

output does not account for the deltaAICc argument)

Page 25: Accounting for uncertainty in species delineation during the analysis of environmental DNA sequence data

Estimate number of clusters, entities (clusters + singletons), Shannon diversity; also estimate variance associated with these parameters1) Specify object contain model scores

calculations; specify cutoff for maximum delta AICc of included models

2) Enter ‘y’ to continue, ‘n’ to stop (e.g., if too many models – time limitation)

Page 26: Accounting for uncertainty in species delineation during the analysis of environmental DNA sequence data

Calculate pairwise probabilities that tips co-occur within GMYC clusters1) Specify object containing

model scores calculations; specify cutoff for maximum delta AICc of included models

2) Enter ‘y’ to continue, ‘n’ to stop (e.g., if too many models – time limitation)

delta AIC Level of empirical support

0-2substantial

4-7considerably less

>10essentially none

(Burnham and Anderson, 2002, Model Selection and Multimodel Inference, page 170)

Page 27: Accounting for uncertainty in species delineation during the analysis of environmental DNA sequence data

Visual representation of cluster sizes, uncertainty; probabilities range from white (1) to red (0); x- and y-axis labels are arbitrary

Page 28: Accounting for uncertainty in species delineation during the analysis of environmental DNA sequence data

Plot tree, numbers above branches represent probabilities that all tips nested within node exist in a single GMYC cluster (hard to see in the default plot window)

Page 29: Accounting for uncertainty in species delineation during the analysis of environmental DNA sequence data

Plot to file, specify dimensions (in inches) to plot over larger area1) Open connection2) Plot to file3) Close connection4) Show files in working directory

Page 30: Accounting for uncertainty in species delineation during the analysis of environmental DNA sequence data

File found in working directory; numbers above branches represent probabilities that all tips nested within node exist in a GMYC cluster

Page 31: Accounting for uncertainty in species delineation during the analysis of environmental DNA sequence data

Finish session:1) Show all objects in the workspace2) Quit R; specifying ‘y’ to save image will result in this workspace being restored upon next

start, as long as the user first navigates to the current directory before starting R- alternatively: “save.image(‘tutorial.rdata’)” results in image to load from any

directory

Page 32: Accounting for uncertainty in species delineation during the analysis of environmental DNA sequence data

Reload session to demonstrate sample-specific diversity estimates:1) Show working directory (started here)2) Show files in working directory; contains a species-sample matrix (‘test.samples.txt’)3) Reload source file to load necessary packages

Page 33: Accounting for uncertainty in species delineation during the analysis of environmental DNA sequence data

These data were randomly generated and written to file using the code below, cells representing species presence/abundance in samples – species in rows, samples in columns

Page 34: Accounting for uncertainty in species delineation during the analysis of environmental DNA sequence data

1) Read species-sample information from file (tab-delimited: “sep=‘\t’”)2) Show structure of samples object data in data.frame object (default of

‘read.table()’, 150 species in rows, two samples in columns3) Show summary of samples object

Page 35: Accounting for uncertainty in species delineation during the analysis of environmental DNA sequence data

Model-averaged diversity estimates for whole tree (as previously calculated, for comparison)

Page 36: Accounting for uncertainty in species delineation during the analysis of environmental DNA sequence data

Model-averaged diversity estimates in each sample

‘est’: Species richness in each sample‘var’: Variance of richness estimate – can propagate through further analyses

For example,

Page 37: Accounting for uncertainty in species delineation during the analysis of environmental DNA sequence data

Average richness

Variance around the mean (including species boundary uncertainty)

Variance (underestimated, neglects species boundary uncertainty)

Page 38: Accounting for uncertainty in species delineation during the analysis of environmental DNA sequence data

Tutorial

Accounting for uncertainty in species delineation during the analysis of environmental DNA sequence data

For more information:[email protected]

[email protected]