24
The Barcode of Life Data Portal (http://bol.uvm.edu) Dr. David E Schindel, Executive Secretary Michael Trizna, Database Specialist Consortium for the Barcode of Life (CBOL) Smithsonian Institution Washington, DC www.barcodeoflife.org; [email protected] and [email protected]

Dr David Schindel and Mike Trizna - BOL Data Portal

Embed Size (px)

DESCRIPTION

Using BOL in conjunction with BOLD, its capabilities and an example of the case study; Smithsonian frozen bird tissue project

Citation preview

Page 1: Dr David Schindel and Mike Trizna - BOL Data Portal

The Barcode of LifeData Portal

(http://bol.uvm.edu)

Dr. David E Schindel, Executive Secretary

Michael Trizna, Database Specialist

Consortium for the Barcode of Life (CBOL)

Smithsonian Institution

Washington, DC

www.barcodeoflife.org;

[email protected] and [email protected]

Page 2: Dr David Schindel and Mike Trizna - BOL Data Portal

Contents of PresentationCrowd-sourced open source software

How does Data Portal complement BOLD and GenBank?

Data Portal capabilities

Case Study: Smithsonian frozen bird tissue project

Page 3: Dr David Schindel and Mike Trizna - BOL Data Portal

An Experiment in Museum Tissue Mining and Fast Data Release

Tissue sampling winter/spring

Sequencing completed in September

Sequence quality control in October

Taxonomic checking in early November– Obvious errors removed– Minor discrepancies remain

Data released for Adelaide Conference– Crowd-sourced annotation by community– Will data be mis-used?

Page 4: Dr David Schindel and Mike Trizna - BOL Data Portal

Unique Data Portal Capabilities

Creating customized datasets from public and/or your private data

Online library of standard datasets

Support sharing within project teams using Connect IDs, easy link to Working Groups

Running different identification analyses based on different methodologies:– Standard sequence input using FASTA format– Use standard or customized datasets

Page 5: Dr David Schindel and Mike Trizna - BOL Data Portal

Barcode Aggregator

727,170 public records

Page 6: Dr David Schindel and Mike Trizna - BOL Data Portal

Summary Statistics per Family

Page 7: Dr David Schindel and Mike Trizna - BOL Data Portal

Creating Customized Datasets

Page 8: Dr David Schindel and Mike Trizna - BOL Data Portal

Existing Data Analysis Packages

LIST of packages– BLOG– BRONX– Kernel– CAOS– USEARCH– BLAST

Output of identification routines as probabilities of assignment

Page 9: Dr David Schindel and Mike Trizna - BOL Data Portal

Data Analysis Methods Session

New packages presented Friday afternoon:– Damon Little: Automatic Plants Barcode

pipeline (from raw traces to trimmed/edited sequences)

– Ka Hou Chu: Composite Vector Method (profile trees for faster alignment and tree-based analysis)

– Alain Franc: Matching Next Generation results to Sanger-based reference records

Page 10: Dr David Schindel and Mike Trizna - BOL Data Portal
Page 11: Dr David Schindel and Mike Trizna - BOL Data Portal

Sample output

Page 12: Dr David Schindel and Mike Trizna - BOL Data Portal

CONNECT for Data Portal Collaboration

Page 13: Dr David Schindel and Mike Trizna - BOL Data Portal
Page 14: Dr David Schindel and Mike Trizna - BOL Data Portal

The USNM Bird ProjectUSNM Division of Birds frozen tissue collection:– 21,104 specimens, 2512 species

Which new ones ones to sample/barcode?

Public records for birds– All public bird COI records: 10,967– All BARCODE records in GenBank: 8,419– BARCODE with taxonomic names: 7,965– BARCODE, name and 2 traces: 2,388

Page 15: Dr David Schindel and Mike Trizna - BOL Data Portal

Moving Data Among BOLD, GenBank, Data Portal

USNM Excel Spreadsheet

(KE-Emu Source)

Local database that holds all fields from

the original spreadsheet

Data Portal Aggregator database

BOLDSplit into projects that consist of 2-4 plates

Page 16: Dr David Schindel and Mike Trizna - BOL Data Portal

Creating a ‘Pick List’

Spreadsheet of tissue samples compared with:– ITIS taxonomy– Clemens species list in BOLD– Counts of GenBank and/or public BOLD

records– Geographic informattion

Screenshot of USNM list side-by-side with BOLD records

Page 17: Dr David Schindel and Mike Trizna - BOL Data Portal

Identifying Samples to be Subsampled

Page 18: Dr David Schindel and Mike Trizna - BOL Data Portal

Side-by-Side Lists

Page 19: Dr David Schindel and Mike Trizna - BOL Data Portal

USNM Bird Dataset

3150 tissues sampled

168 failed sequences

94 problematic sequences

166 clustered badly

2761 ‘BARCODE-ready’ samples

1,147 ‘first-BARCODE’ species

91% increase over 1,259 barcoded species

(3,892 listed in BOLD includes BINs, others)

Page 20: Dr David Schindel and Mike Trizna - BOL Data Portal

Two problematic clades, USNM data

Flycatchers: Family Tyrannidae– Sublegatus arenarum, S. modestus, S.

obscurior, S. sp.– Conopias parvus, C. albovittatus– Myiarchus ferox, M. swainsoni, M. sp.

Hummingbirds: Family Trochilidae– Phaethornis longuemareus

Inconsistencies within USNM dataset

Incompatibilities with public, other data

Page 21: Dr David Schindel and Mike Trizna - BOL Data Portal
Page 22: Dr David Schindel and Mike Trizna - BOL Data Portal

Resolving Mis-identified Specimens

Page 23: Dr David Schindel and Mike Trizna - BOL Data Portal

What testing dataset to use?

ID trees and analytical routines could use:– All public bird COI records: 10,967– All BARCODE records in GenBank: 8,419– BARCODE with taxonomic names: 7,965– BARCODE, name and 2 traces: 2,388

Which ones have reliable taxonomic IDs?

Page 24: Dr David Schindel and Mike Trizna - BOL Data Portal

Preparing a Data Release PaperSummary statistics from Data Portal

Figures from BOLD