Aug2014 working group report characterization bioinformatics

Characterization/Bioinformatics Working Group

Chunlin XiaoMike Eberle

Characterization

• What are the barriers to submitting data via SRA?– Experienced submitters are okay with the process– First time submitter: need simpler instructions for

submission– Plan for accepting BioNano long read sequence

• What raw sequence data is currently available?– NA12878 (available) – Illumina, 454, SOLiD– Pedigree (available) – Illumina Platinum Genomes, CG– PGP trios (in progress) – Illumina, CG, Ion AmpliSeq exome– Long Reads (in progress)– CG LFR, Illumina Moleculo, PacBio

(older), BioNano

Integration of SNPs/indels

• Merged NA12878 calls available– GIAB + RTG + PG– Readme file explains rules

• Next step is Integrating Illumina, CG & Ion Torrent for PGP trios– Proposal slide will be available for comments– Test first on NA12878 and apply to trios

Merging/integrating calls

• Current data release directory includes multiple versions of files, a bit of confusing to users – Create a subdir under ftp/release directory, just

containing one file for vcf, one bed file for regions, and one README file

• Need to develop a merging tool– Illumina is working on one for PG– RTG has one for normalizing data– GA4GH will need to develop a tool for vcf

comparison

Long Read Technologies• Incorporating long read technologies?

– PacBio data/calls combined with BioNano creates long contigs/scaffolds– Sequence & calls will be available soon (submitted)

• How should we call structural variants?– Spiral Genetics is developing an assembly-based approach– NIST is incorporating a set of rules using to score SVs (~180 annotations

per SV)– Bcbio – caller combining multiple callers– All call sets have a bias towards deletions – still a lot of work to do

• Adding hg38 reference based call set– Start by aligning with the alternate haplotypes and then mask these

regions

Health & Medicine

Aug2014 working group report characterization bioinformatics