Upload
genomeinabottle
View
142
Download
1
Embed Size (px)
DESCRIPTION
Aug2014 working group report characterization bioinformatics
Citation preview
Characterization/Bioinformatics Working Group
Chunlin XiaoMike Eberle
Characterization
• What are the barriers to submitting data via SRA?– Experienced submitters are okay with the process– First time submitter: need simpler instructions for
submission– Plan for accepting BioNano long read sequence
• What raw sequence data is currently available?– NA12878 (available) – Illumina, 454, SOLiD– Pedigree (available) – Illumina Platinum Genomes, CG– PGP trios (in progress) – Illumina, CG, Ion AmpliSeq exome– Long Reads (in progress)– CG LFR, Illumina Moleculo, PacBio
(older), BioNano
Integration of SNPs/indels
• Merged NA12878 calls available– GIAB + RTG + PG– Readme file explains rules
• Next step is Integrating Illumina, CG & Ion Torrent for PGP trios– Proposal slide will be available for comments– Test first on NA12878 and apply to trios
Merging/integrating calls
• Current data release directory includes multiple versions of files, a bit of confusing to users – Create a subdir under ftp/release directory, just
containing one file for vcf, one bed file for regions, and one README file
• Need to develop a merging tool– Illumina is working on one for PG– RTG has one for normalizing data– GA4GH will need to develop a tool for vcf
comparison
Long Read Technologies• Incorporating long read technologies?
– PacBio data/calls combined with BioNano creates long contigs/scaffolds– Sequence & calls will be available soon (submitted)
• How should we call structural variants?– Spiral Genetics is developing an assembly-based approach– NIST is incorporating a set of rules using to score SVs (~180 annotations
per SV)– Bcbio – caller combining multiple callers– All call sets have a bias towards deletions – still a lot of work to do
• Adding hg38 reference based call set– Start by aligning with the alternate haplotypes and then mask these
regions