1
www.bina.com A highly efficient and scalable compute platform for massive variant annotation and rapid genome interpretation James Warren 1 , Emre Colak 1 , Amirhossein Kiani 1 , Jian Li 2 , Aparna Chhibber 2 , Sanchita Bhattacharya 3 , Narges Bani Asadi 1,2 , Sharon Barr 1 , Atul Butte 3 , Garry Nolan 4 , Rong Chen 5 , Wing H. Wong 6,7 , and Hugo Y.K. Lam 2,† MOTIVATION After obtaining variants from next generation sequencing data, researchers and clinicians still face the undertaking of interpreting the results. Despite the availability of a multitude of public databases, using this collective information is an arduous task due to inconsistent and heterogeneous data, multiple versions, and nonstandard formats. Moreover, after aggregating the data and annotating the variants, it remains a laborious exercise to identify the causative variants associated with the disease in question. APPROACH We have developed a highly efficient data processing pipeline that leverages big data technologies to integrate annotations from a wide range of biological databases. The pipeline takes variant call sets, annotates all samples, and indexes the variants for analysis. Users can perform real-time searches and analytical queries against the annotated results to rapidly identify variants for further study. CHALLENGES Heterogeneous data. An annotation platform must integrate diverse datasets of variants, genes, diseases, transcripts and functional predictions. No standardizations. The platform must account for differences in datasets, such as different reference genomes or changing schemas between versions. Real-time Interaction. A user must be able to interact with the annotation results in real time. Such interaction allows rapid identification of relevant variants while also supporting undirected investigation. Contextual interface. The system cannot assume the user is familiar with the underlying data sources. It instead must support contextual queries, such as "find all predicted damaging variants of high quality associated with a given disease.” 1. Department of Engineering, Bina Technologies, Redwood City, CA 94065. 2. Department of Bioinformatics, Bina Technologies, Redwood City, CA 94065. 3. Division of Systems Medicine, Department of Pediatrics, Stanford University School of Medicine, Stanford CA 94305. 4. Department of Microbiology and Immunology, Stanford University, School of Medicine, Stanford California 94305. 5. Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY 10029 6. Department of Statistics, Stanford University, Stanford, CA 94305. 7. Department of Health Research and Policy, Stanford University School of Medicine, Stanford, CA 94305. † To whom correspondence should be addressed. Aliations [email protected] ! Contact Us METHOD During the annotation process, the pipeline: constructs indices that can be efficiently composed to support an effectively infinite number of queries uses Hadoop MapReduce to associate variants with relevant annotations stores the annotated output and indices in HBase, a NoSQL database a 5-node Hadoop cluster can annotate and index a whole exome sequencing sample in 30 minutes and a whole genome sequencing sample in under an hour As a variant set passes through the data pipeline: linked with over 140 annotation classes from more than 20 databases/datasets annotating a sample and indexing its variants are computationally demanding steps, but these are one-time costs After the process is complete: users can interact with the results via an intuitive web interface. External Data Sources* Genomic Variants Variants with Predicted Effects SnpEff Fully Annotated Variants Indices / Functional Filters NoSQL Datastore REST / API Hadoop / Cascalog HBase Pre-Computation Real-Time Interaction * Data sources include: 1000 Genomes Cancer Gene Census ClinVar dbNSFP dbSNP DGV ENSEMBL ESP GWAS HGMD PGMD PROTEOME RefSeq RepeatMasker Segmental Duplications TRANSFAC Ohtahara syndrome (early infantile epileptic encephalopathy with suppression bursts) is a rare form of epilepsy that presents in early infancy and occurs slightly more often in males. We analyzed a whole genome sequenced family trio with two unaffected parents and an affected son. Using the Bina Annotation Platform we were able to filter from over 6.5 million variants in this family down to one X-linked non-synonymous variant in the gene AGTR2 potentially associated with the syndrome in the proband. For another application of the Bina Annotation Platform, we analyzed the WGS data from the DNA of Ata [1], the skeletal remains of a 6-inch human found in the Atacama Desert, Chile. The annotation platform discovered 4,000+ exonic non-synonymous SNVs, 400+ frame-shift or codon-change indels in genes previously associated with disease, and 1,000+ structural variations. Fourteen of these variants were located in genes known to be associated with dwarfism and skeletal dysplasia, of which one was not in dbSNP. The results were scientifically interesting and taken for further investigation [2]. [1] http://news.sciencemag.org/health/2013/05/bizarre-6-inch-skeleton-shown-be-human [2] S. Bhattacharya, J. Li, H. Lam, R. Lachman, N. Asadi, A. Butte, G. Nolan, Whole genome sequencing of mummy DNA shows significant association with human disease phenotype. Poster 2914S at ASHG 2014. EXAMPLE APPLICATIONS . CITATIONS CONCLUSION AND FUTURE WORK . The Bina Annotation Platform has proven to be a powerful tool for variant interpretation for both single and multi-sample analyses. In future releases the platform will support additional workflows such as case-control and cohort studies, and will allow users to upload custom databases.

ASHG_2014_AP

Embed Size (px)

Citation preview

Page 1: ASHG_2014_AP

www.bina.com

A highly efficient and scalable compute platform for massive variant annotation and rapid genome interpretation

James Warren1, Emre Colak1, Amirhossein Kiani1, Jian Li2, Aparna Chhibber2, Sanchita Bhattacharya3, Narges Bani Asadi1,2, Sharon Barr1, Atul Butte3, Garry Nolan4, Rong Chen5, Wing H. Wong6,7, and Hugo Y.K. Lam 2,†

MOTIVATION After obtaining variants from next generation sequencing data, researchers and clinicians still face the undertaking of interpreting the results. Despite the availability of a multitude of public databases, using this collective information is an arduous task due to inconsistent and heterogeneous data, multiple versions, and nonstandard formats. Moreover, after aggregating the data and annotating the variants, it remains a laborious exercise to identify the causative variants associated with the disease in question.

APPROACH We have developed a highly efficient data processing pipeline that leverages big data technologies to integrate annotations from a wide range of biological databases. The pipeline takes variant call sets, annotates all samples, and indexes the variants for analysis. Users can perform real-time searches and analytical queries against the annotated results to rapidly identify variants for further study.

CHALLENGES •  Heterogeneous data. An annotation platform must integrate diverse datasets of variants, genes, diseases, transcripts and functional predictions. •  No standardizations. The platform must account for differences in datasets, such as different reference genomes or changing schemas between versions. •  Real-time Interaction. A user must be able to interact with the annotation results in real time. Such interaction allows rapid identification of relevant variants while also

supporting undirected investigation. •  Contextual interface. The system cannot assume the user is familiar with the underlying data sources. It instead must support contextual queries, such as "find all

predicted damaging variants of high quality associated with a given disease.”

1.  Department of Engineering, Bina Technologies, Redwood City, CA 94065. 2.  Department of Bioinformatics, Bina Technologies, Redwood City, CA 94065. 3.  Division of Systems Medicine, Department of Pediatrics, Stanford University School of

Medicine, Stanford CA 94305. 4.  Department of Microbiology and Immunology, Stanford University, School of Medicine,

Stanford California 94305.

5.  Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY 10029

6.  Department of Statistics, Stanford University, Stanford, CA 94305. 7.  Department of Health Research and Policy, Stanford University School of Medicine,

Stanford, CA 94305. † To whom correspondence should be addressed.

Affiliations

[email protected]

!

Contact Us

METHOD During the annotation process, the pipeline: •  constructs indices that can be efficiently composed to support an effectively infinite

number of queries •  uses Hadoop MapReduce to associate variants with relevant annotations •  stores the annotated output and indices in HBase, a NoSQL database •  a 5-node Hadoop cluster can annotate and index a whole exome sequencing sample in

30 minutes and a whole genome sequencing sample in under an hour As a variant set passes through the data pipeline: •  linked with over 140 annotation classes •  from more than 20 databases/datasets •  annotating a sample and indexing its variants are computationally demanding steps, but

these are one-time costs After the process is complete: •  users can interact with the results via an intuitive web interface.

External Data Sources*

GenomicVariants

Variants with Predicted

EffectsSnpEff

Fully Annotated Variants

Indices /Functional

Filters

NoSQLDatastore

REST / API

Hadoop / Cascalog

HBase

Pre-Computation

Real-Time Interaction

* Data sources include:1000 GenomesCancer Gene CensusClinVardbNSFPdbSNPDGVENSEMBLESP

GWASHGMDPGMDPROTEOMERefSeqRepeatMaskerSegmental DuplicationsTRANSFAC

Ohtahara syndrome (early infantile epileptic encephalopathy with suppression bursts) is a rare form of epilepsy that presents in early infancy and occurs slightly more often in males.

We analyzed a whole genome sequenced family trio with two unaffected parents and an affected son. Using the Bina Annotation Platform we were able to filter from over 6.5 million variants in this family down to one X-linked non-synonymous variant in the gene AGTR2 potentially associated with the syndrome in the proband.

For another application of the Bina Annotation Platform, we analyzed the WGS data from the DNA of Ata [1], the skeletal remains of a 6-inch human found in the Atacama Desert, Chile.

The annotation platform discovered 4,000+ exonic non-synonymous SNVs, 400+ frame-shift or codon-change indels in genes previously associated with disease, and 1,000+ structural variations. Fourteen of these variants were located in genes known to be associated with dwarfism and skeletal dysplasia, of which one was not in dbSNP. The results were scientifically interesting and taken for further investigation [2].

[1] http://news.sciencemag.org/health/2013/05/bizarre-6-inch-skeleton-shown-be-human

[2] S. Bhattacharya, J. Li, H. Lam, R. Lachman, N. Asadi, A. Butte, G. Nolan, Whole genome sequencing of mummy DNA shows significant association with human disease phenotype. Poster 2914S at ASHG 2014.

EXAMPLE APPLICATIONS .

CITATIONS CONCLUSION AND FUTURE WORK . The Bina Annotation Platform has proven to be a powerful tool for variant interpretation for both single and multi-sample analyses. In future releases the platform will support additional workflows such as case-control and cohort studies, and will allow users to upload custom databases.