Best pratices at BGI for the Challenges in the Era of Big Genomics Data

Xing Xu, Ph.DDirector of Cloud Computing Product

Challenges in the Era of Big Genomic Data and Our Practices in BGI

Topics for Today

About BGI

Challenges and Solutions- Data transfer- Cloud Computing- Computational Algorithms and Infrastructure- Data Storage

The world largest genome sequencing center- Started with Human Genome Project in 1999 with only a

few sequencers.- Now more than 150 sequencers, 6 TB/day sequencing

throughput.

MODEL ABI3730XL

Roche454

ABISOLiD 4

SolexaGA IIx

IlluminaHiSeq 2000

INSTALLATION 16 1 27 6 135

The world largest genome sequencing center The largest computing and storage center for

genomics in China

- 20,000+ CPU cores- 19 NVIDIA GPUs- 220+ Tflops peak

performance- 17 PB data storage- The storage and

computation capability increase by 10000 folds!

- Still increasing …

genomics in China One of world leading research institutes in

Genomics

Since 2007, - 253 papers in high-impact journals- Including 47 in Nature and its sub-

journals， 9 in Science， 2 in Cell, and 1 in NEJM, with 42 first and/or corresponding authors

- 369 patent applications- 254 software authorship

genomics in China One of world leading research institutes in

Genomics

BGI has the sequencing capacity, hardware resource and software proficiency to be the one of the strongest end-to-end service providers in the world for NGS sequencing, data analysis and data interpretation.

Challenges for Handling Big Data

Exponential growth of data amount

Exponential growth of data amount Complicate data analysis process

Exponential growth of data amount Complicate data analysis process Widely distributed data

Images from omicsmaps.com 9

Challenges and Solutions

Data transfer

Cloud Computing

Computational Algorithms and Infrastructure

Data Management

Solutions for data transfer

Data transfer- Solution I: Hard drive shipment (w/ Fedex)

Data transfer- Solution I: Hard drive shipment (w/ Fedex)- Solution II: High Speed Data Transfer

High speed data transfer

Solutions for data transfer:High speed data transfer

Demonstrated 10Gbps ultra high speed data exchange with UC Davis, and NCBI in June, 2012.

Solutions for data transfer:High speed data transfer

Demonstrated 10Gbps ultra high speed data exchange with UC Davis, and NCBI in June.

A 24GB file was transferred from China to US in 30 Seconds (~8Gbits/s).- Right software: Aspera Fastp data transfer protocol- Right infrastructure: 10Gb link between US and China- Right technology: RAM Disk, iPV6

Data transfer- Solution I: Hard drive shipment (w/ Fedex)- Solution II: High Speed Data Transfer

Aspera Server

Aspera Client

Software license Expensive physical

bandwidth

Clients Bottleneck on the

client site

Not a good solution of sharing

Data transfer- Solution I: Hard drive shipment (w/ Fedex)- Solution II: High Speed Data Transfer- Solution III: Don’t move the data (Cloud Computing)

Solutions for cloud

Cloud Computing- EasyGenomics, A Software as a Service (SaaS) platform

for NGS data analysis

EasyGenomics™

EasyGenomics is a Software as a Service (SaaS) bioinformatics platform for research and applications.

Algorithms, Workflows,

Reports

Computational ResourcesDatabase,

Data management

Web portal,Simple UIHigh speed

connection

A typical user case

Bioinformatics Workflow

Four steps: Upload, Create a Sample, Perform Analyses, Download Results

Algorithms: Carefully chosen, tested and optimized

Workflows: Whole Genome Resequencing, Exome Resequencing, RNA-Seq, small RNA, ncRNA, and De novo Assembly

Homepage

Four task portals

Status of recent works

Warning and Logging

Navigation Tabs

Sequencing Quality Report

Mapping Report

Create an Analysis

Selected sample(s)

• One selected sample => Single Analysis

• Multiple selected samples => Batch Analyses

Create an Analysis

Selectable modules

Predefined Settings

Shortcut

What’s new?

An internal version of EG is running automatically as a production system.

It integrates the new data delivery portal of sequencing service.- Aspera fastp download- Accessible to all workflows on EasyGenomics

You can chose to deliver data to EasyGenomics platform

Configuration file

Import Data from Sequencing Service

Imported Samples

Solutions for cloud

Cloud Computing- EasyGenomics, A SaaS platform for NGS data analysis - Two paths for the future cloud solution

Two paths for the future cloud solution

Software as a Service (SaaS) to Platform as a Service (PaaS)To give the flexibility to research users:- Add their own tools (any tools)- Integrate their own workflows (different combinations of

modules)

One-Click SaaS solutionTo give the automated solution for clinical users:- Automated solution for repetitive works- Fulfill very specific functions

Solutions for Algorithm and Infrastructure

Algorithm and Infrastructure- Scale up with Hadoop / MapReduce: Hecate (de novo

Assembly tool), Gaea (Resequencing pipeline)

• Fast Parallel Framework: Hadoop Streaming

• Reliable Storage System: HDFS

• Scalable Map/Reduce framework

Raw Data

Mapping

Remove PCR duplications

Realignment

Identify Variations

Selection & Annotation

Raw Data

SOAP-GaeaQC

SOAPalginer BWA BOWTIESOAP-GaeaAlignment

Selection & Annotation

SOAP-GaeaMarkDuplicate

SOAP-GaeaRealignment

SNP : SOAPsnp, SOAP-GaeaSNP, SAMtools InDel : Dindel, SOAP-GaeaIndel

SOAP-Gaea: Hadoop based resequencing pipeline

Reference

Key Value

PositionMap

Aligning

Reduce

Distributed Indexing for load balancing

Flexible splitting tolerates more mismatches

Dynamic Programming for robust gap alignment

SOAP-Gaea: Hadoop based resequencing pipeline

Old Pipeline Cloud-based pipeline 0

16Two weeks

Within 15 hrs （ 120cores)

Data: Human 60X whole genome Re-sequencing

Fast and Scalable

• The Hadoop Implementation provides great scalability.• Simply by providing more resource, the analysis can finish much

faster.

SOAP-GaeaAlignment (1 human sample in 1000genome)

Software Mapping RateConfident Mapping Rate(MAPQ>=10)

Stampy 85.93% 70.00%

SOAP2 79.14% 79.14%

Novo align 82.53% 79.74

BWA 91.54% 84.78%

Bowtie 81.15% 81.15%

SOAP-GaeaAlignment 91.75% 85.20%

It’s not only FAST, but also ACCURATE

Assembly

Constructing de bruijn Graph

Solving Tiny Repeats Merging Bubbles

Scaffolding Merging Contigs

SOAP-Hecate: Distributed de novo Genome Assembly

Contig Extension ScaffoldingGap closing

SOAPdenovo v2 SOAP-Hecate v2.5(84 cores)

SOAP-Hecate v2.5(180 cores)

Data Size 670GB 670GB 670GB

No. of Servers 1 7 15

Time 59 hour 59hour 38hour

Memory Size 400*1 24*7 24G*15

Mode Centralized Distributed Distributed

*80X human whole genome

SOAP-Hecate is scalable and using much less memory

Scalability

PerformanceSOAP-Hecate SOAPdenovo ALLPATH Phusion2, phrap Meraculous ABySS

Scaffold N50 26,570,829 117,000 211,000 495,000 486,000 144,300

Tested on simulated data from Assemblathon 1(Earl, Bradnam et al. 2011)

Solutions for Algorithms

Algorithm and Infrastructure- Scale up with Hadoop / MapReduce: Hecate (de novo

Assembly tool), Gaea (Resequencing pipeline)- GPU based acceleration: SOAP3 (Aligner), GSNP(SNP

caller), GAMA (Population genetics tool)

SOAP3: ~20X speed up from SOAP2

SOAP2 (2008)20-30x

SOAP3 (2011)10-30XGPU Version

Human Zebra fish0

1893.45

10671.39

211.53

819.809999999999

Total Time (second)

SOAP2 SOAP3

Human Zebra fish13

Speedup

Human Zebra fish0

102030405060708090

10084.2

88.2976.55

Alignment Ratio (%)

SOAP2 SOAP3

Collaboration from University of Hong Kong

GSNP SOAPsnp100

100000

GSNP SOAPsnp10

Ch. 21

GSNP: 50X faster than its CPU based SOAPSNP

The elapsed time of all steps are included. GSNP is around 50x faster than single-thread

CPU-based SOAPsnp.

Solutions for Data Management

Algorithm and Infrastructure- Scale up with Hadoop / MapReduce- GPU based acceleration

Data Management- Data management in BGI

Paradigm Shift

Traditional Model

BusinessDetermine

what question to ask

ITStructures the

data to answer

that question

Big Data Model

ITDelivers a platform to

enable creative

discovery

BusinessExplores what

questions could be

Information Pyramid

Decision

Knowledge

Information

DataElement

Meaning

Context

ApplicationAchievement

Organizing Refining Summarizing Utilizing

BGI Data Pyramid

iRODS(Data)

Database(Information)

Data Mining(Knowledge)

Health/Clinical APP(Decision)

• Data Preservation• Data Retrieval• Data Sharing

• BGI-SNP• BGI-SV• BGI-GaP• Disease:

HGVD/PMRD• Systems Biology• Drug Discovery• Diagnosis of Genetic

Diseases• Drug of Choice

Sequencer

Raw Data

Data Analysis

Analyzed Data

Data Warehousing

Personalized Analysis

Clinical Diagnosis

Data Flow

KnowledgeBase

Metadata

Public Resources

BGI-DB

Variant (Gene)

Disease

iRODS - integrated Rule Oriented Data System

48*Access data with Web-based Browser or iRODS GUI or Command Line clients.

renci.org

Sequencer

Raw Data

Data Analysis

Analyzed Data

Data Warehousing

Clinical Diagnosis

Data Flow - iRODS

Knowledge Base

Metadata

Public Resources

BGI-DB

Variant (Gene)

Disease

iRODS-based Data Management• Contents: raw data, analyzed data and related metadata• Data backup• Fully integrated with LIMS• Able to search and access any data according to the metadata from

BGI data standard, e.g. project, sample, cohort, phenotype, QC, etc.• Federation: integrate separate iRODS zones

Variant (Gene)

Disease

Sequencer

Raw Data

Data Analysis

Analyzed Data

Data Warehousing

Clinical Diagnosis

Data Flow – BGI-DB

Knowledge Base

Metadata

Public Resources

BGI-DB

BGI-DB• A locus-specific database (LSDB) for all variants identified by BGI• Manage all basic information generated from data analysis pipelines• Link all detailed information about individual samples to each variant• Easy to query information from samples with certain commonality

(such as same phenotype, same cohort, etc.)• Provide the raw information for further data mining steps

Sequencer

Raw Data

Data Analysis

Analyzed Data

Data Warehousing

Clinical Diagnosis

Data Flow – BGI-DW & BGI-KB

Knowledge Base

Metadata

Public Resources

BGI-DB

Variant (Gene)

Disease

BGI Data Warehousing & Knowledge Base• BGI data warehousing (BGI-DW) consists of a series of secondary databases related to

variants, diseases and drugs• BGI knowledge base (BGI-KB) stores and manages the knowledge obtained through

mining BGI-DB, BGI-DW and other public resources• Periodically and automatically updated• Provide APIs for the bioinformaticians to query the information and generate

individualized reports

Sequencer

Raw Data

Data Analysis

Analyzed Data

Data Warehousing

Clinical Diagnosis

Data Flow - Successful Story

Knowledge Base

Metadata

Public Resources

BGI-DB

Query the allele frequency database to filter out common variants and identify disease-causal variants

Calculate variant frequencies from certain cohorts and save them into the allele frequency database

Diagnosis for Monogenic Disease

Group samples into cohorts based on their phenotypes

Variant (Gene)

Disease

Summary of Our Practice in IT infrastructure

Algorithm and Infrastructure- Scale up with Hadoop / MapReduce- GPU based acceleration

Data Management- Using iRODs file system to manage big data

Acknowledgement

Development Team- Dev: Ming Jiang, Yongsheng Chen, Can Long, Jiasheng Wu, etc.- Flex Lab: Yan Li (Hecate), Zhi Zhang(GAEA, iRODS) etc. GPU Lab: Bingqiang Wang etc.

Test & QA Team- Xin Guan, Jingjuan Liu, etc.

PMO & IT Operation- Wenjun Zeng, Litong Lai, Jing Tian, etc.

Product Team- Xing Xu, Jing Guo, Fang Fang etc.

Other BGI Teams Collaborators:

- University of Hong Kong (HKU)- Hong Kong University of Science and Technology (HKUST)- Nvidia - Aspera- RENCI - TianJing Supercomputing center

Best pratices at BGI for the Challenges in the Era of Big Genomics Data

Presentations & Public Speaking

Language Assessment Principles and Classroom Pratices

Dataset Points BGI (land) 224 BGI (marine) 3600

BGI magasinet #1, 2015. Kvartalsmagasin for Efterskolen BGI

BGI - genomics.cn · BGI Newsletter Subscribe / Unsubscribe News from BGI Spotlights in Genomics Research Highlight Contact BGI Newsletter Tel: +86 755 25283805 Email: bgi-newsletter@service.genomics.cn

State-of-the-Art ET Application Pratices

Access Control Best Pratices Study v1.01

SQL Good Pratices

Secret Pratices of the German Rune Magicians

IOT Firmware: Best Pratices

2014-Lasersicherheit-Florsch tz-layout.ppt [Kompatibilit ... · BGV B2 BGR, BGI BGI 5007, BGI 5031, BGI 5092 Betriebs-sicherheits-verordnung TRBS Produkt sicherheitsgesetz Verordnungen

Building a Safer More Pratices

BGI Shenzhen’s Acquisition of Complete Genomics – Insights ... _English.pdfComplete Genomics, Inc. (“CGI”) • U.S. publicly traded company headquartered in Mountain View,

BGI - Fosfor

Good Pratices GAS PIPING

Genomics in food security: 100K Pathogen genome Project Bart Weimer, Ph.D. Professor UC Davis - School of Veterinary Medicine Director BGI@UCDavis

Software Testing Best Pratices

Guida Metodologica Java Best Pratices

Ch1-Human pratices

Call Center Best Pratices Overview

Data Sheet Universal Force / Torque Gauge Model BGI · 2020. 7. 21. · Model BGI. BGI Specifications. Dimensions. in [mm] Ordering Information. The BGI and all sensors are supplied