View
400
Download
4
Category
Tags:
Preview:
DESCRIPTION
My presentation for the workshop about the Best Practice Award BioIT on TriCon 2013
Citation preview
Xing Xu, Ph.DDirector of Cloud Computing Product
Challenges in the Era of Big Genomic Data and Our Practices in BGI
Topics for Today
About BGI
Challenges and Solutions- Data transfer- Cloud Computing- Computational Algorithms and Infrastructure- Data Storage
2
BGI
The world largest genome sequencing center- Started with Human Genome Project in 1999 with only a
few sequencers.- Now more than 150 sequencers, 6 TB/day sequencing
throughput.
MODEL ABI3730XL
Roche454
ABISOLiD 4
SolexaGA IIx
IlluminaHiSeq 2000
INSTALLATION 16 1 27 6 135
BGI
The world largest genome sequencing center The largest computing and storage center for
genomics in China
- 20,000+ CPU cores- 19 NVIDIA GPUs- 220+ Tflops peak
performance- 17 PB data storage- The storage and
computation capability increase by 10000 folds!
- Still increasing …
BGI
The world largest genome sequencing center The largest computing and storage center for
genomics in China One of world leading research institutes in
Genomics
Since 2007, - 253 papers in high-impact journals- Including 47 in Nature and its sub-
journals, 9 in Science, 2 in Cell, and 1 in NEJM, with 42 first and/or corresponding authors
- 369 patent applications- 254 software authorship
BGI
The world largest genome sequencing center The largest computing and storage center for
genomics in China One of world leading research institutes in
Genomics
BGI has the sequencing capacity, hardware resource and software proficiency to be the one of the strongest end-to-end service providers in the world for NGS sequencing, data analysis and data interpretation.
8
Challenges for Handling Big Data
Exponential growth of data amount Complicate data analysis process
Challenges for Handling Big Data
Exponential growth of data amount Complicate data analysis process Widely distributed data
Images from omicsmaps.com 9
BGI
Challenges and Solutions
Data transfer
Cloud Computing
Computational Algorithms and Infrastructure
Data Management
10
Solutions for data transfer
Data transfer- Solution I: Hard drive shipment (w/ Fedex)
11
Solutions for data transfer
Data transfer- Solution I: Hard drive shipment (w/ Fedex)- Solution II: High Speed Data Transfer
12
High speed data transfer
Solutions for data transfer:High speed data transfer
13
Demonstrated 10Gbps ultra high speed data exchange with UC Davis, and NCBI in June, 2012.
Solutions for data transfer:High speed data transfer
14
Demonstrated 10Gbps ultra high speed data exchange with UC Davis, and NCBI in June.
A 24GB file was transferred from China to US in 30 Seconds (~8Gbits/s).- Right software: Aspera Fastp data transfer protocol- Right infrastructure: 10Gb link between US and China- Right technology: RAM Disk, iPV6
Solutions for data transfer
Data transfer- Solution I: Hard drive shipment (w/ Fedex)- Solution II: High Speed Data Transfer
15
Aspera Server
Aspera Client
Aspera Client
Aspera Client
Software license Expensive physical
bandwidth
Free
BGI
Clients Bottleneck on the
client site
Not a good solution of sharing
Solutions for data transfer
Data transfer- Solution I: Hard drive shipment (w/ Fedex)- Solution II: High Speed Data Transfer- Solution III: Don’t move the data (Cloud Computing)
16
Solutions for cloud
Data transfer- Solution I: Hard drive shipment (w/ Fedex)- Solution II: High Speed Data Transfer- Solution III: Don’t move the data (Cloud Computing)
Cloud Computing- EasyGenomics, A Software as a Service (SaaS) platform
for NGS data analysis
17
EasyGenomics™
EasyGenomics is a Software as a Service (SaaS) bioinformatics platform for research and applications.
Algorithms, Workflows,
Reports
Computational ResourcesDatabase,
Data management
Web portal,Simple UIHigh speed
connection
Bioinformatics Workflow
Four steps: Upload, Create a Sample, Perform Analyses, Download Results
Algorithms: Carefully chosen, tested and optimized
Workflows: Whole Genome Resequencing, Exome Resequencing, RNA-Seq, small RNA, ncRNA, and De novo Assembly
Create an Analysis
Selected sample(s)
• One selected sample => Single Analysis
• Multiple selected samples => Batch Analyses
What’s new?
An internal version of EG is running automatically as a production system.
It integrates the new data delivery portal of sequencing service.- Aspera fastp download- Accessible to all workflows on EasyGenomics
26
Solutions for cloud
Data transfer- Solution I: Hard drive shipment (w/ Fedex)- Solution II: High Speed Data Transfer- Solution III: Don’t move the data (Cloud Computing)
Cloud Computing- EasyGenomics, A SaaS platform for NGS data analysis - Two paths for the future cloud solution
30
Two paths for the future cloud solution
Software as a Service (SaaS) to Platform as a Service (PaaS)To give the flexibility to research users:- Add their own tools (any tools)- Integrate their own workflows (different combinations of
modules)
One-Click SaaS solutionTo give the automated solution for clinical users:- Automated solution for repetitive works- Fulfill very specific functions
31
Solutions for Algorithm and Infrastructure
Data transfer- Solution I: Hard drive shipment (w/ Fedex)- Solution II: High Speed Data Transfer- Solution III: Don’t move the data (Cloud Computing)
Cloud Computing- EasyGenomics, A SaaS platform for NGS data analysis - Two paths for the future cloud solution
Algorithm and Infrastructure- Scale up with Hadoop / MapReduce: Hecate (de novo
Assembly tool), Gaea (Resequencing pipeline)
32
• Fast Parallel Framework: Hadoop Streaming
• Reliable Storage System: HDFS
• Scalable Map/Reduce framework
Raw Data
QC
Mapping
Remove PCR duplications
Realignment
Identify Variations
Selection & Annotation
Raw Data
SOAP-GaeaQC
SOAPalginer BWA BOWTIESOAP-GaeaAlignment
Selection & Annotation
SOAP-GaeaMarkDuplicate
SOAP-GaeaRealignment
SNP : SOAPsnp, SOAP-GaeaSNP, SAMtools InDel : Dindel, SOAP-GaeaIndel
SOAP-Gaea: Hadoop based resequencing pipeline
Reads
Reference
Key Value
PositionMap
Aligning
Reduce
Distributed Indexing for load balancing
Flexible splitting tolerates more mismatches
Dynamic Programming for robust gap alignment
SOAP-Gaea: Hadoop based resequencing pipeline
Old Pipeline Cloud-based pipeline 0
2
4
6
8
10
12
14
16Two weeks
Within 15 hrs ( 120cores)
Data: Human 60X whole genome Re-sequencing
Fast and Scalable
• The Hadoop Implementation provides great scalability.• Simply by providing more resource, the analysis can finish much
faster.
SOAP-GaeaAlignment (1 human sample in 1000genome)
Software Mapping RateConfident Mapping Rate(MAPQ>=10)
Stampy 85.93% 70.00%
SOAP2 79.14% 79.14%
Novo align 82.53% 79.74
BWA 91.54% 84.78%
Bowtie 81.15% 81.15%
SOAP-GaeaAlignment 91.75% 85.20%
It’s not only FAST, but also ACCURATE
Assembly
Constructing de bruijn Graph
Solving Tiny Repeats Merging Bubbles
Scaffolding Merging Contigs
SOAP-Hecate: Distributed de novo Genome Assembly
Contig Extension ScaffoldingGap closing
SOAPdenovo v2 SOAP-Hecate v2.5(84 cores)
SOAP-Hecate v2.5(180 cores)
Data Size 670GB 670GB 670GB
No. of Servers 1 7 15
Time 59 hour 59hour 38hour
Memory Size 400*1 24*7 24G*15
Mode Centralized Distributed Distributed
*80X human whole genome
SOAP-Hecate is scalable and using much less memory
Scalability
PerformanceSOAP-Hecate SOAPdenovo ALLPATH Phusion2, phrap Meraculous ABySS
Scaffold N50 26,570,829 117,000 211,000 495,000 486,000 144,300
Tested on simulated data from Assemblathon 1(Earl, Bradnam et al. 2011)
Solutions for Algorithms
Data transfer- Solution I: Hard drive shipment (w/ Fedex)- Solution II: High Speed Data Transfer- Solution III: Don’t move the data (Cloud Computing)
Cloud Computing- EasyGenomics, A SaaS platform for NGS data analysis - Two paths for the future cloud solution
Algorithm and Infrastructure- Scale up with Hadoop / MapReduce: Hecate (de novo
Assembly tool), Gaea (Resequencing pipeline)- GPU based acceleration: SOAP3 (Aligner), GSNP(SNP
caller), GAMA (Population genetics tool)
40
SOAP3: ~20X speed up from SOAP2
SOAP
SOAP2 (2008)20-30x
SOAP3 (2011)10-30XGPU Version
Human Zebra fish0
2000
4000
6000
8000
10000
12000
1893.45
10671.39
211.53
819.809999999999
Total Time (second)
SOAP2 SOAP3
Human Zebra fish13
13.5
14
14.5
15
14.12
14.6
Speedup
Human Zebra fish0
102030405060708090
10084.2
64.49
88.2976.55
Alignment Ratio (%)
SOAP2 SOAP3
Collaboration from University of Hong Kong
GSNP SOAPsnp100
1000
10000
100000
527
21879
Ch.1
Elap
sed
time
(sec
.)
GSNP SOAPsnp10
100
1000
10000
73
3675
Ch. 21
Elap
sed
time
(sec
.)
GSNP: 50X faster than its CPU based SOAPSNP
The elapsed time of all steps are included. GSNP is around 50x faster than single-thread
CPU-based SOAPsnp.
Solutions for Data Management
Data transfer- Solution I: Hard drive shipment (w/ Fedex)- Solution II: High Speed Data Transfer- Solution III: Don’t move the data (Cloud Computing)
Cloud Computing- EasyGenomics, A SaaS platform for NGS data analysis - Two paths for the future cloud solution
Algorithm and Infrastructure- Scale up with Hadoop / MapReduce- GPU based acceleration
Data Management- Data management in BGI
43
Paradigm Shift
Traditional Model
BusinessDetermine
what question to ask
ITStructures the
data to answer
that question
Big Data Model
ITDelivers a platform to
enable creative
discovery
BusinessExplores what
questions could be
asked
Information Pyramid
Value
Decision
Knowledge
Information
DataElement
Meaning
Context
ApplicationAchievement
Organizing Refining Summarizing Utilizing
BGI Data Pyramid
iRODS(Data)
Database(Information)
Data Mining(Knowledge)
Health/Clinical APP(Decision)
• Data Preservation• Data Retrieval• Data Sharing
• BGI-SNP• BGI-SV• BGI-GaP• Disease:
HGVD/PMRD• Systems Biology• Drug Discovery• Diagnosis of Genetic
Diseases• Drug of Choice
iRODS
Sequencer
Raw Data
Data Analysis
Analyzed Data
Data Warehousing
Personalized Analysis
Clinical Diagnosis
Data Flow
KnowledgeBase
Metadata
LIMS
Public Resources
BGI-DB
Variant (Gene)
Disease
Drug
iRODS - integrated Rule Oriented Data System
48*Access data with Web-based Browser or iRODS GUI or Command Line clients.
renci.org
iRODS
Sequencer
Raw Data
Data Analysis
Analyzed Data
Data Warehousing
Personalized Analysis
Clinical Diagnosis
Data Flow - iRODS
Knowledge Base
Metadata
LIMS
Public Resources
BGI-DB
Variant (Gene)
Disease
Drug
iRODS-based Data Management• Contents: raw data, analyzed data and related metadata• Data backup• Fully integrated with LIMS• Able to search and access any data according to the metadata from
BGI data standard, e.g. project, sample, cohort, phenotype, QC, etc.• Federation: integrate separate iRODS zones
Variant (Gene)
Disease
Drug
iRODS
Sequencer
Raw Data
Data Analysis
Analyzed Data
Data Warehousing
Personalized Analysis
Clinical Diagnosis
Data Flow – BGI-DB
Knowledge Base
Metadata
LIMS
Public Resources
BGI-DB
BGI-DB• A locus-specific database (LSDB) for all variants identified by BGI• Manage all basic information generated from data analysis pipelines• Link all detailed information about individual samples to each variant• Easy to query information from samples with certain commonality
(such as same phenotype, same cohort, etc.)• Provide the raw information for further data mining steps
iRODS
Sequencer
Raw Data
Data Analysis
Analyzed Data
Data Warehousing
Personalized Analysis
Clinical Diagnosis
Data Flow – BGI-DW & BGI-KB
Knowledge Base
Metadata
LIMS
Public Resources
BGI-DB
Variant (Gene)
Disease
Drug
BGI Data Warehousing & Knowledge Base• BGI data warehousing (BGI-DW) consists of a series of secondary databases related to
variants, diseases and drugs• BGI knowledge base (BGI-KB) stores and manages the knowledge obtained through
mining BGI-DB, BGI-DW and other public resources• Periodically and automatically updated• Provide APIs for the bioinformaticians to query the information and generate
individualized reports
iRODS
Sequencer
Raw Data
Data Analysis
Analyzed Data
Data Warehousing
Personalized Analysis
Clinical Diagnosis
Data Flow - Successful Story
Knowledge Base
Metadata
LIMS
Public Resources
BGI-DB
Query the allele frequency database to filter out common variants and identify disease-causal variants
Calculate variant frequencies from certain cohorts and save them into the allele frequency database
Diagnosis for Monogenic Disease
Group samples into cohorts based on their phenotypes
Variant (Gene)
Disease
Drug
Summary of Our Practice in IT infrastructure
Data transfer- Solution I: Hard drive shipment (w/ Fedex)- Solution II: High Speed Data Transfer- Solution III: Don’t move the data (Cloud Computing)
Cloud Computing- EasyGenomics, A SaaS platform for NGS data analysis - Two paths for the future cloud solution
Algorithm and Infrastructure- Scale up with Hadoop / MapReduce- GPU based acceleration
Data Management- Using iRODs file system to manage big data
53
Acknowledgement
Development Team- Dev: Ming Jiang, Yongsheng Chen, Can Long, Jiasheng Wu, etc.- Flex Lab: Yan Li (Hecate), Zhi Zhang(GAEA, iRODS) etc. GPU Lab: Bingqiang Wang etc.
Test & QA Team- Xin Guan, Jingjuan Liu, etc.
PMO & IT Operation- Wenjun Zeng, Litong Lai, Jing Tian, etc.
Product Team- Xing Xu, Jing Guo, Fang Fang etc.
Other BGI Teams Collaborators:
- University of Hong Kong (HKU)- Hong Kong University of Science and Technology (HKUST)- Nvidia - Aspera- RENCI - TianJing Supercomputing center
Recommended