Upload
evangeline-mclaughlin
View
217
Download
0
Tags:
Embed Size (px)
Citation preview
TM
High Performance Computing in Bioinformatics
http://www.sgi.com/solutions/sciences/chembio/
TM
Abstract
The current presentation will mainly outline the efforts undertaken by SGI Application Engineering to implement scalable, high-throughput bioinformatics algorithms to improve the cycle time for life sciences computing. This is typified by the following examples: (i) BLAST is one of the most widely used similarity search tools available in computational biology. It rapidly identifies statistically significant matches between newly sequenced segments of nucleic acids or proteins and databases of known nucleotide or amino acid sequences. Such searches allow making inferences about the structure and function of biomolecules, or screening new sequences for further investigation using more sensitive and computationally expensive methods. SGI undertook a project that speeds the application process by making more efficient use of larger numbers of processors and allows high-volumes of genetic sequences to be compared to other genetic sequences. (ii) Multiple sequence alignments (MA) represent a class of powerful bioinformatics tools with many uses in biology. Knowledge of MA helps to predict secondary and tertiary structures and to detect homologies between newly sequenced genes and existing gene and protein families. With the adoption of high-throughput automation techniques offering scientists significantly more data with which to make better decisions, it is increasingly important to run MA calculations as quickly as possible to allow real-time decision-making in the lab. The popular MA application Clustal W provides a very good example of how resources of multiprocessor shared memory computers can be utilized more efficiently. SGI bioinformatics engineers parallelized and optimized Clustal W, and this version shows very good scalability and significantly reduces time required for data analysis, by minimizing the turnaround time and increasing the overall throughput. Finally, the presentation will include examples of work involving leading software developers and research labs and SGI engineers to demonstrate the feasibility of solving difficult number crunching scientific problems.
TM
Agenda
Trends in Life Sciences Research
SGI in Bioinformatics: High Throughput Algorithms
• HT BLAST
• HT CLUSTAL W
Scientific Crunch Projects
TM
The Data Explosion
NCBI Web SiteJune 2000: 4500 Million Base PairsFebruary 2001: 11000 Million Base
Pairs
PDB Web Site1995: 4056 Structures
2000: 12777 Structures
Gene Sequences Proteins
TM
Bioinformatics…the future
Full annotation of genome• Functional genomics for an entire genome
– what gene does what, when, how, why– genetic interactions - interdependencies - complexes– target and lead identification for disease states of the genome
• High Throughput methods– microarray/gene chips– in-Silico methods and analysis
Proteomics• Functional proteomics for the full proteome
– structure/function prediction – target/lead identification for disease states of the proteome– Computer methods and analysis
TM
Pharmaceutical Drug Discovery
High R&D Costs, High yields•Top drugs yield big annual revenues
Prilosec (Astra) $6BZocor (Merck) $4BProzac (Lilly) $3B
•Average cost to bring to market >$500M•Average time to market: 10-12 years•60% fail to recover R&D costs
Need to halve discovery time and 3x increase in drug candidates
Speed discovery time by using new technologies and alliances
• Bioinformatics• Computational Chemistry• Combinatorial Chemistry & HTS• Data mining• University partnerships
0
100
200
300
400
500
R&D Costs per
approved drug
($Million)
1976 1986 1987 1990 1995
TM
DataVolume
Time
Genetic,Chemical &BiologicalData
Research Requirements forInformation Technologies
• Store genetic, chemical, biological information in databases
• Develop tools for data analysis, knowledge management, collaboration
• Integrate multiple data sources and computational methods
• Build systems that scale to address rapid growth in data
TM
Data Serving
• Data management, retrieval & analysis of chemical, biological & genomic information
Computational Serving
• Computational Chemistry • Bioinformatics
Research IT
Visualization
• Reality Centers & workstations for visualization and group decision making
Research Computing
SAN w/CXFS
Research Computing Environment
TM
Large Scale Single System Images
• 64-bit architecture: processors, memory, file systems
• Large memory– Allows large data to be memory resident for fast access
• Large bandwidth/low latency– Time to access data in memory is a function of how long
you wait for the data to become available (latency), and how fast you can push the data to or from the CPU (bandwidth)
• Large I/O– Fast networks (internal processor net is at least an order of
magnitude faster than COTS networks)– Fast disks (multiple GB databases require multiple GB of
I/O)
TM
Large Scale Single System Images
Use of these systems is indicated when the analysis algorithm requires:
– High performance single jobs (fine-grain parallel) with large data requirements, i.e., when turn-around time on single jobs is more important
– High performance multiple jobs (coarse-grain parallel) with large data requirements
– Origin as leading platform for capability computing (IRIX features: check point restart, weightless threads... and scalability)
– Large memory, large number of simultaneous parallelized jobs, large databases
– Long running jobs, production systems also used by interactive users
TM
Agenda
Trends in Life Sciences Research
SGI in Bioinformatics: High Throughput Algorithms
• HT BLAST
• HT CLUSTAL W
Scientific Crunch Projects
TM
Applications Engineering Team
Application vendor technical relations• Maintenance, support and optimization• Parallelization: Throughput and turnaround
Chemistry• OpenMP and MPI parallelization of quantum and mechanics apps
Bioinformatics• High-throughput calculations• Multiple Sequence Alignments
Chemical, Biological and Genomics Databases• Configuration guides, Oracle 8i
Reality Center Solutions• Display lists
TMBasic parallel concepts
processors of number - Ncode of portion parallel - PP/NP-1
1Speedupmax
Amdahl’s law (no load balance problems)
Parallel code
Serial code
Load balance problem
CPU1
CPU3CPU2
CPU4
TM
Basic parallel conceptsTheoretical maximum speedups
010
2030
4050
60
1 2 4 8 16 32 64 128
CPUs
Sp
eed
up
P=95%P=99%
Performance is limited by the remainingportion of the serial code
Serial
Parallel
CPUs
Tim
e
TM
14,875 7,592 3,857 1,982 1,042
29,73059,373
0
50
100
150
200
250
4 8 16 32 64 128 256Number of Processors
Par
alle
l Sp
eed
up
SGI 2000 series 300 MHz 8MB
Labels: Time in sec
99.94%
The test consists of using 2,500 EST sequences (ranging in size from 253 to 509 bp) to search two databases (Genpept: 130,988,238 total letters in 423,994 sequences; GenBank 111: 3,916,984,546 total letters in 4,207,758 sequences).
3 days on 1 processor17 minutes on 256 processors
SGI Added value:High Throughput BLAST
Customer Problem
• Production screening of large numbers of genetic sequences
Solution
• SGI Engineering re-designed program executable to support large volume queries
Results
• Greatly improved throughput
• SGI Origin scales to 256 processors
• More efficient use of resources
• HT BLAST Availability (free)www.sgi.com/sciences/solutions/chembio/
tech_resources.html
• >200 unique downloads
TM
Relative performance increases with number of CPUs:
3x with 16 CPUs, 7x with 64 CPU
11,530 8,643 6,673 6,947 5,1118,519 4,420 2,077 1,362 711
34,90033,834
0.00
5.00
10.00
15.00
20.00
25.00
30.00
35.00
40.00
45.00
50.00
55.00
1 4 8 16 32 64Number of Processors
Para
llel S
peed
up
NCBIHT-BLASTLabels: Time in sec
85.0%99.5%
The benchmark consists of using 1,000 EST query sequences (233,708 total letters in 1000 sequences) to perform a BLASTN search against a Non-Redundant Nucleotide database (1,961,177,913 total letters in 614,801 sequences), as well as a BLASTX search against a Non-Redundant Protein database (157,988,254 total letters in 503,479 sequences).
SGI Added Value:HT-BLAST vs. NCBI-BLAST
TMHT BLAST Scalability
HT BLAST on SGI Origin 2000 256 R12000 Processors2500 EST's vs gb111 via BLASTN and GENPEPT vs
BLASTX
0326496
128160192224256
0 32 64 96 128 160 192 224 256
Number of Processors
Sp
eed
up
6.513.26
1.64
0.83
0.43Time (Hours)
TM
Clustal W overview ( Thompson J. et al, Nucl. Acid Res., 22, 4673, 1994 )
•Pairwise (PW) alignment matrixaverage alignment calculation spends most of its time hereeasy to parallelize as all elements are independent
•Guide tree calculationCalculation of closest sequences (branch)can be parallelized. Together with PW matrix calculationClustal W is ~85-92% parallel
•Progressive alignmentRemaining ~5-10% of the code can be parellelized at this stage by calculating profile scores in parallel. As a result the whole application is ~93-98% paralleldepending on a size of a problem
--
17 --
59 60 --
59 59 13 --
77 77 75 75 --
TMParallel Clustal W
•Serial calculation of pairwise distance matrix elements N(N-1)/2 individual pairwise score calculations
•Parallel calculation of distance matrix elementsall independent calculations of different sizesdynamic scheduling to minimize load balance problem
CPU1
CPU2
CPU3
CPU4
CPU1 --
17 --
59 60 --
59 59 13 --
77 77 75 75 --
TM
Parallel Clustal W scalability
0
2
4
6
8
10
12
1 2 4 8 16CPUs
Sp
eed
up
100 sequences
600 sequences
P=92.0%
P=96.3%
Speedups for G-protein coupled receptor (GPCR) sequence alignments with lines representing theoretical (Amdahl’s) speedups
•Alignments for bigger number of sequences have better scalability because:
more time is spent in the parallel pairwise matrix part ~N 2
load balance problems are smaller because of an averaging effect
TMHigh Throughput Clustal W
•100 input files were generated by randomly choosing a number of sequences out of pool of 1000 GPCR sequences (average length 390 a.a ) thus generating heterogeneous mix of Clustal W jobs with the following profile
0
5
10
15
20
25
0 20 40 60 80 100 120
Number of sequences
Nu
mb
er
of
inp
ut
file
s
•This profile tries to reproduce a “production” HT environment with heterogeneous mix of multiple alignments
TMManaging load balance forheterogeneous Clustal W jobs
Unsorted
0 100 200 300
1
5
9
13
17
21
25
29
CP
U
Time, sec
Sorted
0 100 200 300
1
5
9
13
17
21
25
29
CP
U
Time, sec
•Presorting input files by size reduces the load balance problems
TMScalability of HT- Clustal W
0
4
8
12
16
20
24
28
32
36
1 4 8 16 32
Number of CPUs
Spee
dup
0.E+00
2.E+04
4.E+04
6.E+04
8.E+04
1.E+05
Sequ
ence
s/ho
ur
Unsorted
Sorted
•HT Clustal W has close to linear scalability on Origin2000•Presorting helps when a number of input files is comparable toa number of CPUs (same profile of100 input files was used for this plot)
TMSystem oversubscription
•If the amount of parallelism in individual calculation of an HT production runis not very large, it can be advantageous to oversubscribe a system (starting more processes than number of CPUs available) thus reducing idle CPU time and increasing effective parallelism.
0
1
2
3
4
5
6
7
8
9
0.25 0.5 0.75 1 1.25 1.5 1.75 2 2.25 2.5
Oversubscription
Sp
ee
du
p
Results are shown for running parallel Clustal W jobs (each using 2 CPUs) on 8CPU Origin2000.P ~ 90%Oversubscription = Nprocesses / NCPUs
•In this case oversubscribing system by more then 50% allows to reach close to maximum (8 x) speedups on 8CPU Origin2000
TM
Parallel Clustal W example(Yuan J., et al, Bioinformatics, 15, 862, 1999)
optimization of input parameters - scoring matrices, gap penalties - requires many repetitive Clustal W calculations with various input parameters.
PW Tree MA
VARY: PW parameters MA parameters
Better alignment
SH3 family multiple alignment (parallel Clustal W 1.8)
Default parameters Optimized parameters
TM
MULTICLUSTAL scalability
0
1
2
3
4
5
6
7
1 2 4 8 16
CPUs
Sp
eed
up
•Parallel Clustal W significantly increases MULTICLUSTAL performance
•MULTICLUSTAL input parameter optimization for 100 GPCR sequences (average length 339 a.a) using parallel Clustal W 1.8
•Parallel performance is limited by the MULTICLUSTAL score calculation step
TM
Modified MULTICLUSTAL
• Original MULTICLUSTAL - optimizing PW & MA parameters independently
MA parameters
PW
• Modified MULTICLUSTAL - reuses tree from PW optimization
Better performance
TMModified MULTICLUSTAL scalability
0
2
4
6
8
10
1 2 4 8 16
CPUs
Rela
tive s
peed
originalmodified
•Modified MULTICLUSTAL is ~1.5 - 3 times faster than original program•Enhancing modifications were reported to the authors at Merck and incorporated in the current code
•MULTICLUSTAL input parameter optimization for 100 GPCR sequences(average length 339 a.a)
TM
High Throughput Proteomics Problem: Customers need efficient “dataflow” for proteomics facilities• Multiple instruments• Continuous operations
Solution: Couple processing operations• “Fuse” the data acquisition
with the Proteomics data analysis• Use coarse-grain parallelism
Results• Efficient data processing• Data usable throughout organization
904
1,170
1,721
3,3466,61313,19826,344
0.00
5.00
10.00
15.00
20.00
25.00
30.00
35.00
1 2 4 8 16 24 32
Number of Processors
Pa
rall
el
Sp
ee
du
p Speedup
99.76% Parallelism
Labels: Elapsed time in seconds
TM
Agenda
Trends in Life Sciences Research
SGI in Bioinformatics: High Throughput Algorithms
• HT BLAST
• HT CLUSTAL W
Scientific Crunch Projects
TM“Number Crunching” Scientific Projects with partners/customers
GeneCrunch: first full genome annotation (yeast)
• EMBL/EBI (Chris Sanders) 3-D Crunch: prediction of 3-D protein structures (SWISS-PROT)
• GlaxoWellcome/MSI Microsecond protein simulation
• UCSF (Science) DockCrunch: docking simulation of 1,000,000 compounds
• Protherics plc. Virtual Screening of 850,000 compounds
• Tripos Virtual Docking project with 800,000 compounds
• Metaphorics Genome Annotation: New 3D annotations for approximately 27,000 sequences on 256-procs
• MSI 3D Functional Annotation of 3855 sequences in V. Cholerae genome on 256-procs
• MSI (Nature)
Another 3 in the pipeline presently
TMCapability DemonstrationSGI and MSI Cholera Crunch
Problem:
• Only 60-80% of cholera vaccines are effective. Need new drugs
& therefore need to understand 3D structures of potential protein targets
Solution:
• 3D Protein annotation of 3855 sequences in V. Cholerae genome (bacterium)– Data from: Nature 406, 477 - 483 (2000) © Macmillan Publishers Ltd.
• MSI GeneAtlasTM pipeline software
• SGI Origin server with 256 Processors and 256GB memory
Results
• 7 days of computation
• 76% functional annotation vs. 54% in literature
• Results available in MSI AtlasBaseTM for MSI Functional Genomics Consortium members. Will be published in early 2002 in “Protein Structure Determination, Analysis and Modeling for Drug Discovery"
Post Genomics
(see Press release on Jan 23, 2001)
TM
Evolution of AIDS Virus
Problem: Trace the evolution of the AIDS Virus
Solution: • 16x128 Processor Cluster of SGI Origin servers at Los Alamos• Calculate rate of virus mutation as a function of time - predict the
origin of the disease
Results:• The worldwide AIDS epidemic has been traced back to a single viral
ancestor -- the HIV “Eve” -- that emerged perhaps around 1930.
Citation: Dr. Bette Korber, LANLhttp://w10.lanl.gov:80/physics/abst01_06.html
Throughput & Turnaround
TM
Functional Genomics - MSI
Need• Production screening of large numbers of genetic sequences
• Commercial and non-commercial genome projects
Solution• SGI Origin2000 w/256 processors
• MSI Gene Atlas -- High-throughput automated pipeline for functional annotation of protein sequences
• MSI AtlasBase 22 complete Genome database
Results• New 3D annotations for approximately 27,000 sequences
Normally a project like this would have been expected to take 2.5 CPU years. By working jointly with SGI, it was possible to complete it in a week
Dr. Michael Pear, Director, Protein Bioinformatics at MSI.
Throughput & Turnaround
TM
MineSet - Data Mining and Visualization
Visualization for Bioinformatics
Genome Comparison M. genitalium x H. influenzae
TM
Visualization for Bioinformatics Reality Centers Enable (Local or Global) Group Decisions on
– Proteins, Genes– How two molecules ‘fit’ or ‘interact’– Shape & size of new chemical entities
Customers are Research teams comprised of– Scientists– Management
Using “Standard” Scientific Applications:– Tripos SYBYL 6.61&6.62 w/MOLCAD– MSI InsightII-2000– VMD (w/CaveLib)– UCSF Chimera– SGI MineSet– AVS
TM
Visualization without the restrictions of a desktop monitor
• An immersive DISPLAY ENVIRONMENT for real-time collaborative viewing and manipulation of data
• An innovative WORK ENVIRONMENT to truly interact with virtual prototypes and large data sets
Reality Centers for Life Sciences
TMConclusion: SGI for Discovery Research
• Need for better, faster research decision making
• Tools for Bioinformatics and Chemistry– SGI Scalable Servers– Bioinformatics and Chemistry application availability– SGI enabled scalable Bioinformatics and Chemistry
algorithms– Advanced Visualization Environments
• Expert team committed to discovery research customers - Global Scientific Support
• Activity in other areas: KM/DM, ASP, LIMS, IA-64