TM High Performance Computing in Bioinformatics

TM

High Performance Computing in Bioinformatics

http://www.sgi.com/solutions/sciences/chembio/

TM

Abstract

The current presentation will mainly outline the efforts undertaken by SGI Application Engineering to implement scalable, high-throughput bioinformatics algorithms to improve the cycle time for life sciences computing. This is typified by the following examples: (i) BLAST is one of the most widely used similarity search tools available in computational biology. It rapidly identifies statistically significant matches between newly sequenced segments of nucleic acids or proteins and databases of known nucleotide or amino acid sequences. Such searches allow making inferences about the structure and function of biomolecules, or screening new sequences for further investigation using more sensitive and computationally expensive methods. SGI undertook a project that speeds the application process by making more efficient use of larger numbers of processors and allows high-volumes of genetic sequences to be compared to other genetic sequences. (ii) Multiple sequence alignments (MA) represent a class of powerful bioinformatics tools with many uses in biology. Knowledge of MA helps to predict secondary and tertiary structures and to detect homologies between newly sequenced genes and existing gene and protein families. With the adoption of high-throughput automation techniques offering scientists significantly more data with which to make better decisions, it is increasingly important to run MA calculations as quickly as possible to allow real-time decision-making in the lab. The popular MA application Clustal W provides a very good example of how resources of multiprocessor shared memory computers can be utilized more efficiently. SGI bioinformatics engineers parallelized and optimized Clustal W, and this version shows very good scalability and significantly reduces time required for data analysis, by minimizing the turnaround time and increasing the overall throughput. Finally, the presentation will include examples of work involving leading software developers and research labs and SGI engineers to demonstrate the feasibility of solving difficult number crunching scientific problems.

TM

Agenda

Trends in Life Sciences Research

SGI in Bioinformatics: High Throughput Algorithms

• HT BLAST

• HT CLUSTAL W

Scientific Crunch Projects

TM

The Data Explosion

NCBI Web SiteJune 2000: 4500 Million Base PairsFebruary 2001: 11000 Million Base

Pairs

PDB Web Site1995: 4056 Structures

2000: 12777 Structures

Gene Sequences Proteins

TM

Bioinformatics…the future

Full annotation of genome• Functional genomics for an entire genome

– what gene does what, when, how, why– genetic interactions - interdependencies - complexes– target and lead identification for disease states of the genome

• High Throughput methods– microarray/gene chips– in-Silico methods and analysis

Proteomics• Functional proteomics for the full proteome

– structure/function prediction – target/lead identification for disease states of the proteome– Computer methods and analysis

TM

Pharmaceutical Drug Discovery

High R&D Costs, High yields•Top drugs yield big annual revenues

Prilosec (Astra) $6BZocor (Merck) $4BProzac (Lilly) $3B

•Average cost to bring to market >$500M•Average time to market: 10-12 years•60% fail to recover R&D costs

Need to halve discovery time and 3x increase in drug candidates

Speed discovery time by using new technologies and alliances

• Bioinformatics• Computational Chemistry• Combinatorial Chemistry & HTS• Data mining• University partnerships

0

100

200

300

400

500

R&D Costs per

approved drug

($Million)

1976 1986 1987 1990 1995

TM

DataVolume

Time

Genetic,Chemical &BiologicalData

Research Requirements forInformation Technologies

• Store genetic, chemical, biological information in databases

• Develop tools for data analysis, knowledge management, collaboration

• Integrate multiple data sources and computational methods

• Build systems that scale to address rapid growth in data

TM

Data Serving

• Data management, retrieval & analysis of chemical, biological & genomic information

Computational Serving

• Computational Chemistry • Bioinformatics

Research IT

Visualization

• Reality Centers & workstations for visualization and group decision making

Research Computing

SAN w/CXFS

Research Computing Environment

TM

Large Scale Single System Images

• 64-bit architecture: processors, memory, file systems

• Large memory– Allows large data to be memory resident for fast access

• Large bandwidth/low latency– Time to access data in memory is a function of how long

you wait for the data to become available (latency), and how fast you can push the data to or from the CPU (bandwidth)

• Large I/O– Fast networks (internal processor net is at least an order of

magnitude faster than COTS networks)– Fast disks (multiple GB databases require multiple GB of

I/O)

TM

Large Scale Single System Images

Use of these systems is indicated when the analysis algorithm requires:

– High performance single jobs (fine-grain parallel) with large data requirements, i.e., when turn-around time on single jobs is more important

– High performance multiple jobs (coarse-grain parallel) with large data requirements

– Origin as leading platform for capability computing (IRIX features: check point restart, weightless threads... and scalability)

– Large memory, large number of simultaneous parallelized jobs, large databases

– Long running jobs, production systems also used by interactive users

TM

Agenda



• HT BLAST

• HT CLUSTAL W


TM

Applications Engineering Team

Application vendor technical relations• Maintenance, support and optimization• Parallelization: Throughput and turnaround

Chemistry• OpenMP and MPI parallelization of quantum and mechanics apps

Bioinformatics• High-throughput calculations• Multiple Sequence Alignments

Chemical, Biological and Genomics Databases• Configuration guides, Oracle 8i

Reality Center Solutions• Display lists

TMBasic parallel concepts

processors of number - Ncode of portion parallel - PP/NP-1

1Speedupmax

Amdahl’s law (no load balance problems)

Parallel code

Serial code

Load balance problem

CPU1

CPU3CPU2

CPU4

TM

Basic parallel conceptsTheoretical maximum speedups

010

2030

4050

60

1 2 4 8 16 32 64 128

CPUs

Sp

eed

up

P=95%P=99%

Performance is limited by the remainingportion of the serial code

Serial

Parallel

CPUs

Tim

e

TM

14,875 7,592 3,857 1,982 1,042

29,73059,373

0

50

100

150

200

250

4 8 16 32 64 128 256Number of Processors

Par

alle

l Sp

eed

up

SGI 2000 series 300 MHz 8MB

Labels: Time in sec

99.94%

The test consists of using 2,500 EST sequences (ranging in size from 253 to 509 bp) to search two databases (Genpept: 130,988,238 total letters in 423,994 sequences; GenBank 111: 3,916,984,546 total letters in 4,207,758 sequences).

3 days on 1 processor17 minutes on 256 processors

SGI Added value:High Throughput BLAST

Customer Problem

• Production screening of large numbers of genetic sequences

Solution

• SGI Engineering re-designed program executable to support large volume queries

Results

• Greatly improved throughput

• SGI Origin scales to 256 processors

• More efficient use of resources

• HT BLAST Availability (free)www.sgi.com/sciences/solutions/chembio/

tech_resources.html

• >200 unique downloads

TM

Relative performance increases with number of CPUs:

3x with 16 CPUs, 7x with 64 CPU

11,530 8,643 6,673 6,947 5,1118,519 4,420 2,077 1,362 711

34,90033,834

0.00

5.00

10.00

15.00

20.00

25.00

30.00

35.00

40.00

45.00

50.00

55.00

1 4 8 16 32 64Number of Processors

Para

llel S

peed

up

NCBIHT-BLASTLabels: Time in sec

85.0%99.5%

The benchmark consists of using 1,000 EST query sequences (233,708 total letters in 1000 sequences) to perform a BLASTN search against a Non-Redundant Nucleotide database (1,961,177,913 total letters in 614,801 sequences), as well as a BLASTX search against a Non-Redundant Protein database (157,988,254 total letters in 503,479 sequences).

SGI Added Value:HT-BLAST vs. NCBI-BLAST

TMHT BLAST Scalability

HT BLAST on SGI Origin 2000 256 R12000 Processors2500 EST's vs gb111 via BLASTN and GENPEPT vs

BLASTX

0326496

128160192224256

0 32 64 96 128 160 192 224 256

Number of Processors

Sp

eed

up

6.513.26

1.64

0.83

0.43Time (Hours)

TM

Clustal W overview ( Thompson J. et al, Nucl. Acid Res., 22, 4673, 1994 )

•Pairwise (PW) alignment matrixaverage alignment calculation spends most of its time hereeasy to parallelize as all elements are independent

•Guide tree calculationCalculation of closest sequences (branch)can be parallelized. Together with PW matrix calculationClustal W is ~85-92% parallel

•Progressive alignmentRemaining ~5-10% of the code can be parellelized at this stage by calculating profile scores in parallel. As a result the whole application is ~93-98% paralleldepending on a size of a problem

--

17 --

59 60 --

59 59 13 --

77 77 75 75 --

TMParallel Clustal W

•Serial calculation of pairwise distance matrix elements N(N-1)/2 individual pairwise score calculations

•Parallel calculation of distance matrix elementsall independent calculations of different sizesdynamic scheduling to minimize load balance problem

CPU1

CPU2

CPU3

CPU4

CPU1 --

17 --

59 60 --

59 59 13 --

77 77 75 75 --

TM

Parallel Clustal W scalability

0

2

4

6

8

10

12

1 2 4 8 16CPUs

Sp

eed

up

100 sequences

600 sequences

P=92.0%

P=96.3%

Speedups for G-protein coupled receptor (GPCR) sequence alignments with lines representing theoretical (Amdahl’s) speedups

•Alignments for bigger number of sequences have better scalability because:

more time is spent in the parallel pairwise matrix part ~N 2

load balance problems are smaller because of an averaging effect

TMHigh Throughput Clustal W

•100 input files were generated by randomly choosing a number of sequences out of pool of 1000 GPCR sequences (average length 390 a.a ) thus generating heterogeneous mix of Clustal W jobs with the following profile

0

5

10

15

20

25

0 20 40 60 80 100 120

Number of sequences

Nu

mb

er

of

inp

ut

file

s

•This profile tries to reproduce a “production” HT environment with heterogeneous mix of multiple alignments

TMManaging load balance forheterogeneous Clustal W jobs

Unsorted

0 100 200 300

1

5

9

13

17

21

25

29

CP

U

Time, sec

Sorted

0 100 200 300

1

5

9

13

17

21

25

29

CP

U

Time, sec

•Presorting input files by size reduces the load balance problems

TMScalability of HT- Clustal W

0

4

8

12

16

20

24

28

32

36

1 4 8 16 32

Number of CPUs

Spee

dup

0.E+00

2.E+04

4.E+04

6.E+04

8.E+04

1.E+05

Sequ

ence

s/ho

ur

Unsorted

Sorted

•HT Clustal W has close to linear scalability on Origin2000•Presorting helps when a number of input files is comparable toa number of CPUs (same profile of100 input files was used for this plot)

TMSystem oversubscription

•If the amount of parallelism in individual calculation of an HT production runis not very large, it can be advantageous to oversubscribe a system (starting more processes than number of CPUs available) thus reducing idle CPU time and increasing effective parallelism.

0

1

2

3

4

5

6

7

8

9

0.25 0.5 0.75 1 1.25 1.5 1.75 2 2.25 2.5

Oversubscription

Sp

ee

du

p

Results are shown for running parallel Clustal W jobs (each using 2 CPUs) on 8CPU Origin2000.P ~ 90%Oversubscription = Nprocesses / NCPUs

•In this case oversubscribing system by more then 50% allows to reach close to maximum (8 x) speedups on 8CPU Origin2000

TM

Parallel Clustal W example(Yuan J., et al, Bioinformatics, 15, 862, 1999)

optimization of input parameters - scoring matrices, gap penalties - requires many repetitive Clustal W calculations with various input parameters.

PW Tree MA

VARY: PW parameters MA parameters

Better alignment

SH3 family multiple alignment (parallel Clustal W 1.8)

Default parameters Optimized parameters

TM

MULTICLUSTAL scalability

0

1

2

3

4

5

6

7

1 2 4 8 16

CPUs

Sp

eed

up

•Parallel Clustal W significantly increases MULTICLUSTAL performance

•MULTICLUSTAL input parameter optimization for 100 GPCR sequences (average length 339 a.a) using parallel Clustal W 1.8

•Parallel performance is limited by the MULTICLUSTAL score calculation step

TM

Modified MULTICLUSTAL

• Original MULTICLUSTAL - optimizing PW & MA parameters independently

MA parameters

PW

• Modified MULTICLUSTAL - reuses tree from PW optimization

Better performance

TMModified MULTICLUSTAL scalability

0

2

4

6

8

10

1 2 4 8 16

CPUs

Rela

tive s

peed

originalmodified

•Modified MULTICLUSTAL is ~1.5 - 3 times faster than original program•Enhancing modifications were reported to the authors at Merck and incorporated in the current code

•MULTICLUSTAL input parameter optimization for 100 GPCR sequences(average length 339 a.a)

TM

High Throughput Proteomics Problem: Customers need efficient “dataflow” for proteomics facilities• Multiple instruments• Continuous operations

Solution: Couple processing operations• “Fuse” the data acquisition

with the Proteomics data analysis• Use coarse-grain parallelism

Results• Efficient data processing• Data usable throughout organization

904

1,170

1,721

3,3466,61313,19826,344

0.00

5.00

10.00

15.00

20.00

25.00

30.00

35.00

1 2 4 8 16 24 32

Number of Processors

Pa

rall

el

Sp

ee

du

p Speedup

99.76% Parallelism

Labels: Elapsed time in seconds

TM

Agenda



• HT BLAST

• HT CLUSTAL W


TM“Number Crunching” Scientific Projects with partners/customers

GeneCrunch: first full genome annotation (yeast)

• EMBL/EBI (Chris Sanders) 3-D Crunch: prediction of 3-D protein structures (SWISS-PROT)

• GlaxoWellcome/MSI Microsecond protein simulation

• UCSF (Science) DockCrunch: docking simulation of 1,000,000 compounds

• Protherics plc. Virtual Screening of 850,000 compounds

• Tripos Virtual Docking project with 800,000 compounds

• Metaphorics Genome Annotation: New 3D annotations for approximately 27,000 sequences on 256-procs

• MSI 3D Functional Annotation of 3855 sequences in V. Cholerae genome on 256-procs

• MSI (Nature)

Another 3 in the pipeline presently

TMCapability DemonstrationSGI and MSI Cholera Crunch

Problem:

• Only 60-80% of cholera vaccines are effective. Need new drugs

& therefore need to understand 3D structures of potential protein targets

Solution:

• 3D Protein annotation of 3855 sequences in V. Cholerae genome (bacterium)– Data from: Nature 406, 477 - 483 (2000) © Macmillan Publishers Ltd.

• MSI GeneAtlasTM pipeline software

• SGI Origin server with 256 Processors and 256GB memory

Results

• 7 days of computation

• 76% functional annotation vs. 54% in literature

• Results available in MSI AtlasBaseTM for MSI Functional Genomics Consortium members. Will be published in early 2002 in “Protein Structure Determination, Analysis and Modeling for Drug Discovery"

Post Genomics

(see Press release on Jan 23, 2001)

TM

Evolution of AIDS Virus

Problem: Trace the evolution of the AIDS Virus

Solution: • 16x128 Processor Cluster of SGI Origin servers at Los Alamos• Calculate rate of virus mutation as a function of time - predict the

origin of the disease

Results:• The worldwide AIDS epidemic has been traced back to a single viral

ancestor -- the HIV “Eve” -- that emerged perhaps around 1930.

Citation: Dr. Bette Korber, LANLhttp://w10.lanl.gov:80/physics/abst01_06.html

Throughput & Turnaround

TM

Functional Genomics - MSI

Need• Production screening of large numbers of genetic sequences

• Commercial and non-commercial genome projects

Solution• SGI Origin2000 w/256 processors

• MSI Gene Atlas -- High-throughput automated pipeline for functional annotation of protein sequences

• MSI AtlasBase 22 complete Genome database

Results• New 3D annotations for approximately 27,000 sequences

Normally a project like this would have been expected to take 2.5 CPU years. By working jointly with SGI, it was possible to complete it in a week

Dr. Michael Pear, Director, Protein Bioinformatics at MSI.

Throughput & Turnaround

TM

MineSet - Data Mining and Visualization

Visualization for Bioinformatics

Genome Comparison M. genitalium x H. influenzae

TM

Visualization for Bioinformatics Reality Centers Enable (Local or Global) Group Decisions on

– Proteins, Genes– How two molecules ‘fit’ or ‘interact’– Shape & size of new chemical entities

Customers are Research teams comprised of– Scientists– Management

Using “Standard” Scientific Applications:– Tripos SYBYL 6.61&6.62 w/MOLCAD– MSI InsightII-2000– VMD (w/CaveLib)– UCSF Chimera– SGI MineSet– AVS

TM

Visualization without the restrictions of a desktop monitor

• An immersive DISPLAY ENVIRONMENT for real-time collaborative viewing and manipulation of data

• An innovative WORK ENVIRONMENT to truly interact with virtual prototypes and large data sets

Reality Centers for Life Sciences

TMConclusion: SGI for Discovery Research

• Need for better, faster research decision making

• Tools for Bioinformatics and Chemistry– SGI Scalable Servers– Bioinformatics and Chemistry application availability– SGI enabled scalable Bioinformatics and Chemistry

algorithms– Advanced Visualization Environments

• Expert team committed to discovery research customers - Global Scientific Support

• Activity in other areas: KM/DM, ASP, LIMS, IA-64

Documents

TM High Performance Computing in Bioinformatics