49
BioPerf: An Open Benchmark Suite for Evaluating Computer Architecture on Bioinformatics and Life Science Applications David A. Bader

David A. Bader

  • Upload
    karah

  • View
    46

  • Download
    0

Embed Size (px)

DESCRIPTION

BioPerf : An Open Benchmark Suite for Evaluating Computer Architecture on Bioinformatics and Life Science Applications. David A. Bader. Collaborators. Vipin Sachdeva (U New Mexico, Georgia Tech, IBM Austin) Tao Li (U Florida) Yue Li (U Florida) Virat Agrawal (IIT Delhi) - PowerPoint PPT Presentation

Citation preview

Page 1: David A. Bader

BioPerf: An Open Benchmark Suite for Evaluating Computer Architecture on Bioinformatics and Life Science ApplicationsDavid A. Bader

Page 2: David A. Bader

BioPerf: an open bioinformatics and life sciences workload, David A. Bader

Collaborators

• Vipin Sachdeva (U New Mexico, Georgia Tech, IBM Austin)

• Tao Li (U Florida)• Yue Li (U Florida)• Virat Agrawal (IIT Delhi)• Gaurav Goel (IIT Delhi)• Abhishek Narain Singh (IIT Delhi)• Ram Rajamony (IBM Austin)

Page 3: David A. Bader

BioPerf: an open bioinformatics and life sciences workload, David A. Bader

Acknowledgment of Support• National Science Foundation

– CAREER: High-Performance Algorithms for Scientific Applications (06-11589; 00-93039)

– ITR: Building the Tree of Life -- A National Resource for Phyloinformatics and Computational Phylogenetics (EF/BIO 03-31654)

– DEB: Ecosystem Studies: Self-Organization of Semi-Arid Landscapes: Test of Optimality Principles (99-10123)

– ITR/AP: Reconstructing Complex Evolutionary Histories (01-21377)– DEB Comparative Chloroplast Genomics: Integrating Computational

Methods, Molecular Evolution, and Phylogeny (01-20709)– ITR/AP(DEB): Computing Optimal Phylogenetic Trees under Genome

Rearrangement Metrics (01-13095)– DBI: Acquisition of a High Performance Shared-Memory Computer for

Computational Science and Engineering (04-20513).

• IBM PERCS / DARPA High Productivity Computing Systems (HPCS)– DARPA Contract NBCH30390004

Page 4: David A. Bader

BioPerf: an open bioinformatics and life sciences workload, David A. Bader

Contributions of this Work

• An open source, freely-available, freely-redistributable suite of applications and inputs, BioPerf, which spans a wide variety of bioinformatics application– www.bioperf.org

• Performance study on PowerPC G5, IBM Mambo simulator, and Alpha

Page 5: David A. Bader

BioPerf: an open bioinformatics and life sciences workload, David A. Bader

Outline

• Motivation • Bioinformatics Workload• BioPerf Suite• Performance Analysis on PowerPC G5

and Mambo• Conclusions and Future Work

Page 6: David A. Bader

BioPerf: an open bioinformatics and life sciences workload, David A. Bader

Motivation

• Improve performance on a wide range of bioinformatics applications– Heterogeneous in problems, algorithms,

applications• BioPerf workload assembled as a

representative set of bioinformatics applications important now and expected to increase in usage over the next 5—10 years

• Decide if this is YAW “yet another workload” or rather unique in its characteristics

Page 7: David A. Bader

BioPerf: an open bioinformatics and life sciences workload, David A. Bader

Related Work

• General benchmark suites: SPEC• Domain-specific benchmarks

– TPC, EEMBC, SPLASH, SPLASH-2

• Few special benchmark for bioinformatics• Previous attempts have been incomplete: Analysis

on old architectures (BioBench) [Albayraktaroglu et al., ISPASS 2005]

• Included proprietary codes in benchmark suite (BioInfoMark) [Li et al., MASCOTS 2005]

• Previous suites not available for download• Included several non-redistributable packages• Inputs not articulated and not included with

benchmark suite for similar comparisons

Page 8: David A. Bader

BioPerf: an open bioinformatics and life sciences workload, David A. Bader

Guiding Principles for BioPerf• Coverage: The packages must span the heterogeneity of

algorithms and biological and life science problems important today as well as (in our view) increasing in importance over the next 5-10 years.

• Popularity: Codes with larger numbers of users are preferred because these packages represent a greater percentage of the aggregate workloads used in this domain.

• Open Source: Open source code allows the scientific study of the applicatio performance, the ability to place hooks into the code, and eases porting to new architectures.

• Licensing: Only packages for which their licensing allows free redistribution as open source are included. This requirement eliminated several popular packages, but was kept as a strict requirement to encourage the broadest use of this suite.

• Portability: Preference was given to packages that used standard programming languages and could easily be ported to new systems (both in sequential and parallel languages).

• Performance: We gave slight preference to packages whose performance is well-characterized in other studies. In addition, we strived for computationally-demanding packages and included parallel versions where available.

Page 9: David A. Bader

BioPerf: an open bioinformatics and life sciences workload, David A. Bader

BioPerf Suite

• Pre-compiled binaries (PowerPC, x86, Alpha)• Scalable Input datasets with each code for

fair comparisons• Scripts for installation, running and

collecting outputs• Documentation for compiling and using the

suite• Parallel codes where available• Available for download from

www.bioperf.org

Page 10: David A. Bader

BioPerf: an open bioinformatics and life sciences workload, David A. Bader

BioPerf workloadArea PackageExecutables

Sequence homology Word-based BLAST blastp, blastn Profile-based HMMER

hmmpfam, hmmsearch Sequence Alignment Pairwise FASTA ssearch, fasta

Multiple CLUSTALW clustalw, clustalw_smp

Multiple TCOFFEE tcoffee

Phylogeny Parsimony/Likelihood PHYLIP

dnapenny, promlk

Gene Rearrangement GRAPPA grappa

Protein Structure Prediction PREDATOR predator

Gene Finding GLIMMERglimmer,glimmer-package

Molecular Dynamics CE ce

Page 11: David A. Bader

BioPerf: an open bioinformatics and life sciences workload, David A. Bader

Sequence Alignment

• Sequence Alignment one of the most useful techniques in computational biology– Sequence Alignment : Stacking the

sequences against each other, with gaps if necessary, to expose similarity. ALIGNMENT

S1 : ACGCTGATATTA ACGCTGATAT---TA

S2 : AGTGTTATCCCTA AG--TGTTATCCCTA

S1 : ACGCTGATATTA ACGCTGATAT---TA

S2 : AGTGTTATCCCTA AG--TGTTATCCCTA

MATCH

Page 12: David A. Bader

BioPerf: an open bioinformatics and life sciences workload, David A. Bader

Sequence Alignment

• Sequence Alignment one of the most useful techniques in computational biology– Sequence Alignment : Stacking the

sequences against each other, with gaps if necessary, to expose similarity. ALIGNMENT

S1 : ACGCTGATATTA ACGCTGATAT---TA

S2 : AGTGTTATCCCTA AG--TGTTATCCCTA

S1 : ACGCTGATATTA ACGCTGATAT---TA

S2 : AGTGTTATCCCTA AG--TGTTATCCCTA

MISMATCH

Page 13: David A. Bader

BioPerf: an open bioinformatics and life sciences workload, David A. Bader

Sequence Alignment

• Sequence Alignment one of the most useful techniques in computational biology– Sequence Alignment : Stacking the

sequences against each other, with gaps if necessary, to expose similarity. ALIGNMENT

S1 : ACGCTGATATTA ACGCTGATAT---TA

S2 : AGTGTTATCCCTA AG--TGTTATCCCTA

S1 : ACGCTGATATTA ACGCTGATAT---TA

S2 : AGTGTTATCCCTA AG--TGTTATCCCTA

“GAPS”

Page 14: David A. Bader

BioPerf: an open bioinformatics and life sciences workload, David A. Bader

Multiple Sequence Alignment

• Bring the greatest number of similar characters into same column.

• Provides much more information than pairwise alignment

V S N S

S

N

AA

S

V S N

S

S

N A —

A S— — —

Run-time of dynamic programming solution = O(2k nk)6 sequences of length 100 6.4X1013 calculationsHence heuristics employed

Page 15: David A. Bader

BioPerf: an open bioinformatics and life sciences workload, David A. Bader

Sequence Homology

• Find similar sequences (DNA/protein) to an unknown sequence (DNA/protein).

• Computationally expensive• Size of data is huge and grows exponentially every

year• Public databases available: Genbank, SwissProt, PDB

NCBI Genbank DNA sequences 5 million sequencesSwissprot Protein Sequences 160,000

sequencesPDB Protein Structure 32,000 structures

Problems with computational approach• Exact alignment is O(l2) dynamic programming

solution• Quicker but less accurate heuristics employed

Page 16: David A. Bader

BioPerf: an open bioinformatics and life sciences workload, David A. Bader

Blast

• Basic Local Alignment Search Tool• Developed by NCBI• The most important bioinformatics

application for its popularity

Blastblastpblastn

The homo sapiens hereditary haemochromatosis proteinNon-redundant protein sequence nr developed by NCBI

Page 17: David A. Bader

BioPerf: an open bioinformatics and life sciences workload, David A. Bader

FASTA

• Also performs pairwise sequence alignment

FASTAFasta34ssearch

The human LDL receptor precursor nr

Page 18: David A. Bader

BioPerf: an open bioinformatics and life sciences workload, David A. Bader

ClustalW

• Multiple sequence alignment (MSA) program

ClustalW

ClustalwClustalw_smp

317 Ureaplasma’s gene sequences from NCBI Bacteria genomes database

Page 19: David A. Bader

BioPerf: an open bioinformatics and life sciences workload, David A. Bader

T-Coffee

• A sequential MSA similar to ClustalW with higher accuracy and complexity

T-coffee

Tcoffee

50 sequences of average length 850 extracted from the Prefab database

Page 20: David A. Bader

BioPerf: an open bioinformatics and life sciences workload, David A. Bader

Hmmer

• Align multiple sequences by using hidden Markov models

Hmmer hmmsearchhmmpfam

Brine shrimp globin

HMM of 50 aligned globin sequences

Page 21: David A. Bader

BioPerf: an open bioinformatics and life sciences workload, David A. Bader

Phylogenetic Reconstruction

• Study the evolution of all sequences and all species

• Find the best among all possible trees.

• Given n taxa, number of possible trees (2n-3)!!

• 10 taxa 2 million trees

• Approaches like maximum parsimony, maximum likelihood, among others

The Tree of Life (10-100M organisms)

Page 22: David A. Bader

BioPerf: an open bioinformatics and life sciences workload, David A. Bader

Phylogeny Reconstruction: Phylip

• Collection of programs for inferring phylogenies

• Methods include– Maximum parsimony – Maximum likelihood – Distance based methods.

• Input: Aligned dataset of 92 cyclophilins proteins of eukaryotes each of length 220

Page 23: David A. Bader

BioPerf: an open bioinformatics and life sciences workload, David A. Bader

Phylogeny Reconstruction: GRAPPA

• Genome Rearrangements Analysis under Parsimony and other Phylogenetic Algorithm

• Freely-available, open-source, GNU GPL

• already used by other computational phylogeny groups, Caprara, Pevzner, LANL, FBI, Smithsonian Institute, Aventis, GlaxoSmithKline, PharmCos.

• Gene-order Phylogeny Reconstruction• Breakpoint Median• Inversion Median

• over one-billion fold speedup from previous codes

• Parallelism scales linearly with the number of processors

• [Bader, Moret, Warnow]

Tobacco

Campanulaceae• Bob Jansen, UT-Austin;• Linda Raubeson, Central Washington U

Input: 12 bluebell flower species of 105 genes

A

B

C

D

E

F

A

B

C

D

E

F

XYZ W

Gene-order based phylogeny

Page 24: David A. Bader

BioPerf: an open bioinformatics and life sciences workload, David A. Bader

Protein Structure Prediction

• Find the sequences, three dimensional structures and functions of all proteins and vice-versa– Why computationally?

• Experimental Techniques slow and expensive

– Problems with computational approach• Little understanding of how structure develops• Does function really follow structure ?

Page 25: David A. Bader

BioPerf: an open bioinformatics and life sciences workload, David A. Bader

Protein Structure : Predator

• Tool for finding protein structures. • Relies on local alignments from

BLAST, FASTA• Input: 20 sequences from Swissprot

each of length about 7000 residues.

Page 26: David A. Bader

BioPerf: an open bioinformatics and life sciences workload, David A. Bader

CE (Combinatorial Extension)

• Find structural similarities between the primary structures of pairs of proteins.

CE ce

Two different types of hemoglobin which is used to transport oxygen

Page 27: David A. Bader

BioPerf: an open bioinformatics and life sciences workload, David A. Bader

Gene-Finding: Glimmer

• Gene-Finding: Find regions of genome which code for proteins.

• Widely used gene finding tool for microbial DNA.

• Input: Bacteria genome consisting of 9.2 million base pairs

Page 28: David A. Bader

BioPerf: an open bioinformatics and life sciences workload, David A. Bader

Pre-compiled binaries

• PowerPC• x86• Alpha

Page 29: David A. Bader

BioPerf: an open bioinformatics and life sciences workload, David A. Bader

BioPerf Performance Studies

• Analysis at the instruction and memory level on PowerPC

• Livegraph data helps to visualize performance as it varies during phases of a run

• Identify bottlenecks of current processors and make inputs for better performance on future processors

• Ongoing work using Mambo simulator (IBM PERCS)

• Pre-compiled Alpha binaries for the majority of benchmarks for simulation

• In order to reduce the simulation time, we collect the simulation points for those benchmarks by using SimPoint

Page 30: David A. Bader

BioPerf: an open bioinformatics and life sciences workload, David A. Bader

Conclusions

• Bioinformatics is a rapidly evolving field of increasing importance to computing

• BioPerf is a first step to characterize bioinformatics workload: infrastructure to evaluate performance

• Performance data collected so far provides insight into the limitations of current architectures

Page 31: David A. Bader

BioPerf: an open bioinformatics and life sciences workload, David A. Bader

Related Publications• D.A. Bader, V. Sachdeva, A. Trehan, V. Agarwal, G. Gupta,

and A.N. Singh, “BioSPLASH: A sample workload from bioinformatics and computational biology for optimizing next-generation high-performance computer systems,” (Poster Session), 13th Annual International Conference on Intelligent Systems for Molecular Biology (ISMB 2005), Detroit, MI, June 25-29, 2005.

• D.A. Bader, V. Sachdeva, “BioSPLASH: Incorporating life sciences applications in the architectural optimizations of next-generation petaflop-system,”(Poster Session), The 4th IEEE Computational Systems Bioinformatics Conference (CSB 2005), Stanford University, CA, August 8-11, 2005

• D.A. Bader, Y. Li, T. Li, V. Sachdeva, “BioPerf: A Benchmark Suite to Evaluate High-Performance Computer Architecture on Bioinformatics Applications,” The IEEE International Symposium on Workload Characterization (IISWC 2005), Austin, TX, October 6-8, 2005

Page 32: David A. Bader

BioPerf: an open bioinformatics and life sciences workload, David A. Bader

Backup Slides

Page 33: David A. Bader

BioPerf: an open bioinformatics and life sciences workload, David A. Bader

BioPerf on PowerPC

• PowerPC G5 dual-processor machine– Uniprocessor performance ( nvram boot-args=1

)– CPU frequency of 1.8 Ghz– 1 GB of physical memory available.

• Codes compiled using gcc-3.3 with no additional optimizations.

• MOnster tool of C.H.U.D package used for collecting hardware performance counters– Instruction and Memory level analysis

Page 34: David A. Bader

BioPerf: an open bioinformatics and life sciences workload, David A. Bader

Clustalw Algorithm Summary• Pairwise alignment of all sequences against one

another.– dynamic programming step

• Generate guide tree for aligning sequences– Sequences with highest similarity get aligned first

• Sequence-group and group-group alignments (progressive)– All possible pairwise alignments between sequence and group

are tried. Highest scoring pair is how it gets aligned to the group.

– All possible pairwise alignments of sequences between groups are tried; highest scoring pair is how groups get aligned

– Clustalw uses calculations from step 1 for this step

Page 35: David A. Bader

BioPerf: an open bioinformatics and life sciences workload, David A. Bader

Clustalw Livegraphs

Time Samples0 500 1000 1500 2000 2500 3000

Inst

r (p

pc,io

,ld.s

t)

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

Instr.Completed (ppc, io, ld/st)/Cycle Instr. (ppc)/Cycle

Pairwise alignment step

(70.1%)ppc instructions

lag the total instructions

Progressive alignment step

(29.8%)Almost all

instructions are ppc

Guide tree formation

(<0.1%) of total time

•Input: 318 sequences each of length almost 1050

Page 36: David A. Bader

BioPerf: an open bioinformatics and life sciences workload, David A. Bader

Clustalw Livegraphs

Time Samples0 500 1000 1500 2000 2500 3000

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

0.75

0.80

0.85

0.90

0.95

1.00

1.05

Instr. Completed/Cycle L1d Hit Rate

L1D hit rate almost 100%

Instructiosexecuted

low

L1D Hit Rate falls

down

Instructions executed increase

remarkably

Page 37: David A. Bader

BioPerf: an open bioinformatics and life sciences workload, David A. Bader

Clustalw Livegraphs

0 500 1000 1500 2000 2500 3000

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

0.000

0.005

0.010

0.015

0.020

0.025

0.030

Instr. Completed/Cycle Branch Mispredicts/Instr.

Branch mispredicts is high in dynamic programming

Instruction count is low

Branch mispredicts

falls in progressive alignment

Instruction count increases in progressive alignment

Is performance directly related to branch mispredicts ?

Page 38: David A. Bader

BioPerf: an open bioinformatics and life sciences workload, David A. Bader

Clustalw livegraphs

X Data

0 500 1000 1500 2000 2500 3000

0.000

0.001

0.002

0.003

0.004

0.005

0.006

0.007

0.000

0.005

0.010

0.015

0.020

0.025

0.030

Branch mispredict due to TA Branch mispredict due to CR

Almost all branch mispredicts caused due to condition register mispredict

Page 39: David A. Bader

BioPerf: an open bioinformatics and life sciences workload, David A. Bader

But what about loads per instruction ?

0 500 1000 1500 2000 2500 3000

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Instr. Completed/Cycle Loads/instr

Loads per instruction is high in dynamic programming

Instruction count is low

Loads per instruction

falls in progressive alignment

Instruction count increases in progressive alignment

Page 40: David A. Bader

BioPerf: an open bioinformatics and life sciences workload, David A. Bader

Clustalw livegraphs - smaller inputs

• Smaller input - 44 sequences of length 583

0 500 1000 1500 2000 2500 3000

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

1.8

0.84

0.86

0.88

0.90

0.92

0.94

0.96

0.98

1.00

1.02

Instructions per cycle L1d hit rate

Same performance

characteristics but with longer

progressive alignment step

Page 41: David A. Bader

BioPerf: an open bioinformatics and life sciences workload, David A. Bader

Clustalw livegraphs – smaller inputs

X Data

0 500 1000 1500 2000 2500 3000

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

0.000

0.005

0.010

0.015

0.020

0.025

0.030

Instructions per cycleBranch mispredicts/instr

Same performance

characteristics but with longer

progressive alignment step

Page 42: David A. Bader

BioPerf: an open bioinformatics and life sciences workload, David A. Bader

Clustalw livegraphs – smaller inputs

X Data

0 500 1000 1500 2000 2500 3000

0.000

0.002

0.004

0.006

0.008

0.010

0.012

0.000

0.005

0.010

0.015

0.020

0.025

0.030

Branch mispredicts due to TA Branch mispredicts due to CR

Almost all branch mispredicts caused due to condition register mispredict

Page 43: David A. Bader

BioPerf: an open bioinformatics and life sciences workload, David A. Bader

Clustalw livegraphs – smaller input

X Data

0 500 1000 1500 2000 2500 3000

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

0.2

0.3

0.4

0.5

0.6

0.7

Instructions Per Cycle Loads/instr.

Same performance

characteristics but with longer

progressive alignment step

Can we use Mambo with smaller input sizes for more performance analysis ?

Page 44: David A. Bader

BioPerf: an open bioinformatics and life sciences workload, David A. Bader

Using Mambo with Clustalw and other applications• Collect separate outputs for each phase of

the run• Inserted “callthru exit” into the source

code separating each part• Dump the system statistics at the end of

each phase – mysim stats dump– mysim caches stats dump– MamboClearSystemStats (clean the previous

statistics)• Multiple “mysim go” in the .tcl file.

Page 45: David A. Bader

BioPerf: an open bioinformatics and life sciences workload, David A. Bader

Clustalw on Mambo

X Data

0 1e+9 2e+9 3e+9 4e+9 5e+9

0

1e+5

2e+5

3e+5

4e+5

5e+5

6e+5

INST_TYPE_ARITH INST_TYPE_BRANCH INST_TYPE_LOAD

Pairwise alignment – high

loads and arithmetic

instructions

Progressive alignment

uses results from first step – high branch

and loads

Mambo offers far more detailed instruction profiling than G5 ?

Page 46: David A. Bader

BioPerf: an open bioinformatics and life sciences workload, David A. Bader

Comparing large datasets with small datasets

Is it feasible to use smaller input datasets for accurate simulation results ?

Branch mispredicts lesser due to smaller

dynamic programming step

Branch mispredictsmuch higher High increase in

L1d hit rate

Page 47: David A. Bader

BioPerf: an open bioinformatics and life sciences workload, David A. Bader

Summary of BioPerf performance

Highest instructions executed per

cycle

High loads per instruction

Low branch mispredicts

Low TLB misses

High L1d Hit

rate

Highest branch mispredicts and

TLB misses

High % of ld/st/io

instructionVery low % of ld/st/io

Page 48: David A. Bader

BioPerf: an open bioinformatics and life sciences workload, David A. Bader

Summary of BioPerf performance

High loads per

instruction

High branch mispredicts

Mid-range instructions

per cycle

Low TLB

misses

Low % of ld/st/io instructions

Page 49: David A. Bader

BioPerf: an open bioinformatics and life sciences workload, David A. Bader

Summary of BioPerf Performance

Lowest instruction

rate

Lowest loads per

instruction

Low branch mispredicts and

TLB misses

Lowest L1D and L2D hit

rate

Low % of ld/st/io

instructions