Parallel Algorithm for Multiple Genome Alignment Using Multiple Clusters Nova Ahmed, Yi Pan, Art Vandenberg Georgia State University SURA Cyberinfrastructure

Parallel Algorithm forMultiple Genome Alignment

Using Multiple Clusters

Nova Ahmed, Yi Pan, Art Vandenberg Georgia State University

SURA Cyberinfrastructure Workshop:

Grid Application Planning & ImplementationJanuary 5-7, 2005

Southeastern Universities Research Association


2SURA Cyberinfrastructure Workshop

January 5-7, 2005

Discussion Topics…

• Sequence alignment problem

• Memory efficient algorithm

• Convergence toward collaboration

• System configurations

Results (part 1, part 2)

Conclusions

Future work



January 5-7, 2005

Sequence alignment problem

• Sequences used to find biologically meaning relationships among organisms

• Evolutionary information• Determining diseases, causes, cures• Finding out information about proteins

• Problem especially compute intensive for long sequences• Needleman and Wunsch (1970) - optimal global alignment• Smith and Waterman (1981) - optimal local alignment• Taylor (1987) - multiple sequence alignment by pairwise alignment• BLAST trades off optimal results for faster computation

• Challenge - achieve optimal results without sacrificing speed



January 5-7, 2005

Memory efficient algorithm

• Based on pairwise algorithm• Similarity Matrix generated to compare all sequence positions• Observation that many “alignment scores” are zero value

• Similarity Matrix reduced by storing only non-zero elements• Row-column information stored along with value• Block of memory dynamically allocated as non-zero element found• Data structure used to access allocated blocks

• Parallelism introduced to reduce computation



January 5-7, 2005

• Alignment of DNA sequences:Sequence X: TGATGGAGGTSequence Y: GATAGG

• 1 = matching; 0 = non-matching• ss = substitution score; gp = gap score • Generate Similarity Matrix max score with respect to neighbors using:

Similarity Matrix Generation



January 5-7, 2005

• Back trace matrix to find sequence matches

Trace sequences



January 5-7, 2005

• Algorithm calculates only non-zero values• Memory dynamically allocated as needed

Data structure



January 5-7, 2005

Parallel distribution of multiple sequences

Sequences 1-6

Sequences 7-12

Seq 1-2 Seq 5-6Seq 3-4



January 5-7, 2005

Convergence toward collaboration

• Algorithm implementation• Nova Ahmed, Masters CS student

• Dr. Yi Pan, CS, graduate advisor

• Shared memory system – Georgia State• Algorithm implementation and initial validation results

• NMI Integration Testbed program• Georgia State

– Art Vandenberg, Victor Bolet, et al.

• University of Alabama at Birmingham– Jill Gemmill, John-Paul Robinson, Pravin Joshi

• SURA NMI Testbed Grid• Looking for applications to demonstrate value



January 5-7, 2005

System configurations

• Shared memory – Georgia State• SGI Origin 2000

– 24 250MHz MIPS R10000; 4 gigabytes total RAM

• Clusters – University of Alabama at Birmingham• Single Cluster

– 8 node Beowulf cluster (each node 4 550MHz Pentium III; 512 MB RAM)

• Single Cluster Grid

– Same 8 node Beowulf cluster with Globus Toolkit 3.0

• Multi-Cluster

– 2 additional grid-enabled clusters (small SMP systems)

• Multi-Cluster interconnect speed essentially 100mb/sec



January 5-7, 2005

Results, part 1

• Initial validation of algorithm on Shared memory

• UAB Cluster• As “relative comparison” to shared memory performance

• UAB grid-enabled cluster• To evaluate impact of grid middleware layer



January 5-7, 2005

Initial Validation: Shared Memory Machine

Performance Validates AlgorithmComputation time decreases with increased number of processors

2 4 6 8 10 12

Computation Time(Shared Memory)

0

100

200

300

400

500

Computation

Time

Number of Processors

Computation Time(Shared Memory)

Limitations• Memory

Max sequence is2000 x 2000

• ProcessorsPolicy limits studentto 12 processors

• Not scalable



January 5-7, 2005

Results: UAB Clusters; Shared Memory*

• Increase genome lengths to 3000 (remove student limit shared memory)

* NB: results comparing clusters with shared memory are relative;Systems distinctly different.

2

8

14

20

26

0

100

200

300

400

500

Computation

Time (seconds)


Genome length 3000(Grid)

Genome length 3000(Cluster)

Genome length 3000( Shared Memory)



January 5-7, 2005

Results: Grid-enabled cluster (Globus, MPICH)

Advantages of grid-enabled cluster:• Longer Sequences – up to 10,000 length tested • Scalable – Can add new cluster nodes to the grid• Easier job submission – Don’t need account on every node• Scheduling is easier – Can submit multiple jobs at one time

2

8

14

20

26

0

200

400

600

Computation

Time (seconds)


Genome length 10000 (Grid)

Genome length 10000( Cluster)



January 5-7, 2005

Results, part 2

• Focus on clusters• UAB Cluster• UAB grid-enabled cluster• Multi-clusters at UAB

• Multiple Genome alignment – not just pairwise• Sequence set from sequence library• Approx 150 sequences ranging from 80,000 to 1,000,000 length

• Globus Toolkit 3.0, MPICH-G2



January 5-7, 2005

Computation TimeNumber of elements per processor

Using 9 processors in each config (cluster, grid cluster, multi-grid cluster)

0

20

40

60

80

100

120

140

160

0 10 20 30 40 50

Number of elements per processor

Computation time (sec)

Single Cluster

Single Clustered Grid

Multi Clustered Grid



January 5-7, 2005

Computation Time

9 processors available in multi-cluster32 processors for other configs.

0

100

200

300

400

500

0 5 10 15 20 25 30

Number of processors

Computation time (sec)

Single Cluster

Single Clustered

Grid

Multi Clustered

Grid



January 5-7, 2005

Speed up (time 1 cpu / time n cpus)

0

1

2

3

4

5

6

7

8

9

0 5 10 15 20 25 30


Speed up

Single Cluster

Single Clustered

Grid

Multi Clustered Grid

9 processors available in multi-cluster32 processors for other configs.



January 5-7, 2005

Some Conclusions

• Having cluster nodes available via Testbed beneficial• Enables access where resource not available locally• Empowers student investigation

• Grid capability demonstrated• Provides awareness and outreach vector• Nova Ahmed’s thesis defense - engages other graduate students• Concrete “take away” that engages faculty/IT/student discussion

• Some interesting results• Hypothesis: multi-cluster may provide better results than one cluster• Research leads to understanding, learning - whatever Hypothesis result

• Ahmed et al., “Memory Efficient Pair-Wise Genome Alignment Algorithm - A Small-Scale Application with Grid Potential,” Proceedings Grid and Cooperative Computing - GCC 2004, Lecture Notes in Computer Science



January 5-7, 2005

Future Work

• Running across clusters at different sites

• Intelligent agent: submit to mixed environment

– shared memory and/or clusters and/or …

• Using BridgeCA for transparent access

• Optically connected clusters?

• Analysis of network factors• cf. Warren Matthews, GaTech, et al., end-to-end performance



January 5-7, 2005

Questions / Contacts

Georgia State University

Nova Ahmed [email protected]

Yi Pan [email protected]

Art Vandenberg [email protected]



January 5-7, 2005

Acknowledgement

• This work is supported in part by the NSF Middleware Initiative Cooperative Agreement No. ANI-0123937. Any opinions, findings, conclusions or recommendations expressed herein are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

Documents

Parallel Algorithm for Multiple Genome Alignment Using Multiple Clusters Nova Ahmed, Yi Pan, Art Vandenberg Georgia State University SURA Cyberinfrastructure