22
Parallel Algorithm for Multiple Genome Alignment Using Multiple Clusters Nova Ahmed, Yi Pan, Art Vandenberg Georgia State University SURA Cyberinfrastructure Workshop: Grid Application Planning & Implementation January 5-7, 2005 theastern Universities Research Association

Parallel Algorithm for Multiple Genome Alignment Using Multiple Clusters Nova Ahmed, Yi Pan, Art Vandenberg Georgia State University SURA Cyberinfrastructure

Embed Size (px)

Citation preview

Page 1: Parallel Algorithm for Multiple Genome Alignment Using Multiple Clusters Nova Ahmed, Yi Pan, Art Vandenberg Georgia State University SURA Cyberinfrastructure

Parallel Algorithm forMultiple Genome Alignment

Using Multiple Clusters

Nova Ahmed, Yi Pan, Art Vandenberg Georgia State University

SURA Cyberinfrastructure Workshop:

Grid Application Planning & ImplementationJanuary 5-7, 2005

Southeastern Universities Research Association

Page 2: Parallel Algorithm for Multiple Genome Alignment Using Multiple Clusters Nova Ahmed, Yi Pan, Art Vandenberg Georgia State University SURA Cyberinfrastructure

Southeastern Universities Research Association

2SURA Cyberinfrastructure Workshop

January 5-7, 2005

Discussion Topics…

• Sequence alignment problem

• Memory efficient algorithm

• Convergence toward collaboration

• System configurations

Results (part 1, part 2)

Conclusions

Future work

Page 3: Parallel Algorithm for Multiple Genome Alignment Using Multiple Clusters Nova Ahmed, Yi Pan, Art Vandenberg Georgia State University SURA Cyberinfrastructure

Southeastern Universities Research Association

3SURA Cyberinfrastructure Workshop

January 5-7, 2005

Sequence alignment problem

• Sequences used to find biologically meaning relationships among organisms

• Evolutionary information• Determining diseases, causes, cures• Finding out information about proteins

• Problem especially compute intensive for long sequences• Needleman and Wunsch (1970) - optimal global alignment• Smith and Waterman (1981) - optimal local alignment• Taylor (1987) - multiple sequence alignment by pairwise alignment• BLAST trades off optimal results for faster computation

• Challenge - achieve optimal results without sacrificing speed

Page 4: Parallel Algorithm for Multiple Genome Alignment Using Multiple Clusters Nova Ahmed, Yi Pan, Art Vandenberg Georgia State University SURA Cyberinfrastructure

Southeastern Universities Research Association

4SURA Cyberinfrastructure Workshop

January 5-7, 2005

Memory efficient algorithm

• Based on pairwise algorithm• Similarity Matrix generated to compare all sequence positions• Observation that many “alignment scores” are zero value

• Similarity Matrix reduced by storing only non-zero elements• Row-column information stored along with value• Block of memory dynamically allocated as non-zero element found• Data structure used to access allocated blocks

• Parallelism introduced to reduce computation

Page 5: Parallel Algorithm for Multiple Genome Alignment Using Multiple Clusters Nova Ahmed, Yi Pan, Art Vandenberg Georgia State University SURA Cyberinfrastructure

Southeastern Universities Research Association

5SURA Cyberinfrastructure Workshop

January 5-7, 2005

• Alignment of DNA sequences:Sequence X: TGATGGAGGTSequence Y: GATAGG

• 1 = matching; 0 = non-matching• ss = substitution score; gp = gap score • Generate Similarity Matrix max score with respect to neighbors using:

Similarity Matrix Generation

Page 6: Parallel Algorithm for Multiple Genome Alignment Using Multiple Clusters Nova Ahmed, Yi Pan, Art Vandenberg Georgia State University SURA Cyberinfrastructure

Southeastern Universities Research Association

6SURA Cyberinfrastructure Workshop

January 5-7, 2005

• Back trace matrix to find sequence matches

Trace sequences

Page 7: Parallel Algorithm for Multiple Genome Alignment Using Multiple Clusters Nova Ahmed, Yi Pan, Art Vandenberg Georgia State University SURA Cyberinfrastructure

Southeastern Universities Research Association

7SURA Cyberinfrastructure Workshop

January 5-7, 2005

• Algorithm calculates only non-zero values• Memory dynamically allocated as needed

Data structure

Page 8: Parallel Algorithm for Multiple Genome Alignment Using Multiple Clusters Nova Ahmed, Yi Pan, Art Vandenberg Georgia State University SURA Cyberinfrastructure

Southeastern Universities Research Association

8SURA Cyberinfrastructure Workshop

January 5-7, 2005

Parallel distribution of multiple sequences

Sequences 1-6

Sequences 7-12

Seq 1-2 Seq 5-6Seq 3-4

Page 9: Parallel Algorithm for Multiple Genome Alignment Using Multiple Clusters Nova Ahmed, Yi Pan, Art Vandenberg Georgia State University SURA Cyberinfrastructure

Southeastern Universities Research Association

9SURA Cyberinfrastructure Workshop

January 5-7, 2005

Convergence toward collaboration

• Algorithm implementation• Nova Ahmed, Masters CS student

• Dr. Yi Pan, CS, graduate advisor

• Shared memory system – Georgia State• Algorithm implementation and initial validation results

• NMI Integration Testbed program• Georgia State

– Art Vandenberg, Victor Bolet, et al.

• University of Alabama at Birmingham– Jill Gemmill, John-Paul Robinson, Pravin Joshi

• SURA NMI Testbed Grid• Looking for applications to demonstrate value

Page 10: Parallel Algorithm for Multiple Genome Alignment Using Multiple Clusters Nova Ahmed, Yi Pan, Art Vandenberg Georgia State University SURA Cyberinfrastructure

Southeastern Universities Research Association

10SURA Cyberinfrastructure Workshop

January 5-7, 2005

System configurations

• Shared memory – Georgia State• SGI Origin 2000

– 24 250MHz MIPS R10000; 4 gigabytes total RAM

• Clusters – University of Alabama at Birmingham• Single Cluster

– 8 node Beowulf cluster (each node 4 550MHz Pentium III; 512 MB RAM)

• Single Cluster Grid

– Same 8 node Beowulf cluster with Globus Toolkit 3.0

• Multi-Cluster

– 2 additional grid-enabled clusters (small SMP systems)

• Multi-Cluster interconnect speed essentially 100mb/sec

Page 11: Parallel Algorithm for Multiple Genome Alignment Using Multiple Clusters Nova Ahmed, Yi Pan, Art Vandenberg Georgia State University SURA Cyberinfrastructure

Southeastern Universities Research Association

11SURA Cyberinfrastructure Workshop

January 5-7, 2005

Results, part 1

• Initial validation of algorithm on Shared memory

• UAB Cluster• As “relative comparison” to shared memory performance

• UAB grid-enabled cluster• To evaluate impact of grid middleware layer

Page 12: Parallel Algorithm for Multiple Genome Alignment Using Multiple Clusters Nova Ahmed, Yi Pan, Art Vandenberg Georgia State University SURA Cyberinfrastructure

Southeastern Universities Research Association

12SURA Cyberinfrastructure Workshop

January 5-7, 2005

Initial Validation: Shared Memory Machine

Performance Validates AlgorithmComputation time decreases with increased number of processors

2 4 6 8 10 12

Computation Time(Shared Memory)

0

100

200

300

400

500

Computation

Time

Number of Processors

Computation Time(Shared Memory)

Limitations• Memory

Max sequence is2000 x 2000

• ProcessorsPolicy limits studentto 12 processors

• Not scalable

Page 13: Parallel Algorithm for Multiple Genome Alignment Using Multiple Clusters Nova Ahmed, Yi Pan, Art Vandenberg Georgia State University SURA Cyberinfrastructure

Southeastern Universities Research Association

13SURA Cyberinfrastructure Workshop

January 5-7, 2005

Results: UAB Clusters; Shared Memory*

• Increase genome lengths to 3000 (remove student limit shared memory)

* NB: results comparing clusters with shared memory are relative;Systems distinctly different.

2

8

14

20

26

0

100

200

300

400

500

Computation

Time (seconds)

Number of Processors

Genome length 3000(Grid)

Genome length 3000(Cluster)

Genome length 3000( Shared Memory)

Page 14: Parallel Algorithm for Multiple Genome Alignment Using Multiple Clusters Nova Ahmed, Yi Pan, Art Vandenberg Georgia State University SURA Cyberinfrastructure

Southeastern Universities Research Association

14SURA Cyberinfrastructure Workshop

January 5-7, 2005

Results: Grid-enabled cluster (Globus, MPICH)

Advantages of grid-enabled cluster:• Longer Sequences – up to 10,000 length tested • Scalable – Can add new cluster nodes to the grid• Easier job submission – Don’t need account on every node• Scheduling is easier – Can submit multiple jobs at one time

2

8

14

20

26

0

200

400

600

Computation

Time (seconds)

Number of Processors

Genome length 10000 (Grid)

Genome length 10000( Cluster)

Page 15: Parallel Algorithm for Multiple Genome Alignment Using Multiple Clusters Nova Ahmed, Yi Pan, Art Vandenberg Georgia State University SURA Cyberinfrastructure

Southeastern Universities Research Association

15SURA Cyberinfrastructure Workshop

January 5-7, 2005

Results, part 2

• Focus on clusters• UAB Cluster• UAB grid-enabled cluster• Multi-clusters at UAB

• Multiple Genome alignment – not just pairwise• Sequence set from sequence library• Approx 150 sequences ranging from 80,000 to 1,000,000 length

• Globus Toolkit 3.0, MPICH-G2

Page 16: Parallel Algorithm for Multiple Genome Alignment Using Multiple Clusters Nova Ahmed, Yi Pan, Art Vandenberg Georgia State University SURA Cyberinfrastructure

Southeastern Universities Research Association

16SURA Cyberinfrastructure Workshop

January 5-7, 2005

Computation TimeNumber of elements per processor

Using 9 processors in each config (cluster, grid cluster, multi-grid cluster)

0

20

40

60

80

100

120

140

160

0 10 20 30 40 50

Number of elements per processor

Computation time (sec)

Single Cluster

Single Clustered Grid

Multi Clustered Grid

Page 17: Parallel Algorithm for Multiple Genome Alignment Using Multiple Clusters Nova Ahmed, Yi Pan, Art Vandenberg Georgia State University SURA Cyberinfrastructure

Southeastern Universities Research Association

17SURA Cyberinfrastructure Workshop

January 5-7, 2005

Computation Time

9 processors available in multi-cluster32 processors for other configs.

0

100

200

300

400

500

0 5 10 15 20 25 30

Number of processors

Computation time (sec)

Single Cluster

Single Clustered

Grid

Multi Clustered

Grid

Page 18: Parallel Algorithm for Multiple Genome Alignment Using Multiple Clusters Nova Ahmed, Yi Pan, Art Vandenberg Georgia State University SURA Cyberinfrastructure

Southeastern Universities Research Association

18SURA Cyberinfrastructure Workshop

January 5-7, 2005

Speed up (time 1 cpu / time n cpus)

0

1

2

3

4

5

6

7

8

9

0 5 10 15 20 25 30

Number of Processors

Speed up

Single Cluster

Single Clustered

Grid

Multi Clustered Grid

9 processors available in multi-cluster32 processors for other configs.

Page 19: Parallel Algorithm for Multiple Genome Alignment Using Multiple Clusters Nova Ahmed, Yi Pan, Art Vandenberg Georgia State University SURA Cyberinfrastructure

Southeastern Universities Research Association

19SURA Cyberinfrastructure Workshop

January 5-7, 2005

Some Conclusions

• Having cluster nodes available via Testbed beneficial• Enables access where resource not available locally• Empowers student investigation

• Grid capability demonstrated• Provides awareness and outreach vector• Nova Ahmed’s thesis defense - engages other graduate students• Concrete “take away” that engages faculty/IT/student discussion

• Some interesting results• Hypothesis: multi-cluster may provide better results than one cluster• Research leads to understanding, learning - whatever Hypothesis result

• Ahmed et al., “Memory Efficient Pair-Wise Genome Alignment Algorithm - A Small-Scale Application with Grid Potential,” Proceedings Grid and Cooperative Computing - GCC 2004, Lecture Notes in Computer Science

Page 20: Parallel Algorithm for Multiple Genome Alignment Using Multiple Clusters Nova Ahmed, Yi Pan, Art Vandenberg Georgia State University SURA Cyberinfrastructure

Southeastern Universities Research Association

20SURA Cyberinfrastructure Workshop

January 5-7, 2005

Future Work

• Running across clusters at different sites

• Intelligent agent: submit to mixed environment

– shared memory and/or clusters and/or …

• Using BridgeCA for transparent access

• Optically connected clusters?

• Analysis of network factors• cf. Warren Matthews, GaTech, et al., end-to-end performance

Page 21: Parallel Algorithm for Multiple Genome Alignment Using Multiple Clusters Nova Ahmed, Yi Pan, Art Vandenberg Georgia State University SURA Cyberinfrastructure

Southeastern Universities Research Association

21SURA Cyberinfrastructure Workshop

January 5-7, 2005

Questions / Contacts

Georgia State University

Nova Ahmed [email protected]

Yi Pan [email protected]

Art Vandenberg [email protected]

Page 22: Parallel Algorithm for Multiple Genome Alignment Using Multiple Clusters Nova Ahmed, Yi Pan, Art Vandenberg Georgia State University SURA Cyberinfrastructure

Southeastern Universities Research Association

22SURA Cyberinfrastructure Workshop

January 5-7, 2005

Acknowledgement

• This work is supported in part by the NSF Middleware Initiative Cooperative Agreement No. ANI-0123937. Any opinions, findings, conclusions or recommendations expressed herein are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.