75
1 Parallel Longest Common Subsequence using Graphics Hardware John Kloetzli Brian Strege Jonathan Decker Dr. Marc Olano Presented by: Brian Strege

Parallel Longest Common Subsequence using Graphics Hardware

  • Upload
    vothien

  • View
    234

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Parallel Longest Common Subsequence using Graphics Hardware

1

Parallel Longest Common Subsequence using Graphics Hardware

John KloetzliBrian Strege

Jonathan DeckerDr. Marc Olano

Presented by: Brian Strege

Page 2: Parallel Longest Common Subsequence using Graphics Hardware

2

Overview

• Introduction– Problem Statement

• Background and Related Work– The NVIDIA G80 Architecture

• Algorithm Description• Results and Analysis• Conclusion

Page 3: Parallel Longest Common Subsequence using Graphics Hardware

3

Introduction

• Worked on GPU acceleration of Dynamic Programming– Specifically, problems in the Gaussian

Elimination Paradigm (GEP)– More specifically, Longest Common

Subsequence as a representative problem belonging to the GEP

Page 4: Parallel Longest Common Subsequence using Graphics Hardware

4

Problem Statement

• Design and implement an algorithm for finding the LCS of two arbitrary length strings on a CPU + GPU machine– Must make efficient use of both CPU and

GPU architectures– Must have theoretical justification of design

Page 5: Parallel Longest Common Subsequence using Graphics Hardware

5

Overview

• Introduction– Problem Statement

• Background and Related Work– The NVIDIA G80 Architecture

• Algorithm Description• Results and Analysis• Conclusion

Page 6: Parallel Longest Common Subsequence using Graphics Hardware

6

Related Work

• General Purpose on Graphics Hardware– NVIDIA CUDA– Owens et al. (2005)

• Linear Dynamic Programming– Hirschberg (1975)– Chowdhury et al. (2006)

• GPU Sequence Alignment– Liu et al. (2007)– Schatz et al. (2007)

Page 7: Parallel Longest Common Subsequence using Graphics Hardware

7

• 16 multiprocessors, 8 cores each128 logical processors

• 1.35 GHz• 768 MB of RAM• 86.4GB/sec transfer rate

(8.5GB/sec Core 2 Duo)

• 520 GFLOPS(22 GFLOPS Core 2 Duo)

NV

IDIA

CU

DA

Pro

gram

min

g G

uide

, 1.0

The NVIDIA G80 Architecture

Page 8: Parallel Longest Common Subsequence using Graphics Hardware

8

The NVIDIA G80 Architecture

Program workflow:• CPU (host) creates

kernel program• GPU maps kernel

“blocks” to processors• Processors map

kernel “threads” to processor cores

• Cores execute in parallel

NV

IDIA

CU

DA

Pro

gram

min

g G

uide

, 1.0

Page 9: Parallel Longest Common Subsequence using Graphics Hardware

9

Overview

• Introduction– Problem Statement

• Background and Related Work– The NVIDIA G80 Architecture

• Algorithm Description• Results and Analysis• Conclusion

Page 10: Parallel Longest Common Subsequence using Graphics Hardware

10

Algorithm Description

• The SIMPLE-LCS recurrence– Requires quadratic space, which limits

scalability– Faster than Chowdhury et al. linear space

method

Page 11: Parallel Longest Common Subsequence using Graphics Hardware

11

A B A B

AABB

SIMPLE-LCS Example

Page 12: Parallel Longest Common Subsequence using Graphics Hardware

12

A B A B

AABB

SIMPLE-LCS Example

0 0 0 0 0

0

0

0

0

Page 13: Parallel Longest Common Subsequence using Graphics Hardware

13

A B A B

AABB

SIMPLE-LCS Example

0 0 0 0 0

0 10

0

0

Page 14: Parallel Longest Common Subsequence using Graphics Hardware

14

A B A B

AABB

SIMPLE-LCS Example

0 0 0 0 0

0 1 10

0

0

Page 15: Parallel Longest Common Subsequence using Graphics Hardware

15

A B A B

AABB

SIMPLE-LCS Example

0 0 0 0 0

0 1 1 10

0

0

Page 16: Parallel Longest Common Subsequence using Graphics Hardware

16

A B A B

AABB

SIMPLE-LCS Example

0 0 0 0 0

0 1 1 1 10

0

0

Page 17: Parallel Longest Common Subsequence using Graphics Hardware

17

A B A B

AABB

SIMPLE-LCS Example

0 0 0 0 0

0 1 1 1 10 10

0

Page 18: Parallel Longest Common Subsequence using Graphics Hardware

18

A B A B

AABB

SIMPLE-LCS Example

0 0 0 0 0

0 1 1 1 10 1 10

0

Page 19: Parallel Longest Common Subsequence using Graphics Hardware

19

A B A B

AABB

SIMPLE-LCS Example

0 0 0 0 0

0 1 1 1 10 1 1 20

0

Page 20: Parallel Longest Common Subsequence using Graphics Hardware

20

A B A B

AABB

SIMPLE-LCS Example

0 0 0 0 0

0 1 1 1 10 1 1 2 20

0

Page 21: Parallel Longest Common Subsequence using Graphics Hardware

21

A B A B

AABB

SIMPLE-LCS Example

0 0 0 0 0

0 1 1 1 10 1 1 2 20 10

Page 22: Parallel Longest Common Subsequence using Graphics Hardware

22

A B A B

AABB

SIMPLE-LCS Example

0 0 0 0 0

0 1 1 1 10 1 1 2 20 1 20

Page 23: Parallel Longest Common Subsequence using Graphics Hardware

23

A B A B

AABB

SIMPLE-LCS Example

0 0 0 0 0

0 1 1 1 10 1 1 2 20 1 2 20

Page 24: Parallel Longest Common Subsequence using Graphics Hardware

24

A B A B

AABB

SIMPLE-LCS Example

0 0 0 0 0

0 1 1 1 10 1 1 2 20 1 2 2 30

Page 25: Parallel Longest Common Subsequence using Graphics Hardware

25

A B A B

AABB

SIMPLE-LCS Example

0 0 0 0 0

0 1 1 1 10 1 1 2 20 1 2 2 30 1

Page 26: Parallel Longest Common Subsequence using Graphics Hardware

26

A B A B

AABB

SIMPLE-LCS Example

0 0 0 0 0

0 1 1 1 10 1 1 2 20 1 2 2 30 1 2

Page 27: Parallel Longest Common Subsequence using Graphics Hardware

27

A B A B

AABB

SIMPLE-LCS Example

0 0 0 0 0

0 1 1 1 10 1 1 2 20 1 2 2 30 1 2 2

Page 28: Parallel Longest Common Subsequence using Graphics Hardware

28

A B A B

AABB

SIMPLE-LCS Example

0 0 0 0 0

0 1 1 1 10 1 1 2 20 1 2 2 30 1 2 2 3

Page 29: Parallel Longest Common Subsequence using Graphics Hardware

29

A B A B

AABB

SIMPLE-LCS Example

0 0 0 0 0

0 1 1 1 10 1 1 2 20 1 2 2 30 1 2 2 3

Page 30: Parallel Longest Common Subsequence using Graphics Hardware

30

A B A B

AABB

SIMPLE-LCS Example

0 0 0 0 0

0 1 1 1 10 1 1 2 20 1 2 2 30 1 2 2 3

Page 31: Parallel Longest Common Subsequence using Graphics Hardware

31

Algorithm Description

• Chowdhury et al. perform CPU quadratic space algorithm on small subproblems– CH-LCS is their linear space algorithm– CUTOFF ranges from 28 – 210

Page 32: Parallel Longest Common Subsequence using Graphics Hardware

32

Algorithm Description• Our approach is to add another base case

solved quickly on the GPU– GPU-LCS is our new algorithm (not recursive)– GPU-CUTOFF is 216

– CUTOFF is 211

Page 33: Parallel Longest Common Subsequence using Graphics Hardware

33

Algorithm Description

• CH: CPU Linear Space DP• GPU: GPU DP

– GPU level 1: GPU Quadratic Space DP (block level)

– GPU level 2: GPU Linear Space DP (thread level)

• Simple: CPU Quadratic Space DP

Page 34: Parallel Longest Common Subsequence using Graphics Hardware

34

CH: CPU Linear Space DP

Two recursive functions used:• Output boundary• LCS reconstruction

Page 35: Parallel Longest Common Subsequence using Graphics Hardware

35

CH: CPU Linear Space DP

Output boundary:• Given input boundary,

computes output boundary

• Expects subproblem size to be square, with power-of-two lengths

Page 36: Parallel Longest Common Subsequence using Graphics Hardware

36

A B A B

AABB

Pushing Example

19 20 21 22 2220202020

Page 37: Parallel Longest Common Subsequence using Graphics Hardware

37

A B A B

AABB

Pushing Example

19 20 21 22 2220202020

20 20 20 20 19 20 21 22 22

Page 38: Parallel Longest Common Subsequence using Graphics Hardware

38

A B A B

AABB

Pushing Example

19 20 21 22 2220 20202020

20 20 20 20 20 20 21 22 22

Page 39: Parallel Longest Common Subsequence using Graphics Hardware

39

A B A B

AABB

Pushing Example

19 20 21 22 2220 20 21202020

20 20 20 20 20 21 21 22 22

Page 40: Parallel Longest Common Subsequence using Graphics Hardware

40

A B A B

AABB

Pushing Example

19 20 21 22 2220 20 2120 212020

20 20 20 21 20 21 21 22 22

Page 41: Parallel Longest Common Subsequence using Graphics Hardware

41

A B A B

AABB

Pushing Example

19 20 21 22 2220 20 2120 21 212020

20 20 20 21 21 21 21 22 22

Page 42: Parallel Longest Common Subsequence using Graphics Hardware

42

A B A B

AABB

Pushing Example

19 20 21 22 2220 20 21 2220 21 212020

20 20 20 21 21 21 22 22 22

Page 43: Parallel Longest Common Subsequence using Graphics Hardware

43

A B A B

AABB

Pushing Example

19 20 21 22 2220 20 21 22 2220 21 212020

20 20 20 21 21 21 22 22 22

Page 44: Parallel Longest Common Subsequence using Graphics Hardware

44

A B A B

AABB

Pushing Example

19 20 21 22 2220 20 21 22 2220 21 21 222020

20 20 20 21 21 22 22 22 22

Page 45: Parallel Longest Common Subsequence using Graphics Hardware

45

A B A B

AABB

Pushing Example

19 20 21 22 2220 20 21 22 2220 21 21 22 222020

20 20 20 21 21 22 22 22 22

Page 46: Parallel Longest Common Subsequence using Graphics Hardware

46

A B A B

AABB

Pushing Example

19 20 21 22 2220 20 21 22 2220 21 21 22 2220 2120

20 20 21 21 21 22 22 22 22

Page 47: Parallel Longest Common Subsequence using Graphics Hardware

47

A B A B

AABB

Pushing Example

19 20 21 22 2220 20 21 22 2220 21 21 22 2220 21 2220

20 20 21 22 21 22 22 22 22

Page 48: Parallel Longest Common Subsequence using Graphics Hardware

48

A B A B

AABB

Pushing Example

19 20 21 22 2220 20 21 22 2220 21 21 22 2220 21 2220 21

20 21 21 22 21 22 22 22 22

Page 49: Parallel Longest Common Subsequence using Graphics Hardware

49

A B A B

AABB

Pushing Example

19 20 21 22 2220 20 21 22 2220 21 21 22 2220 21 2220 21 22

20 21 22 22 21 22 22 22 22

Page 50: Parallel Longest Common Subsequence using Graphics Hardware

50

A B A B

AABB

Pushing Example

19 20 21 22 2220 20 21 22 2220 21 21 22 2220 21 22 2220 21 22

20 21 22 22 22 22 22 22 22

Page 51: Parallel Longest Common Subsequence using Graphics Hardware

51

A B A B

AABB

Pushing Example

19 20 21 22 2220 20 21 22 2220 21 21 22 2220 21 22 22 2320 21 22

20 21 22 22 22 23 22 22 22

Page 52: Parallel Longest Common Subsequence using Graphics Hardware

52

A B A B

AABB

Pushing Example

19 20 21 22 2220 20 21 22 2220 21 21 22 2220 21 22 22 2320 21 22 22

20 21 22 22 22 23 22 22 22

Page 53: Parallel Longest Common Subsequence using Graphics Hardware

53

A B A B

AABB

Pushing Example

19 20 21 22 2220 20 21 22 2220 21 21 22 2220 21 22 22 2320 21 22 22 23

20 21 22 22 23 23 22 22 22

Page 54: Parallel Longest Common Subsequence using Graphics Hardware

54

A B A B

AABB

Pushing Example

19 20 21 22 2220 20 21 22 2220 21 21 22 2220 21 22 22 2320 21 22 22 23

20 21 22 22 23 23 22 22 22

Page 55: Parallel Longest Common Subsequence using Graphics Hardware

55

Algorithm Description

• CH: CPU Linear Space DP • GPU: GPU DP

– GPU level 1: GPU Quadratic Space DP (block level)

– GPU level 2: GPU Linear Space DP (thread level)

• Simple: CPU Quadratic Space DP

Page 56: Parallel Longest Common Subsequence using Graphics Hardware

56

GPU Processing Overview• Two levels of parallelism

– Blocks are executed on a processor– Threads are executed on a processor core– Each thread is computed by exactly one processor core

Page 57: Parallel Longest Common Subsequence using Graphics Hardware

57

GPU Level 1: Quadratic Space

• Length of LCS with max length of 216

• Divide DP matrix into “blocks,” each block is solved by one of the GPU processors

• We must enforce the correct order of block execution– Each diagonal can be

computed in parallel

Page 58: Parallel Longest Common Subsequence using Graphics Hardware

58

GPU Level 1: Quadratic Space

• The basic quadratic space DP algorithm would require 16 GB of memory– We “fold” the memory to store only the input/output boundary

for each block– Reduces the storage required to 64 MB– From n2 to 2(n2/m) where m = 512– Duplicate some values to avoid memory contention

Page 59: Parallel Longest Common Subsequence using Graphics Hardware

59

Algorithm Description

• CH: CPU Linear Space DP • GPU: GPU DP

– GPU level 1: GPU Quadratic Space DP (block level)

– GPU level 2: GPU Linear Space DP (thread level)

• Simple: CPU Quadratic Space DP

Page 60: Parallel Longest Common Subsequence using Graphics Hardware

60

GPU Level 2: Linear Space

• Within each block we also have more parallelism– Divide each block into “threads”– Each processor core computes one thread at a time– Hardware-level synchronization ensures the correct

diagonal ordering– Each core reuses the same space (white) and

computes the entire logical matrix (grey)

Page 61: Parallel Longest Common Subsequence using Graphics Hardware

61

GPU Level 2 : Linear Space

• Each thread is a 4x4 subproblem– The size was determined by experimentation– This memory is on chip, so we do not have to

worry about memory conflicts– The linear space algorithm allows us to make

each block as large as possible, which allows for very fast execution

Page 62: Parallel Longest Common Subsequence using Graphics Hardware

62

Algorithm Description

• CH: CPU Linear Space DP • GPU: GPU DP

– GPU level 1: GPU Quadratic Space DP (block level)

– GPU level 2: GPU Linear Space DP (thread level)

• Simple: CPU Quadratic Space DP

Page 63: Parallel Longest Common Subsequence using Graphics Hardware

63

Simple: CPU Quadratic Space DP

• Only gets called when a subproblem is too small for the GPU

• Implements SIMPLE-LCS, the “classic” matrix-based LCS algorithm

Page 64: Parallel Longest Common Subsequence using Graphics Hardware

64

Overview

• Introduction– Problem Statement

• Background and Related Work– The NVIDIA G80 Architecture

• Algorithm Description• Results and Analysis• Conclusion

Page 65: Parallel Longest Common Subsequence using Graphics Hardware

65

Results and Analysis

GPU thread width of 4 proves optimal

Page 66: Parallel Longest Common Subsequence using Graphics Hardware

66

Results and Analysis

GPU block width of 512 is slightly faster

Page 67: Parallel Longest Common Subsequence using Graphics Hardware

67

Results and Analysis

CPU/GPU cutoff sizes determined experimentally

Page 68: Parallel Longest Common Subsequence using Graphics Hardware

68

Results and Analysis

• Test DNA sequence data obtained from Mike Brudno• Over five-fold performance improvement from results in

Chowdhury et al. on all sequence comparisons

Species LengthHuman 1.80Chimp 1.32Baboon 1.51Chicken 0.42Fugu 0.27Cow 1.46Mouse 1.49Rat 1.50Cat 1.16Dog 1.05

Lengths in millions

Page 69: Parallel Longest Common Subsequence using Graphics Hardware

69

Conclusion

• We present a GPU based Dynamic Programming algorithm to compute the LCS of very large sequences

• GPU implementation over five-fold performance boost over single CPU implementation

Page 70: Parallel Longest Common Subsequence using Graphics Hardware

70

Future Work

• We believe our algorithm can be accelerated further with careful optimization– Memory management on the GPU– Memory transfer between CPU and GPU

• Investigation of other computation models– Implementations using 8xCPU + 2xGPU?

Page 71: Parallel Longest Common Subsequence using Graphics Hardware

71

Questions?

Special thanks to Rezaul Chowdhury for his support and Mike Brudno for the DNA sequence data

Page 72: Parallel Longest Common Subsequence using Graphics Hardware

72

NVIDIA CUDA

• Compute Unified Device Architecture• Available on G80 Series• Architecture for utilizing the GPU as a

data-parallel computing device• Eliminates the need to map computation

through graphics API• User writes a C style function which is

then run in parallel on the GPU

Page 73: Parallel Longest Common Subsequence using Graphics Hardware

73

CH: CPU Linear Space DP

LCS reconstruction• Computes output

boundaries in specific order

• Traces back through boundaries to generate LCS

• Linear space

Page 74: Parallel Longest Common Subsequence using Graphics Hardware

74

CH: CPU Linear Space DP

LCS reconstruction omissions:

• Non-power-of-two sequence lengths

• Non-equal sequence lengths

Page 75: Parallel Longest Common Subsequence using Graphics Hardware

75

Integration with Parallel CPUs

• Chowdhury et al. implemented a parallel version of their algorithm– No data available for LCS, but results from other

algorithms show we should expect ~6 times speedup for LCS using 8 server processors

– Disadvantages: • Number of processors which can be effectively used scales

poorly with input size

• Server CPUs cost between $500 and $1600 each, while the GPU we used cost $550