16
1 High Performance Computing on GPU Clusters Timothy McGuiness a , Ali Khajeh-Saeed b , Stephen Poole c and J. Blair Perot b a Lincoln Laboratories, Lexington, MA, 02420, USA b Department of Mechanical and Industrial Engineering, University of Massachusetts, Amherst, MA, 01003, USA c Computer Science and Mathematics Division, Oak Ridge National Laboratory, Oak Ridge, TN 37831, USA Abstract Commodity graphics cards have recently proven to be an inexpensive and effective way to accelerate some scientific computations. Acceleration using multiple GPUs is more challenging, because the type of algorithm/parallelism needed to couple many GPUs is very different from the algorithm/parallelism used to efficiently utilize a single GPU. In this work, four very different applications are evaluated in both the single and multiple GPU contexts. The algorithms are memory bound, like almost all scientific computing algorithms. But the memory use differs considerably allowing us to isolate which types of scientific computing algorithms are likely to benefit from GPU acceleration, and which problems are more easily solved on traditional CPU clusters. Keywords: GPGPU, STREAM Benchmark, Smith-Waterman, Graph Analysis, Unbalanced Tree Search. 1. Introduction The early 2000s saw the development of several high-level shading languages such as Cg [1], HLSL and OpenGL [2], designed to help exploit the computational power of graphics hardware. These languages dealt heavily in graphics-specific concepts such as textures and fragments, making it necessary for scientific programmers to navigate an additional level of abstraction [3]. In late 2006, NVIDIA released their Compute Unified Device Architecture (CUDA), providing a much more user-friendly general purpose GPU (GPGPU) programming environment. Using only a few extensions to the C language, CUDA allows programmers to easily create code for execution on graphics hardware. Even more recently OpenCL has been introduced allowing code to be written for any GPU hardware. When used as a math accelerator for scientific algorithms, single GPUs often produce roughly an order of magnitude performance increase over a similar generation dual or quad core CPU. The tradeoff involved with GPU computing is that programming efficient code is more difficult, and the hardware is more specific about when it performs well. This work is intended to delineate the region of applicability of GPU hardware more definitively. We will show some cases with over 100x speedup over a CPU, and some which show (despite our best efforts) a performance decrease compared to the CPU. The benchmarks chosen for this study are interesting because they span a range of applications from very linear memory access patterns, to regular but non-linear access patterns, to entirely random memory access dominated algorithms. This work also explores the algorithmic changes necessary to efficiently program software to use GPU clusters with hundreds of GPUs. Because of its popularity, this work uses MPI (Message Passing Interface) as the communication protocol between the GPUs. The paper also explores the performance of GPU clusters based on dedicated scientific computing GPUs (Tesla), and a cluster built from the lower cost commodity GPUs (GTX 295 and 480). The four benchmarks were chosen to span a wide range of different scientific computing applications. The STREAM benchmark accesses double precision memory linearly. It scales almost directly with the linear memory access speed of the hardware, and provides an upper bound on the power consumption and temperature limits achieved by the GPUs. Our next benchmark, DNA sequencing via the Smith-Waterman algorithm, deals with sequence matching for very large sequences. This task is represented by the Scalable Synthetic Compact Application (SSCA) #1 benchmark which performs different variants of the Smith- Waterman algorithm. The classic Smith-Waterman algorithm fills in a large table one anti-diagonal row at

Face Based Discrete Calculus Method to Solve the Advection

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

1

High Performance Computing on GPU Clusters

Timothy McGuinessa, Ali Khajeh-Saeed

b, Stephen Poole

c and J. Blair Perot

b

aLincoln Laboratories, Lexington, MA, 02420, USA

bDepartment of Mechanical and Industrial Engineering, University of Massachusetts, Amherst, MA, 01003, USA cComputer Science and Mathematics Division, Oak Ridge National Laboratory, Oak Ridge, TN 37831, USA

Abstract

Commodity graphics cards have recently proven to be an inexpensive and effective way to accelerate some

scientific computations. Acceleration using multiple GPUs is more challenging, because the type of

algorithm/parallelism needed to couple many GPUs is very different from the algorithm/parallelism used to

efficiently utilize a single GPU. In this work, four very different applications are evaluated in both the

single and multiple GPU contexts. The algorithms are memory bound, like almost all scientific computing

algorithms. But the memory use differs considerably allowing us to isolate which types of scientific

computing algorithms are likely to benefit from GPU acceleration, and which problems are more easily

solved on traditional CPU clusters.

Keywords: GPGPU, STREAM Benchmark, Smith-Waterman, Graph Analysis, Unbalanced Tree Search.

1. Introduction

The early 2000s saw the development of several high-level shading languages such as Cg [1], HLSL and

OpenGL [2], designed to help exploit the computational power of graphics hardware. These languages

dealt heavily in graphics-specific concepts such as textures and fragments, making it necessary for

scientific programmers to navigate an additional level of abstraction [3]. In late 2006, NVIDIA released

their Compute Unified Device Architecture (CUDA), providing a much more user-friendly general purpose

GPU (GPGPU) programming environment. Using only a few extensions to the C language, CUDA allows

programmers to easily create code for execution on graphics hardware. Even more recently OpenCL has

been introduced allowing code to be written for any GPU hardware.

When used as a math accelerator for scientific algorithms, single GPUs often produce roughly an order of

magnitude performance increase over a similar generation dual or quad core CPU. The tradeoff involved

with GPU computing is that programming efficient code is more difficult, and the hardware is more

specific about when it performs well. This work is intended to delineate the region of applicability of GPU

hardware more definitively. We will show some cases with over 100x speedup over a CPU, and some

which show (despite our best efforts) a performance decrease compared to the CPU. The benchmarks

chosen for this study are interesting because they span a range of applications from very linear memory

access patterns, to regular but non-linear access patterns, to entirely random memory access dominated

algorithms.

This work also explores the algorithmic changes necessary to efficiently program software to use GPU

clusters with hundreds of GPUs. Because of its popularity, this work uses MPI (Message Passing Interface)

as the communication protocol between the GPUs. The paper also explores the performance of GPU

clusters based on dedicated scientific computing GPUs (Tesla), and a cluster built from the lower cost

commodity GPUs (GTX 295 and 480).

The four benchmarks were chosen to span a wide range of different scientific computing applications. The

STREAM benchmark accesses double precision memory linearly. It scales almost directly with the linear

memory access speed of the hardware, and provides an upper bound on the power consumption and

temperature limits achieved by the GPUs. Our next benchmark, DNA sequencing via the Smith-Waterman

algorithm, deals with sequence matching for very large sequences. This task is represented by the Scalable

Synthetic Compact Application (SSCA) #1 benchmark which performs different variants of the Smith-

Waterman algorithm. The classic Smith-Waterman algorithm fills in a large table one anti-diagonal row at

2

a time. When reformulated for the GPU architecture this benchmark shows 100x and 70x speedup over a

single core of a CPU for single GTX 480 and GTX 259 GPU, and a 5335x speedup when using 120 GPUs

(44x faster than a single GPU). The third application is the HPCS SSCA #2 benchmark which analyzes

very large graphs consisting of a set of nodes connected by a set of edges. Graphs are directed and

weighted, meaning edges have specific start and end nodes, as well as a given cost or weight value. This

application depends almost entirely on random memory accesses. Finally, the unbalanced tree search

displays both random memory accesses and the need for dynamic load balancing, since the tree is so large

it must be built „on the fly‟. We describe load balancing techniques both on a single GPU and when using

many GPUs.

Two machines were used for computations, Orion and Lincoln. Orion contains an AMD quad-core Phenom

II X4 CPU, operating at 3.2 GHz, with 4 x 512 KB of L2 cache, 6 MB of L3 cache and 8 GB of RAM. In

terms of GPUs, Orion contains four NVIDIA 295 GTX cards (occupying four PCIe 16x slots) which each

come as two GPU cards sharing a single PCIe slot. When we refer to a single 295 GTX GPU we refer to

one of these two cards, which has 240 cores and a memory bandwidth of 111.9 GB/s. Orion therefore

typically has 8 GPUs. Also we replaced the first and second GPUs with GTX 480 and Tesla C2070 cards

in order to run some cases with these new GPUs. Orion is shown in figure 1. Results on Orion were

compiled using Microsoft Visual Studio 2005 (VS 8) under Windows XP Professional x64. The bulk of

NVIDIA SDK examples use this configuration.

Figure 1. Orion configuration and fan locations (Fan 6 is located on the side of the case, blowing air into the GPUs)

Lincoln is a Teragrid/XSede GPU cluster located at NCSA. Lincoln has 96 Tesla S1070 (384 GPUs).

Lincoln‟s 192 servers each hold two Intel 64 (Harpertown) 2.33 GHz dual socket quad-core processors

with 2 x 6 MB L2 cache and 2 GB of RAM per core. Each server is connected to 2 Tesla processors via

PCI-e Gen2 X8 slots. All code was written in C++ with NVIDIA‟s CUDA language extensions for the

GPU. The Lincoln results were compiled using Red Hat Enterprise Linux 4 (Linux 2.6.19) and the gcc

compiler [4].

2. STREAM Benchmark

The STREAM benchmark is a simple synthetic benchmark program that measures sustainable memory

bandwidth (in MB/s) and the corresponding computation rate for simple vector kernels. The STREAM

benchmark is composed of four kernels. In the first kernel one vector is copied to another within the same

device (a = b, one read and one write). For the second kernel, vector entries are multiplied by a constant

number and the results is written to another vector (a = b, one read and one write). Kernel 3 adds two

3

vectors (a = b + c, two reads and one write). Finally, Kernel 4 is a combination of Kernels two and three,

sometimes referred to as a DAXPY operation (a = b + c, two reads and one write).

This benchmark was computed using a single GPU operating on different vector sizes. Table 1 shows the

specifications for four types of NVIDIA GPUs. The GTX 295 and Tesla S1070 actually house two and four

GPUs respectively. However the hardware specifications below are for one of these GPUs (1/2 of the 295

GTX and 1/4 of the Tesla S1070) [5, 6]. The GTX 480 and Tesla C2070 are the latest generation of GPUs

(Fermi architecture). The tests shown below used 64k blocks with 128 threads each, operating on double

precision vectors. Each GPU kernel was called 100 times, and each kernel performs the STREAM

operation 100 times in that kernel. The timings below are reported for a single STREAM operation (total

time / 10,000).

Table 1. NVIDIA hardware specifications for four different GPUs

2.1 Single GPU

Figures 2 and 3 show the single-GPU execution time and bandwidth for the GTX 295 and 480 GPU and

Tesla S1070 and C2070. For vectors with lengths less 105

elements, the time is nearly constant and the

bandwidth is less than the maximum value for 10 series and 20 series respectively. However, for vector

sizes larger than 105 the bandwidth is close too the maximum value and execution time increases linearly

with vector length. This shows that the startup cost of simply initiating a GPU kernel is high, and large

vector lengths are required for good GPU performance.

Vector Length

Tim

e(m

s)

102

102

103

103

104

104

105

105

106

106

107

107

108

108

109

109

10-3

10-3

10-2

10-2

10-1

10-1

100

100

101

101

102

102

Copy (GTX 295)

Scalar (GTX 295)

Add (GTX 295)

TriAdd (GTX 295)

Copy (GTX 480)

Scalar (GTX 480)

Add (GTX 480)

TriAdd (GTX 480)

Vector Length

Ba

nd

wid

th(G

B/s

ec)

102

102

103

103

104

104

105

105

106

106

107

107

108

108

109

109

10-1

10-1

100

100

101

101

102

102

103

103

Copy (GTX 295)

Scalar (GTX 295)

Add (GTX 295)

TriAdd (GTX 295)

Copy (GTX 480)

Scalar (GTX 480)

Add (GTX 480)

TriAdd (GTX 480)

Figure 2. (a) Time and (b) Bandwidth for single NVIDIA GTX 295 and GTX 480 for the different STREAM kernels.

The Tesla S1070 has more memory (4 GB) than the 295 GTX (896 MB). However, Figure 3 shows that for

large vector lengths (greater than 5×107) bandwidth begins to decrease. For the largest possible lengths on

the Tesla S1070, bandwidth is approximately 50% of the maximum value. However, there is no efficiency

loses for Tesla C2070 when using large vector sizes. Also figure 3 shows that for small problem sizes Tesla

S1070 has better performance than C2070. The kernel startup time for the C2070 is an order of magnitude

Model CUDA

Cores

Memory

(MB)

Theoretical

Bandwidth (GB/sec)

Memory Interface

Width (bits)

Max Power

(W)

GTX 295 240 896 119.9 448 145

GTX 480 480 1536 177.4 384 250

Tesla S1070 240 4000 102 512 200

Tesla C2070 448 6000 144 384 238

(a)

(b)

4

larger than its predecessor. The bandwidth achieved by the STREAM benchmark is consistently about 20%

below the theoretical peak predicted by Nvidia for all GPUs.

Vector Length

Tim

e(m

s)

102

102

103

103

104

104

105

105

106

106

107

107

108

108

109

109

10-3

10-3

10-2

10-2

10-1

10-1

100

100

101

101

102

102

103

103

Copy (Tesla S1070)

Scalar (Tesla S1070)

Add (Tesla S1070)

TriAdd (Tesla S1070)

Copy (Tesla C2070)

Scalar (Tesla C2070)

Add (Tesla C2070)

TriAdd (Tesla C2070)

Vector Length

Ba

nd

wid

th(G

B/s

ec)

102

102

103

103

104

104

105

105

106

106

107

107

108

108

109

109

10-1

10-1

100

100

101

101

102

102

103

103

Copy (Tesla S1070)

Scalar (Tesla S1070)

Add (Tesla S1070)

TriAdd (Tesla S1070)

Copy (Tesla C2070)

Scalar (Tesla C2070)

Add (Tesla C2070)

TriAdd (Tesla C2070)

Figure 3. Time and bandwidth for single NVIDIA Tesla S1070 and C2070 for four different STREAM kernels

2.2 Weak Scaling on many GPUs

In the weak scaling case, the vector length is constant per GPU (at 2M double precision elements). This

makes the number of operations constant per GPU as the number of GPUs increases. Figure 3 shows the

results for weak scaling. Figures 4a and 4b show total bandwidth as well as bandwidth per GPU for various

numbers of GPUs. Because 2M is large enough for a single GPU to be efficient, the bandwidth is close to

the maximum value and the speedup is linear.

Number of GPUs

Ba

nd

wid

th(G

B/s

)

0 8 16 24 32 40 48 56 640 0

1000 1000

2000 2000

3000 3000

4000 4000

5000 5000

6000 6000

7000 7000

Copy

Scalar

Add

TriAdd

Ideal

Number of GPUs

Ba

nd

wid

th/G

PU

(GB

/s)

0 8 16 24 32 40 48 56 640 0

20 20

40 40

60 60

80 80

100 100

120 120

Copy

Scalar

Add

TriAdd

Ideal

Figure 4. Results for weak scaling of the four STREAM benchmark kernels on Lincoln with 2M elements per GPU,

(a) Actual and ideal bandwidth for various numbers of GPUs, (b) bandwidth per GPU for various numbers of GPUs.

2.3 Strong Scaling

In the strong scaling case, vector length remains constant, and number of processors varies. This means

increased number of GPUs decreases the vector length per GPU. The total vector length used for the strong

scaling case was 32M. This results in 0.5M elements per GPU when 64 GPUs are used for the computation.

Figures 5a and 5b show MCUPS (millions of cell updates per second) and bandwidth per GPU for various

numbers of GPUs, respectively. Small numbers of GPUs are less efficient for this case because memory

accesses are slow because the vector length is so large (greater than 107, see figure 3b).

(a)

(b)

(a)

(b)

5

Number of GPUs

MC

UP

S

10 20 30 40 50 6010

310

3

104

104

105

105

106

106

Copy

Scalar

Add

TriAdd

Number of GPUs

Ba

nd

wid

th/G

PU

(GB

/s)

0 8 16 24 32 40 48 56 640 0

20 20

40 40

60 60

80 80

100 100

120 120

Copy

Scalar

Add

TriAdd

Ideal

Figure 5. Results of strong scaling of the four STREAM benchmark kernels on the Lincoln with 32M total elements,

(a) MCUPS for different numbers of GPUs, (b) bandwidth per GPU for various numbers of GPUs.

2.4 Power Consumption

Figures 6a and 6b show power consumption for the AMD quad-core Phenom II X4, operating at 3.2 GHz,

and for the GTX 295 GPUs, respectively. This test involved the STREAM Benchmark operating on double

precision vectors of length 2M per GPU. Power was measured at the source to Orion using a watt meter.

Note that 60 W (for figure 6a) is not what the whole machine draws when running idle. At idle it draws

close to 450 W (there are many fans).

Number of Cores (CPU)

Wa

tt

0 1 2 3 4 550 50

60 60

70 70

80 80

90 90

100 100

110 110

120 120

Watt = 12.3 x (Cores) + 60

Number of GPUs

Wa

tt

0 2 4 6 8 100 0

200 200

400 400

600 600

800 800

1000 1000

1200 1200

Watt = 115 x (GPUs) + 31

Figure 6. Power consumption for the weak scaling STREAM benchmark on Orion with the AMD CPU and 295 GTX GPUs.

Each GPU uses approximately 115 W over idle consumption when running the STREAM Benchmark.

Also, it was found that the idle power for one GTX 295 (2 GPUs) is 71 W (or about 30 W per GPU). This

was ascertained by physically removing GPUs from the machine, and re-running the code. The NVIDIA

hardware specifications (table 1) imply that each Tesla S1070 GPU uses more power than the GTX 295

GPUs (200 W vs. 145 W maximum). One reason for this difference could be the amount of memory

supported. The Tesla S1070 has 4000 MB per GPU but the GTX 295 has only 896 MB per GPU. The Tesla

also contains its own power supply and cooling system, which may be playing a role in the difference in

consumption levels. We did not have access to the Tesla hardware (at NCSA) to measure its power

consumption directly.

(a) (b)

(a) (b)

6

2.5 GPU Temperature

The STREAM benchmark was computed on Orion (GTX 295 GPUs) to obtain a measurement of each

GPU‟s temperature. The code was run with 8 GPUs for 331 seconds. Table 2 shows the initial and

maximum temperatures for each GPU, as the STREAM benchmark was running for 331 seconds. The

maximum allowable temperature for the 295 GTX listed on NVIDIA‟s website is 105 oC [5]. The

maximum GPU temperatures range between 88-96 oC when running for this prolonged period of time.

Since most GPGPU applications involve many short kernel calls rather than this type of extended exertion,

GPU temperatures are typically lower than these maximum measured values.

Table 2. Temperatures for GTX 295 cards, running for 331 second STREAM benchmark

Temperature GPU 1 GPU 2 GPU 3 GPU 4 GPU 5 GPU 6 GPU 7 GPU 8

Initial Temp oC 66 64 72 68 72 67 63 61

Max Temp oC 90 88 96 92 96 93 91 88

T oC 24 24 24 24 24 26 28 27

3. SSCA #1 – Sequence Matching

For two candidate sequences, A and B, sequence alignment attempts to find the best matching

subsequences for the pair. The best match is defined in equation 1,

1

( ), ( ) ,L

s eiS A i B i W G G

(1)

where W is a gap function involving the gap start- and gap extension-penalties, (Gs and Ge, respectively)

and the similarity score S, which is user-defined for this implementation. This research uses a simple

scoring system where matching items get a score of 5 and non-matching items have a score of -3. The gap

start-penalty is 8 and gap extension-penalty is 1.

The Smith-Waterman algorithm locates the optimal alignment by building a solution table [8]. The first

data sequence (the database) is typically seeded along the top row, with the second sequence (the test

sequence) in the first column. Table values are denoted as ,i jH with i and j being the row and column

indices, respectively. Other table entries are set using the equation,

1, 1 ,

, ,0

,0

,0

( )

( )

i j i j

i j i k j s ek i

i j k s ek j

Max H S

H Max Max H G kG

Max H G kG

(2)

That is, an entry in the table is the maximum of: the entry diagonally to the upper left plus the similarity

score for the item in that row and column, the maximum of all the entries above the entry and in that same

column minus a gap function, and the maximum of all the entries to the left of the entry and in that same

row minus a gap function, and zero. It is not possible for table values to be negative. Additionally, the

dependencies of table values are such that they cannot be computed in parallel. Finding a particular value in

the table requires knowledge of all the values above and to the left of the value being computed. Figure 7

shows table value dependencies.

7

If equation (2) is naïvely implemented, then the column and row maximums (second and third items in

equation (3) are repetitively calculated for each table entry creating O(L1L2(L1+L2)) computational work.

This can be reduced by retaining the previous row and column sums [9] which reduces the work

significantly but triples the algorithm‟s required storage space. The classic parallelization technique for this

algorithm is to work along anti-diagonals [10]. It should be clear from the dependency region that each

diagonal item can be calculated independently of the others. This means the diagonal algorithm only

performs efficiently on a single GPU when the shorter of the two sequence lengths is 30k or larger.

Sequence lengths of this size are uncommon in biological applications, but this is not the true problem with

the diagonal algorithm. The primary issue with the diagonal approach is memory accesses patterns. On the

GPU it is very efficient to access up to 32 consecutive memory locations, and relatively (5-10x slower)

inefficient to access random memory locations such as those dispersed along the anti-diagonal (and its

neighbors). To get around this problem, it was shown that the Smith-Waterman algorithm can be

reformulated so calculations can be performed simultaneously one row (or column) at a time [11, 12]. Row

(or column) calculations allow GPU memory accesses to be consecutive, and therefore fast.

3.1 Results

The first kernel computes the Smith-Waterman table is saves the largest table values (200 of them) and

their locations. These serve as the end points of well-aligned sequences, but the sequences themselves are

not constructed or saved until Kernel 2. Because the traceback step is not performed in Kernel 1, the

amount of data needed in GPU memory from the table is minimal. For example, in the row-parallel version,

only the data from the previous row needs to be retained, making Kernel 1 both memory-efficient and

highly parallel on a fine scale. The results for four different GPUs are shown in Fig. 8. The weak scaling

timings are shown in Fig. 9a, for the case using a 2M-elements database per GPU, and a 128-element test

sequence. Obviously, the amount of work being performed increases proportionally with the number of

GPUs used. The total times are from 380–480 ms, and they increase slowly with the number of GPUs used.

Figure 9b shows speedups for Kernel 1 compared to a single-CPU version. Compared to a single core of

the CPU, the speedup is 100x for a single GPU, and 5335x for 120 GPUs. Figures 10a and 10b show

speedups for Kernel 2. Kernel 2 is not very sensitive to sequences length and does not show nearly as large

of a speedup. The complexity of kernel 2 means that the GPU performs at roughly the same speed as all the

cores on the CPU.

0 C A G C C U C G C U G

0 0 0 0 0 0 0 0 0 0 0 0 0

A 0 0 5 0 0 0 0 0 0

A 0 0 5 2 0 0 0 0 0

U 0 0 0 2 0 0 5 0 0

G 0 0 0 5 0 0 0 2 5

C 0 5 0 0 10 5 0 5 0

C 0 5 2 0 5 15 6 5 4

A

U

G

C

G

Figure 7. Dependency of the values in the Smith-Waterman table.

8

Size

Tim

e(s

)

106

107

108

109

1010

1011

10-1

10-1

100

100

101

101

102

102

103

103

104

104

9800 GT

9800 GTX

295 GTX

Tesla S1070

480 GTX

CPU 3.2 MHz

Size

Sp

ee

du

p

106

107

108

109

1010

1011

0 0

20 20

40 40

60 60

80 80

100 100

120 120

9800 GT

9800 GTX

295 GTX

Tesla S1070

480 GTX

Figure 8. Results for four different GPUs timings (a) and speedups (b) for Kernel 1.

Kernel 1 performed particularly well on the GPU for the SSCA #1 benchmark because it was possible to

reformulate the Smith-Waterman algorithm to use sequential stride 1 memory accesses. Since each

sequential memory access can be guaranteed to come from a different memory bank these accesses will

occur simultaneously. The parallel scan was a critical component of being able to perform this

reformulation. The parallel scan operation may currently be under-utilized and might be effectively used in

many other scientific algorithms as well.

On the other hand, the speedups demonstrated by the already very fast Kernel 2 were not as impressive.

The naïve algorithms for these tasks are parallel but also MIMD. Because this kernel executes so quickly

already there was little motivation to find SIMT analogs and pursue better GPU performance. Fortunately,

every GPU is hosted by a CPU, so users are not forced to use a GPU if the task is not well suited to the

hardware. It is important to keep in mind that the GPU is meant to be a supplement to the CPU. This paper

is intended to give the reader insight into which types of tasks will do well when ported to the GPU and

which will not.

Number of GPUs

Ke

rne

l1

Tim

e(m

s)

0 20 40 60 80 100 120300

320

340

360

380

400

420

440

460

480

500

Number of GPUs

Ke

rne

l1

Sp

ee

du

p

0 20 40 60 80 100 1200

500

1000

1500

2000

2500

3000

3500

4000

4500

Figure 9. Weak scaling timings (a) and speedups (b) for Kernel 1 using various numbers of GPUs on Lincoln.

(a) (b)

(a) (b)

9

Number of GPUs

Ke

rne

l2

Tim

e(m

s)

0 20 40 60 80 100 1200

5

10

15

20

25

30

Number of GPUs

Ke

rne

l2

Sp

ee

du

p

0 20 40 60 80 100 1200

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

5.5

6

Figure 10. Weak scaling timings (a) and speedups (b) for Kernel 2 using various numbers of GPUs on Lincoln.

The issue of programming multiple GPUs is an interesting one, because it requires the use of a totally

different type of parallelism. A single GPU functions well with massive fine grained (at least 30k threads)

nearly SIMD parallelism. With multiple GPUs, located on different computers, communicating via MPI

and Ethernet cards, course grained MIMD parallelism is needed. Due to this, all multi-GPU

implementations partition the problem coarsely into subsets for the GPUs, and then use fined grained

parallelism within each GPU.

4. SSCA #2 – Graph Analysis

The graphs for this benchmark consist of a set of nodes, connected by a set of directed edges. The size of

the graph is given by the user-defined variable SCALE. The number of nodes in the graph is 2SCALE

, and the

number of edges is eight times the number of nodes [13]. A node‟s degree value is the number of edges

pointing out from it. Because there are eight times as many edges as nodes, the average nodal degree is

eight. The histogram in figure 11 shows the number of nodes for any given degree value. Graphs are

constructed so that there are a few nodes with very high degrees. These graphs are examined using four

timed kernels, described in the following sections.

Figure 11. Statistical distribution of edges for a typical SCALE 21 graph.

4.1 Kernel 1 – Graph Construction

The original edge list is presented as three lists; start node, end node and edge weight. The purpose of the

first kernel is to convert this data structure into the format that the remaining kernels will use. The graph

may be represented in any format, but this new data structure cannot be altered after Kernel 1.

(a) (b)

10

The new structure used in the GPU code is a node-to-node (N2N) system. This structure uses two arrays – a

shorter list of pointers, and a longer list of children. In the pointer list, the entry at position p is a location q

in the child array; the location where parent p‟s children are saved. The pointer list is „number of nodes‟

long, since each node in the graph can be treated as a parent, and the child list is „edges‟ long, because it

stores all the end nodes from the original edge list. A simple way to think of building N2N, is by sorting the

original list according to start node. In this case, the list of start nodes would then consist of large groups of

1‟s, 2‟s, 3‟s, etc. This array is condensed into a list of where the 1‟s, 2‟s and 3‟s – and hence, their children

– begin. An illustration of this conversion is shown in figure 12.

Figure 12. Conversion from (a) original edge list, to (b) sorted by start node, and finally to (c) N2N format.

The code performs this conversion in three steps. First, nodal degrees are counted by looping over each

edge, and incrementing an array of counters corresponding to start nodes. Next, this list of degrees is

converted into the pointer list, using an operation known as a parallel scan [14]. The scanned pointer value

for a node is the sum of the degree values for all preceding nodes. The final step builds the child list using

the newly created pointer list. A loop finds each edge‟s start node and the corresponding pointer, and

inserts the edge‟s end node into the child array at that location. Care must be taken to offset individual end

nodes from the pointer to ensure unique locations in the child list.

It is worth nothing that Kernel 1 not only builds the standard N2N structure, which follows the true

direction of the edges (N2Nout), but it also builds a second N2N structure that goes “against the grain”

(N2Nin). This second version has proven useful for the parallel implementations of Kernels 2 and 4.

4.2 Kernel 2 – Find Max-Weight Edges

Kernel 2 searches through edge weights and picks out those with the largest possible value. The use of the

N2N data structure actually makes this task more challenging. In the original edge list, weights are paired

with data for both nodes. In the new structure, weights are only associated with one node, and finding the

second is non-trivial.

For this algorithm, threads search the weight list in the N2Nin structure for max-weight values. When a

max-weight is found, the corresponding node (the edge‟s start node) is saved. Using this node and the

N2Nout pointer list, the GPU finds the location of that node‟s children, and searches this limited region for a

max-weight edge. When the weight is found, the corresponding node (the edge‟s end node) is paired with

its start node.

Due to the benchmark‟s specifications regarding number of edges and weight distribution, there are on

average only eight max-weight edges in any graph, regardless of SCALE. As a result, this kernel has

relatively little work to do, and is easily the fastest of the four timed tasks.

4.3 Kernel 3 – Subgraph Construction

The third kernel of SSCA 2 is designed to construct subsets (or subgraphs) of the original graph, using the

edges found in Kernel 2 as starting points. Kernel 3 starts at a max-weight edge, and moves out a user-

specified number of levels from it.

11

The final output of Kernel 3 is a list of nodes which representing the members of the subgraph. This list,

called the queue, is built in sections, which are filled as the code steps out to each new level of the

subgraph. The code reads parent nodes from the current level, and fills in their children in the next level.

The most challenging part of parallelizing this code is determining where to insert children into the queue.

Each thread needs its own dedicated space in the queue, and must know where that space begins. This is

done by storing count and point arrays which correspond to the queue. When a node is added to the queue,

its number of children is recorded in the count array. After each level is filled, this array is scanned into the

point array, which then tells the queue location for that node‟s children. This process is illustrated in figure

13. Before a child is inserted into the queue, the code tests to ensure it has not already been added. This

eliminates extraneous queue entries, limits the amount of required memory, and provides a natural stopping

point for the following kernel.

Figure 13. Construction of the Kernel 3 queue. (a) the first node in the queue, with its number of children, and the

location they will be written to, (b) the second generation, and the scanned point array (c) part of the third generation.

4.4 Kernel 4 – Betweenness Centrality

The goal of Kernel 4 is to determine which nodes have the highest connectivity, or betweenness centrality

(BC). For a particular starting node, partial BC scores are calculated for all other nodes. This process is

repeated using a subset of nodes as starting points. A node‟s final BC value is the sum of all its partial

scores. This is easily the most computationally-taxing portion of the SSCA 2 code.

As proposed by Brandes [15], and Bader and Madduri [16], the BC algorithm consists of two main steps.

The outsweep assigns nodal values of distance to and number of shortest paths back to the start node. This

process is conducted in the same manner as Kernel 3, moving out visiting new generations, level by level.

The only differences are nodal values need to be assigned now in addition to just filling the queue, and the

algorithm must continue for as long as it takes to visit all nodes. Assigning depth and shortest path values is

trivial, and since each node can only appear once in the queue, the final queue level will be empty, and the

algorithm will stop on its own.

The insweep works through the queue backwards, level by level. As the code moves in (from child to

parent) along shortest paths, the child‟s BC and shortest paths values are used to update the BC score for

the parent. The N2Nin data structure – carefully marked during the outsweep – is used to determine if

parents lie along these shortest paths. In equation 3, υ and ω represent parent and child nodes, respectively.

A node‟s temporary BC score (reset after each outsweep / insweep pair) is represented by δ, while σ is a

node‟s shortest path value.

1

(3)

One of the problems frequently encountered during SSCA 2 is when multiple GPU threads attempt

simultaneous reads or writes to the same memory location. This never occurs in serial versions since only a

single core is active, working on a single data element. The simplest solution to this problem is atomic

functions, which are built into the CUDA API. These perform a simple locking procedure to serialize all

12

threads in the device attempting concurrent reads/writes. These are essential to several parts the SSCA 2

code – particularly for finding nodal degrees in Kernel 1. However, these operations only work with integer

values, and are therefore, limited.

4.5 Results

The parallel SSCA #2 code uses MPI to run on up to 128 GPUs simultaneously. Comparisons to a single

CPU refer to the HPCS optimized code provided with the benchmark. The GPU timings reported in figure

14 are from trials conducted on the Lincoln cluster. Results on up to four cards on the Orion machine can

be found in [17].

Number of GPUs

Sp

ee

du

p(C

PU

)

10-1

100

101

102

10310

0

101

102

103

Kernel 1

Kernel 2

Kernel 3

Kernel 4

Number of GPUs

Sp

ee

du

p(G

PU

)

10-1

100

101

102

10310

-1

100

101

102

103

Kernel 1

Kernel 2

Kernel 3

Kernel 4

Figure 14. Strong scaling speedups for Kernels 1–4 (a) relative to a single CPU core,

and (b) relative to a single GPU. All tests were run for a SCALE 21 graph.

The results are also compared with HPC CPU code that is optimized for multi-core CPUs. Figure 15 shows

that with equal number of GPUs and CPU cores, we achieved 30, 0.4 and 4 speedup for Kernel 1, 3 and 4

respectively. Kernel 4 is by far the most computationally demanding of the four kernels and only shows a

speed up of 4. This means the GPU is performing roughly like a quad-core CPU for this benchmark.

GPUs

Sp

ee

du

p

2 4 6 8 10 12 14 161810

-110

-1

100

100

101

101

102

102

Kernel 1

Kernel 3

Kernel 4

Figure 15. Speedup for GPUs comparing with equal number of CPU cores with HPC code

Several important conclusions can be drawn from this data. First, GPU performance is better for larger

problem sizes. This is a trend that is clearly evident when speedups are compared for multiple SCALEs,

and when GPU and CPU results are directly compared (not pictured). For small graphs, the GPU was not

able to produce the expected speedups, however as SCALE size increased, timings began moving closer to

the theoretical performance. In Kernel 2, each GPU operates on a portion of the N2N list. As the number of

(a) (b)

13

GPUs increases, the workload decreases. The results in figure 8 show that for fewer GPUs, Kernel 2‟s

performance is as expected. When more GPUs are used, and individual workload decreases, performance

declines considerably.

A second, related conclusion is that given enough work, performance scales with number of processors

used. While this may seem obvious, it is not always an easy relationship to attain. Inefficient algorithms,

overhead for memory copies and kernel invocations, and restrictions on problem size can all erode the

expected efficiency of the GPU. The results for Kernels 3 and 4 show that for large SCALEs, doubling

processing power does indeed halve the required time almost exactly. It is not surprising that this trend is

apparent on the two most computationally-intense kernels. It is worth noting that the plateau in

performance for Kernel 3 is due to the number of max-weight edges – and hence subgraphs to be built – in

the SCALE 21 graph. There were exactly eight of these edges in this graph, and since each card builds one

subgraph at a time, using any more than eight GPUs meant they were idle in this case.

A final conclusion is that MPI carries a high cost. While MPI allows large scale cluster parallelization, it

can also slow down program execution considerably, which was particularly evident in the results for

Kernel 1. Compared to the other kernels, this portion of the code requires the most MPI communication in

terms of number of calls and amount of data transferred. Here, more GPUs means a reduced workload per

card, as well as a higher volume of MPI interaction, making this kernel particularly inefficient for multi-

GPU runs. A way around this is to try and hide the MPI and CUDA communication using non-blocking

MPI sends and receives and asynchronous CUDA memory copies. These operations are performed in the

background while computational tasks can be worked on at the same time.

5. Unbalanced Tree Search

The Unbalanced Tree Search (UTS) benchmark performs an exhaustive search on an unbalanced tree. The

tree is generated on the fly using a splittable random number generator (RNG) that allows the random

stream to be split and processed in parallel while still producing a deterministic tree. There are two kinds of

trees evaluated in this work, binary trees and geometric trees (Fig. 16). The binary tree is based on a

probability that each node can have children (or not). For the binary tree, each node either has 8 children or

none at all. For the geometric tree, each node has 1 to 4 children with the number of children being

assigned randomly. These geometric trees are terminated at a predefined level. Nodes greater than the

terminating level have no children.

Figure 16. Representations of (a) a binary tree, with nodes having 0 or 8 children,

and (b) a geometric tree, with 1-4 randomly assigned children.

There are two well-known schemes for dynamic load balancing, work sharing and work stealing. In a work

sharing approach there is a global shared queue, and each processor has its own chunk of data. If the

number of nodes on a processor increases beyond fixed number, then it will start to write the extra data to

the shared queue. Similarly, if there are not enough nodes on a processor, it will start to read nodes from the

queue. Conversely, in work stealing there is no central queue. When there are not enough nodes for a

processor to work on, it borrows nodes directly from processors that have too many. The advantage of work

stealing is that there is no communication when all processors are working on their own data set, making

this a stable scheme. In contrast, work sharing can be unstable because it requires load balancing messages

to be sent even when all processors have work to do [18]. Unfortunately, neither approach is particularly

well-suited for the GPU because it is not possible to communicate directly between two GPUs. All

(a) (b)

14

communication must go through the CPU in order to use MPI. Copying data from the GPU to the CPU or

vice versa is expensive and should be avoided. To circumvent these problems, load balancing is divided

into two parts, load balancing between the CPUs and load balancing on the GPU.

Each computational team is comprised of one CPU and one GPU – each with its own queue. Each GPU

works on computation while the CPUs perform the load balancing via MPI. After launching the GPU

kernel, the CPUs begin load balancing amongst themselves. For this CPU balancing, they rank themselves

by nodes per process, and then begin sending work to one another based on this ranked list. The protocol

for load redistribution is that the CPU with the most work shares with the CPU with the least amount of

work, the CPU with the second most work shares with the CPU with the second least work, and so on. For

GPU when it finishes its work, it begins reading new data from the CPU. If a GPU has too much work, it

will periodically stop and deliver that data to the CPU to be redistributed elsewhere. The GPU memory

copies to and from the CPU are easily overlapped with GPU computation using the cudaMemcpyAsync

command [19].

Figure 17. Load balancing algorithm for the Unbalanced Tree Search

The last box in figure 17 represents two check conditions. The whole UTS algorithm is inside the while

loop. If the number of nodes in every GPU is equal to zero the program is terminated. Figure 18 shows

results for two different tree searches using various numbers of GPUs. The binary tree is started with 5000

nodes and the maximum depth of the tree is 653. The total number of nodes is 1,098,896. The geometric

tree is started with four nodes and has a maximum depth of 12.

Number of GPUs

Sp

ee

du

p(C

PU

)

0 1 2 3 40

0.5

1

1.5

2

2.5

3

3.5

4

Geometric Tree

Binary Tree

Number of GPUs

Sp

ee

du

p(G

PU

)

0 1 2 3 40

0.5

1

1.5

2

2.5

3

Geometric Tree

Binary Tree

Figure 18. Strong scaling speedups for the unbalanced tree search (a) relative to a single CPU core

and (b) relative to a single GPU using the GTX 295 GPUs (Orion).

UTS_Kernel (Tree1)

CPUs Load Balancing

Asynchronous Copy from

CPU to GPU (Tree2)

If (Nodes < C1 )

Tree1=Tree2

If (Nodes > C2 )

Asynchronous Copy from GPU to CPU

(a) (b)

15

Because the binary tree starts with only 5,000 nodes, there is not enough work initially for even a single

GPU. Also, the random number generator is so fast that it is not possible to successfully hide load

balancing. For the geometric tree, again, there is not enough work to require multiple cards until the tree

grows beyond level 7. Additionally, once this level is reached, the geometric tree begins to grow very fast,

so running tests for large trees requires huge amounts of memory. Finally, like the binary tree, the random

number generator is very fast compared to data transfers via MPI. For all binary tree searches and for the

first 7 levels of the geometric tree, the CPU is capable of putting all the data in its cache. For load

balancing, the GPU has to exit the kernel and write data to global memory that is 100 times slower than the

CPU‟s cache memories. So for this application, with very random memory accesses, the GPU is

performing similarly to a single CPU core. In addition, trying to parallelize the GPU algorithm does not

help the performance.

6. Discussion

This work shows that the performance of GPUs for high performance computing problems is highly

dependent on the type of memory access an algorithm is using. The STREAM benchmark, which requires

only long linear memory accesses showed a 25x increase over one core of a CPU. The SSCA#1 benchmark

achieved a 100x increase once the standard algorithm was reformulated to better utilize linear memory

accesses. This required a new form of the Smith-Waterman algorithm and the use of a parallel scan

operation. The graph benchmark (SSCA #2) could not be reformulated. It achieved a speedup over a single

CPU core (for the important Kernel 4) of only about 4x. Similarly, the unbalanced tree search was only

marginally faster than (or sometimes slower) than the CPU, because of the large number of random

memory accesses. The performance of the unbalanced tree search could be improved by using a much

larger tree, and by having a tree search that does more work per each node than just traversing the tree.

The evolution of algorithms from a single GPU to many GPUs is nontrivial. The type of large scale and

coarse parallelism required to use many GPUs connected via a relatively slow interconnect (and MPI), is

completely different from the very fine grained parallelism which ports easily to the many threads on

each GPU. Algorithms that run on GPU clusters therefore require parallelism to be present, and efficiently

captured by the algorithm, on two very different levels. This will make the problem of automatic compiler

parallelization for GPU clusters an even more difficult problem than it is on existing CPU clusters.

Our tests of temperature and power consumption on the GPUs show that GPUs are not as attractive on a

speed per Watt basis as they are on a speed per dollar basis. For the STREAM benchmark each GPU is

using just under 5 times more power than a single CPU core. So on a speed per Watt basis the GPU is only

about 5x better than a CPU for this benchmark. It is not presented in this paper, but we have noted that

power consumption is closely related to memory bandwidth. The other benchmarks do not draw as much

power as the STREAM benchmark, perhaps because they can not achieve the same bandwidth for random

memory access transactions.

Acknowledgements

This work was supported by the Department of Defense and used resources from the Extreme Scale

Systems Center at Oak Ridge National Laboratory. Some of the computations were performed on the NSF

Teragrid/XSede supercomputer, Lincoln, located at NCSA.

References

[1] Mark, W.R., et al. Cg: a system for programming graphics hardware in a C-like language. in

International Conference on Computer Graphics and Interactive Techniques. 2003. San Diego,

California: Association for Computing Machinery.

[2] Kessenich, J., The OpenGL Shading Language. 1.50.09 ed, ed. D. Baldwin and R. Rost. 2009.

[3] Owens, J.D., et al., A Survey of General-Purpose Computation on Graphics Hardware. Computer

Graphics Forum, 2007. 26(1): p. 80-113.

[4] http://www.ncsa.illinois.edu/UserInfo/Resources/Hardware/Intel64TeslaCluster/

[5] http://www.nvidia.com/object/product_geforce_gtx_295_us.html

16

[6] http://www.nvidia.com/object/product_tesla_s1070_us.html

[7] D. Gusfield, Algorithms on Strings, Trees, and Sequences, Cambridge University Press, 1997.

[8] T. F. Smith, M. S. Waterman, Identification of common molecular subsequences, J Mol Biol 147

(1981) 195-197.

[9] O. Gotoh, An Improved Algorithm for Matching Biological Sequences, J Mol Biol 162 (1982)

705-708.

[10] Storaasli, Olaf, Accelerating Science Applications up to 100X with FPGAs, Proc. of 9th Int‟l

Workshop on State-of-the-Art in Scientific and Parallel Computing, Trondheim, Norway, May

13-16, 2008.

[11] A. Khajeh-Saeed, S. Poole and J. B. Perot, Acceleration of the Smith–Waterman algorithm using

single and multiple graphics processors, J. Comput. Phys. 229 (2010) 4247–4258

[12] A. Khajeh-Saeed, J. B. Perot, GPU-Supercomputer Acceleration of Pattern Matching, GPU

Computing Gems Emerald Edition, Morgan Kaufmann Publishers, Chapter 13, January 2011,

185-198.

[13] HPCS, HPC Scalable Graph Analysis Benchmark, Version 1.0, D.A. Bader, et al., Editors. 2009.

[14] Harris, M., Parallel Prefix Sum (Scan) with CUDA, in NVIDIA CUDA SDK, Version 2.3, 2009:

NVIDIA Corporation

[15] Brandes, U., A faster algorithm for betweenness centrality. Journal of Mathematical Sociology,

2001. 25(2): p. 163-177.

[16] Bader, D.A. and K. Madduri. Parallel Algorithms for Evaluating Centrality Indices in Real-

World Networks. in Parallel Processing. 2006. Columbus, Ohio.

[17] T. McGuiness and J.B. Perot, Parallel Graph Analysis and Adaptive Meshing using Graphics

Processing Units, 2010 Meeting of the Canadian CFD Society, London, Ontario, 2010.

[18] J. Dinan, S. Olivier, G. Sabin, J. Prins, P. Sadayappan, and C. Tseng, Dynamic Load Balancing

of Unbalanced Computations Using Message Passing, Parallel and Distributed Processing

Symposium, 2007.

[19] NVIDIA CUDA Programming Guide, Version 4.0. 2011: NVIDIA Corporation.