Upload
others
View
7
Download
0
Embed Size (px)
Citation preview
Tutorial: High Performance SBSEUsing Commodity Graphics Cards
Simon Poulding, University of York, UKSSBSE, September 2012
© Simon Poulding & The University of York, 2012
SBSE and High Performance Computing
entire searchalgorithm parallelisable
operations withinalgorithm parallelisable
EVALUATION
EVALUATIONVARIATION SELECTION
EVALUATION
VARIATION EVALUATION SELECTION
VARIATION EVALUATION SELECTION
VARIATION EVALUATION SELECTION
Distributed Computing
Multicore Computing
General Purpose Computing on GPUs (GPGPU)
CUDA Architecture
Developing CUDA Applications
Case StudiesParallelising Search AlgorithmParallelising Fitness EvaluationParallelising Software Execution
GPU Cards
0
1000
2000
3000
4000
2008 2009 2010 2011 2012 2013
Technical Innovation
release date
GFL
OP/
s(si
ngle
prec
ision
)
GeForce GTX 280
GeForce GTX 480GeForce GTX 580
GeForce GTX 680
Adapted from “CUDA C Programming Guide”, NVIDIA, July 2012
General Purpose Computing for GPUs (GPGPU)
NVIDIA GPUs(most since 2009)
NVIDIA GPUsAMD GPUs
Intel HD GPUs
other vendors ... Intel Core CPUs
General Purpose Computing on GPUs (GPGPU)
CUDA Architecture
Developing CUDA Applications
Case StudiesParallelising Search AlgorithmParallelising Fitness EvaluationParallelising Software Execution
Physical Architecture
globalmemory
streamingmultiprocessorsDRAM
GPU
systemmemory
DRAM
CPU
sharedmemory
registers
‘core’
sharedmemory
registers
‘core’
sharedmemory
registers
‘core’
sharedmemory
registers
‘core’
Adapted from “CUDA C Best Practices Guide”, NVIDIA, May 2012
Logical Architecture
shared memory
threadlocalmemory
blocks
globalmemory
Mapping Logical to Physical
sharedmemory
registers
‘core’
sharedmemory
registers
‘core’
blockstreamingmultiprocessor
CUDA Performance Features
single-instruction multiple-thread
hardware multithreading
coalesced memory access
Single-Instruction Multiple-Thread
} 1 warp =32 threads
......
Hardware Multithreading & Occupancy
...
}shared
memory
registers
‘core’
...... }
...
}...... }
...
}...... }
Coalesced Memory Access
......
}global
memory
...
General Purpose Computing on GPUs (GPGPU)
CUDA Architecture
Developing CUDA Applications
Case StudiesParallelising Search AlgorithmParallelising Fitness EvaluationParallelising Software Execution
Typical CUDA Application Pattern
globalmemory
systemmemory
memorycopy
kernellaunch
kernelcompletion
memorycopy
device
host
threads runningkernel code
Example Problem
a b c = a * b382 17 ?1124 17 ?
30 17 ?2781 98 ?824 98 ?
4510 98 ?4088 31 ?
......
...256 x 64
Kernel Code (device-side)
__global__ void exampleKernel(int * a, int * b, int * c) {
__shared__ int sb;
const unsigned int thread = threadIdx.x; const unsigned int block = blockIdx.x; const unsigned int gThread = block * blockDim.x + thread;
if (thread == 0) { sb = b[block];}
__syncthreads();
c[gThread] = a[gThread] * sb;
}
Launching a Kernel (host-side)
const unsigned int numThreads = 256;const unsigned int numBlocks = 64;
dim3 gridD(numBlocks, 1, 1);dim3 blockD(numThreads, 1, 1);
exampleKernel<<<gridD,blockD>>>(a,b,c);
Allocating and Copying Memory (host-side)
const unsigned int numThreads = 256;const unsigned int numBlocks = 64;
int * a,b,c;
cudaMalloc((void **)&a, numThreads * numBlocks * sizeof(int));cudaMalloc((void **)&b, numBlocks * sizeof(int));cudaMalloc((void **)&c, numThreads * numBlocks * sizeof(int));
cudaMemcpy(a, inputA, numThreads * numBlocks * sizeof(int), cudaMemcpyHostToDevice);cudaMemcpy(b, inputB, numBlocks * sizeof(int), cudaMemcpyHostToDevice);
Putting It All Together
__global__ void exampleKernel(int * a, int * b, int * c) {...}
int main(...) {...cudaMalloc(...);cudaMemcpy(...);...exampleKernel<<<gridD,blockD>>>(a,b,c);...cudaMemcpy(...);...}
Build Process
CUDAsource
file host source
device source
deviceintermediatecode (PTX)
deviceexecutable
code (cubin)host source
withembeddeddevice code
hostexecutable
nvcc
non-CUDAsource
file
standardcompilerand linker
Adapted from “CUDA Compiler Driver NVCC”, NVIDIA, May 2012
Compute Capability
compute capability 1.0 1.1 1.2 1.3 2.x 3.0 3.5
atomic functions (global memory) No YesYesYesYesYesYes
atomic functions (shared memory) NoNo YesYesYesYesYes
warp vote functions NoNo YesYesYesYesYes
double precision floating point NoNoNo YesYesYesYes
additional fence and sync functions NoNoNoNo YesYesYes
max number threads per block 512512512512 102410241024
number register per multiprocessor 8K8K 16K16K 32K 64K64K
max shared memory per multiprocessor 16KB16KB16KB16KB 48KB48KB48KB
local memory per thread 16KB16KB16KB16KB 512KB512KB512KB
max number instructions per kernel 2 million2 million2 million2 million 512 million512 million512 million
Adapted from “CUDA C Programming Guide”, NVIDIA, July 2012
Additional Tools and Libraries
Development Tools CUDA Libraries
debugger linear algebra (CUBLAS)
memory checker
profiler
sparse matrices (CUSPARSE)
random number generation (CURAND)
fast Fourier transform (CUFFT)
Thrust
General Purpose Computing on GPUs (GPGPU)
CUDA Architecture
Developing CUDA Applications
Case StudiesParallelising Search AlgorithmParallelising Fitness EvaluationParallelising Software Execution
Bayesian Optimisation Algorithm
Ising Spin Glass
+1-1 -1
+1+1
-1 -1+1
+1
+1
+1 +1
+1
+1
-1-1
-1
+1
-1
-1
+1
Implementation
build Bayesiannetwork model
EVALUATION
EVALUATIONVARIATION SELECTION
EVALUATIONVARIATION
VARIATION SELECTION
SELECTION
calculate Isingspin glass energy
restricted tournamentreplacement
Poulding, Staunton, Burles, “Full Implementation of an Estimation of Distribution Algorithm on a GPU”, CIGPU Competition Entry, GECCO 2011
CUDA kernel CUDA kernel CUDA kernel
Results
0
20
40
60
80
100
8x8x8 12x12x12 16x16x16 24x24x24
GPU
Spe
ed-U
p
Problem Size
Poulding, Staunton, Burles, “Full Implementation of an Estimation of Distribution Algorithm on a GPU”, CIGPU Competition Entry, GECCO 2011
General Purpose Computing on GPUs (GPGPU)
CUDA Architecture
Developing CUDA Applications
Case StudiesParallelising Search AlgorithmParallelising Fitness EvaluationParallelising Software Execution
Multi-Objective Test Suite Minimisation
t1 t2 t3 ... tl
r1 1 0 1 ... 0
r2 1 0 0 ... 1
r3 0 1 1 ... 1
rm 1 1 0 ... 0
cost 9 7 4 6
test casesre
quire
men
ts
... ... ... ... ...
Yoo, Harman, Ur, “Highly Scalable Multi Objective Test Suite Minimisation Using Graphics Cards”, SSBSE 2011
Implementation
MO algorithm NSGA-II
EVALUATION
EVALUATIONVARIATION SELECTION
EVALUATION
calculation of coverage and cost by
matrix multiplication
Java jMetal MOEA library openCL usingJavaCL wrapper
Yoo, Harman, Ur, “Highly Scalable Multi Objective Test Suite Minimisation Using Graphics Cards”, SSBSE 2011
Results
0
10
20
30
5.92E+4 6.62E+5 1.12E+7
GPU
Spe
ed-U
p
Problem Size
Yoo, Harman, Ur, “Highly Scalable Multi Objective Test Suite Minimisation Using Graphics Cards”, SSBSE 2011
General Purpose Computing on GPUs (GPGPU)
CUDA Architecture
Developing CUDA Applications
Case StudiesParallelising Search AlgorithmParallelising Fitness EvaluationParallelising Software Execution
Implementation
EVALUATION
EVALUATION
EVALUATION
execute instrumented softwarewith test inputs
CUDA kernel
research funded by the MOD Centre for Defence Enterprise (CDE)
Language Compatibility
large subset of C++Standard Template Libraryruntime type informationnetwork and file IOrand()
dynamic memory allocationfunction pointersfunction recursionmultiple source code files
only in computecapability 2.0+:
missing:
OO featurestemplatesmath libraryIEEE 754 floating point compliance
including:
research funded by the MOD Centre for Defence Enterprise (CDE)
Results
0
20
40
60
80
~20 LOC ~100 LOC ~1,500 LOC
GPU
Spe
ed-U
p
Problem Size
research funded by the MOD Centre for Defence Enterprise (CDE)
Resources
NVIDIA CUDA Zone
CUDA SDK samples - ‘template’ application
C Programming GuideC Best Practices Guide