Upload
hoanghuong
View
243
Download
0
Embed Size (px)
Citation preview
PARRAY: The Array-Based GPU Programming Technology
Yifeng Chen
School of EECSPeking University, China.
Two Conflicting Approaches for Programmability in HPC
Top-down ApproachCore programming model is high-level (e.g. func parallel lang)Must rely on heavy heuristic runtime optimizationAdd low-level program constructs to improve low-level controlRisks:
Programmers tend to avoid using “extra” constructs.Low-level controls do not fit well into the core model.
Bottom-up Approach (PARRAY)Core programming model exposes the memory hierarchySame algorithm, Same performance, Same intellectual challenge, but Shorter codeRuntime optimization possible, but not part of the core model.
Basic Notation
• Dimensions in a tree• A dimension may refer to another array type.
Motivating Examples for PARRAY
Thread Arrays
#parray {pthd [2]} P#parray {paged float [2][[2048][4096]]} H#parray {dmem float # H_1} D#parray {[#P][#D]} Gfloat* host;_pa_pthd* p;#mainhost{
#create P(p)#create H(host)#detour P(p) {
float* dev;INIT_GPU($tid$);#create D(dev)#insert DataTransfer(dev, G, host, H){}
}#destroy H(host)#destroy P(p)
}
pthread_create
sem_post
sem_wait
pthread_join
Generating CUDA+Pthread
#parray { mpi [2] } M#parray { paged float [2][[2048][4096]] } H#parray { [#M][#H_1] } G
float* host;_pa_mpi* m;
#mainhosts{#create M(m)#create H(host)#detour M(m) {
float* dev;#create H_1(dev)#insert DataTransfer(dev, G, host, H){}
}#destroy H(host)#destroy M(m)
}
Generating MPI or IB/verbs
MPI_Scatter
ALLTOALL
BCAST
Other Communication Patterns
One-Line CUDA Code
Large-Scale FFTin 20 linesDeeply optimized algorithm (ICS 2010Zero-copy for hmem
(Before Nov 2011)
Direct Simulation of Turbulent Flows
ScaleUp to 14336 3D Single-Precision12 distributed arrays, each with 11 TB data (128TB total)Entire Tianhe-1A with 7168 nodes
Progress4096 3D completed8192 3D half-way and 14336 3D tested for performance.
Software TechnologiesPARRAY (ACM PPoPP’12) code only 300 lines.Programming-level resilience technology for stable computation Conclusion: GPU-accelerated large simulation on entire Tianhe-1A is feasible.
DiscussionsCan other programming models benefit from PARRAY ideas?
MPI (more expressive datatype)OpenACC (optimization for coalescing accesses)PGAS (generating PGAS library calls)IB/verbs (directly generating Zero-Copy IB calls)
PARRAY helps, if you can write it down!Any index expressions using add/mul/mod/divIrregular structures must be encoded into arrays and then benefit from PARRAY. Generating Pthread + CUDA + MPI (future support of FPGA and MIC possible) + macrosMacros are compiled out: no performance loss.Typical training = 3 days, Friendly to engineers, geophysicists…