16
PARRAY: The Array-Based GPU Programming Technology Yifeng Chen School of EECS Peking University, China.

PARRAY: The Array-Based GPU Programming … The Array-Based GPU Programming Technology Yifeng Chen School of EECS Peking University, China

Embed Size (px)

Citation preview

Page 1: PARRAY: The Array-Based GPU Programming … The Array-Based GPU Programming Technology Yifeng Chen School of EECS Peking University, China

PARRAY: The Array-Based GPU Programming Technology

Yifeng Chen

School of EECSPeking University, China.

Page 2: PARRAY: The Array-Based GPU Programming … The Array-Based GPU Programming Technology Yifeng Chen School of EECS Peking University, China

Two Conflicting Approaches for Programmability in HPC

Top-down ApproachCore programming model is high-level (e.g. func parallel lang)Must rely on heavy heuristic runtime optimizationAdd low-level program constructs to improve low-level controlRisks:

Programmers tend to avoid using “extra” constructs.Low-level controls do not fit well into the core model.

Bottom-up Approach (PARRAY)Core programming model exposes the memory hierarchySame algorithm, Same performance, Same intellectual challenge, but Shorter codeRuntime optimization possible, but not part of the core model.

Page 3: PARRAY: The Array-Based GPU Programming … The Array-Based GPU Programming Technology Yifeng Chen School of EECS Peking University, China

Basic Notation

• Dimensions in a tree• A dimension may refer to another array type.

Page 4: PARRAY: The Array-Based GPU Programming … The Array-Based GPU Programming Technology Yifeng Chen School of EECS Peking University, China

Motivating Examples for PARRAY

Page 5: PARRAY: The Array-Based GPU Programming … The Array-Based GPU Programming Technology Yifeng Chen School of EECS Peking University, China

Thread Arrays

Page 6: PARRAY: The Array-Based GPU Programming … The Array-Based GPU Programming Technology Yifeng Chen School of EECS Peking University, China

#parray {pthd [2]} P#parray {paged float [2][[2048][4096]]} H#parray {dmem float # H_1} D#parray {[#P][#D]} Gfloat* host;_pa_pthd* p;#mainhost{

#create P(p)#create H(host)#detour P(p) {

float* dev;INIT_GPU($tid$);#create D(dev)#insert DataTransfer(dev, G, host, H){}

}#destroy H(host)#destroy P(p)

}

pthread_create

sem_post

sem_wait

pthread_join

Generating CUDA+Pthread

Page 7: PARRAY: The Array-Based GPU Programming … The Array-Based GPU Programming Technology Yifeng Chen School of EECS Peking University, China

#parray { mpi [2] } M#parray { paged float [2][[2048][4096]] } H#parray { [#M][#H_1] } G

float* host;_pa_mpi* m;

#mainhosts{#create M(m)#create H(host)#detour M(m) {

float* dev;#create H_1(dev)#insert DataTransfer(dev, G, host, H){}

}#destroy H(host)#destroy M(m)

}

Generating MPI or IB/verbs

MPI_Scatter

Page 8: PARRAY: The Array-Based GPU Programming … The Array-Based GPU Programming Technology Yifeng Chen School of EECS Peking University, China

ALLTOALL

BCAST

Other Communication Patterns

Page 9: PARRAY: The Array-Based GPU Programming … The Array-Based GPU Programming Technology Yifeng Chen School of EECS Peking University, China

One-Line CUDA Code

Page 10: PARRAY: The Array-Based GPU Programming … The Array-Based GPU Programming Technology Yifeng Chen School of EECS Peking University, China

Large-Scale FFTin 20 linesDeeply optimized algorithm (ICS 2010Zero-copy for hmem

Page 11: PARRAY: The Array-Based GPU Programming … The Array-Based GPU Programming Technology Yifeng Chen School of EECS Peking University, China
Page 12: PARRAY: The Array-Based GPU Programming … The Array-Based GPU Programming Technology Yifeng Chen School of EECS Peking University, China

(Before Nov 2011)

Page 13: PARRAY: The Array-Based GPU Programming … The Array-Based GPU Programming Technology Yifeng Chen School of EECS Peking University, China

Direct Simulation of Turbulent Flows

ScaleUp to 14336 3D Single-Precision12 distributed arrays, each with 11 TB data (128TB total)Entire Tianhe-1A with 7168 nodes

Progress4096 3D completed8192 3D half-way and 14336 3D tested for performance.

Software TechnologiesPARRAY (ACM PPoPP’12) code only 300 lines.Programming-level resilience technology for stable computation Conclusion: GPU-accelerated large simulation on entire Tianhe-1A is feasible.

Page 14: PARRAY: The Array-Based GPU Programming … The Array-Based GPU Programming Technology Yifeng Chen School of EECS Peking University, China
Page 15: PARRAY: The Array-Based GPU Programming … The Array-Based GPU Programming Technology Yifeng Chen School of EECS Peking University, China
Page 16: PARRAY: The Array-Based GPU Programming … The Array-Based GPU Programming Technology Yifeng Chen School of EECS Peking University, China

DiscussionsCan other programming models benefit from PARRAY ideas?

MPI (more expressive datatype)OpenACC (optimization for coalescing accesses)PGAS (generating PGAS library calls)IB/verbs (directly generating Zero-Copy IB calls)

PARRAY helps, if you can write it down!Any index expressions using add/mul/mod/divIrregular structures must be encoded into arrays and then benefit from PARRAY. Generating Pthread + CUDA + MPI (future support of FPGA and MIC possible) + macrosMacros are compiled out: no performance loss.Typical training = 3 days, Friendly to engineers, geophysicists…