75
synergy.cs.vt.edu Towards an Ecosystem for Heterogeneous Parallel Computing Wu FENG Dept. of Computer Science Dept. of Electrical & Computer Engineering Virginia Bioinforma<cs Ins<tute © W. Feng, November 2013 [email protected], 540.231.1192

HC-4022, Towards an Ecosystem for Heterogeneous Parallel Computing, by Wu Feng

Embed Size (px)

DESCRIPTION

Presentation HC-4022, Towards an Ecosystem for Heterogeneous Parallel Computing, by Wu Feng at the AMD Developer Summit (APU13) November 11-13, 2013.

Citation preview

Page 1: HC-4022, Towards an Ecosystem for Heterogeneous Parallel Computing, by Wu Feng

synergy.cs.vt.edu  

Towards an Ecosystem for Heterogeneous Parallel Computing

Wu  FENG  

Dept.  of  Computer  Science    Dept.  of  Electrical  &  Computer  Engineering  Virginia  Bioinforma<cs  Ins<tute  

© W. Feng, November 2013 [email protected], 540.231.1192

Page 2: HC-4022, Towards an Ecosystem for Heterogeneous Parallel Computing, by Wu Feng

synergy.cs.vt.edu  

© W. Feng, November 2013 [email protected], 540.231.1192

Japanese ‘Computnik’ Earth Simulator Shatters U.S. Supercomputer Hegemony

Page 3: HC-4022, Towards an Ecosystem for Heterogeneous Parallel Computing, by Wu Feng

synergy.cs.vt.edu  

Importance of High-Performance Computing (HPC) in the LARGE and in the Small Competitive Risk From Not Having Access to HEC

3%

16%

34%

47%

Could exist and compete

Could not exist as a business

Could not compete on quality &testing issues

Could not compete on time to market& cost

Data from Council of Competitiveness. Sponsored Survey Conducted by IDC

Competitive Risk From Not Having Access to HPC

Only 3% of companies could exist and compete without HPC. ª  200+ participating companies, including

many Fortune 500 (Proctor & Gamble and biological and chemical companies)

© W. Feng, November 2013 [email protected], 540.231.1192

Page 4: HC-4022, Towards an Ecosystem for Heterogeneous Parallel Computing, by Wu Feng

synergy.cs.vt.edu  

Computnik 2.0?  

•  The Second Coming of Computnik? Computnik 2.0?  –  No  …  “only”  43%  faster  than  the  previous  #1  supercomputer,  but        à  $20M  cheaper  than  the  previous  #1  supercomputer        à  42%  less  power  consump<on  

•  The  Second Coming of the “Beowulf Cluster” for HPC –  The further commoditization of HPC

© W. Feng, November 2013 [email protected], 540.231.1192

Page 5: HC-4022, Towards an Ecosystem for Heterogeneous Parallel Computing, by Wu Feng

synergy.cs.vt.edu  

The First Coming of the “Beowulf Cluster”

•  Utilize commodity PCs (with commodity CPUs) to build a supercomputer

The Second Coming of the “Beowulf Cluster”

•  Utilize commodity PCs (with commodity CPUs) to build a supercomputer +

•  Utilize commodity graphics processing units (GPUs) to build a supercomputer

© W. Feng, November 2013 [email protected], 540.231.1192

Issue: Extracting performance with programming ease and portability à productivity

Page 6: HC-4022, Towards an Ecosystem for Heterogeneous Parallel Computing, by Wu Feng

synergy.cs.vt.edu  

“Holy Grail” Vision •  An ecosystem for heterogeneous parallel computing

–  Enabling software that tunes parameters of hardware devices … with respect to performance, programmability, and portability … via a benchmark suite of dwarfs (i.e., motifs) and mini-apps

Highest-ranked commodity supercomputer in U.S. on the Green500 (11/11)

© W. Feng, November 2013 [email protected], 540.231.1192

Page 7: HC-4022, Towards an Ecosystem for Heterogeneous Parallel Computing, by Wu Feng

synergy.cs.vt.edu  © W. Feng, November 2013

[email protected], 540.231.1192

Design  of  Composite  Structures  

SoPware  Ecosystem  

Heterogeneous  Parallel  Compu<ng  (HPC)  PlaUorm  

Page 8: HC-4022, Towards an Ecosystem for Heterogeneous Parallel Computing, by Wu Feng

synergy.cs.vt.edu  

An Ecosystem for Heterogeneous Parallel Computing

Software Ecosystem

Heterogeneous Parallel Computing (HPC) Platform

Sequence Alignment

Molecular Dynamics

Earthquake Modeling

Neuro- informatics

CFD for Mini-Drones

Applications

^

… inter-node (in brief) with application to BIG DATA (StreamMR)

Page 9: HC-4022, Towards an Ecosystem for Heterogeneous Parallel Computing, by Wu Feng

synergy.cs.vt.edu  

Roadmap

© W. Feng, November 2013 [email protected], 540.231.1192

Performance, Programmability, Portability

Page 10: HC-4022, Towards an Ecosystem for Heterogeneous Parallel Computing, by Wu Feng

synergy.cs.vt.edu  

Programming GPUs

CUDA •  NVIDIA’s proprietary framework

OpenCL •  Open standard for heterogeneous parallel computing

(Khronos Group) •  Vendor-neutral environment for CPUs, GPUs,

APUs, and even FPGAs

© W. Feng, November 2013 [email protected], 540.231.1192

Page 11: HC-4022, Towards an Ecosystem for Heterogeneous Parallel Computing, by Wu Feng

synergy.cs.vt.edu  

Prevalence

© W. Feng, November 2013 [email protected], 540.231.1192

Search Date: 2013-03-25

First  public  release  December  2008  

First  public  release  February  2007  

Page 12: HC-4022, Towards an Ecosystem for Heterogeneous Parallel Computing, by Wu Feng

synergy.cs.vt.edu  

CU2CL: CUDA-to-OpenCL Source-to-Source Translator†

•  Works as a Clang plug-in to leverage its production-quality compiler framework.

•  Covers primary CUDA constructs found in CUDA C and CUDA run-time API.

•  Performs as well as codes manually ported from CUDA to OpenCL.

See APU’13 Talk (PT-4057) from Tuesday, November 12 •  Automated CUDA-to-OpenCL Translation with CU2CL: What’s Next?

© W. Feng, November 2013 [email protected], 540.231.1192

†  “CU2CL:  A  CUDA-­‐to-­‐OpenCL  Translator  for  Mul<-­‐  and  Many-­‐core  Architectures,”  17th  IEEE  Int’l  Conf.  on  Parallel  &  Distributed  Systems  (ICPADS),  Dec.  2011.    

Page 13: HC-4022, Towards an Ecosystem for Heterogeneous Parallel Computing, by Wu Feng

synergy.cs.vt.edu  

OpenCL: Write Once, Run Anywhere

© W. Feng, November 2013 [email protected], 540.231.1192

CUDA Program

CU2CL (“cuticle”)

OpenCL-supported CPUs, GPUs, FPGAs NVIDIA GPUs

OpenCL Program

Page 14: HC-4022, Towards an Ecosystem for Heterogeneous Parallel Computing, by Wu Feng

synergy.cs.vt.edu  

Application CUDA Lines Lines Manually Changed

% Auto-Translated

bandwidthTest 891 5 98.9

BlackScholes 347 14 96.0

matrixMul 351 9 97.4

vectorAdd 147 0 100.0

Back Propagation 313 24 92.3

Hotspot 328 2 99.4

Needleman-Wunsch 430 3 99.3

SRAD 541 0 100.0

Fen Zi: Molecular Dynamics 17,768 1,796 89.9

GEM: Molecular Modeling 524 15 97.1

IZ PS: Neural Network 8,402 166 98.0

© W. Feng, November 2013 [email protected], 540.231.1192

Page 15: HC-4022, Towards an Ecosystem for Heterogeneous Parallel Computing, by Wu Feng

synergy.cs.vt.edu  

Why CU2CL? •  Much larger # of apps implemented in CUDA than in OpenCL

–  Idea §  Leverage scientists’ investment in CUDA to drive OpenCL adoption

–  Issues (from the perspective of domain scientists) §  Writing from Scratch: Learning Curve (OpenCL is too low-level an API compared to CUDA. CUDA also low level.) §  Porting from CUDA: Tedious and Error-Prone

•  Significant demand from major stakeholders (and Apple …)

© W. Feng, November 2013 [email protected], 540.231.1192

Why Not CU2CL? •  Just start with OpenCL?! •  CU2CL only does source-to-source translation at present

–  Functional (code) portability? Yes. –  Performance portability? Mostly no.

Page 16: HC-4022, Towards an Ecosystem for Heterogeneous Parallel Computing, by Wu Feng

synergy.cs.vt.edu  

Roadmap

© W. Feng, November 2013 [email protected], 540.231.1192

IEEE ICPADS ’11 P2S2 ’12

Parallel Computing ‘13

Performance, Programmability, Portability

What about performance portability?

Page 17: HC-4022, Towards an Ecosystem for Heterogeneous Parallel Computing, by Wu Feng

synergy.cs.vt.edu  

Roadmap

© W. Feng, November 2013 [email protected], 540.231.1192

Performance, Programmability, Portability

Page 18: HC-4022, Towards an Ecosystem for Heterogeneous Parallel Computing, by Wu Feng

synergy.cs.vt.edu  

C.  del  Mundo,  W.  Feng.  “Towards  a  Performance-­‐Portable  FFT  Library  for  Heterogeneous  Computing,”    in  IEEE  IPDPS  ‘13.  Phoenix,  AZ,  USA,  May    2014.  (Under  review.)  C.  del  Mundo,  W.  Feng.  “Enabling  Efficient  Intra-­‐Warp  Communication  for  Fourier  Transforms  in  a  Many-­‐Core  Architecture”,  SC|13,  Denver,  CO,  Nov.  2013.    C.  del  Mundo,  V.  Adhinarayanan,  W.  Feng,  “Accelerating  FFT  for  Wideband  Channelization.”  in  IEEE  ICC  ‘13.  Budapest,  Hungary,  June  2013.  

Traditional Approach to Optimizing Compiler Architecture-Independent Optimization

Architecture-Aware Optimization

Page 19: HC-4022, Towards an Ecosystem for Heterogeneous Parallel Computing, by Wu Feng

synergy.cs.vt.edu  

Computational Units Not Created Equal

•  “AMD CPU ≠ Intel CPU” and “AMD GPU ≠ NVIDIA GPU” •  Initial performance of a CUDA-optimized N-body dwarf

© W. Feng, November 2013 [email protected], 540.231.1192

-32%

“Architecture-unaware” or

“Architecture-independent”

Page 20: HC-4022, Towards an Ecosystem for Heterogeneous Parallel Computing, by Wu Feng

synergy.cs.vt.edu  

Basic & Architecture-Unaware Execution

© W. Feng, November 2013 [email protected], 540.231.1192

163 192

328

88

224

371

0

100

200

300

400

Basic Architecture unaware Architecture aware

Spee

dup

over

han

d-tu

ned

SSE

NVIDIA GTX280 AMD 5870

12%

Page 21: HC-4022, Towards an Ecosystem for Heterogeneous Parallel Computing, by Wu Feng

synergy.cs.vt.edu  

Combined

Optimization for N-body Molecular Modeling

•  Optimization techniques on AMD GPUs –  Removing conditions à kernel

splitting –  Local staging –  Using vector types –  Using image memory 

•  Speedup over basic OpenCL GPU implementation –  Isolated optimizations –  Combined optimizations

© W. Feng, November 2013 [email protected], 540.231.1192

Isolated

MT: Max Threads; KS: Kernel Splitting; RA: Register Accumulator; RP: Register Preloading; LM: Local Memory; IM: Image Memory; LU{2,4}: Loop Unrolling{2x,4x}; VASM{2,4}: Vectorized Access & Scalar Math{float2, float4}; VAVM{2,4}: Vectorized Access & Vector Math{float2, float4}

Page 22: HC-4022, Towards an Ecosystem for Heterogeneous Parallel Computing, by Wu Feng

synergy.cs.vt.edu  

AMD-Optimized N-body Dwarf

© W. Feng, November 2013 [email protected], 540.231.1192

-32% 371x speed-up •  12% better than NVIDIA GTX 280

Page 23: HC-4022, Towards an Ecosystem for Heterogeneous Parallel Computing, by Wu Feng

synergy.cs.vt.edu  

FFT: Fast Fourier Transform

•  A spectral method that is a critical building block across many disciplines

33  

Page 24: HC-4022, Towards an Ecosystem for Heterogeneous Parallel Computing, by Wu Feng

synergy.cs.vt.edu  

Peak SP Performance & Peak Memory Bandwidth

© W. Feng, November 2013 [email protected], 540.231.1192

Page 25: HC-4022, Towards an Ecosystem for Heterogeneous Parallel Computing, by Wu Feng

synergy.cs.vt.edu  

Summary of Optimizations

© W. Feng, November 2013 [email protected], 540.231.1192

Page 26: HC-4022, Towards an Ecosystem for Heterogeneous Parallel Computing, by Wu Feng

synergy.cs.vt.edu  

Optimizations

•  RP: Register Preloading –  All data elements are first preloaded into the register file before use. –  Computation facilitated solely on registers

•  LM-CM: Local Memory (Communication Only) –  Data elements are loaded into local memory only for communication –  Threads swap data elements solely in local memory

•  CGAP: Coalesced Global Access Pattern –  Threads access memory contiguously

•  VASM{2|4}: Vector Access, Scalar Math, float{2|4} –  Data elements are loaded as the listed vector type. –  Arithmetic operations are scalar (float × float).

•  CM-K: Constant Memory for Kernel Arguments –  The twiddle multiplication state of FFT is precomputed on the CPU

and stored in GPU constant memory for fast look-up.

© W. Feng, November 2013 [email protected], 540.231.1192

Page 27: HC-4022, Towards an Ecosystem for Heterogeneous Parallel Computing, by Wu Feng

synergy.cs.vt.edu  

Architecture-Optimized FFT

© W. Feng, November 2013 [email protected], 540.231.1192

Page 28: HC-4022, Towards an Ecosystem for Heterogeneous Parallel Computing, by Wu Feng

synergy.cs.vt.edu  

Architecture-Optimized FFT

© W. Feng, November 2013 [email protected], 540.231.1192

Page 29: HC-4022, Towards an Ecosystem for Heterogeneous Parallel Computing, by Wu Feng

synergy.cs.vt.edu  

Roadmap

© W. Feng, November 2013 [email protected], 540.231.1192

GPU Computing Gems J. Molecular Graphics & Modeling

IEEE HPCC ’12 IEEE HPDC ‘13

IEEE ICPADS ’11 IEEE ICC ‘13

Performance, Programmability, Portability

Page 30: HC-4022, Towards an Ecosystem for Heterogeneous Parallel Computing, by Wu Feng

synergy.cs.vt.edu  

Roadmap

FUTURE  WORK  

© W. Feng, November 2013 [email protected], 540.231.1192

Performance, Programmability, Portability

Page 31: HC-4022, Towards an Ecosystem for Heterogeneous Parallel Computing, by Wu Feng

synergy.cs.vt.edu  

Roadmap

© W. Feng, November 2013 [email protected], 540.231.1192

Performance, Programmability, Portability

Page 32: HC-4022, Towards an Ecosystem for Heterogeneous Parallel Computing, by Wu Feng

synergy.cs.vt.edu  

The Berkeley View †

•  Traditional Approach –  Applications that target existing

hardware and programming models

•  Berkeley Approach –  Hardware design that keeps

future applications in mind –  Basis for future applications?

13 computational dwarfs A computational dwarf is a pattern of communication & computation that is common across a set of applications.

© W. Feng, November 2013 [email protected], 540.231.1192

†    Asanovic,  K.,  et  al.  The  Landscape  of  Parallel  CompuDng  Research:  A  View  from  Berkeley.  Tech.  Rep.  UCB/EECS-­‐2006-­‐183,  University  of  California,  Berkeley,  Dec.  2006.    

Dense Linear Algebra

Sparse Linear Algebra

Spectral Methods

N-Body Methods

Structured Grids

Unstructured Grids

Monte Carlo à  MapReduce

Combinational Logic Graph Traversal Dynamic Programming Backtrack & Branch+Bound Graphical Models Finite State Machine

and

Page 33: HC-4022, Towards an Ecosystem for Heterogeneous Parallel Computing, by Wu Feng

synergy.cs.vt.edu  

Example of a Computational Dwarf: N-Body

•  Computational Dwarf: Pattern of computation & communication … that is common across a set of applications

•  N-Body problems are studied in –  Cosmology, particle physics, biology, and engineering

•  All have similar structures •  An N-Body benchmark can provide meaningful insight

to people in all these fields •  Optimizations may be generally applicable as well

© W. Feng, November 2013 [email protected], 540.231.1192

GEM: Molecular Modeling

RoadRunner Universe: Astrophysics

Page 34: HC-4022, Towards an Ecosystem for Heterogeneous Parallel Computing, by Wu Feng

synergy.cs.vt.edu  

First Instantiation: OpenDwarfs (formerly “OpenCL and the 13 Dwarfs”)  

•  Goal –  Provide common algorithmic methods, i.e., dwarfs, in a language that is

“write once, run anywhere” (CPU, GPU, or even FPGA), i.e., OpenCL

•  Part of a larger umbrella project (2008-2018), funded by the NSF Center for High-Performance Reconfigurable Computing (CHREC)

© W. Feng, November 2013 [email protected], 540.231.1192

Page 35: HC-4022, Towards an Ecosystem for Heterogeneous Parallel Computing, by Wu Feng

synergy.cs.vt.edu  

Roadmap

© W. Feng, November 2013 [email protected], 540.231.1192

ACM ICPE ’12: OpenCL & the 13 Dwarfs

IEEE Cluster ’11, FPGA ’11, IEEE ICPADS ’11, SAAHPC ’11

Performance, Programmability, Portability

Page 36: HC-4022, Towards an Ecosystem for Heterogeneous Parallel Computing, by Wu Feng

synergy.cs.vt.edu  

Roadmap

© W. Feng, November 2013 [email protected], 540.231.1192

Performance, Programmability, Portability

Page 37: HC-4022, Towards an Ecosystem for Heterogeneous Parallel Computing, by Wu Feng

synergy.cs.vt.edu  

Performance & Power Modeling

•  Goals –  Robust framework –  Very high accuracy (Target: < 5% prediction error) –  Identification of portable predictors for performance and power –  Multi-dimensional characterization

§  Performance à sequential, intra-node parallel, inter-node parallel §  Power à component level, node level, cluster level

© W. Feng, November 2013 [email protected], 540.231.1192

Page 38: HC-4022, Towards an Ecosystem for Heterogeneous Parallel Computing, by Wu Feng

synergy.cs.vt.edu  

Problem Formulation: LP-Based Energy-Optimal DVFS Schedule

•  Definitions –  A DVFS system exports n { (fi, Pi ) } settings. –  Ti : total execution time of a program running at setting i

•  Given a program with deadline D, find a DVS schedule (t1*, …, tn*) such that –  If the program is executed for ti seconds at setting i, the total energy usage E is

minimized, the deadline D is met, and the required work is completed.

© W. Feng, November 2013 [email protected], 540.231.1192

Page 39: HC-4022, Towards an Ecosystem for Heterogeneous Parallel Computing, by Wu Feng

synergy.cs.vt.edu  

Single-Coefficient β Performance Model

•  Our Formulation –  Define the relative performance slowdown δ as T(f) / T(fMAX) – 1 –  Re-formulate two-coefficient model

as a single-coefficient model:

–  The coefficient β is computed at run-time using a regression method on the past MIPS rates reported from the built-in PMU.

© W. Feng, November 2013 [email protected], 540.231.1192

C.  Hsu  and  W.  Feng.  “A  Power-­‐Aware  Run-­‐Time  System  for  High-­‐Performance  Compu<ng,”  SC|05,  Nov.  2005.  

Page 40: HC-4022, Towards an Ecosystem for Heterogeneous Parallel Computing, by Wu Feng

synergy.cs.vt.edu  

β – Adaptation on NAS Parallel Benchmarks

C. Hsu and W. Feng. “A Power-Aware Run-Time System for High-Performance Computing,” SC|05, Nov. 2005.

© W. Feng, November 2013 [email protected], 540.231.1192

Page 41: HC-4022, Towards an Ecosystem for Heterogeneous Parallel Computing, by Wu Feng

synergy.cs.vt.edu  

Need for Better Performance & Power Modeling

•  S

© W. Feng, November 2013 [email protected], 540.231.1192

Source: Virginia Tech

Page 42: HC-4022, Towards an Ecosystem for Heterogeneous Parallel Computing, by Wu Feng

synergy.cs.vt.edu  

Search for Reliable Predictors

•  Re-visit performance counters used for prediction –  Applicability of performance counters across generations of

architecture

•  Performance counters monitored –  NPB and PARSEC benchmarks –  Platform: Intel Xeon E5645 CPU (Westmere-EP)

© W. Feng, November 2013 [email protected], 540.231.1192

Context switches (CS) Instruction issued (IIS)

L3 data cache misses (L3 DCM) L2 data cache misses (L2 DCM)

L2 instruction cache misses (L2 ICM) Instruction completed (TOTINS)

Outstanding bus request cycles (OutReq)

Instruction queue write cycles (InsQW)

Page 43: HC-4022, Towards an Ecosystem for Heterogeneous Parallel Computing, by Wu Feng

synergy.cs.vt.edu  

Page 44: HC-4022, Towards an Ecosystem for Heterogeneous Parallel Computing, by Wu Feng

synergy.cs.vt.edu  

Roadmap

© W. Feng, November 2013 [email protected], 540.231.1192

Performance, Programmability, Portability

ACM/IEEE SC ’05 ICPP ’07

IEEE/ACM CCGrid ‘09 IEEE GreenCom ’10

ACM CF ’11 IEEE Cluster ‘11

ISC ’12 IEEE ICPE ‘13

Page 45: HC-4022, Towards an Ecosystem for Heterogeneous Parallel Computing, by Wu Feng

synergy.cs.vt.edu  

Roadmap

© W. Feng, November 2013 [email protected], 540.231.1192

Performance, Programmability, Portability

Performance Portability

Performance Portability

Page 46: HC-4022, Towards an Ecosystem for Heterogeneous Parallel Computing, by Wu Feng

synergy.cs.vt.edu  

What is Heterogeneous Task Scheduling?

•  Automatically spreading tasks across heterogeneous compute resources –  CPUs, GPUs, APUs, FPGAs, DSPs, and so on

•  Specify tasks at a higher level (currently OpenMP extensions) •  Run them across available resources automatically

•  A run-time system that intelligently uses what is available resource-wise and optimize for performance portability –  Each user should not have to implement this for themselves!

© W. Feng, November 2013 [email protected], 540.231.1192

Goal

Page 47: HC-4022, Towards an Ecosystem for Heterogeneous Parallel Computing, by Wu Feng

synergy.cs.vt.edu  

Heterogeneous Task Scheduling

© W. Feng, November 2013 [email protected], 540.231.1192

Page 48: HC-4022, Towards an Ecosystem for Heterogeneous Parallel Computing, by Wu Feng

synergy.cs.vt.edu  

How to Heterogeneous Task Schedule (HTS)

•  Accelerated OpenMP offers heterogeneous task scheduling with –  Programmability –  Functional portability, given underlying compilers –  Performance portability

•  How? –  A simple extension to Accelerated OpenMP syntax for programmability –  Automatically dividing parallel tasks across arbitrary heterogeneous

compute resources for functional portability §  CPUs §  GPUs §  APUs

–  Intelligent runtime task scheduling for performance portability

© W. Feng, November 2013 [email protected], 540.231.1192

Page 49: HC-4022, Towards an Ecosystem for Heterogeneous Parallel Computing, by Wu Feng

synergy.cs.vt.edu  

# pragma acc_region_loop \

acc_copyin(in1[0:end],in2[0:end]) \

acc_copyout(out[0:end]) \

acc_copy(pow[0:end]) \

hetero(<cond>[,<scheduler>[,<ratio>\

[,<div>]]])

for (i=0; i<end; i++){

out[i] = in1[i] * in2[i];

pow[i] = pow[i]*pow[i];

}

1

Programmability: Why Accelerated OpenMP?

Traditional OpenMP OpenMP Accelerator Directives

Our Proposed Extension

# pragma omp parallel for \

shared(in1,in2,out,pow)

for (i=0; i<end; i++){

out[i] = in1[i]*in2[i];

pow[i] = pow[i]*pow[i];

}

1

# pragma acc_region_loop \

acc_copyin(in1[0:end],in2[0:end])\

acc_copyout(out[0:end]) \

acc_copy(pow[0:end])

for (i=0; i<end; i++){

out[i] = in1[i] * in2[i];

pow[i] = pow[i]*pow[i];

}

1

© W. Feng, November 2013 [email protected], 540.231.1192

Page 50: HC-4022, Towards an Ecosystem for Heterogeneous Parallel Computing, by Wu Feng

synergy.cs.vt.edu  

OpenMP Accelerator Behavior

Original/Master thread Worker threads Parallel region Accelerated region

#pragma omp parallel …

#pragma omp acc_region …

Kernels

© W. Feng, November 2013 [email protected], 540.231.1192

Implicit barrier

Page 51: HC-4022, Towards an Ecosystem for Heterogeneous Parallel Computing, by Wu Feng

synergy.cs.vt.edu  

DESIRED OpenMP Accelerator Behavior

Original/Master thread Worker threads Parallel region Accelerated region

© W. Feng, November 2013 [email protected], 540.231.1192

#pragma omp parallel num_threads(2)

#pragma omp acc_region …

#pragma omp parallel …

Page 52: HC-4022, Towards an Ecosystem for Heterogeneous Parallel Computing, by Wu Feng

synergy.cs.vt.edu  

Work-share a Region Across the Whole System

Original/Master thread Worker threads Parallel region Accelerated region

#pragma omp acc_region …

#pragma omp parallel …

OR

Page 53: HC-4022, Towards an Ecosystem for Heterogeneous Parallel Computing, by Wu Feng

synergy.cs.vt.edu  

Scheduling and Load-Balancing by Adaptation •  Measure computational suitability at runtime •  Compute new distribution of work through a linear

optimization approach •  Re-distribute work before each pass

OpenMP implementation with coscheduling. This goal imposes de-sign constraints. Most importantly, it must not require any changes,even for memory movement, to the loop body beyond those for Ac-celerated OpenMP. For example, no pragmas or API calls may beinserted into the loop, nor memory access patterns be changed, astask scheduling systems often require. All information necessaryfor CoreTSAR to provide the correct data for any range of itera-tions to any device’s memory space must be provided in the direc-tive outside the loop. Further, we must preserve data consistencyoutside the region: main memory must hold the same values whenthe loop exits as it would have with Accelerated OpenMP.

Our design has two main components: the scheduling and taskinference portion; and the memory specification and managementportion. We now detail both and provide an example of their use.

3.1 Assigning TasksWith homogeneous iterations and resources, the OpenMP staticschedule yields high performance. Since iterations can vary in run-time, OpenMP supports additional schedule types (dynamic andguided) to improve load balance. These schedules target hetero-geneous iterations on homogeneous resources that have low con-currency control costs. However, they are less appropriate for het-erogeneous resources due to varying costs and synchronization re-quirements. Since CoreTSAR targets heterogeneous resources withdistributed memories, we provide different schedules.

Our adaptive scheduler assigns iterations at the beginning ofparallel regions, or sub-regions. This approach reduces lockingoverhead but does not balance load dynamically. To provide bal-anced schedules, CoreTSAR predicts the time to compute an itera-tion on each resource based on previous passes.

CoreTSAR tracks the average time to complete an iteration oneach device, which it uses to predict the amount of work each de-vice can complete in the next pass. For example, consider a systemwith two CPU cores and one GPU (one CPU core must control theGPU). If the CPU core completes 10 iterations in the same time thatthe GPU takes to copy in data, complete 40 iterations, and to copyback the results, then the CPU should be assigned 20% of the iter-ations in the next pass. We thus determine the relationship betweencompute units and can compute the amount of work to provide eachdevice to balance their loads. However, we must extend this simpleapproach to more than two devices and choose an initial split.

3.2 Applying RatiosWe use a linear program to extend our approach to arbitrary devicecounts, a version of which was discussed briefly in our previouswork [22]. The linear program computes the iterations to assignto each device based on their time per iteration. Figure 2 lists itsvariables (Equation 1), objective function (Equation 2) and accom-panying constraints (Equations 3-6). The program minimizes thetotal deviation between the predicted runtimes for all devices. Weassume that performance of an average iteration does not changeacross region instances. Thus, the time for a device to finish itswork in the next pass equals the time per iteration from the previ-ous pass multiplied by its assigned iteration count. In practice thisassumption holds well: although the cost of iterations varies, thesame iteration in different passes often has similar performance,rendering accuracy within a few percent for our tests.

3.3 Static SchedulingOn the first entry into a region, our static schedule uses the linearprogram to assign iterations. To increase portability, we computedefault relative times per iteration at runtime rather than using aprecomputed static value (the user can also specify the ratio). Ourdefault assumes that one instruction cycle on a GPU core takesthe same time as one cycle on a single SIMD lane of a CPU.

I = total iterations available

ij = iterations for compute unit j

fj = fraction of iterations for compute unit j

pj = recent time/iteration for compute unit j (1)

n = number of compute devices

t+j (or t�j ) = time over (or under) equal

min(n�1X

j=1

t+1 + t�1 · · ·+ t+n�1 + t�n�1) (2)

nX

j=0

ij = I (3)

i2 ⇤ p2 � i1 ⇤ p1 = t+1 � t�1 (4)

i3 ⇤ p3 � i1 ⇤ p1 = t+2 � t�2 (5)

...

in ⇤ pn � i1 ⇤ p1 = t+n�1 � t�n�1 (6)

Figure 2: Linear program variables, objective and constraints

While this assumption does not hold in general, we can portablycompute an initial time per iteration for each device. We computethe time per iteration for a GPU as pg = 1

m/s and for CPU cores as1 � pg (where m is the number of multiprocessors on a GPU ands the SIMD width of a CPU core; in the case of multiple GPUs,we use the largest value). For applications that are not dominatedby floating-point computation, we have considered models thatinclude several other factors, including memory bandwidth andinteger performance, none of which have significantly changed ourresults.

3.4 Adaptive SchedulingOur adaptive schedules (Adaptive, Split and Quick) use the staticschedule for the first pass. We then use the time that each devicetakes to complete its iterations in the preceding pass as input toour linear program for the next pass. We include all recurring datatransfer and similar overheads required to execute an iteration on aparticular device (but not one-time overheads such as the copyingof persistent data). Thus, we incorporate those overheads into thecost of the iteration and naturally account for them. The Adaptiveschedule trains on the first instance of the region and then eachsubsequent instance. The Split schedule accommodates regions thatmay only run once or that may benefit from scheduling more often.It breaks each region instance into several evenly split sub-regions,based on the div input. Each time a sub-region completes, we usethe linear program to split the next. This schedule can provide betterload balance at the cost of increased scheduling and kernel launchoverhead. Thus, it is impractical for short regions and overheadsensitive applications. The Quick schedule balances between theSplit and Adaptive schedules by executing a small sub-region for itsfirst training phase, similarly to Split. It then immediately schedulesall remaining iterations of the first region instance and uses theAdaptive schedule for any subsequent instances. This schedule suitsapplications that cannot tolerate a full instance using the staticschedule or the overhead of extra scheduling steps in every pass.

3.5 Memory ManagementMoving exactly the data required is essential to efficient and correctregion execution across multiple memory spaces. Thus, we allowthe user to specify the association between a loop iteration and in-

3 2013/1/6

CoreTSAR: Task-Size Adapting Runtime A:7

Original With GPU back−off

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

AdaptiveSplit

1 2 3 4 1 2 3 4Number of GPUs

Perc

enta

ge o

f tim

e in

eac

h ph

ase

Program phase Compute Scheduling

(a) Percentage of time spent on computationand scheduling

min(n�1X

j=1

t+j + t�j ) (7)

nX

j=1

fj = 1 (8)

f2 ⇤ p2 � f1 ⇤ p1 = t+1 � t�1 (9)

f3 ⇤ p3 � f1 ⇤ p1 = t+2 � t�2 (10)

...

fn ⇤ pn � f1 ⇤ p1 = t+n�1 � t�n�1 (11)

(b) Modified objective and constraints

Fig. 5: Linear program optimization and performance

exactly that linear program. In order to ensure the solve itself is efficient, CoreTSARemploys the lp solve library[Berkelaar et al. 2003], an optimized linear program solverthat can refine an existing solved tableau for a new set of inputs. This incremental ap-proach reduces overhead since each pass tends to have similar inputs.

//items in {} are optional

#pragma acc region \

hetero(<cond>{,<devices>{,<sched.>{,<ratio>{,<div>}}}})\

pcopy{in/out}(<var>[<cond>:<num>{:<boundary>}])\

persist(<var>)

#pragma acc depersist(<var>)

hetero() inputs<cond> Boolean, true=coschedule, false=ignore

<devices> Allowable devices (cpu/gpu/all)<scheduler> Scheduler to use for this region

<ratio> Initial split between CPU and GPU.<div> How many times to divide the iteration space

pcopy() and {de}persist() inputs<var> Variable to copy.

<size> Size of each “item” in the array/matrix.<cond> Whether this dimension should be copied.<num> Number of items in this dimension.

<boundary> Number of boundary elements required.

Fig. 3: Our proposed extension

Figure ?? represents the time spentin CoreTSAR scheduling 1,900 passesthrough a region, or 19,000 schedul-ing iterations with the split scheduler.The original linear model has exponen-tial time complexity as the number ofdevices increases. In the worst case,the split schedule with four GPUs, thescheduling takes nearly 3⇥ longer thanthe 40-second compute phase.

Two issues reduce solver performance.The input has widely distributed val-ues, which leads to numerical instabilityand slows convergence due to frequentfloating-point error corrections. Also, alloutputs require integer values, which re-quires the solver to refine an optimal so-lution into an optimal integer solutionacross those values, which significantlyincreases computational complexity.

To alleviate these issues we remove integer output requirements by computing thepercentage of iterations to assign to each device. This choice also keeps nearly all val-ues between zero and one, improving numerical stability. Figure ?? shows the newobjective function (Equation 7) and constraints (Equations 8-11). These changes pro-duce the optimized results in Figure ??. With this version, the time in CoreTSAR canactually decrease as the number of GPUs increases due to the consistency of GPU per-formance across passes. Despite the larger matrix, the solution converges faster sinceit deviates less from the previous solution.

ACM Journal Name, Vol. V, No. N, Article A, Publication date: January YYYY.

Page 54: HC-4022, Towards an Ecosystem for Heterogeneous Parallel Computing, by Wu Feng

synergy.cs.vt.edu  

Heterogeneous Scheduling: Issues and Solutions

•  Issue: Launching overhead is high on GPUs –  Using a work-queue, many GPU kernels may need to be run

•  Solution: Schedule only at the beginning of a region –  The overhead is only paid once or a small number of times

•  Issue: Without a work-queue, how do we balance load? –  The performance of each device must be predicted

•  Solution: Allocate different amounts of work –  For the first pass, predict a reasonable initial division of work. We use

a ratio between the number of CPU and GPU cores for this. –  For subsequent passes, use the performance of previous passes to

predict following passes.

© W. Feng, November 2013 [email protected], 540.231.1192

Page 55: HC-4022, Towards an Ecosystem for Heterogeneous Parallel Computing, by Wu Feng

synergy.cs.vt.edu  

Dynamic Ratios

•  All dynamic schedulers share a common basis

•  At the end of a pass, a version of the linear program at right is solved

•  For the simplified case, we actually solve – 

where –  i’c is the number of iterations to assign the CPU –  it is the total iterations of the next pass

–  p* is the time per iteration for (c)pu/(g)pu on the last pass

I = total iterations in next pass (int)ij = iterations on compute unit j in next pass (int)pj = time/iteration for compute unit j from last passn = number of compute devicest

+j = time over equalt

�j = time under equal

Table I: Variable definitions for the linear program

compute-bound floating point intensive applications, but notfor memory bound ones, or highly conditional applications,we discuss further in Section V.

C. Dynamic Splitting

Similarly to our static scheduler, our dynamic schedulerdeviates from the original OpenMP dynamic schedule policy.We make scheduling decisions only at the boundaries ofaccelerated regions. Thus, the dynamic scheduler assumesthat the code will execute the parallel region several times.The first time, our approach executes the region executesas it would have been by the static scheduler. We measurethe time taken to complete the assigned iterations on eachcompute unit. On all subsequent instances of the parallelregion, we update the ratio based on these measurements.

Since we split at region boundaries rather than using aqueue, we are subject to blocking time, during which onecompute unit is waiting on the other to finish before theycan pass the barrier at the end of the region. In order tominimize blocking time, we attempt to compute the ratiothat causes the compute units to finish in as close to the sameamount of time as possible. In order to predict the time forthe next iteration, we assume that iterations take the sameamount of time on average from one pass to the next. For thegeneral case with an arbitrary number of compute units, weuse a linear program for which Figure I lists the necessaryvariables. Equation 1 represents the objective function, withthe accompanying constraints in Equations 2 through 5.

min(n�1X

j=1

t

+1 + t

�1 · · ·+ t

+n�1 + t

�n�1) (1)

nX

j=0

ij = I (2)

i2 ⇤ p2 � i1 ⇤ p1 = t

+1 � t

�1 (3)

i3 ⇤ p3 � i1 ⇤ p1 = t

+2 � t

�2 (4)

...in ⇤ pn � i1 ⇤ p1 = t

+n�1 � t

�n�1 (5)

Expressed in words, the linear program calculates theiteration counts with predicted times that are as close toidentical between all devices as possible. The constraints

specify that the sum of all assigned iterations must equalthe total number of iterations and that all iteration countsmust be integers. Since we often have exactly two computedevices, we also use a reduced form that is only accuratefor two devices but can be solved more efficiently. Thenew ratio is computed such that t

0c = t

0g where t

0c is the

predicted new time for the CPU portion to finish and t

0g is

the predicted time to finish the GPU portion. When expandedwe eventually get Equation 6, which can be solved in onlya few instructions and produces a result within one iterationof the linear program for the common case.

i

0c ⇤ pc = i

0g ⇤ pg

i

0c ⇤ pc = (i0t � i

0c) ⇤ pg

i

0c = ((i0t � i

0c) ⇤ pg)/pc

i

0c = ((i0t � i

0c) ⇤ pg)/pc

i

0c + (i0c ⇤ pg)/pc = (i0t ⇤ pg)/pc

(i0c ⇤ pc)/pc + (i0c ⇤ pg)/pc = (i0t ⇤ pg)/pci

0c ⇤ (pg + pc) = it ⇤ pg

i

0c = (it ⇤ pg)/(pg + pc) (6)

D. Special PurposeOur design and testing indicated that neither of the sched-

ulers above is entirely appropriate in certain cases. Thus, wecreated two other schedulers: split and quick.

1) Split: Our dynamic scheduling requires more at leastmultiple executions of the parallel region to compute animproved ratio, which works well for applications whichmake multiple passes. However, some parallel regions areexecuted only a few times, or even just once. Split schedul-ing addresses these regions. In this case, each pass throughthe region begins a loop that iterates div times with eachiteration executing totaltasks/div tasks. Thus, the runtimecan adjust the ratio more frequently, and earlier, than withdynamic scheduling. More importantly, it can adjust the ratioin cases that dynamic scheduling cannot. The split scheduleis analogous to the original OpenMP dynamic schedule sinceit specifies totaliterations/chunksize instead of chunksize directly. It remains distinct however in that whileit runs a number of chunks, they can be independentlysubdivided to avoid overloading, or underloading, a computedevice. Increasing the number of passes however, and thusthe number of synchronization steps, increases overhead,which is especially problematic with short regions and thoseunsuitable for GPU computation so it is unsuitable as adefinitive replacement for dynamic.

2) Quick: Quick is a hybrid of the split and dynamicschedules. It executes a small section of the first pass ofsize iterations/div just as split does, but the rest of thatpass in one step of size iterations�iterations/div. It thenswitches to using the dynamic schedule for the rest of therun. It targets efficiently scheduling of applications with long

I = total iterations in next pass (int)ij = iterations on compute unit j in next pass (int)pj = time/iteration for compute unit j from last passn = number of compute devicest

+j = time over equalt

�j = time under equal

Table I: Variable definitions for the linear program

compute-bound floating point intensive applications, but notfor memory bound ones, or highly conditional applications,we discuss further in Section V.

C. Dynamic Splitting

Similarly to our static scheduler, our dynamic schedulerdeviates from the original OpenMP dynamic schedule policy.We make scheduling decisions only at the boundaries ofaccelerated regions. Thus, the dynamic scheduler assumesthat the code will execute the parallel region several times.The first time, our approach executes the region executesas it would have been by the static scheduler. We measurethe time taken to complete the assigned iterations on eachcompute unit. On all subsequent instances of the parallelregion, we update the ratio based on these measurements.

Since we split at region boundaries rather than using aqueue, we are subject to blocking time, during which onecompute unit is waiting on the other to finish before theycan pass the barrier at the end of the region. In order tominimize blocking time, we attempt to compute the ratiothat causes the compute units to finish in as close to the sameamount of time as possible. In order to predict the time forthe next iteration, we assume that iterations take the sameamount of time on average from one pass to the next. For thegeneral case with an arbitrary number of compute units, weuse a linear program for which Figure I lists the necessaryvariables. Equation 1 represents the objective function, withthe accompanying constraints in Equations 2 through 5.

min(n�1X

j=1

t

+1 + t

�1 · · ·+ t

+n�1 + t

�n�1) (1)

nX

j=0

ij = I (2)

i2 ⇤ p2 � i1 ⇤ p1 = t

+1 � t

�1 (3)

i3 ⇤ p3 � i1 ⇤ p1 = t

+2 � t

�2 (4)

...in ⇤ pn � i1 ⇤ p1 = t

+n�1 � t

�n�1 (5)

Expressed in words, the linear program calculates theiteration counts with predicted times that are as close toidentical between all devices as possible. The constraints

specify that the sum of all assigned iterations must equalthe total number of iterations and that all iteration countsmust be integers. Since we often have exactly two computedevices, we also use a reduced form that is only accuratefor two devices but can be solved more efficiently. Thenew ratio is computed such that t

0c = t

0g where t

0c is the

predicted new time for the CPU portion to finish and t

0g is

the predicted time to finish the GPU portion. When expandedwe eventually get Equation 6, which can be solved in onlya few instructions and produces a result within one iterationof the linear program for the common case.

i

0c ⇤ pc = i

0g ⇤ pg

i

0c ⇤ pc = (i0t � i

0c) ⇤ pg

i

0c = ((i0t � i

0c) ⇤ pg)/pc

i

0c = ((i0t � i

0c) ⇤ pg)/pc

i

0c + (i0c ⇤ pg)/pc = (i0t ⇤ pg)/pc

(i0c ⇤ pc)/pc + (i0c ⇤ pg)/pc = (i0t ⇤ pg)/pci

0c ⇤ (pg + pc) = it ⇤ pg

i

0c = (it ⇤ pg)/(pg + pc) (6)

D. Special PurposeOur design and testing indicated that neither of the sched-

ulers above is entirely appropriate in certain cases. Thus, wecreated two other schedulers: split and quick.

1) Split: Our dynamic scheduling requires more at leastmultiple executions of the parallel region to compute animproved ratio, which works well for applications whichmake multiple passes. However, some parallel regions areexecuted only a few times, or even just once. Split schedul-ing addresses these regions. In this case, each pass throughthe region begins a loop that iterates div times with eachiteration executing totaltasks/div tasks. Thus, the runtimecan adjust the ratio more frequently, and earlier, than withdynamic scheduling. More importantly, it can adjust the ratioin cases that dynamic scheduling cannot. The split scheduleis analogous to the original OpenMP dynamic schedule sinceit specifies totaliterations/chunksize instead of chunksize directly. It remains distinct however in that whileit runs a number of chunks, they can be independentlysubdivided to avoid overloading, or underloading, a computedevice. Increasing the number of passes however, and thusthe number of synchronization steps, increases overhead,which is especially problematic with short regions and thoseunsuitable for GPU computation so it is unsuitable as adefinitive replacement for dynamic.

2) Quick: Quick is a hybrid of the split and dynamicschedules. It executes a small section of the first pass ofsize iterations/div just as split does, but the rest of thatpass in one step of size iterations�iterations/div. It thenswitches to using the dynamic schedule for the rest of therun. It targets efficiently scheduling of applications with long

© W. Feng, November 2013 [email protected], 540.231.1192

Page 56: HC-4022, Towards an Ecosystem for Heterogeneous Parallel Computing, by Wu Feng

synergy.cs.vt.edu  

Dynamic Schedulers

•  Schedulers –  Dynamic –  Split –  Quick

•  Arguments –  Ratio

§  Initial split to use, i.e., ratio between CPU and GPU performance

–  Div §  How far to subdivide the region

© W. Feng, November 2013 [email protected], 540.231.1192

Page 57: HC-4022, Towards an Ecosystem for Heterogeneous Parallel Computing, by Wu Feng

synergy.cs.vt.edu  

Static Scheduler

Original/Master thread Worker threads Parallel region Accelerated region

Ratio: 0.5

500 it

250 it

250 it

1,000 it

500 it

500 it

Pass 1 Pass 2

© W. Feng, November 2013 [email protected], 540.231.1192

Page 58: HC-4022, Towards an Ecosystem for Heterogeneous Parallel Computing, by Wu Feng

synergy.cs.vt.edu  

Dynamic Scheduler

Original/Master thread Worker threads Parallel region Accelerated region

Ratio: 0.5

500 it

250 it

250 it

1,000 it

900 it

100 it

Pass 1 Pass 2

© W. Feng, November 2013 [email protected], 540.231.1192

Page 59: HC-4022, Towards an Ecosystem for Heterogeneous Parallel Computing, by Wu Feng

synergy.cs.vt.edu  

Split Scheduler

Original/Master thread Worker threads Parallel region Accelerated region

Ratio: 0.5 Div: 4

125 it

Reschedule

500 it

125 it 125 it 125 it

Pass 1

© W. Feng, November 2013 [email protected], 540.231.1192

Page 60: HC-4022, Towards an Ecosystem for Heterogeneous Parallel Computing, by Wu Feng

synergy.cs.vt.edu  

Split

Dynamic vs. Split Scheduler

Original/Master thread Worker threads Parallel region Accelerated region Reschedule

Arguments: Ratio=0.5 Div=4

63 it

62 it

113 it

12 it

113 it

12 it

113 it

12 it

500 it 250 it

250 it

1,000 it

900 it

100 it Pass 1 Pass 2

1,000 it 500 it

Dynamic

113 it

12 it

113 it

12 it

113 it

12 it

113 it

12 it

88  

Page 61: HC-4022, Towards an Ecosystem for Heterogeneous Parallel Computing, by Wu Feng

synergy.cs.vt.edu  

Quick Scheduler

Original/Master thread Worker threads Parallel region Accelerated region

Ratio: 0.5 Div: 4

125 it

Reschedule

500 it

375 it

Pass 1 Pass 2

© W. Feng, November 2013 [email protected], 540.231.1192

Page 62: HC-4022, Towards an Ecosystem for Heterogeneous Parallel Computing, by Wu Feng

synergy.cs.vt.edu  

Dynamic vs. Quick Scheduler

Original/Master thread Worker threads Parallel region Accelerated region Reschedule

500 it

338 it

Arguments: Ratio=0.5 Div=4

63 it

62 it 37 it

Dynamic

Quick

500 it 250 it

250 it

1,000 it

900 it

100 it Pass 1 Pass 2

1,000 it

900 it

100 it

90  

Page 63: HC-4022, Towards an Ecosystem for Heterogeneous Parallel Computing, by Wu Feng

synergy.cs.vt.edu  

Benchmarks

•  NAS CG – Many passes (1,800 for C class) –  The Conjugate Gradient benchmark from the NAS Parallel

Benchmarks

•  GEM – One pass –  Molecular Modeling, computes the electrostatic potential along the

surface of a macromolecule

•  K-Means – Few passes –  Iterative clustering of points

•  Helmholtz – Few passes, GPU Unsuitable –  Jacobi iterative method implementing the Helmholtz equation

© W. Feng, November 2013 [email protected], 540.231.1192

Page 64: HC-4022, Towards an Ecosystem for Heterogeneous Parallel Computing, by Wu Feng

synergy.cs.vt.edu  

Experimental Setup

•  System –  12-core AMD Opteron 6174 CPU –  NVIDIA Tesla C2050 GPU –  Linux 2.6.27.19 –  CCE compiler with OpenMP accelerator extensions

•  Procedures –  All parameters default, unless otherwise specified –  Results represent 5 or more runs

© W. Feng, November 2013 [email protected], 540.231.1192

Page 65: HC-4022, Towards an Ecosystem for Heterogeneous Parallel Computing, by Wu Feng

synergy.cs.vt.edu  

Results across Schedulers for all Benchmarks

Application

Spee

dup

over

12−

core

CPU

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

cg

cg

0

2

4

6

8

gem

gem

0.0

0.2

0.4

0.6

0.8

1.0helmholtz

helmholtz

0.0

0.5

1.0

1.5

kmeans

kmeans

SchedulerCPU GPU Static Dynamic Split Quick

© W. Feng, November 2013 [email protected], 540.231.1192

Page 66: HC-4022, Towards an Ecosystem for Heterogeneous Parallel Computing, by Wu Feng

synergy.cs.vt.edu  

Roadmap

© W. Feng, November 2013 [email protected], 540.231.1192

IEEE IPDPS ’12

Performance, Programmability, Portability

Page 67: HC-4022, Towards an Ecosystem for Heterogeneous Parallel Computing, by Wu Feng

synergy.cs.vt.edu  

Roadmap

Much work still to be done. •  OpenCL and the 13 Dwarfs Beta release pending •  Source-to-Source Translation CU2CL only & no optimization •  Architecture-Aware Optimization Only manual optimizations •  Performance & Power Modeling Preliminary & pre-multicore •  Affinity-Based Cost Modeling Empirical results; modeling in progress •  Heterogeneous Task Scheduling Preliminary with OpenMP

© W. Feng, November 2013 [email protected], 540.231.1192

An Ecosystem for Heterogeneous Parallel Computing

^

Page 68: HC-4022, Towards an Ecosystem for Heterogeneous Parallel Computing, by Wu Feng

synergy.cs.vt.edu  

An Ecosystem for Heterogeneous Parallel Computing

Software Ecosystem

Heterogeneous Parallel Computing (HPC) Platform

Sequence Alignment

Molecular Dynamics

Earthquake Modeling

Neuro- informatics

Avionic Composites

Applications

Intra-Node

^

… from Intra-Node to Inter-Node

Page 69: HC-4022, Towards an Ecosystem for Heterogeneous Parallel Computing, by Wu Feng

synergy.cs.vt.edu  

Programming CPU-GPU Clusters (e.g., MPI+CUDA)

GPU  device  memory  

GPU  device  memory  

CPU  main  

memory  

CPU  main  

memory  

Network

Rank = 0 Rank = 1

if(rank == 0) { cudaMemcpy(host_buf, dev_buf, D2H) MPI_Send(host_buf, .. ..) }

if(rank == 1) { MPI_Recv(host_buf, .. ..) cudaMemcpy(dev_buf, host_buf, H2D) }

Node 1 Node 2

© W. Feng, November 2013 [email protected], 540.231.1192

Page 70: HC-4022, Towards an Ecosystem for Heterogeneous Parallel Computing, by Wu Feng

synergy.cs.vt.edu  

Goal of Programming CPU-GPU Clusters (MPI + Any Acc)

GPU  device  memory  

GPU  device  memory  

CPU  main  

memory  

CPU  main  

memory  

Network

Rank = 0 Rank = 1

if(rank == 0) { MPI_Send(any_buf, .. ..); }

if(rank == 1) { MPI_Recv(any_buf, .. ..); }

© W. Feng, November 2013 [email protected], 540.231.1192

IEEE HPCC ’12 ACM HPDC ‘13

Page 71: HC-4022, Towards an Ecosystem for Heterogeneous Parallel Computing, by Wu Feng

synergy.cs.vt.edu  

“Virtualizing” GPUs …

Compute Node

Physical GPU

Application

Native OpenCL Library

OpenCL API

Traditional Model

Compute Node

Physical GPU

VOCL Proxy

OpenCL API

Our Model

Native OpenCL Library Compute Node

Virtual GPU

Application

VOCL Library

OpenCL API

MPI

Compute Node

Physical GPU

VOCL Proxy

OpenCL API Native OpenCL Library

Virtual GPU MPI

© W. Feng, November 2013 [email protected], 540.231.1192

Page 72: HC-4022, Towards an Ecosystem for Heterogeneous Parallel Computing, by Wu Feng

synergy.cs.vt.edu  

Conclusion •  An ecosystem for heterogeneous parallel computing

–  Enabling software that tunes parameters of hardware devices … with respect to performance, programmability, and portability … via a benchmark suite of dwarfs (i.e., motifs) and mini-apps

Highest-ranked commodity supercomputer in U.S. on the Green500 (11/11)

© W. Feng, November 2013 [email protected], 540.231.1192

Page 73: HC-4022, Towards an Ecosystem for Heterogeneous Parallel Computing, by Wu Feng

synergy.cs.vt.edu  

People

© W. Feng, November 2013 [email protected], 540.231.1192

Page 74: HC-4022, Towards an Ecosystem for Heterogeneous Parallel Computing, by Wu Feng

synergy.cs.vt.edu  

Funding Acknowledgements

© W. Feng, November 2013 [email protected], 540.231.1192

Microsoft featured our “Big Data in the Cloud” grant in U.S. Congressional Testimony

Page 75: HC-4022, Towards an Ecosystem for Heterogeneous Parallel Computing, by Wu Feng

synergy.cs.vt.edu  

http://synergy.cs.vt.edu/

http://www.mpiblast.org/

http://sss.cs.vt.edu/

http://www.green500.org/

http://myvice.cs.vt.edu/

http://www.chrec.org/

http://accel.cs.vt.edu/

“Accelerators ‘R Us”!

Wu  Feng,  [email protected],  540-­‐231-­‐1192  

© W. Feng, November 2013 [email protected], 540.231.1192