Upload
vernon-moll
View
223
Download
4
Tags:
Embed Size (px)
Citation preview
Garuda-PM July 2011
(C)RG@SERC,IISc
Challenges and Opportunities in HeterogeneousMulti-core Era
R. GovindarajanHPC Lab,SERC, IISc
Garuda-PM July 2011 © RG@SERC,IISc 2
Overview
• Introduction• Trends in Processor Architecture• Programming Challenges• Data and Task Level Parallelism on Multi-cores• Exploiting Data, Task and Pipeline Parallelism • Data Parallelism on GPUs and
Task Parallelism on CPUs• Conclusions
Garuda-PM July 2011 © RG@SERC,IISc 3
Moore’s Law : Transistors
Garuda-PM July 2011 © RG@SERC,IISc 4
Moore’s Law : Performance
Processor performance doubles every 1.5 yearsProcessor performance doubles every 1.5 years
Garuda-PM July 2011 © RG@SERC,IISc 5
Overview
• Introduction• Trends in Processor Architecture
– Instruction Level Parallelism– Multi-Cores– Accelerators
• Programming Challenges• Data and Task Level Parallelism on Multi-cores• Exploiting Data, Task and Pipeline Parallelism • Data Parallelism on GPUs and
Task Parallelism on CPUs• Conclusions
Garuda-PM July 2011 © RG@SERC,IISc 6
Progress in Processor Architecture• More transistors New architecture
innovations– Multiple Instruction Issue processors
• VLIW • Superscalar• EPIC
– More on-chip caches, multiple levels of cache hierarchy, speculative execution, …
Era of Instruction Level Parallelism
Garuda-PM July 2011 © RG@SERC,IISc 7
Moore’s Law: Processor Architecture Roadmap (Pre-2000)
First P
Super-scalar
EPIC
RISC
VLIW ILP Processors
Garuda-PM July 2011 © RG@SERC,IISc 8
Multicores : The Right Turn
6 G
Hz
1 C
ore
3 G
Hz
1 C
ore
1 G
Hz
1 C
ore
Performance 3 GHz
16 Core3 GHz 4 Core
3 GHz 2 Core
Garuda-PM July 2011 © RG@SERC,IISc 9
Era of Multicores (Post 2000)
• Multiple cores in a single die
• Early efforts utilized multiple cores for multiple programs
• Throughput oriented rather than speedup-oriented!
Garuda-PM July 2011 © RG@SERC,IISc 10
Moore’s Law: Processor Architecture Roadmap (Post-2000)
First P
RISC
VLIW Super-scalar
EPIC Multi-cores
Garuda-PM July 2011 © RG@SERC,IISc 11
Progress in Processor Architecture
• More transistors New architecture innovations– Multiple Instruction Issue processors– More on-chip caches– Multi cores– Heterogeneous cores and accelerators
Graphics Processing Units (GPUs)
Cell BE, Clearspeed
Larrabee or SCC
Reconfigurable accelerators …
Era of Heterogeneous Accelerators
Garuda-PM July 2011 © RG@SERC,IISc 12
Moore’s Law: Processor Architecture Roadmap (Post-2000)
First P
RISC
VLIW Super-scalar
EPIC Multi-cores
Accele-rators
Garuda-PM July 2011 © RG@SERC,IISc 13
Accelerators
Garuda-PM July 2011 © RG@SERC,IISc 14
Accelerators: Hype or Reality?
Some Top500 Systems (June 2011 List)
Rank System Description # Procs. R_max(TFLOPS)
2 Tianhe Xeon + Nvidia Tesla C2050 GPUs
186368 2,566
4 Nebulae-Dawning
Intel X5650, NVidia Tesla C2050 GPU
55,680 + 64,960
1,271
7 Tsubame Xeon + Nvidia GPU 73278 1,192
10 Roadrunner Opteron + CellBE 6480 +12960
1,105
Garuda-PM July 2011 © RG@SERC,IISc 15
Accelerators – Cell BE
Garuda-PM July 2011 © RG@SERC,IISc 16
Accelerator – Fermi S2050
Garuda-PM July 2011 © RG@SERC,IISc 17
How is the GPU connected?
C0
Memory
C1 C2 C3
IMC
SLC
Garuda-PM July 2011 © RG@SERC,IISc 18
Overview
• Introduction• Trends in Processor Architecture• Programming Challenges• Data and Task Level Parallelism on Multi-cores• Exploiting Data, Task and Pipeline Parallelism • Data Parallelism on GPUs and
Task Parallelism on CPUs• Conclusions
Garuda-PM July 2011 © RG@SERC,IISc 19
Handling the Multi-Core Challenge
• Shared and Distributed Memory Programming Languages– OpenMP– MPI
• Other Parallel Languages (partitioned global address space languages)– X10, UPC, Chapel, …
• Emergence of Programming Languages for GPU– CUDA– OpenCL
© RG@SERC,IISc 19
Garuda-PM July 2011 © RG@SERC,IISc 20
GPU Programming: Good News
• Emergence of Programming Languages for GPU– CUDA– OpenCL – Open Standards
• Growing collection of code base – CUDAzone– Packages supporting GPUs by ISV
• Impressive performance – Yes!
• What about Programmer Productivity?
© RG@SERC,IISc 20
Garuda-PM July 2011 © RG@SERC,IISc 21
GPU Programming: Boon or Bane • Challenges in GPU programming
– Managing parallelism across SMs and SPMD cores– Transfer of data between CPU and GPU– Managing CPU-GPU memory bandwidth efficiently– Efficient use of different level of memory (GPU
memory, Shared Memory, Constant and Texture Memory, …
– Efficient buffer layout scheme to ensure all accesses to GPU memory are coalesced.
– Identifying appropriate execution configuration for efficient execution
– Synchronization across multiple SMs
© RG@SERC,IISc 21
Garuda-PM July 2011 © RG@SERC,IISc 22
The Challenge : Data Parallelism
Cg
CUDA
OpenCL
AMD CAL
Garuda-PM July 2011 © RG@SERC,IISc 23
Challenge – Level 2
C0
Memory
C1 C2 C3
MC
SLC
Synergistic Exec. on CPU & GPU Cores
Garuda-PM July 2011 © RG@SERC,IISc 24
Challenge – Level 3
Synergistic Exec. on Multiple Nodes
Memory NIC
N/WSwitch
Memory NIC
MemoryNIC
Garuda-PM July 2011 © RG@SERC,IISc 25
Data, Thread, & Task Level Parallelism
MPI
CUDA
OpenCLOpenMP
SSE
CAL, PTX, …
Garuda-PM July 2011 © RG@SERC,IISc 26
Overview
• Introduction• Trends in Processor Architecture• Programming Challenges• Data and Task Level Parallelism on Multi-cores
– CUDA for Clusters
• Exploiting Data, Task and Pipeline Parallelism Data Parallelism on GPUs and Task Parallelism on CPUs
• Conclusions
Garuda-PM July 2011 © RG@SERC,IISc 27
Synergistic Execution on Multiple Hetergeneous Cores
Our Apprach
Compiler/Runtime System
CellBE
OtherAccel.
Multicores
GPUsSSE
StreamingLang.
MPIOpenMP
CUDA/OpenCL
Parallel Lang.
Array Lang. (Matlab)
Garuda-PM July 2011 © RG@SERC,IISc
CUDA on Other Platforms?
• CUDA programs are data parallel• Cluster of multi-cores, typically for task parallel
applications• CUDA for multi-core SMPs (M-CUDA)• CUDA on Cluster of multi-nodes having multi-
cores?
Garuda-PM July 2011 © RG@SERC,IISc 29
CUDA on Clusters: Motivation
Motivation• CUDA, a natural way to write data-parallel
programs• Large Code-base (CUDA Zone)• Emergence of OpenCL • How can we leverage these on (existing)
Clusters without GPUs?
Challenges:• Mismatch in parallelism granularity• Synchronization and context switch can be more
expensive!
Garuda-PM July 2011 © RG@SERC,IISc
CUDA on Multi-Cores (M-CUDA)
Memory
• Fuses threads in a thread block to a thread• OMP Parallel for executing thread blocks in parallel• Proper synchronization across kernels• Loop Transformation techniques for efficient code• Shared memory and synchronization inherent!
Garuda-PM July 2011 © RG@SERC,IISc
CUDA-For-Clusters
• CUDA kernels naturally express parallelism– Block level and thread-level
• Multiple levels of parallelism in hardware too!– Across multiple nodes– And across multiple cores within a node
• A runtime system to perform this work partitioning at multiple granularities.
• Little communication between block (in data parallel programs)
• Need some mechanism for – “Global Shared memory” across multiple nodes!– Synchronization across nodes
• A 'lightweight' software distributed shared memory layer to provide necessary abstraction
Garuda-PM July 2011 © RG@SERC,IISc
CUDA-For-Clusters
• Consistency only at kernel boundaries
• Relaxed consistency model for global data
• Lazy Update• CUDA memory mgmt.
functions are linked to the DSM functions.
• Runtime system to ensure partitioning of tasks across cores
• OpenMP and MPI
CUDA program
Source-to-source transforms
C program with block-level
code specification
MPI + OpenMP work distribution runtime
Software Distributed Shared Memory
Interconnect
P1
P2RAM
P1
P2RAM
P1
P2RAM
B0,B1 B2,B3 B4,B5 B6,B7
N1
N2
N3
N4
B0 B1 B2 B3
B4 B5 B6 B7
B0 B1 B2 B3
B4 B5 B6 B7
P1
P2RAM
Garuda-PM July 2011 © RG@SERC,IISc
Experimental Setup
• Benchmarks from Parboil and Nvidia CUDA SDK suite
• 8-node Xeon clusters, each node having Dual Socket Quad Core (2 x 4 cores)
• Baseline M-CUDA on single node (8-cores)
Garuda-PM July 2011 © RG@SERC,IISc
Experimental Results
With Lazy Update
Garuda-PM July 2011 © RG@SERC,IISc 36
Overview
• Introduction• Trends in Processor Architecture• Programming Challenges• Data and Task Level Parallelism on Multi-cores• Exploiting Data, Task and Pipeline Parallelism
– Stream Programs on GPU
• Data Parallelism on GPUs and Task Parallelism on CPUs
• Conclusions
Garuda-PM July 2011 © RG@SERC,IISc 37© RG@SERC,IISc 37
Stream Programming Model
• Higher level programming model where nodes represent computation and channels communication (producer/consumer relation) between them.
• Exposes Pipelined parallelism and Task-level parallelism
• Synchronous Data Flow (SDF), Stream Flow Graph, StreamIT, Brook, …
• Compiling techniques for achieving rate-optimal, buffer-optimal, software-pipelined schedules
• Mapping applications to Accelerators such as GPUs and Cell BE.
Garuda-PM July 2011 © RG@SERC,IISc 38© RG@SERC,IISc 38
StreamIT Example Program
Dup. Splitter
Bandpass Filter+
Amplifier
Combiner
Signal Source
Bandpass Filter+
Amplifier
2 – Band Equalizer
Garuda-PM July 2011 © RG@SERC,IISc 39© RG@SERC,IISc 39
Challenges on GPUs
• Work distribution between the multiprocessors– GPUs have hundreds of processors (SMs and SIMD units)!
• Exploiting task-level and data-level parallelism– Scheduling across the multiprocessors– Multiple concurrent threads in SM to exploit DLP
• Determining the execution configuration (number of threads for each filter) that minimizes execution time.
• Lack of synchronization mechanisms between the multiprocessors of the GPU.
• Managing CPU-GPU memory bandwidth efficiently
Garuda-PM July 2011 © RG@SERC,IISc 40© RG@SERC,IISc 40
Stream Graph Execution
Stream Graph Software Pipelined Execution
A
C
D
B
SM1 SM2 SM3 SM4
A1 A2
A3 A4
B1 B2
B3 B4 D1
C1
D2
C2
D3
C3
D4
C4
0123
4567
Pipeline Parallelism
Task Parallelism
Data Parallelism
Garuda-PM July 2011 © RG@SERC,IISc 41© RG@SERC,IISc 41
• Multithreading– Identify good execution configuration to exploit the
right amount of data parallelism
• Memory– Efficient buffer layout scheme to ensure all
accesses to GPU memory are coalesced.
• Task Partition between GPU and CPU cores • Work scheduling and processor (SM)
assignment problem.– Takes into account communication bandwidth
restrictions
Our Approach
Garuda-PM July 2011 © RG@SERC,IISc 42© RG@SERC,IISc 42
Compiler Framework
Execute ProfileRuns
Generate Code for Profiling
ConfigurationSelection
StreamItProgram
TaskPartitioning
TaskPartitioning
ILP Partitioner
Heuristic Partitioner
InstancePartitioning
InstancePartitioning
ModuloScheduling
CodeGeneration
CUDACode
+C Code
Garuda-PM July 2011 © RG@SERC,IISc 43© RG@SERC,IISc 43
Bitoni
c
Bitoni
c-Rec
Chann
elvo
c...
DCTDES
FFT-C
FFT-F
Filter
bank FM
Mat
rixM
ult
MPE
G-Sub
set
TDE-P
P
Geom
etric
...
02468
101214161820
Mostly GPUSyn-HeurSyn-ILP
Experimental Results on Tesla
> 5
2x
> 3
2x
> 6
5x
Garuda-PM July 2011 © RG@SERC,IISc 44
Overview
• Introduction• Trends in Processor Architecture• Programming Challenges• Data and Task Level Parallelism on Multi-cores• Exploiting Data, Task and Pipeline Parallelism• Data Parallelism on GPUs and
Task Parallelism on CPUs– MATLAB on CPU/GPU
• Conclusions
Garuda-PM July 2011 © RG@SERC,IISc 45
Compiling MATLAB to Heterogeneous Machines
• MATLAB is an array language extensively used for scientific computation
• Expresses data parallelism – Well suited for acceleration on GPUs
• Current solutions (Jacket, GPUmat) require user annotation to identify “GPU friendly” regions
• Our compiler, MEGHA (MATLAB Execution on GPU-based Heterogeneous Architectures), is fully automatic
Garuda-PM July 2011 © RG@SERC,IISc 46
Compiler Overview
• Frontend constructs an SSA intermediate representation (IR) from the input MATLAB code
• Type inference is performed on the SSA IR– Needed because MATLAB is
dynamically typed• Backend identifies “GPU
friendly” kernels, decides where to run them and inserts reqd. data transfers
Code Simplification
Type Inference
SSA Construction
Frontend
Kernel Identification
Mapping and Scheduling
Data Transfer Insertion
Code GenerationBackend
Garuda-PM July 2011 © RG@SERC,IISc 47
Backend : Kernel Identification
• Kernel identification identifies sets of IR statements (kernels) on which mapping and scheduling decisions are made
• Modeled as a graph clustering problem• Takes into account several costs and benefits
while forming kernels– Register utilization (Cost)– Memory utilization (Cost)– Intermediate array elimination (Benefit)– Kernel invocation overhead reduction (Benefit)
Garuda-PM July 2011 © RG@SERC,IISc 48
Backend : Scheduling and Transfer Insertion• Assignment and scheduling is performed
using a heuristic algorithm– Assigns kernels to the processor which
minimizes its completion time– Uses a variant of list scheduling
• Data transfers for dependencies within a basic block are inserted during scheduling
• Transfers for inter basic block dependencies are inserted using a combination of data flow analysis and edge splitting
Garuda-PM July 2011 © RG@SERC,IISc 49
Results
bscholes clos fdtd nb1d nb3d0
2
4
6
8
10
12
14
GPUmatCPU onlyCPU+ GPU
Benchmarks
Spee
dup
Ove
r MA
TLA
B
113 64
Benchmarks were run on a machine with an Intel Xeon 2.83GHz quad-core processor and a Tesla S1070. Generated code was compiled using nvcc version 2.3. MATLAB version 7.9 was used.
Garuda-PM July 2011 © RG@SERC,IISc 50
Overview
• Introduction• Trends in Processor Architecture• Programming Challenges• Data and Task Level Parallelism on Multi-cores• Exploiting Data, Task and Pipeline Parallelism• Data Parallelism on GPUs and
Task Parallelism on CPUs• Conclusions
Garuda-PM July 2011 © RG@SERC,IISc 51
Summary
• Multi-cores and Heterogeneous accelerators present tremendous research opportunity in– Architecture– High Performance Computing– Compilers– Programming Languages & Models
• Challenge is to exploit Performance• Keeping programmer Productivity high and• Portability of code across platforms
PPPis Important
Garuda-PM July 2011
(C)RG@SERC,IISc
Thank You !!