Upload
sani
View
38
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Make HPC Easy with Domain-Specific Languages and High-Level Frameworks. Biagio Cosenza, Ph.D. DPS Group, Institut für Informatik Universit ä t Innsbruck, Austria. Outline. Complexity in HPC Parallel hardware Optimizations Programming models Harnessing compexity Automatic tuning - PowerPoint PPT Presentation
Citation preview
Make HPC Easy with Domain-Specific Languages and High-Level Frameworks
Biagio Cosenza, Ph.D.DPS Group, Institut für Informatik
Universität Innsbruck, Austria
HPC Seminar at FSP Scientific Computing, Innsbruck, May 15th, 2013
Outline
• Complexity in HPC– Parallel hardware– Optimizations– Programming models
• Harnessing compexity– Automatic tuning– Automatic parallelization – DSLs– Abstractions for HPC
• Related work in Insieme
COMPLEXITY IN HPC
HPC Seminar at FSP Scientific Computing, Innsbruck, May 15th, 2013
Complexity in Hardware
• The need of parallel computing• Parallelism in hardware• Three walls– Power wall – Memory wall – Instruction Level Parallelism
HPC Seminar at FSP Scientific Computing, Innsbruck, May 15th, 2013
The Power WallPower is expensive, but transistors are free
• We can put more transistors on a chip than we have the power to turn on• Power efficiency challenge
– Performance per watt is the new metric – systems are often constrained by power & cooling
• This forces us to concede the battle for maximum performance of individual processing elements, in order to win the war for application efficiency through optimizing total system performance
• Example– Intel Pentium 4 HT 670 (released on May 2005)
• Clock rate 3.8 GHz– Intel Core i7 3930K Sandy Bridge (released on Nov. 2011)
• Clock rate 3.2 GHz
HPC Seminar at FSP Scientific Computing, Innsbruck, May 15th, 2013
The Memory Wall
The growing disparity of speed between CPU and memory outside the CPU chip, would become an
overwhelming bottleneck
• It change the way we optimize programs– Optimize for memory vs optimize computation
• E.g. multiply is no longer considered a harming slow operation, if compared to load and store
HPC Seminar at FSP Scientific Computing, Innsbruck, May 15th, 2013
The ILP WallThere are diminishing returns on finding more ILP
• Instruction Level Parallelism– The potential overlap among instructions – Many ILP techniques
• Instruction pipelining • Superscalar execution • Out-of-order execution • Register renaming • Branch prediction
• The goal of compiler and processor designers is to identify and take advantage of as much ILP as possible
• There is an increasing difficulty of finding enough parallelism in a single instructions stream to keep a high-performance single-core processor busy
Parallelism in Hardware
Xeon Phi 5110p Tesla K20x FirePro S10000 Cortex A50 TILE-Gx8072 Power7+
Company Intel NVidia AMD ARM Tilera IBM
Memory 8 GB 320 GB/s bandwidth
6 GB250 GB/sec bandwidth
6 GB480 GB/s bandwidth (dual)
4 GB and banked L2
23Mbyteon chip cache32K /core256 KB L2/core18 MB L3 cache
2 MB L2 cache (256KB core)32 MB of L3 cache (4 MB per core) for the 8-core SCM.
Cores 60 (240 treads)
2688 CUDA cores, arranged in SMs
2x1792 stream processors
up to 16(4x4 cluster)
72 8-core SCM, 64 with 4 drawers(4 SMT threads per core)
Core frequency
1.053 GHz 1 Ghz 825 Mhz 1.0 GHz 4.14 GHz
SIMD/SIMT 512 bit 32 th. warp 64 th. wavefront
32, 16, and 8 bit ops
The “Many-core” challenges
• Many-core vs multi-core– Multi-core architectures
and programming models suitable for 2 to 32 processors will not easily incrementally evolve to serve many-core systems of 1000’s of processors
– Many-core is the future
Tilera TILE-Gx807
HPC Seminar at FSP Scientific Computing, Innsbruck, May 15th, 2013
What does it mean?
• Hardware is evolving– The number of cores is the new Megahertz
• We need– New programming model– New system software– New supporting architecture that are naturally
parallel
HPC Seminar at FSP Scientific Computing, Innsbruck, May 15th, 2013
New Challenges
• Make easy to write programs that execute efficiently on highly parallel computing systems – The target should be 1000s of cores per chip– Maximize productivity
• Programming models should– be independent of the number of processors– support successful models of parallelism, such as task-level
parallelism, word-level parallelism, and bit-level parallelism• “Autotuners” should play a larger role than
conventional compilers in translating parallel programs
Parallel Programming ModelsReal-Time Worksop
(MathWorks) Binary Modular Data Flow Machine(TU Munich and AS Nuremberg)
MPI
Pthreads
MapReduce(Google)
StreamIt(MIT&Microsoft)
CUDA(NVidia)
OpenCL(Khronos Group)Brook
(Stanford)DataCutter(Maryland)
OpenMP
Thread Building Blocks(Intel)
Cilk(MIT)
NESL(CMU)
HPCS Chapel(Cray)
HPCS X10(IBM)
HPCS Fortress(Sun)Sequoia
(Stanford)
Charm(Illinois)
Erlang
Borealis(Brown)
HMPPOpenAcc
Parallel Programming ModelsReal-Time Worksop
(MathWorks) Binary Modular Data Flow Machine(TU Munich and AS Nuremberg)
MPI
Pthreads
MapReduce(Google)
StreamIt(MIT&Microsoft)
CUDA(NVidia)
OpenCL(Khronos Group)Brook
(Stanford)DataCutter(Maryland)
OpenMP
Thread Building Blocks(Intel)
Cilk(MIT)
NESL(CMU)
HPCS Chapel(Cray)
HPCS X10(IBM)
HPCS Fortress(Sun)Sequoia
(Stanford)
Charm(Illinois)
Erlang
Borealis(Brown)
HMPPOpenAcc
HPC Seminar at FSP Scientific Computing, Innsbruck, May 15th, 2013
Reconsidering…
• Applications– What are common parallel kernel applications?– Parallel patterns
• Instead of traditional benchmarks, design and evaluate parallel programming models and architectures on parallel patterns
• A parallel pattern (“dwarf”) is an algorithmic method that captures a pattern of computation and communication
• E.g. dense linear algebra, sparse algebra, spectral methods, …
• Metrics– Scalability
• An old belief was that less than linear scaling for a multi-processor application is failure
• With new hardware trend, this is no longer true– Any speedup is OK!
HARNESSING COMPLEXITY
HPC Seminar at FSP Scientific Computing, Innsbruck, May 15th, 2013
Harnessing Complexity
• Compiler approaches – DSL, automatic parallelization, …
• Library-based approaches
HPC Seminar at FSP Scientific Computing, Innsbruck, May 15th, 2013
What a compiler can do for us?
• Optimize code• Automatic tuning• Automatic code generation– e.g. in order to support different hardware
• Automatically parallelize code
Automatic Parallelization
Critical opinions on parallel programming model:
The other way:• Auto-parallelizing compilers– Sequential code => parallel code
Wen-mei Hwu, University of Illinois, Urbana-ChampaignWhy sequential programming models could be the best way to program many-core
systemshttp://view.eecs.berkeley.edu/w/images/3/31/Micro-keynote-hwu-12-11-2006_.pdf
Automatic Parallelization
• Nowadays compilers have new “tools” for analysis– Polyhedral model
• …but performance– are still far from a
manual parallelization approach
IR
Polyhedral Model
Analyses & Transformations
for(int i=0;i<100;i++) { A[i] = A[i+1];}
Code generation:• Generate IR code
from model
Polyhedral extraction:• SCoP detection• Translation to polyhedral
D: { i in N: 0 <= i < 100 }R: A[ i ] for each i in DW: A[i+1] for each i in D
HPC Seminar at FSP Scientific Computing, Innsbruck, May 15th, 2013
Autotuners vs Traditional Compilers
• Performance of future parallel applications will crucially depend on the quality of the code generated by the compiler
• The compiler selects which optimizations to perform, chooses parameters for these optimizations, and selects from among alternative implementations of a library kernel
• The resulting space of optimization is large• Programming model may simplify the problem– but not to solve it
HPC Seminar at FSP Scientific Computing, Innsbruck, May 15th, 2013
Optimizations’ ComplexityAn example
Input• Openmp code• Simple parallel codes– matrix multiplication, jacobi, stencil3d,…
• Few optimizations and tuning parameters– Tiling 2d/3d– # of threads
Goal: Optimize for performance and efficiency
HPC Seminar at FSP Scientific Computing, Innsbruck, May 15th, 2013
Optimizations’ ComplexityAn example
• Problem– Big search space• brute force takes year of computation
– Analytical model fails to find the best configuration• Solution– Multi-objective search• Offline search of Pareto front solutions• Runtime selection according to the objective
– Multi versioning
H. Jordan, P. Thoman, J. J. Durillo, S. Pellegrini, P. Gschwandtner, T. Fahringer, H. Moritsch A Multi-Objective Auto-Tuning Framework for Parallel Codes
ACM Super Computing, 2012
HPC Seminar at FSP Scientific Computing, Innsbruck, May 15th, 2013
Optimizations’ Complexity
Input Code
Parallel Target Platform
Analyzer1
Optimizer
2
CodeRegions
ConfigurationsMeasure-
ments3
Backend
BestSolutions
4
Multi-Versioned
Code
5
Runtime System
DynamicSelection6
compile time runtime
H. Jordan, P. Thoman, J. J. Durillo, S. Pellegrini, P. Gschwandtner, T. Fahringer, H. Moritsch A Multi-Objective Auto-Tuning Framework for Parallel Codes
ACM Super Computing, 2012
Domain Specific Languages
• Easy of programming– Use of domain specific concepts• E.g. “color”, “pixel”, “particle”, “atom”
– Simple interface• Hide complexity– Data structures– Parallelization issues– Optimizations’ tuning– Address specific parallelization pattern
Domain Specific Languages
• DSL may help parallelization– Focus on domain concepts and abstractions– Language constraints may help automatic parallelization by
compilers• 3 major benefits– Productivity– Performance– Portability and forward scalability
HPC Seminar at FSP Scientific Computing, Innsbruck, May 15th, 2013
Domain Specific LanguagesGLSL Shader (OpenGL)
OpenGL 4.3 Pipeline
PrimitiveSetup and
Rasterization
FragmentShader Blending
VertexData
PixelData
VertexShader
TextureStore
GeometryShader
TessellationControlShader
TessellationEvaluation
Shader
attribute vec3 vertex;attribute vec3 normal;attribute vec2 uv1;uniform mat4 _mvProj;uniform mat3 _norm;varying vec2 vUv;varying vec3 vNormal;
void main(void) { // compute position gl_Position = _mvProj * vec4(vertex, 1.0);
vUv = uv1; // compute light info vNormal= _norm * normal;}
varying vec2 vUv;varying vec3 vNormal;uniform vec3 mainColor;uniform float specularExp;uniform vec3 specularColor;uniform sampler2D mainTexture;uniform mat3 _dLight;uniform vec3 _ambient;void getDirectionalLight(vec3 normal, mat3 dLight, float specularExp, out vec3 diffuse, out float specular){ vec3 ecLightDir = dLight[0]; // light direction in eye coordinates vec3 colorIntensity = dLight[1]; vec3 halfVector = dLight[2]; float diffuseContribution = max(dot(normal, ecLightDir), 0.0); float specularContribution = max(dot(normal, halfVector), 0.0); specular = pow(specularContribution, specularExponent);
diffuse = (colorIntensity * diffuseContribution);}void main(void) { vec3 diffuse; float spec; getDirectionalLight(normalize(vNormal), _dLight, specularExp, diffuse, spec); vec3 color = max(diffuse,_ambient.xyz)*mainColor; gl_FragColor = texture2D(mainTexture,vUv) * vec4(color,1.0) + vec4(specular*specularColor,0.0);}
PrimitiveSetup and
Rasterization
FragmentShader Blending
VertexData
PixelData
VertexShader
TextureStore
GeometryShader
TessellationControlShader
TessellationEvaluation
Shader
fragmentvertex
DSL Examples
Matlab, DLA DSL (dense linear algebra), Python, shell script, SQL, XML, CSS, BPEL, …
• Interesting recent research work
A. S. Green, P. L. Lumsdaine, N. J. Ross, and B. Valiron Quipper: A Scalable Quantum Programming Language
ACM PLDI 2013
Charisee Chiw, Gordon Kindlmann, John Reppy, Lamont Samuels, Nick SeltzerDiderot: A Parallel DSL for Image Analysis and Visualization
ACM PLDI 2012
Leo A. Meyerovich, Matthew E. Torok, Eric Atkinson, Rastislav Bodık Superconductor: A Language for Big Data Visualization LASH-C 2013
HPC Seminar at FSP Scientific Computing, Innsbruck, May 15th, 2013
Harnessing Complexity
• Compilers can do– Automatic parallelization– Optimization of (parallel) code– DSL and code generation
• But well written and optimized parallel code still outperforms a compiler based approach
HPC Seminar at FSP Scientific Computing, Innsbruck, May 15th, 2013
Harnessing Complexity
• Compiler approaches – DSL, automatic parallelization, …
• Library-based approaches
HPC Seminar at FSP Scientific Computing, Innsbruck, May 15th, 2013
Some Examples
• Pattern oriented– MapReduce (Google)
• Problem specific– FLASH, adaptive-mesh refinement (AMR)
code– GROMACS, molecular dynamics
• Hardware/programming model specific– Cactus– libWater*
bestperformance
HPC Seminar at FSP Scientific Computing, Innsbruck, May 15th, 2013
Insieme Compiler and Research
• Compiler infrastructure• Runtime support
Insieme Research: Automatic Task Partitioning for Heterogeneous HW
• Heterogeneous platforms– E.g. CPU + 2 GPUs
• Input: OpenCL for single device• Output: OpenCL code for multiple devices• Automatic partitioning of work-items between
multiple devices– Based on hw, program and input size
• Machine-learning approachK. Kofler, I. Grasso, B. Cosenza, T. Fahringer
An Automatic Input-Sensitive Approach for Heterogeneous Task PartitioningACM International Conference on Supercomputing, 2013
HPC Seminar at FSP Scientific Computing, Innsbruck, May 15th, 2013
Results – Architecture 1
DataTrans
VectorA
dd
MatMul
BlackSch
oles
SineWave
Convolution
MolecularD
ynSP
MVLin
Reg
KmeansKNN
SYR2K
SobelFi
lter
MedianFilter
RayInterse
ct
FTLE
FC
FlowMap
Reduction
PerlinNoise
GeoMean
MersTwist
er
Compression
Pendulum0
20
40
60
80
100
CPUGPUANN
HPC Seminar at FSP Scientific Computing, Innsbruck, May 15th, 2013
Results – Architecture 2
DataTrans
VectorA
dd
MatMul
BlackSch
oles
SineWave
Convolution
MolecularD
ynSP
MVLin
Reg
KmeansKNN
SYR2K
SobelFi
lter
MedianFilter
RayInterse
ct
FTLE
FC
FlowMap
Reduction
PerlinNoise
GeoMean
MersTwist
er
Compression
Pendulum0
20
40
60
80
100
CPUGPUANN
Insieme Research: OpenCL on Cluster of Heterogeneous Nodes
• libWater• OpenCL extensions for clusters– Event based, extension on OpenCL event– Supporting intra-deice synchronization
• DQL– A DSL language for device query, management and
discovery
I. Grasso, S. Pellegrini, B. Cosenza, T. Fahringer libWater: Heterogeneous Distributed Copmuting Made Easy
ACM International Conference on Supercomputing, 2013
libWater
• Runtime– OpenCL– pthread, opemp– MPI
• DAG command event representation
libWater: DAG Optimizations
• Dynamic Collective communication pattern Replacement (DCR)
• Latency hiding• Intra-node copy
optimizations
Insieme (Ongoing) Research:Support for DSLs
InputCodesDSL Frontend Backend
pthreadsOpenCL
MPI
InputCodes
OutputCodes
Transformation Framework
Polyhedral modelParallel optimizations Stencil computation
Automatic tuning support
IntermediateRepresentation
Library SupportRendering algorithm
implementations, geometry loader, …
RuntimeSystem
Target hardware:GPU, CPU, heterogeneous platform, compute cluster
HPC Seminar at FSP Scientific Computing, Innsbruck, May 15th, 2013
About Insieme
• Insieme compiler– Research framework– OpenMP, Cilk, MPI, OpenCL– Run time, IR– Support for polyhedral model– Multi-objective optimization– Machine learning – Extensible
• Insieme (GPL) and libWater (LGPL) soon available on GitHub