Make HPC Easy with Domain-Specific Languages and High-Level Frameworks

Make HPC Easy with Domain-Specific Languages and High-Level Frameworks

Biagio Cosenza, Ph.D.DPS Group, Institut für Informatik

Universität Innsbruck, Austria

HPC Seminar at FSP Scientific Computing, Innsbruck, May 15th, 2013

Outline

• Complexity in HPC– Parallel hardware– Optimizations– Programming models

• Harnessing compexity– Automatic tuning– Automatic parallelization – DSLs– Abstractions for HPC

• Related work in Insieme

COMPLEXITY IN HPC


Complexity in Hardware

• The need of parallel computing• Parallelism in hardware• Three walls– Power wall – Memory wall – Instruction Level Parallelism


The Power WallPower is expensive, but transistors are free

• We can put more transistors on a chip than we have the power to turn on• Power efficiency challenge

– Performance per watt is the new metric – systems are often constrained by power & cooling

• This forces us to concede the battle for maximum performance of individual processing elements, in order to win the war for application efficiency through optimizing total system performance

• Example– Intel Pentium 4 HT 670 (released on May 2005)

• Clock rate 3.8 GHz– Intel Core i7 3930K Sandy Bridge (released on Nov. 2011)

• Clock rate 3.2 GHz


The Memory Wall

The growing disparity of speed between CPU and memory outside the CPU chip, would become an

overwhelming bottleneck

• It change the way we optimize programs– Optimize for memory vs optimize computation

• E.g. multiply is no longer considered a harming slow operation, if compared to load and store


The ILP WallThere are diminishing returns on finding more ILP

• Instruction Level Parallelism– The potential overlap among instructions – Many ILP techniques

• Instruction pipelining • Superscalar execution • Out-of-order execution • Register renaming • Branch prediction

• The goal of compiler and processor designers is to identify and take advantage of as much ILP as possible

• There is an increasing difficulty of finding enough parallelism in a single instructions stream to keep a high-performance single-core processor busy

Parallelism in Hardware

Xeon Phi 5110p Tesla K20x FirePro S10000 Cortex A50 TILE-Gx8072 Power7+

Company Intel NVidia AMD ARM Tilera IBM

Memory 8 GB 320 GB/s bandwidth

6 GB250 GB/sec bandwidth

6 GB480 GB/s bandwidth (dual)

4 GB and banked L2

23Mbyteon chip cache32K /core256 KB L2/core18 MB L3 cache

2 MB L2 cache (256KB core)32 MB of L3 cache (4 MB per core) for the 8-core SCM.

Cores 60 (240 treads)

2688 CUDA cores, arranged in SMs

2x1792 stream processors

up to 16(4x4 cluster)

72 8-core SCM, 64 with 4 drawers(4 SMT threads per core)

Core frequency

1.053 GHz 1 Ghz 825 Mhz 1.0 GHz 4.14 GHz

SIMD/SIMT 512 bit 32 th. warp 64 th. wavefront

32, 16, and 8 bit ops

The “Many-core” challenges

• Many-core vs multi-core– Multi-core architectures

and programming models suitable for 2 to 32 processors will not easily incrementally evolve to serve many-core systems of 1000’s of processors

– Many-core is the future

Tilera TILE-Gx807


What does it mean?

• Hardware is evolving– The number of cores is the new Megahertz

• We need– New programming model– New system software– New supporting architecture that are naturally

parallel


New Challenges

• Make easy to write programs that execute efficiently on highly parallel computing systems – The target should be 1000s of cores per chip– Maximize productivity

• Programming models should– be independent of the number of processors– support successful models of parallelism, such as task-level

parallelism, word-level parallelism, and bit-level parallelism• “Autotuners” should play a larger role than

conventional compilers in translating parallel programs

Parallel Programming ModelsReal-Time Worksop

(MathWorks) Binary Modular Data Flow Machine(TU Munich and AS Nuremberg)

MPI

Pthreads

MapReduce(Google)

StreamIt(MIT&Microsoft)

CUDA(NVidia)

OpenCL(Khronos Group)Brook

(Stanford)DataCutter(Maryland)

OpenMP

Thread Building Blocks(Intel)

Cilk(MIT)

NESL(CMU)

HPCS Chapel(Cray)

HPCS X10(IBM)

HPCS Fortress(Sun)Sequoia

(Stanford)

Charm(Illinois)

Erlang

Borealis(Brown)

HMPPOpenAcc

Parallel Programming ModelsReal-Time Worksop

(MathWorks) Binary Modular Data Flow Machine(TU Munich and AS Nuremberg)

MPI

Pthreads

MapReduce(Google)

StreamIt(MIT&Microsoft)

CUDA(NVidia)

OpenCL(Khronos Group)Brook

(Stanford)DataCutter(Maryland)

OpenMP

Thread Building Blocks(Intel)

Cilk(MIT)

NESL(CMU)

HPCS Chapel(Cray)

HPCS X10(IBM)

HPCS Fortress(Sun)Sequoia

(Stanford)

Charm(Illinois)

Erlang

Borealis(Brown)

HMPPOpenAcc


Reconsidering…

• Applications– What are common parallel kernel applications?– Parallel patterns

• Instead of traditional benchmarks, design and evaluate parallel programming models and architectures on parallel patterns

• A parallel pattern (“dwarf”) is an algorithmic method that captures a pattern of computation and communication

• E.g. dense linear algebra, sparse algebra, spectral methods, …

• Metrics– Scalability

• An old belief was that less than linear scaling for a multi-processor application is failure

• With new hardware trend, this is no longer true– Any speedup is OK!

HARNESSING COMPLEXITY


Harnessing Complexity

• Compiler approaches – DSL, automatic parallelization, …

• Library-based approaches


What a compiler can do for us?

• Optimize code• Automatic tuning• Automatic code generation– e.g. in order to support different hardware

• Automatically parallelize code

Automatic Parallelization

Critical opinions on parallel programming model:

The other way:• Auto-parallelizing compilers– Sequential code => parallel code

Wen-mei Hwu, University of Illinois, Urbana-ChampaignWhy sequential programming models could be the best way to program many-core

systemshttp://view.eecs.berkeley.edu/w/images/3/31/Micro-keynote-hwu-12-11-2006_.pdf

Automatic Parallelization

• Nowadays compilers have new “tools” for analysis– Polyhedral model

• …but performance– are still far from a

manual parallelization approach

IR

Polyhedral Model

Analyses & Transformations

for(int i=0;i<100;i++) { A[i] = A[i+1];}

Code generation:• Generate IR code

from model

Polyhedral extraction:• SCoP detection• Translation to polyhedral

D: { i in N: 0 <= i < 100 }R: A[ i ] for each i in DW: A[i+1] for each i in D


Autotuners vs Traditional Compilers

• Performance of future parallel applications will crucially depend on the quality of the code generated by the compiler

• The compiler selects which optimizations to perform, chooses parameters for these optimizations, and selects from among alternative implementations of a library kernel

• The resulting space of optimization is large• Programming model may simplify the problem– but not to solve it


Optimizations’ ComplexityAn example

Input• Openmp code• Simple parallel codes– matrix multiplication, jacobi, stencil3d,…

• Few optimizations and tuning parameters– Tiling 2d/3d– # of threads

Goal: Optimize for performance and efficiency


Optimizations’ ComplexityAn example

• Problem– Big search space• brute force takes year of computation

– Analytical model fails to find the best configuration• Solution– Multi-objective search• Offline search of Pareto front solutions• Runtime selection according to the objective

– Multi versioning

H. Jordan, P. Thoman, J. J. Durillo, S. Pellegrini, P. Gschwandtner, T. Fahringer, H. Moritsch A Multi-Objective Auto-Tuning Framework for Parallel Codes

ACM Super Computing, 2012


Optimizations’ Complexity

Input Code

Parallel Target Platform

Analyzer1

Optimizer

2

CodeRegions

ConfigurationsMeasure-

ments3

Backend

BestSolutions

4

Multi-Versioned

Code

5

Runtime System

DynamicSelection6

compile time runtime

H. Jordan, P. Thoman, J. J. Durillo, S. Pellegrini, P. Gschwandtner, T. Fahringer, H. Moritsch A Multi-Objective Auto-Tuning Framework for Parallel Codes

ACM Super Computing, 2012

Domain Specific Languages

• Easy of programming– Use of domain specific concepts• E.g. “color”, “pixel”, “particle”, “atom”

– Simple interface• Hide complexity– Data structures– Parallelization issues– Optimizations’ tuning– Address specific parallelization pattern

Domain Specific Languages

• DSL may help parallelization– Focus on domain concepts and abstractions– Language constraints may help automatic parallelization by

compilers• 3 major benefits– Productivity– Performance– Portability and forward scalability


Domain Specific LanguagesGLSL Shader (OpenGL)

OpenGL 4.3 Pipeline

PrimitiveSetup and

Rasterization

FragmentShader Blending

VertexData

PixelData

VertexShader

TextureStore

GeometryShader

TessellationControlShader

TessellationEvaluation

Shader

attribute vec3 vertex;attribute vec3 normal;attribute vec2 uv1;uniform mat4 _mvProj;uniform mat3 _norm;varying vec2 vUv;varying vec3 vNormal;

void main(void) { // compute position gl_Position = _mvProj * vec4(vertex, 1.0);

vUv = uv1; // compute light info vNormal= _norm * normal;}

varying vec2 vUv;varying vec3 vNormal;uniform vec3 mainColor;uniform float specularExp;uniform vec3 specularColor;uniform sampler2D mainTexture;uniform mat3 _dLight;uniform vec3 _ambient;void getDirectionalLight(vec3 normal, mat3 dLight, float specularExp, out vec3 diffuse, out float specular){ vec3 ecLightDir = dLight[0]; // light direction in eye coordinates vec3 colorIntensity = dLight[1]; vec3 halfVector = dLight[2]; float diffuseContribution = max(dot(normal, ecLightDir), 0.0); float specularContribution = max(dot(normal, halfVector), 0.0); specular = pow(specularContribution, specularExponent);

diffuse = (colorIntensity * diffuseContribution);}void main(void) { vec3 diffuse; float spec; getDirectionalLight(normalize(vNormal), _dLight, specularExp, diffuse, spec); vec3 color = max(diffuse,_ambient.xyz)*mainColor; gl_FragColor = texture2D(mainTexture,vUv) * vec4(color,1.0) + vec4(specular*specularColor,0.0);}

PrimitiveSetup and

Rasterization

FragmentShader Blending

VertexData

PixelData

VertexShader

TextureStore

GeometryShader

TessellationControlShader

TessellationEvaluation

Shader

fragmentvertex

DSL Examples

Matlab, DLA DSL (dense linear algebra), Python, shell script, SQL, XML, CSS, BPEL, …

• Interesting recent research work

A. S. Green, P. L. Lumsdaine, N. J. Ross, and B. Valiron Quipper: A Scalable Quantum Programming Language

ACM PLDI 2013

Charisee Chiw, Gordon Kindlmann, John Reppy, Lamont Samuels, Nick SeltzerDiderot: A Parallel DSL for Image Analysis and Visualization

ACM PLDI 2012

Leo A. Meyerovich, Matthew E. Torok, Eric Atkinson, Rastislav Bodık Superconductor: A Language for Big Data Visualization LASH-C 2013



• Compilers can do– Automatic parallelization– Optimization of (parallel) code– DSL and code generation

• But well written and optimized parallel code still outperforms a compiler based approach



• Compiler approaches – DSL, automatic parallelization, …

• Library-based approaches


Some Examples

• Pattern oriented– MapReduce (Google)

• Problem specific– FLASH, adaptive-mesh refinement (AMR)

code– GROMACS, molecular dynamics

• Hardware/programming model specific– Cactus– libWater*

bestperformance


Insieme Compiler and Research

• Compiler infrastructure• Runtime support

Insieme Research: Automatic Task Partitioning for Heterogeneous HW

• Heterogeneous platforms– E.g. CPU + 2 GPUs

• Input: OpenCL for single device• Output: OpenCL code for multiple devices• Automatic partitioning of work-items between

multiple devices– Based on hw, program and input size

• Machine-learning approachK. Kofler, I. Grasso, B. Cosenza, T. Fahringer

An Automatic Input-Sensitive Approach for Heterogeneous Task PartitioningACM International Conference on Supercomputing, 2013


Results – Architecture 1

DataTrans

VectorA

dd

MatMul

BlackSch

oles

SineWave

Convolution

MolecularD

ynSP

MVLin

Reg

KmeansKNN

SYR2K

SobelFi

lter

MedianFilter

RayInterse

ct

FTLE

FC

FlowMap

Reduction

PerlinNoise

GeoMean

MersTwist

er

Compression

Pendulum0

20

40

60

80

100

CPUGPUANN


Results – Architecture 2

DataTrans

VectorA

dd

MatMul

BlackSch

oles

SineWave

Convolution

MolecularD

ynSP

MVLin

Reg

KmeansKNN

SYR2K

SobelFi

lter

MedianFilter

RayInterse

ct

FTLE

FC

FlowMap

Reduction

PerlinNoise

GeoMean

MersTwist

er

Compression

Pendulum0

20

40

60

80

100

CPUGPUANN

Insieme Research: OpenCL on Cluster of Heterogeneous Nodes

• libWater• OpenCL extensions for clusters– Event based, extension on OpenCL event– Supporting intra-deice synchronization

• DQL– A DSL language for device query, management and

discovery

I. Grasso, S. Pellegrini, B. Cosenza, T. Fahringer libWater: Heterogeneous Distributed Copmuting Made Easy

ACM International Conference on Supercomputing, 2013

libWater

• Runtime– OpenCL– pthread, opemp– MPI

• DAG command event representation

libWater: DAG Optimizations

• Dynamic Collective communication pattern Replacement (DCR)

• Latency hiding• Intra-node copy

optimizations

Insieme (Ongoing) Research:Support for DSLs

InputCodesDSL Frontend Backend

pthreadsOpenCL

MPI

InputCodes

OutputCodes

Transformation Framework

Polyhedral modelParallel optimizations Stencil computation

Automatic tuning support

IntermediateRepresentation

Library SupportRendering algorithm

implementations, geometry loader, …

RuntimeSystem

Target hardware:GPU, CPU, heterogeneous platform, compute cluster


About Insieme

• Insieme compiler– Research framework– OpenMP, Cilk, MPI, OpenCL– Run time, IR– Support for polyhedral model– Multi-objective optimization– Machine learning – Extensible

• Insieme (GPL) and libWater (LGPL) soon available on GitHub

Documents

Make HPC Easy with Domain-Specific Languages and High-Level Frameworks