41
Make HPC Easy with Domain-Specific Languages and High-Level Frameworks Biagio Cosenza, Ph.D. DPS Group, Institut für Informatik Universität Innsbruck, Austria

Make HPC Easy with Domain-Specific Languages and High-Level Frameworks

  • Upload
    sani

  • View
    38

  • Download
    0

Embed Size (px)

DESCRIPTION

Make HPC Easy with Domain-Specific Languages and High-Level Frameworks. Biagio Cosenza, Ph.D. DPS Group, Institut für Informatik Universit ä t Innsbruck, Austria. Outline. Complexity in HPC Parallel hardware Optimizations Programming models Harnessing compexity Automatic tuning - PowerPoint PPT Presentation

Citation preview

Page 1: Make HPC Easy with Domain-Specific Languages and High-Level Frameworks

Make HPC Easy with Domain-Specific Languages and High-Level Frameworks

Biagio Cosenza, Ph.D.DPS Group, Institut für Informatik

Universität Innsbruck, Austria

Page 2: Make HPC Easy with Domain-Specific Languages and High-Level Frameworks

HPC Seminar at FSP Scientific Computing, Innsbruck, May 15th, 2013

Outline

• Complexity in HPC– Parallel hardware– Optimizations– Programming models

• Harnessing compexity– Automatic tuning– Automatic parallelization – DSLs– Abstractions for HPC

• Related work in Insieme

Page 3: Make HPC Easy with Domain-Specific Languages and High-Level Frameworks

COMPLEXITY IN HPC

Page 4: Make HPC Easy with Domain-Specific Languages and High-Level Frameworks

HPC Seminar at FSP Scientific Computing, Innsbruck, May 15th, 2013

Complexity in Hardware

• The need of parallel computing• Parallelism in hardware• Three walls– Power wall – Memory wall – Instruction Level Parallelism

Page 5: Make HPC Easy with Domain-Specific Languages and High-Level Frameworks

HPC Seminar at FSP Scientific Computing, Innsbruck, May 15th, 2013

The Power WallPower is expensive, but transistors are free

• We can put more transistors on a chip than we have the power to turn on• Power efficiency challenge

– Performance per watt is the new metric – systems are often constrained by power & cooling

• This forces us to concede the battle for maximum performance of individual processing elements, in order to win the war for application efficiency through optimizing total system performance

• Example– Intel Pentium 4 HT 670 (released on May 2005)

• Clock rate 3.8 GHz– Intel Core i7 3930K Sandy Bridge (released on Nov. 2011)

• Clock rate 3.2 GHz

Page 6: Make HPC Easy with Domain-Specific Languages and High-Level Frameworks

HPC Seminar at FSP Scientific Computing, Innsbruck, May 15th, 2013

The Memory Wall

The growing disparity of speed between CPU and memory outside the CPU chip, would become an

overwhelming bottleneck

• It change the way we optimize programs– Optimize for memory vs optimize computation

• E.g. multiply is no longer considered a harming slow operation, if compared to load and store

Page 7: Make HPC Easy with Domain-Specific Languages and High-Level Frameworks

HPC Seminar at FSP Scientific Computing, Innsbruck, May 15th, 2013

The ILP WallThere are diminishing returns on finding more ILP

• Instruction Level Parallelism– The potential overlap among instructions – Many ILP techniques

• Instruction pipelining • Superscalar execution • Out-of-order execution • Register renaming • Branch prediction

• The goal of compiler and processor designers is to identify and take advantage of as much ILP as possible

• There is an increasing difficulty of finding enough parallelism in a single instructions stream to keep a high-performance single-core processor busy

Page 8: Make HPC Easy with Domain-Specific Languages and High-Level Frameworks

Parallelism in Hardware

Xeon Phi 5110p Tesla K20x FirePro S10000 Cortex A50 TILE-Gx8072 Power7+

Company Intel NVidia AMD ARM Tilera IBM

Memory 8 GB 320 GB/s bandwidth

6 GB250 GB/sec bandwidth

6 GB480 GB/s bandwidth (dual)

4 GB and banked L2

23Mbyteon chip cache32K /core256 KB L2/core18 MB L3 cache

2 MB L2 cache (256KB core)32 MB of L3 cache (4 MB per core) for the 8-core SCM.

Cores 60 (240 treads)

2688 CUDA cores, arranged in SMs

2x1792 stream processors

up to 16(4x4 cluster)

72 8-core SCM, 64 with 4 drawers(4 SMT threads per core)

Core frequency

1.053 GHz 1 Ghz 825 Mhz 1.0 GHz 4.14 GHz

SIMD/SIMT 512 bit 32 th. warp 64 th. wavefront

32, 16, and 8 bit ops

Page 9: Make HPC Easy with Domain-Specific Languages and High-Level Frameworks

The “Many-core” challenges

• Many-core vs multi-core– Multi-core architectures

and programming models suitable for 2 to 32 processors will not easily incrementally evolve to serve many-core systems of 1000’s of processors

– Many-core is the future

Tilera TILE-Gx807

Page 10: Make HPC Easy with Domain-Specific Languages and High-Level Frameworks

HPC Seminar at FSP Scientific Computing, Innsbruck, May 15th, 2013

What does it mean?

• Hardware is evolving– The number of cores is the new Megahertz

• We need– New programming model– New system software– New supporting architecture that are naturally

parallel

Page 11: Make HPC Easy with Domain-Specific Languages and High-Level Frameworks

HPC Seminar at FSP Scientific Computing, Innsbruck, May 15th, 2013

New Challenges

• Make easy to write programs that execute efficiently on highly parallel computing systems – The target should be 1000s of cores per chip– Maximize productivity

• Programming models should– be independent of the number of processors– support successful models of parallelism, such as task-level

parallelism, word-level parallelism, and bit-level parallelism• “Autotuners” should play a larger role than

conventional compilers in translating parallel programs

Page 12: Make HPC Easy with Domain-Specific Languages and High-Level Frameworks

Parallel Programming ModelsReal-Time Worksop

(MathWorks) Binary Modular Data Flow Machine(TU Munich and AS Nuremberg)

MPI

Pthreads

MapReduce(Google)

StreamIt(MIT&Microsoft)

CUDA(NVidia)

OpenCL(Khronos Group)Brook

(Stanford)DataCutter(Maryland)

OpenMP

Thread Building Blocks(Intel)

Cilk(MIT)

NESL(CMU)

HPCS Chapel(Cray)

HPCS X10(IBM)

HPCS Fortress(Sun)Sequoia

(Stanford)

Charm(Illinois)

Erlang

Borealis(Brown)

HMPPOpenAcc

Page 13: Make HPC Easy with Domain-Specific Languages and High-Level Frameworks

Parallel Programming ModelsReal-Time Worksop

(MathWorks) Binary Modular Data Flow Machine(TU Munich and AS Nuremberg)

MPI

Pthreads

MapReduce(Google)

StreamIt(MIT&Microsoft)

CUDA(NVidia)

OpenCL(Khronos Group)Brook

(Stanford)DataCutter(Maryland)

OpenMP

Thread Building Blocks(Intel)

Cilk(MIT)

NESL(CMU)

HPCS Chapel(Cray)

HPCS X10(IBM)

HPCS Fortress(Sun)Sequoia

(Stanford)

Charm(Illinois)

Erlang

Borealis(Brown)

HMPPOpenAcc

Page 14: Make HPC Easy with Domain-Specific Languages and High-Level Frameworks

HPC Seminar at FSP Scientific Computing, Innsbruck, May 15th, 2013

Reconsidering…

• Applications– What are common parallel kernel applications?– Parallel patterns

• Instead of traditional benchmarks, design and evaluate parallel programming models and architectures on parallel patterns

• A parallel pattern (“dwarf”) is an algorithmic method that captures a pattern of computation and communication

• E.g. dense linear algebra, sparse algebra, spectral methods, …

• Metrics– Scalability

• An old belief was that less than linear scaling for a multi-processor application is failure

• With new hardware trend, this is no longer true– Any speedup is OK!

Page 15: Make HPC Easy with Domain-Specific Languages and High-Level Frameworks

HARNESSING COMPLEXITY

Page 16: Make HPC Easy with Domain-Specific Languages and High-Level Frameworks

HPC Seminar at FSP Scientific Computing, Innsbruck, May 15th, 2013

Harnessing Complexity

• Compiler approaches – DSL, automatic parallelization, …

• Library-based approaches

Page 17: Make HPC Easy with Domain-Specific Languages and High-Level Frameworks

HPC Seminar at FSP Scientific Computing, Innsbruck, May 15th, 2013

What a compiler can do for us?

• Optimize code• Automatic tuning• Automatic code generation– e.g. in order to support different hardware

• Automatically parallelize code

Page 18: Make HPC Easy with Domain-Specific Languages and High-Level Frameworks

Automatic Parallelization

Critical opinions on parallel programming model:

The other way:• Auto-parallelizing compilers– Sequential code => parallel code

Wen-mei Hwu, University of Illinois, Urbana-ChampaignWhy sequential programming models could be the best way to program many-core

systemshttp://view.eecs.berkeley.edu/w/images/3/31/Micro-keynote-hwu-12-11-2006_.pdf

Page 19: Make HPC Easy with Domain-Specific Languages and High-Level Frameworks

Automatic Parallelization

• Nowadays compilers have new “tools” for analysis– Polyhedral model

• …but performance– are still far from a

manual parallelization approach

IR

Polyhedral Model

Analyses & Transformations

for(int i=0;i<100;i++) { A[i] = A[i+1];}

Code generation:• Generate IR code

from model

Polyhedral extraction:• SCoP detection• Translation to polyhedral

D: { i in N: 0 <= i < 100 }R: A[ i ] for each i in DW: A[i+1] for each i in D

Page 20: Make HPC Easy with Domain-Specific Languages and High-Level Frameworks

HPC Seminar at FSP Scientific Computing, Innsbruck, May 15th, 2013

Autotuners vs Traditional Compilers

• Performance of future parallel applications will crucially depend on the quality of the code generated by the compiler

• The compiler selects which optimizations to perform, chooses parameters for these optimizations, and selects from among alternative implementations of a library kernel

• The resulting space of optimization is large• Programming model may simplify the problem– but not to solve it

Page 21: Make HPC Easy with Domain-Specific Languages and High-Level Frameworks

HPC Seminar at FSP Scientific Computing, Innsbruck, May 15th, 2013

Optimizations’ ComplexityAn example

Input• Openmp code• Simple parallel codes– matrix multiplication, jacobi, stencil3d,…

• Few optimizations and tuning parameters– Tiling 2d/3d– # of threads

Goal: Optimize for performance and efficiency

Page 22: Make HPC Easy with Domain-Specific Languages and High-Level Frameworks

HPC Seminar at FSP Scientific Computing, Innsbruck, May 15th, 2013

Optimizations’ ComplexityAn example

• Problem– Big search space• brute force takes year of computation

– Analytical model fails to find the best configuration• Solution– Multi-objective search• Offline search of Pareto front solutions• Runtime selection according to the objective

– Multi versioning

H. Jordan, P. Thoman, J. J. Durillo, S. Pellegrini, P. Gschwandtner, T. Fahringer, H. Moritsch A Multi-Objective Auto-Tuning Framework for Parallel Codes

ACM Super Computing, 2012

Page 23: Make HPC Easy with Domain-Specific Languages and High-Level Frameworks

HPC Seminar at FSP Scientific Computing, Innsbruck, May 15th, 2013

Optimizations’ Complexity

Input Code

Parallel Target Platform

Analyzer1

Optimizer

2

CodeRegions

ConfigurationsMeasure-

ments3

Backend

BestSolutions

4

Multi-Versioned

Code

5

Runtime System

DynamicSelection6

compile time runtime

H. Jordan, P. Thoman, J. J. Durillo, S. Pellegrini, P. Gschwandtner, T. Fahringer, H. Moritsch A Multi-Objective Auto-Tuning Framework for Parallel Codes

ACM Super Computing, 2012

Page 24: Make HPC Easy with Domain-Specific Languages and High-Level Frameworks

Domain Specific Languages

• Easy of programming– Use of domain specific concepts• E.g. “color”, “pixel”, “particle”, “atom”

– Simple interface• Hide complexity– Data structures– Parallelization issues– Optimizations’ tuning– Address specific parallelization pattern

Page 25: Make HPC Easy with Domain-Specific Languages and High-Level Frameworks

Domain Specific Languages

• DSL may help parallelization– Focus on domain concepts and abstractions– Language constraints may help automatic parallelization by

compilers• 3 major benefits– Productivity– Performance– Portability and forward scalability

Page 26: Make HPC Easy with Domain-Specific Languages and High-Level Frameworks

HPC Seminar at FSP Scientific Computing, Innsbruck, May 15th, 2013

Domain Specific LanguagesGLSL Shader (OpenGL)

OpenGL 4.3 Pipeline

PrimitiveSetup and

Rasterization

FragmentShader Blending

VertexData

PixelData

VertexShader

TextureStore

GeometryShader

TessellationControlShader

TessellationEvaluation

Shader

Page 27: Make HPC Easy with Domain-Specific Languages and High-Level Frameworks

attribute vec3 vertex;attribute vec3 normal;attribute vec2 uv1;uniform mat4 _mvProj;uniform mat3 _norm;varying vec2 vUv;varying vec3 vNormal;

void main(void) { // compute position gl_Position = _mvProj * vec4(vertex, 1.0);

vUv = uv1; // compute light info vNormal= _norm * normal;}

varying vec2 vUv;varying vec3 vNormal;uniform vec3 mainColor;uniform float specularExp;uniform vec3 specularColor;uniform sampler2D mainTexture;uniform mat3 _dLight;uniform vec3 _ambient;void getDirectionalLight(vec3 normal, mat3 dLight, float specularExp, out vec3 diffuse, out float specular){ vec3 ecLightDir = dLight[0]; // light direction in eye coordinates vec3 colorIntensity = dLight[1]; vec3 halfVector = dLight[2]; float diffuseContribution = max(dot(normal, ecLightDir), 0.0); float specularContribution = max(dot(normal, halfVector), 0.0); specular = pow(specularContribution, specularExponent);

diffuse = (colorIntensity * diffuseContribution);}void main(void) { vec3 diffuse; float spec; getDirectionalLight(normalize(vNormal), _dLight, specularExp, diffuse, spec); vec3 color = max(diffuse,_ambient.xyz)*mainColor; gl_FragColor = texture2D(mainTexture,vUv) * vec4(color,1.0) + vec4(specular*specularColor,0.0);}

PrimitiveSetup and

Rasterization

FragmentShader Blending

VertexData

PixelData

VertexShader

TextureStore

GeometryShader

TessellationControlShader

TessellationEvaluation

Shader

fragmentvertex

Page 28: Make HPC Easy with Domain-Specific Languages and High-Level Frameworks

DSL Examples

Matlab, DLA DSL (dense linear algebra), Python, shell script, SQL, XML, CSS, BPEL, …

• Interesting recent research work

A. S. Green, P. L. Lumsdaine, N. J. Ross, and B. Valiron Quipper: A Scalable Quantum Programming Language

ACM PLDI 2013

Charisee Chiw, Gordon Kindlmann, John Reppy, Lamont Samuels, Nick SeltzerDiderot: A Parallel DSL for Image Analysis and Visualization

ACM PLDI 2012

Leo A. Meyerovich, Matthew E. Torok, Eric Atkinson, Rastislav Bodık Superconductor: A Language for Big Data Visualization LASH-C 2013

Page 29: Make HPC Easy with Domain-Specific Languages and High-Level Frameworks

HPC Seminar at FSP Scientific Computing, Innsbruck, May 15th, 2013

Harnessing Complexity

• Compilers can do– Automatic parallelization– Optimization of (parallel) code– DSL and code generation

• But well written and optimized parallel code still outperforms a compiler based approach

Page 30: Make HPC Easy with Domain-Specific Languages and High-Level Frameworks

HPC Seminar at FSP Scientific Computing, Innsbruck, May 15th, 2013

Harnessing Complexity

• Compiler approaches – DSL, automatic parallelization, …

• Library-based approaches

Page 31: Make HPC Easy with Domain-Specific Languages and High-Level Frameworks

HPC Seminar at FSP Scientific Computing, Innsbruck, May 15th, 2013

Some Examples

• Pattern oriented– MapReduce (Google)

• Problem specific– FLASH, adaptive-mesh refinement (AMR)

code– GROMACS, molecular dynamics

• Hardware/programming model specific– Cactus– libWater*

bestperformance

Page 32: Make HPC Easy with Domain-Specific Languages and High-Level Frameworks

HPC Seminar at FSP Scientific Computing, Innsbruck, May 15th, 2013

Insieme Compiler and Research

• Compiler infrastructure• Runtime support

Page 33: Make HPC Easy with Domain-Specific Languages and High-Level Frameworks

Insieme Research: Automatic Task Partitioning for Heterogeneous HW

• Heterogeneous platforms– E.g. CPU + 2 GPUs

• Input: OpenCL for single device• Output: OpenCL code for multiple devices• Automatic partitioning of work-items between

multiple devices– Based on hw, program and input size

• Machine-learning approachK. Kofler, I. Grasso, B. Cosenza, T. Fahringer

An Automatic Input-Sensitive Approach for Heterogeneous Task PartitioningACM International Conference on Supercomputing, 2013

Page 34: Make HPC Easy with Domain-Specific Languages and High-Level Frameworks

HPC Seminar at FSP Scientific Computing, Innsbruck, May 15th, 2013

Results – Architecture 1

DataTrans

VectorA

dd

MatMul

BlackSch

oles

SineWave

Convolution

MolecularD

ynSP

MVLin

Reg

KmeansKNN

SYR2K

SobelFi

lter

MedianFilter

RayInterse

ct

FTLE

FC

FlowMap

Reduction

PerlinNoise

GeoMean

MersTwist

er

Compression

Pendulum0

20

40

60

80

100

CPUGPUANN

Page 35: Make HPC Easy with Domain-Specific Languages and High-Level Frameworks

HPC Seminar at FSP Scientific Computing, Innsbruck, May 15th, 2013

Results – Architecture 2

DataTrans

VectorA

dd

MatMul

BlackSch

oles

SineWave

Convolution

MolecularD

ynSP

MVLin

Reg

KmeansKNN

SYR2K

SobelFi

lter

MedianFilter

RayInterse

ct

FTLE

FC

FlowMap

Reduction

PerlinNoise

GeoMean

MersTwist

er

Compression

Pendulum0

20

40

60

80

100

CPUGPUANN

Page 36: Make HPC Easy with Domain-Specific Languages and High-Level Frameworks

Insieme Research: OpenCL on Cluster of Heterogeneous Nodes

• libWater• OpenCL extensions for clusters– Event based, extension on OpenCL event– Supporting intra-deice synchronization

• DQL– A DSL language for device query, management and

discovery

I. Grasso, S. Pellegrini, B. Cosenza, T. Fahringer libWater: Heterogeneous Distributed Copmuting Made Easy

ACM International Conference on Supercomputing, 2013

Page 37: Make HPC Easy with Domain-Specific Languages and High-Level Frameworks

libWater

• Runtime– OpenCL– pthread, opemp– MPI

• DAG command event representation

Page 38: Make HPC Easy with Domain-Specific Languages and High-Level Frameworks

libWater: DAG Optimizations

• Dynamic Collective communication pattern Replacement (DCR)

• Latency hiding• Intra-node copy

optimizations

Page 39: Make HPC Easy with Domain-Specific Languages and High-Level Frameworks
Page 40: Make HPC Easy with Domain-Specific Languages and High-Level Frameworks

Insieme (Ongoing) Research:Support for DSLs

InputCodesDSL Frontend Backend

pthreadsOpenCL

MPI

InputCodes

OutputCodes

Transformation Framework

Polyhedral modelParallel optimizations Stencil computation

Automatic tuning support

IntermediateRepresentation

Library SupportRendering algorithm

implementations, geometry loader, …

RuntimeSystem

Target hardware:GPU, CPU, heterogeneous platform, compute cluster

Page 41: Make HPC Easy with Domain-Specific Languages and High-Level Frameworks

HPC Seminar at FSP Scientific Computing, Innsbruck, May 15th, 2013

About Insieme

• Insieme compiler– Research framework– OpenMP, Cilk, MPI, OpenCL– Run time, IR– Support for polyhedral model– Multi-objective optimization– Machine learning – Extensible

• Insieme (GPL) and libWater (LGPL) soon available on GitHub