Make HPC Easy with Domain-Specific Languages and High-Level Frameworks

Biagio Cosenza, Ph.D.DPS Group, Institut für Informatik

Universität Innsbruck, Austria

HPC Seminar at FSP Scientific Computing, Innsbruck, May 15th, 2013

Outline

• Complexity in HPC– Parallel hardware– Optimizations– Programming models

• Harnessing compexity– Automatic tuning– Automatic parallelization – DSLs– Abstractions for HPC

• Related work in Insieme

COMPLEXITY IN HPC

Complexity in Hardware

• The need of parallel computing• Parallelism in hardware• Three walls– Power wall – Memory wall – Instruction Level Parallelism

The Power WallPower is expensive, but transistors are free

• We can put more transistors on a chip than we have the power to turn on• Power efficiency challenge

– Performance per watt is the new metric – systems are often constrained by power & cooling

• This forces us to concede the battle for maximum performance of individual processing elements, in order to win the war for application efficiency through optimizing total system performance

• Example– Intel Pentium 4 HT 670 (released on May 2005)

• Clock rate 3.8 GHz– Intel Core i7 3930K Sandy Bridge (released on Nov. 2011)

• Clock rate 3.2 GHz

The Memory Wall

The growing disparity of speed between CPU and memory outside the CPU chip, would become an

overwhelming bottleneck

• It change the way we optimize programs– Optimize for memory vs optimize computation

• E.g. multiply is no longer considered a harming slow operation, if compared to load and store

The ILP WallThere are diminishing returns on finding more ILP

• Instruction Level Parallelism– The potential overlap among instructions – Many ILP techniques

• Instruction pipelining • Superscalar execution • Out-of-order execution • Register renaming • Branch prediction

• The goal of compiler and processor designers is to identify and take advantage of as much ILP as possible

• There is an increasing difficulty of finding enough parallelism in a single instructions stream to keep a high-performance single-core processor busy

Parallelism in Hardware

Xeon Phi 5110p Tesla K20x FirePro S10000 Cortex A50 TILE-Gx8072 Power7+

Company Intel NVidia AMD ARM Tilera IBM

Memory 8 GB 320 GB/s bandwidth

6 GB250 GB/sec bandwidth

6 GB480 GB/s bandwidth (dual)

4 GB and banked L2

23Mbyteon chip cache32K /core256 KB L2/core18 MB L3 cache

2 MB L2 cache (256KB core)32 MB of L3 cache (4 MB per core) for the 8-core SCM.

Cores 60 (240 treads)

2688 CUDA cores, arranged in SMs

2x1792 stream processors

up to 16(4x4 cluster)

72 8-core SCM, 64 with 4 drawers(4 SMT threads per core)

Core frequency

1.053 GHz 1 Ghz 825 Mhz 1.0 GHz 4.14 GHz

SIMD/SIMT 512 bit 32 th. warp 64 th. wavefront

32, 16, and 8 bit ops

The “Many-core” challenges

• Many-core vs multi-core– Multi-core architectures

and programming models suitable for 2 to 32 processors will not easily incrementally evolve to serve many-core systems of 1000’s of processors

– Many-core is the future

Tilera TILE-Gx807

What does it mean?

• Hardware is evolving– The number of cores is the new Megahertz

• We need– New programming model– New system software– New supporting architecture that are naturally

parallel

New Challenges

• Make easy to write programs that execute efficiently on highly parallel computing systems – The target should be 1000s of cores per chip– Maximize productivity

• Programming models should– be independent of the number of processors– support successful models of parallelism, such as task-level

parallelism, word-level parallelism, and bit-level parallelism• “Autotuners” should play a larger role than

conventional compilers in translating parallel programs

Parallel Programming ModelsReal-Time Worksop

(MathWorks) Binary Modular Data Flow Machine(TU Munich and AS Nuremberg)

Pthreads

MapReduce(Google)

StreamIt(MIT&Microsoft)

CUDA(NVidia)

OpenCL(Khronos Group)Brook

(Stanford)DataCutter(Maryland)

OpenMP

Thread Building Blocks(Intel)

Cilk(MIT)

NESL(CMU)

HPCS Chapel(Cray)

HPCS X10(IBM)

HPCS Fortress(Sun)Sequoia

(Stanford)

Charm(Illinois)

Erlang

Borealis(Brown)

HMPPOpenAcc

Parallel Programming ModelsReal-Time Worksop

(MathWorks) Binary Modular Data Flow Machine(TU Munich and AS Nuremberg)

Pthreads

MapReduce(Google)

StreamIt(MIT&Microsoft)

CUDA(NVidia)

OpenCL(Khronos Group)Brook

(Stanford)DataCutter(Maryland)

OpenMP

Thread Building Blocks(Intel)

Cilk(MIT)

NESL(CMU)

HPCS Chapel(Cray)

HPCS X10(IBM)

HPCS Fortress(Sun)Sequoia

(Stanford)

Charm(Illinois)

Erlang

Borealis(Brown)

HMPPOpenAcc

Reconsidering…

• Applications– What are common parallel kernel applications?– Parallel patterns

• Instead of traditional benchmarks, design and evaluate parallel programming models and architectures on parallel patterns

• A parallel pattern (“dwarf”) is an algorithmic method that captures a pattern of computation and communication

• E.g. dense linear algebra, sparse algebra, spectral methods, …

• Metrics– Scalability

• An old belief was that less than linear scaling for a multi-processor application is failure

• With new hardware trend, this is no longer true– Any speedup is OK!

HARNESSING COMPLEXITY

Harnessing Complexity

• Compiler approaches – DSL, automatic parallelization, …

• Library-based approaches

What a compiler can do for us?

• Optimize code• Automatic tuning• Automatic code generation– e.g. in order to support different hardware

• Automatically parallelize code

Automatic Parallelization

Critical opinions on parallel programming model:

The other way:• Auto-parallelizing compilers– Sequential code => parallel code

Wen-mei Hwu, University of Illinois, Urbana-ChampaignWhy sequential programming models could be the best way to program many-core

systemshttp://view.eecs.berkeley.edu/w/images/3/31/Micro-keynote-hwu-12-11-2006_.pdf

Automatic Parallelization

• Nowadays compilers have new “tools” for analysis– Polyhedral model

• …but performance– are still far from a

manual parallelization approach

Polyhedral Model

Analyses & Transformations

for(int i=0;i<100;i++) { A[i] = A[i+1];}

Code generation:• Generate IR code

from model

Polyhedral extraction:• SCoP detection• Translation to polyhedral

D: { i in N: 0 <= i < 100 }R: A[ i ] for each i in DW: A[i+1] for each i in D

Autotuners vs Traditional Compilers

• Performance of future parallel applications will crucially depend on the quality of the code generated by the compiler

• The compiler selects which optimizations to perform, chooses parameters for these optimizations, and selects from among alternative implementations of a library kernel

• The resulting space of optimization is large• Programming model may simplify the problem– but not to solve it

Optimizations’ ComplexityAn example

Input• Openmp code• Simple parallel codes– matrix multiplication, jacobi, stencil3d,…

• Few optimizations and tuning parameters– Tiling 2d/3d– # of threads

Goal: Optimize for performance and efficiency

Optimizations’ ComplexityAn example

• Problem– Big search space• brute force takes year of computation

– Analytical model fails to find the best configuration• Solution– Multi-objective search• Offline search of Pareto front solutions• Runtime selection according to the objective

– Multi versioning

H. Jordan, P. Thoman, J. J. Durillo, S. Pellegrini, P. Gschwandtner, T. Fahringer, H. Moritsch A Multi-Objective Auto-Tuning Framework for Parallel Codes

ACM Super Computing, 2012

Optimizations’ Complexity

Input Code

Parallel Target Platform

Analyzer1

Optimizer

CodeRegions

ConfigurationsMeasure-

ments3

Backend

BestSolutions

Multi-Versioned

Runtime System

DynamicSelection6

compile time runtime

H. Jordan, P. Thoman, J. J. Durillo, S. Pellegrini, P. Gschwandtner, T. Fahringer, H. Moritsch A Multi-Objective Auto-Tuning Framework for Parallel Codes

ACM Super Computing, 2012

Domain Specific Languages

• Easy of programming– Use of domain specific concepts• E.g. “color”, “pixel”, “particle”, “atom”

– Simple interface• Hide complexity– Data structures– Parallelization issues– Optimizations’ tuning– Address specific parallelization pattern

Domain Specific Languages

• DSL may help parallelization– Focus on domain concepts and abstractions– Language constraints may help automatic parallelization by

compilers• 3 major benefits– Productivity– Performance– Portability and forward scalability

Domain Specific LanguagesGLSL Shader (OpenGL)

OpenGL 4.3 Pipeline

PrimitiveSetup and

Rasterization

FragmentShader Blending

VertexData

PixelData

VertexShader

TextureStore

GeometryShader

TessellationControlShader

TessellationEvaluation

Shader

attribute vec3 vertex;attribute vec3 normal;attribute vec2 uv1;uniform mat4 _mvProj;uniform mat3 _norm;varying vec2 vUv;varying vec3 vNormal;

void main(void) { // compute position gl_Position = _mvProj * vec4(vertex, 1.0);

vUv = uv1; // compute light info vNormal= _norm * normal;}

varying vec2 vUv;varying vec3 vNormal;uniform vec3 mainColor;uniform float specularExp;uniform vec3 specularColor;uniform sampler2D mainTexture;uniform mat3 _dLight;uniform vec3 _ambient;void getDirectionalLight(vec3 normal, mat3 dLight, float specularExp, out vec3 diffuse, out float specular){ vec3 ecLightDir = dLight[0]; // light direction in eye coordinates vec3 colorIntensity = dLight[1]; vec3 halfVector = dLight[2]; float diffuseContribution = max(dot(normal, ecLightDir), 0.0); float specularContribution = max(dot(normal, halfVector), 0.0); specular = pow(specularContribution, specularExponent);

diffuse = (colorIntensity * diffuseContribution);}void main(void) { vec3 diffuse; float spec; getDirectionalLight(normalize(vNormal), _dLight, specularExp, diffuse, spec); vec3 color = max(diffuse,_ambient.xyz)*mainColor; gl_FragColor = texture2D(mainTexture,vUv) * vec4(color,1.0) + vec4(specular*specularColor,0.0);}

PrimitiveSetup and

Rasterization

FragmentShader Blending

VertexData

PixelData

VertexShader

TextureStore

GeometryShader

TessellationControlShader

TessellationEvaluation

Shader

fragmentvertex

DSL Examples

Matlab, DLA DSL (dense linear algebra), Python, shell script, SQL, XML, CSS, BPEL, …

• Interesting recent research work

A. S. Green, P. L. Lumsdaine, N. J. Ross, and B. Valiron Quipper: A Scalable Quantum Programming Language

ACM PLDI 2013

Charisee Chiw, Gordon Kindlmann, John Reppy, Lamont Samuels, Nick SeltzerDiderot: A Parallel DSL for Image Analysis and Visualization

ACM PLDI 2012

Leo A. Meyerovich, Matthew E. Torok, Eric Atkinson, Rastislav Bodık Superconductor: A Language for Big Data Visualization LASH-C 2013

• Compilers can do– Automatic parallelization– Optimization of (parallel) code– DSL and code generation

• But well written and optimized parallel code still outperforms a compiler based approach

• Compiler approaches – DSL, automatic parallelization, …

• Library-based approaches

Some Examples

• Pattern oriented– MapReduce (Google)

• Problem specific– FLASH, adaptive-mesh refinement (AMR)

code– GROMACS, molecular dynamics

• Hardware/programming model specific– Cactus– libWater*

bestperformance

Insieme Compiler and Research

• Compiler infrastructure• Runtime support

Insieme Research: Automatic Task Partitioning for Heterogeneous HW

• Heterogeneous platforms– E.g. CPU + 2 GPUs

• Input: OpenCL for single device• Output: OpenCL code for multiple devices• Automatic partitioning of work-items between

multiple devices– Based on hw, program and input size

• Machine-learning approachK. Kofler, I. Grasso, B. Cosenza, T. Fahringer

An Automatic Input-Sensitive Approach for Heterogeneous Task PartitioningACM International Conference on Supercomputing, 2013

Results – Architecture 1

DataTrans

VectorA

MatMul

BlackSch

SineWave

Convolution

MolecularD

KmeansKNN

SobelFi

MedianFilter

RayInterse

FlowMap

Reduction

PerlinNoise

GeoMean

MersTwist

Compression

Pendulum0

CPUGPUANN

Results – Architecture 2

DataTrans

VectorA

MatMul

BlackSch

SineWave

Convolution

MolecularD

KmeansKNN

SobelFi

MedianFilter

RayInterse

FlowMap

Reduction

PerlinNoise

GeoMean

MersTwist

Compression

Pendulum0

CPUGPUANN

Insieme Research: OpenCL on Cluster of Heterogeneous Nodes

• libWater• OpenCL extensions for clusters– Event based, extension on OpenCL event– Supporting intra-deice synchronization

• DQL– A DSL language for device query, management and

discovery

I. Grasso, S. Pellegrini, B. Cosenza, T. Fahringer libWater: Heterogeneous Distributed Copmuting Made Easy

ACM International Conference on Supercomputing, 2013

libWater

• Runtime– OpenCL– pthread, opemp– MPI

• DAG command event representation

libWater: DAG Optimizations

• Dynamic Collective communication pattern Replacement (DCR)

• Latency hiding• Intra-node copy

optimizations

Insieme (Ongoing) Research:Support for DSLs

InputCodesDSL Frontend Backend

pthreadsOpenCL

InputCodes

OutputCodes

Transformation Framework

Polyhedral modelParallel optimizations Stencil computation

Automatic tuning support

IntermediateRepresentation

Library SupportRendering algorithm

implementations, geometry loader, …

RuntimeSystem

Target hardware:GPU, CPU, heterogeneous platform, compute cluster

About Insieme

• Insieme compiler– Research framework– OpenMP, Cilk, MPI, OpenCL– Run time, IR– Support for polyhedral model– Multi-objective optimization– Machine learning – Extensible

• Insieme (GPL) and libWater (LGPL) soon available on GitHub

Make HPC Easy with Domain-Specific Languages and High-Level Frameworks

Documents

Main Benefits of Task Parallel Frameworks€¦ · Main Benefits of Task Parallel Frameworks ... cilk_spawn, locks cilk_sync Sept 13, 2012 HPC-AC. Rafael Asenjo. 24 . o CnC peculiarities

Workshop on HPC in India Programming Models, Languages, and Compilation for Accelerator-Based Architectures R. Govindarajan SERC, IISc govind@serc.iisc.ernet.in

Practical Programming, Validation and Veriﬁcation with ...data.paulosalem.com/articles/...with_Finite-State_Machines-a_Library... · mainstream languages and frameworks (e.g., C,

Programming Languages for HPC - snirsnir.cs.illinois.edu › PDF › Programming Languages for HPC short.pdf · New languages/models succeed only if they enable new capabilities

Linkage of XcalableMP and Python languages for high ... · Linkage of XcalableMP and Python languages for high productivity on HPC cluster system - Application to Graph Order/degree

Design and Analysis of Web Application Frameworks - · PDF fileDesign and Analysis of Web Application Frameworks ... Thanks to the members of the Programming Languages group at Aarhus

Fuzzing Frameworks, Fuzzing Languages?! Stephen “sa7ori” Ridley, McAfee Senior Security Architect (former) (DoD -> McAfee -> Independent RCE/Vuln Researcher)

HPC Cloud Security Stakeholders Platform (HCSSP) · 2019-09-03 · HPC yyy WG HPC xxx WG HPC Cloud Security WG CCM for AWS HPC Cloud CCM for Google HPC Cloud CCM for Microsoft HPC

Software Tools, Part 2 Fall, 2002. Overview Toolkits Toolkits Prototyping Languages Prototyping Languages Frameworks Frameworks Groupware Frameworks Groupware

Lawrence Berkeley National Laboratory / National Energy Research Supercomputing Center Frameworks in Complex Multiphysics HPC Applications CS267 – Spring

A Pattern Language for the Design of Aspect Languages and ...frameworks for aspect-oriented programming (AOP) [19] are distinct but comparable. A number of languages, frameworks, and

Paving the Road to Exascale - University of Oklahoma€¦ · Solution: Co-Design - mapping the communication frameworks on all active devices Result: reduce HPC communication frameworks

Mitigation Action Plan and Scenarios Towards a developmental modeling framework Development – climate On talking new languages and modeling frameworks

The current state of enterprise languages, frameworks, · PDF fileThe current state of enterprise languages, frameworks, and platforms Rich Sharples Senior Director of Product Management

Introduction to Multi-agent Systems - WordPress.com · 2017. 2. 3. · Frameworks •Frameworks have common standards –(FIPA platforms and communication languages) •Frameworks

Frameworks in Complex Multiphysics HPC Applications

Deep500 – A Deep Learning Meta-Framework and HPC Benchmarksimipro.github.io/papers/Deep500 Simon Huber Bachelor.pdf · There are currently many Deep Learning frameworks. Most of

Architecture Frameworks...UML, all important Enterprise Architecture frameworks and modeling languages. Enterprise Architect User Guide Series Author: Sparx Systems Date: 2020-01-20

The Handstand HPC - HPC Gymnastics

Frameworks in Complex Multiphysics HPC Applications