OpenMP tasking model: from the standard to the classroom

Universitat Politècnica de CatalunyaBARCELONA TECH

Madrid. March 30, 2017

Eduard Ayguadé

Dep. Arquitectura de Computadors, U. Politècnica de CatalunyaComputer Sciences Dep., Barcelona Supercomputing Center

OpenMP tasking model: from the standard to the classroom

ICT386 multiprocessor architecture, late 80s

1

4 x CPU modulei386/i387no private cache memory

4 x Memory module4 MB DRAM (100 ns)64 KB shared cache25 ns, direct-mapped

ICT386 multiprocessor architecture, late 80s

2

Crossbar 5x5Gate array design10 ns transfer

Intellligent memoryMemory mapped

1 MB DRAMFunctions: management,

synchronization andscheduling

Prog

ram

mab

leN

IC

And the wall has become higher and higher …

3

big.LITTLE Integrated GPU

Sock

et

S1 S2

GPUGDDR

GPUGDDR

DDR

DDR

FPGANVMHCA

CAPIPCIe

PCIePCIe

QPI

Node

Rank Name Site Computer TotalCores

Accelerator/Co-Proc.

CoresRmax (Gflops) Rpeak

(Gflops) Power (KW) Mflops/Watt

1Sunway TaihuLight

National Supercomputing Center in Wuxi

Sunway MPP, Sunway SW26010 260C 1.45GHz, Sunway 10649600 0 93014593,88 125435904 15371 6051,30

2Tianhe-2 (MilkyWay-2)

National Super Computer Center in Guangzhou

TH-IVB-FEP Cluster, Intel Xeon E5-2692 12C 2.200GHz, TH Express-2, Intel Xeon Phi 31S1P 3120000 2736000 33862700 54902400 17808 1901,54

3 TitanDOE/SC/Oak Ridge National Laboratory

Cray XK7, Opteron 6274 16C 2.200GHz, Cray Gemini interconnect, NVIDIA K20x 560640 261632 17590000 27112550 8209 2142,77

4 Sequoia DOE/NNSA/LLNLBlueGene/Q, Power BQC 16C 1.60 GHz, Custom 1572864 0 17173224 20132659,2 7890 2176,58

5 Cori DOE/SC/LBNL/NERSC

Cray XC40, Intel Xeon Phi 7250 68C 1.4GHz, Aries interconnect 622336 0 14014700 27880653 3939 3557,93

6Oakforest-PACS

Joint Center for Advanced High Performance Computing

PRIMERGY CX1640 M1, Intel Xeon Phi 7250 68C 1.4GHz, Intel Omni-Path 556104 0 13554600 24913459 2718 4985,69

7 K computer

RIKEN Advanced Institute for Computational Science (AICS)

K computer, SPARC64 VIIIfx 2.0GHz, Tofu interconnect 705024 0 10510000 11280384 12660 830,18

8 Piz Daint

Swiss National Supercomputing Centre (CSCS)

Cray XC50, Xeon E5-2690v3 12C 2.6GHz, Aries interconnect , NVIDIA Tesla P100 206720 170240 9779000 15987968 1312 7453,51

9 MiraDOE/SC/Argonne National Laboratory

BlueGene/Q, Power BQC 16C 1.60GHz, Custom 786432 0 8586612 10066330 3945 2176,58

10 Trinity DOE/NNSA/LANL/SNL

Cray XC40, Xeon E5-2698v3 16C 2.3GHz, Aries interconnect 301056 0 8100900 11078861 4232 1913,92

Top5 November 2016 list

1979 Pink Floyd’s The Wall album,Cray-1 delivering 80 Mflops, single processor

OpenMP evolution and OmpSs influenceRuntime opportunitiesArchitecture opportunitiesTeaching opportunities

Programmability

The vision

Plethora of architectures• Heterogeneity• Hierarchies

ISA Low-level interfaces

ISA/PM leaks and variability/faults/errors everywhere• Programmer exposed• Divergence between

mental models and actual behavior

New application domains

Programming models

Applications

Programmability andporting/inter-operability• Legacy codes• ”Legacy” mental models

and habits

APIs

New usage practices• Analytics and visualization• New data models• Interactive supercomputing, response

time

• Virtualized data centers• Cloud

Programming interfaces

Ana…

…lytics

9

The vision

10

Intelligent runtime, parallelization, distribution, interoperability

General purposeTask basedSingle address space

Program logicindependent of computing platform

Performance-and power-drivenoptimization/tuning

Resource management

ISA Low-level interfaces APIs

Runtime system Resource managementPerformance, power, energy… monitoring/insight

Applications

Programming model: high-level, clean, abstract interface

7

The OmpSs forerunner for OpenMP

Proposing ideas …– IWOMP workshop is the ideal scenario– Participation in the OpenMP language committee and sub-committees

Demonstrated with …– Prototype implementation: Mercurium compiler and Nanos runtime– Benchmark/application suite

Lots of discussions and final polishing

3.0 2008

+ Tasks + Taskdependences

+ Taskpriorities

+ Taskloopprototyping

+ Task reductions+ Dependences

on taskwait

+ Multidependences+ Commutative

+ Dependenceson taskloop

Today

3.1 2011

4.0 2013

4.5 2015

5.0 ?OmpSs

8

OmpSs influencing tasking model evolution (1)

The big change …

From loop-based model to task-based model– Dynamically allocated data structures

(lists, trees, …)– Recursion

task generation waits for child tasks to finish (3.0)

task-based model, nesting of tasks (3.0)

#pragma omp task firstprivate(list){ code block }

#pragma omp taskwait

task priorities (4.5)

priority(value)

task generation cut-off (3.1)

final(expr)

root

9


Directionality of task arguments

Dependences between tasks computed at runtimeTasks executed as soon as all dependencesare satisfied

directionality for task arguments (4.0)

#pragma omp task depend(in: list, out: list, inout: list)

{ code block }

10


Tasks become chunks of iterations of one or more associated loops

Dependences between tasks generated in the same or different taskloop construct under discussion

tasks out of loops and granularity control (4.5)

#pragma omp taskloop [num_tasks(value) | grainsize(value)]for-loops

#pragma omp taskloop block(2) grainsize(B, B)depend(in : block[i-1][j], block[i][j-1])depend(in : block[i+1][j], block[i][j+1])depend(out: block[i][j])

for (i=1; i<n i++)for (j=1; j<n;j++)

foo(i, j);

T T T T

T T T T

T T T T

T T T T

11


Reduction operations on task and taskloop not yet supported, currently in discussion#pragma omp taskgroup reduction(+: res){while (node) {#pragma omp task in_reduction(+: res)res += node->value;

node = node->next;}

}

#pragma omp taskloop grainsize(BS) reduction(+: res)for (i = 0; i < N; ++i) {res += foo(v[i]);

}

12


The number of task dependences have to be constant at compile time à multidependences proposal under discussion

for (i = 0; i < l.size(); ++i) {#pragma omp task depend(inout: l[i]) // T1{ ... }

}

#pragma omp task depend(in: {l[j], j=0:l.size()}) // T2for (i = 0; i < l.size(); ++i) {...

}

Defines an iterator with a range of values that only exists in the context of the directive. Equivalent to:depend(in: l[0], l[1], ..., l[l.size()-1])

T1 T1 T1 T1

T2

…

13


New dependence type that relaxes the execution order of commutative tasks but keeping mutual exclusion between them

T1

T3

T2

T2

for (i = 0; i < l.size(); ++i) {#pragma omp task depend(out: l[i])// T1

}

for (i = 0; i < l.size(); ++i) {#pragma omp task shared(x) depend(in: l[i])

depend(inout: x)// T2

}

#pragma omp task shared(x) depend(in: x) // T3for (i = 0; i < l.size(); ++i) {...

}

T1T1 T1

T2

T2

14


New dependence type that relaxes the execution order of commutative tasks but keeping mutual exclusion between them

T1

T3

T2

T2

for (i = 0; i < l.size(); ++i) {#pragma omp task depend(out: l[i])// T1

}

for (i = 0; i < l.size(); ++i) {#pragma omp task shared(x) depend(in: l[i])

depend(commutative: x)// T2

}

#pragma omp task shared(x) depend(in: x) // T3for (i = 0; i < l.size(); ++i) {...

}

T1T1 T1

T2

T2

Scheduling

New tasks

Dynamic task graph

Worker threads

Execute task

Nan

os++

runt

ime

…newTask(call0, data0);newTask(call1, data1);…waitTasks()…

Application binary Readytask queue

End task: dependence update

15

OmpSs: runtime opportunities (1)

Runtime parallelization– Compute data dependences among tasks à dynamic task graph– Dataflow execution of tasks à ready task queue– Look-ahead: task instantiation can proceed past non-ready tasks (with

opportunities to find distant parallelism)

16


Criticallity-aware scheduler: managing heterogeneous SoC– Tasks in the critical path detected at runtime à execute in faster cores– Non-critical tasks executed in slower, more power-efficient cores– Work-stealing for load balancing

“Criticality-Aware Dynamic Task Scheduling on Heterogeneous Architectures”. K. Chronaki et al. ICS-2015

0,70,80,9

11,11,21,3

Cholesky Int. Hist QR Heat AVG

Spee

dup Original CATS

Scheduling

New tasks

Dynamic task graph

Worker threads

Execute task

Nan

os++

runt

ime

…newTask(call0, data0);newTask(call1, data1);…waitTasks()…

Application binary Readytask queue


17


Runtime parallelizationDirectory with per-core arguments utilization– Locality-aware task scheduling strategies– Driving argument prefetching

Directory

18


Management of address spaces and coherence– Directory@master: identifies for every region reference where data is– Per-device software cache: translate host to device address spaces– Use of device specific mechanisms to move data, if needed– Locality-aware scheduling when multiple devices exist

Scheduling

Dynamic task graph

Nan

os++

runt

ime

Readytask queue


Worker threads

Helper threads

Execute task

MIC threads

CUDA threads

Datarequests

Directory / Caches

“Self-Adaptive OmpSs Tasks in Heterogeneous Environments”. J. Planas et al., IPDPS 2013.

Example: nbody

#pragma omp target device(smp) copy_deps#pragma omp task depend(out: [size]out), in: [npart]part)void comp_force (int size, float time, int npart, Part* part, Particle *out, int gid);

#pragma omp target device(fpga) ndrange(1,size,128) copy_deps implements (comp_force)#pragma omp task depend(out: [size]out), in: [npart]part)__kernel void comp_force_opencl (int size, float time, int npart, __global Part* part,

__global Part* out, int gid);

#pragma omp target device(cuda) ndrange(1,size,128) copy_deps implements (comp_force)#pragma omp task depend(out: [size]out), in: [npart]part)__global__ void comp_force_cuda (int size, float time, int npar, Part* part, Particle *out, int gid);

void Particle_array_calculate_forces(Particle* input, Particle *output, int npart, float time) {for (int i = 0; i < npart; i += BS )

comp_force (BS, time, npart, input, &output[i], i);}

20


Support for multiple implementations for same task– Versioning scheduler:

• Monitoring (learning) phase and scheduling phase• Most suitable implementation is selected each time depending on

monitoring, simple model and OmpSs worker’s work load

SMP W1

SMP W2

GPU W3Faster but

busy

Slower but early executor

Run task!

“Self-Adaptive OmpSs Tasks in Heterogeneous Environments”. J. Planas et al., IPDPS 2013.

21


At cluster …– Task creation starts at master node, outermost tasks may be executed

on remote nodes and spawn inner tasks– Locality-aware scheduler, overlap transfers/computation, work stealing

Scheduling

Directory/Cache

Nano

s++

runt

ime

mas

ter n

ode

Worker threads

Helper threads

Execute task

Datarequests

Worker threads

Helper threads

Nano

s++

runt

ime

rem

ote

node

data in host, devices attached to node, and remote nodes

data in host, devices attached to node,

NIC NIC

“Productive Cluster Programming with OmpSs”. J. Bueno et al. EuroPar 2011.“Productive Programming of GPU Clusters with OmpSs”. J. Bueno et al., IPDPS 2012.

22

OmpSs: architecture opportunities (1)

Novel hybrid cache/local memory– Cache: irregular accesses– Local memory (energy efficient, less

coherence traffic): strided accesses

Runtime-assisted prefetching• Task inputs and outputs mapped to the LMs• Overlap with runtime• Double buffering between tasks

8.7% speedup, 14% reduction in power,20% reduction in network-on-chip traffic

C CL1 Cl

usterInterconn

ect

LML1LM

C CL1LM

L1LM

C CL1LM

L1LM

C CL1LM

L1LM

DRAM DRAM

L2

L3

“Runtime-Guided Management of Scratchpad Memories in Multicores.” Ll. Alvarez et al. ÇPACT-2015


Task dependence manager to speedup task management and dependence tracking

Hardware support for reductions– Privatization of variables to speedup reduction à Significant overhead

when performing reductions on a vector or matrix block.– HW support to perform simple reduction operations (+, -, *, max, min)

in the last-level cache or in memory à In-memory computation

X. Tan et al. Performance Analysis of a Hardware Accelerator of Dependency Management for Task-based Dataflow Programming models. ISPASS 2016.X. Tan et al. General Purpose Task-Dependence Management Hardware for Task-based Dataflow Programming Models. IPDPS 2017.

J. Ciesko et al. Supporting Adaptive Privatization Techniques for Irregular Array Reductions in Task-parallel Programming Models. IWOMP 2016.

24


CATA: criticallity-aware task acceleration– Runtime drives per-core DVFS reconfigurations meeting a global

power budget– Hardware Runtime Support Unit (RSU) to minimize overheads that

grow with the number of cores

E. Castillo et al., CATA: Criticality Aware Task Acceleration for Multicore Processors. IPDPS 2016.

80%90%

100%110%120%130%140%150%

Performance imprv EDP imprv

Original CATS CATA CATA+RSU

32-core system with 16 fast cores

25

Teaching opportunities (1)

Two courses in Computer Science BSc @ FIB (UPC)– PAR – Parallelism: 5th semester, mandatory course– PAP – Parallel Architectures and Programming: optional, computer

engineering specialization

OpenMP task-based programming @ node level

26

Teaching opportunities (2)Syllabus• Tasks, task graph, parallelism• Overheads and model: task creation, synchronization, data sharing• Task decomposition: linear, iterative, recursive• Task ordering and data sharing constraints• The magic of the architecture: cache coherency and synchronization

support• Data decomposition and influence of memory

OpenMP tasks at its heart, going towards task-only parallel programming • Embarrassingly parallel, recursive

task decomposition, task/data decomposition, and branch and bound

• Tareador tool for the exploration of task decomposition strategies

PAP

PAR

Innovation: (pseudo-) flipped-classroom model

27

Teaching opportunities (2)

PAP

PAR

Syllabus• Compiler and runtime support to OpenMP• Pthreads API and lib intrinsics• Designing an HPC cluster: components

and interconnect, roofline model for flops/bw• Message-passing interface MPI

OpenMP for node programming, with MPI towards cluster programming • OpenMP tasking review• Implementing a runtime to support OpenMP• Building a 4xOdroid cluster: hard and soft installation:

the effect of heterogeneity• Programming the 4xOdroid cluster

with MPI: Stencil with Jacobi

Innovation: puzzle work in groups when designing cluster

28

Thanks!

https://www.bsc.es

Engineering

OpenMP tasking model: from the standard to the classroom