e c n c Introduction to S a Parallel Computing t a D o f s ... · Fall 2014 Tools for Data Science 24 P a r a l l e l C o m p u t i n g – I n t r o Parallel Overhead The total CPU

Tool

s fo

r D

ata

Sci

ence

Introduction to Parallel Computing

Fall 2014 Tools for Data Science 2

Par

alle

l Com

putin

g –

Intr

oOverview

❏ Parallel Computing

❏ The name of the game

❏ Programming models

❏ Parallel architectures

❏ Multi-core everywhere

❏ Hardware examples


Par

alle

l Com

putin

g –

Intr

oParallelism is everywhere

❏ In today's computer installations one has many levels of parallelism:

❏ Instruction level (ILP)

❏ Chip level (multi-core, multi-threading)

❏ System level (SMP)

❏ GP-GPUs

❏ Grid/Cluster


Par

alle

l Com

putin

g –

Intr

oWhat is Parallelization?

The Name of the Game


Par

alle

l Com

putin

g –

Intr

oWhat is Parallelization?

An attempt of a definition:

“Something” is parallel, if there is a certain level ofindependence in the order of operations

“Something” can be:

► A collection of program statements

► An algorithm

► A part of your program

► The problem you are trying to solve

gran

ularity


Par

alle

l Com

putin

g –

Intr

oParallelism – when?

Something that does not follow this rule is not parallel !!!


Par

alle

l Com

putin

g –

Intr

oParallelism – Example 1


Par

alle

l Com

putin

g –

Intr

oParallelism – Example 2


Par

alle

l Com

putin

g –

Intr

oParallelism – Results example 2

❏ parallel version of example 2 was run 4 times each on 1, 8, 32 and 64 threads/processors

❏ Output: sum over all elements of vector a

❏ Except for P=1, the results are:

❏ Wrong

❏ Inconsistent

❏ NOT reproducable

❏ This is called a 'Data Race'


Par

alle

l Com

putin

g –

Intr

oParallelism

for (i = 0; i < n; i++ ) a[i]= a[i+M] + b[i];

M = 0 : parallel M >= 1 : not parallel

Fundamental problem:


Par

alle

l Com

putin

g –

Intr

oWhat is a Thread?❏ Loosely said, a thread consists of a series of

instructions with it's own program counter (“PC”) and state

❏ A parallel program will execute threads in parallel

❏ These threads are then scheduled onto processors


Par

alle

l Com

putin

g –

Intr

oSingle- vs. multi-threaded

code data I/O handles

registers stack

single-thread

code data I/O handles

registers

stack

thread-privatedata

registers

stack

thread-privatedata

registers

stack

thread-privatedata

registers

stack

thread-privatedata

. . . . .

multi-thread


Par

alle

l Com

putin

g –

Intr

oParallelism vs Concurrency

Concurrent, and parallel execution

Concurrent, non-parallel execution:

e.g. multiple threads on a single core CPU


Par

alle

l Com

putin

g –

Intr

oParallelism vs Concurrency

programs

concurrent

parallel


Par

alle

l Com

putin

g –

Intr

oNumerical Results


Par

alle

l Com

putin

g –

Intr

oBasic concepts

❏ Consider the following code with two loops

❏ Running this in parallel over i might give the wrong answer.

for (i = 0; i < n; i++) a[i] = b[i] + c[i];

for (i = 0; i < n; i++) d[i] = a[i] + e[i];


Par

alle

l Com

putin

g –

Intr

oBasic concepts – the barrier

❏ The problem can be fixed:

❏ The barrier assures that no thread starts working on the second loop before the work on loop one is finished.

for (i = 0; i < n; i++) a[i] = b[i] + c[i];

for (i = 0; i < n; i++) d[i] = a[i] + e[i];

wait!


Par

alle

l Com

putin

g –

Intr

oBasic concepts – the barrier

When to use barriers?

❏ To assure data integrity, e.g.

❏ after one iteration in a solver

❏ between parts of the code that read and write the same variables

❏ Barriers are expensive and don't scale to a large number of threads


Par

alle

l Com

putin

g –

Intr

o ❏ A typical code fragment:

❏ This loop can not run in parallel, unless the update of sum is protected.

❏ An operation like the above is called a “reduction” operation, and there are ways to handle this issue (more later...).

Basic concepts – reduction

for( i = 0; i < n; i++ ) { ... sum += a[i]; ...}

☞ serial code


Par

alle

l Com

putin

g –

Intr

oParallel Overhead

❏ The total CPU time may exceed the serial CPU time:

✔ The newly introduced parallel portions in your program need to be executed

✔ Threads need time for sending data to each other and for synchronizing (“communication”)

❏ Typically, things also get worse when increasing the number of threads

❏ Efficient parallelization is about minimizing the communication overhead


Par

alle

l Com

putin

g –

Intr

oCommunication


Par

alle

l Com

putin

g –

Intr

oLoad Balancing


Par

alle

l Com

putin

g –

Intr

oDilemma – Where to parallelize?


Par

alle

l Com

putin

g –

Intr

oScalability – speed-up & efficiency


Par

alle

l Com

putin

g –

Intr

oAmdahl's Law


Par

alle

l Com

putin

g –

Intr

oAmdahl's Law


Par

alle

l Com

putin

g –

Intr

oAmdahl's Law in Practice


Par

alle

l Com

putin

g –

Intr

oCode scalibility in practice – II

❏ Ideally, HPC codes would be able to scale to the theoretical limit, but ...

❏ Never the case in reality

❏ All codes eventually reach a real upper limit on speedup

❏ At some point codes become “bound” to one or more limiting hardware factors (memory, network, I/O)


Par

alle

l Com

putin

g –

Intr

oWhat is Parallelization? - Summary❏ Parallelization is simply another optimization

technique to get your results sooner

❏ To this end, more that one processor is used to solve the problem

❏ The “Elapsed Time” (also called wallclock time) will come down, but total CPU time will probably go up

❏ The latter is a difference with serial optimization, where one makes better use of existing resources, i.e. the cost will come down


Par

alle

l Com

putin

g –

Intr

o

Parallel Programming Models


Par

alle

l Com

putin

g –

Intr

oParallel Programming ModelsTwo “classic” parallel programming models:

❏ Distributed memory

❏ PVM (standardized)

❏ MPI (de-facto standard, widely used)

❏ http://mpi-forum.org or http://open-mpi.org/

❏ Shared memory

❏ Pthreads (standardized)

❏ OpenMP (de-facto standard)

❏ http://openmp.org/❏ Automatic parallelization (depends on compiler)

Clusters,

SMPs

SMPonly


Par

alle

l Com

putin

g –

Intr

oParallel Programming Models

“Upcoming” programming models

❏ PGAS (Partitioned Global Address Space):

❏ UPC (Unified Parallel C)

❏ Co-Array Fortran

❏ GPUs: massively parallel & shared memory

❏ CUDA

❏ OpenCL


Par

alle

l Com

putin

g –

Intr


Distributed memory programming model, e.g. MPI:

❏ all data is private to the threads

❏ data is shared by exchanging buffers

❏ explicit synchronization


Par

alle

l Com

putin

g –

Intr

oParallel Programming ModelsMPI:

❏ An MPI application is a set of independent processes (threads)

❏ on different machines

❏ on the same machine

❏ communication over the interconnect

❏ network (network of workstations, cluster, grid)

❏ memory (SMP)

❏ communication is under control of the programmer


Par

alle

l Com

putin

g –

Intr


Shared memory model, e.g. OpenMP:

❏ all threads have access to the same global memory

❏ data transfer is transparent to the programmer

❏ synchronization is (mostly) implicit

❏ there is private data as well


Par

alle

l Com

putin

g –

Intr


OpenMP:

❏ needs an SMP

❏ but ... with newer CPU designs, there is an SMP in (almost) every computer

❏ multi-core CPUs (CMP)

❏ chip multi-threading (CMT)

❏ or a combination of both, e.g the Sun UltraSPARC-T series

❏ or ... (whatever we'll see in the future)


Par

alle

l Com

putin

g –

Intr

oMPI vs OpenMP

OpenMP version of “Hello world”:

#include <stdio.h>

int main(int argc, char *argv[]) { printf("Hello world!\n"); return(0);}

#pragma omp parallel { } /* end parallel */


Par

alle

l Com

putin

g –

Intr

oMPI vs OpenMP

no. of threads: OMP_NUM_THREADS

% cc -o hello -xopenmp hello.c% ./helloHello world!

% OMP_NUM_THREADS=2 ./helloHello world!Hello world!

% OMP_NUM_THREADS=8 ./helloHello world!Hello world!Hello world!Hello world!


Par

alle

l Com

putin

g –

Intr

oMPI vs OpenMP

#include <stdio.h>#include <stdlib.h>#include "mpi.h"

int main(int argc, char **argv) { int myrank, p; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &myrank); MPI_Comm_size(MPI_COMM_WORLD, &p); printf("Hello world from %d!\n", myrank); MPI_Finalize(); return 0;}

MPI version of “Hello world”:


Par

alle

l Com

putin

g –

Intr

oMPI vs OpenMPMPI version: compile and run

$ cc -I/.../MPI/include -R/.../MPI/lib \ -L/.../MPI/lib -o hello_mpi hello_mpi.c -lmpi

$ ./hello_mpiERROR in MPI_Init: unclassified error:RTE_Init_lib: Cannot initialize: the program mustbe started by mprun.: Invalid requestFatal error, aborting.

$ mprun -np 4 ./helloHello world from 1!Hello world from 3!Hello world from 0!Hello world from 2!


Par

alle

l Com

putin

g –

Intr

o

Multi-core – everywhere!

Welcome to a “threaded” world

http://mpi-forum.org/

http://open-mpi.org/

http://openmp.org/


Par

alle

l Com

putin

g –

Intr

oToday’s Multicores99% of Top500 Systems Are Based on Multicore

Sun Niagara2 (8 cores)

Intel Polaris [experimental] (80 cores)

AMD Istanbul (6 cores)

IBM Cell (9 cores)

Intel Nehalem (4 cores)

282 use Quad-Core204 use Dual-Core 3 use Nona-core

Fujitsu Venus (8 cores)

IBM Power 7 (8 cores)


Par

alle

l Com

putin

g –

Intr

oWhat is a multi-core chip?

❏ A “core” is not well-defined – let us assume it covers the processing units and the L1 caches (a very simplified CPU).

❏ Different implementations are possible – and available (examples follow), e.g. multi-threaded cores

❏ Cache hierarchy of private and shared caches

❏ For software developers it matters that there is parallellism in the hardware, they can take advantage of


Par

alle

l Com

putin

g –

Intr

oA generic multi-core design


Par

alle

l Com

putin

g –

Intr

oAMD Opteron – quad-core

❏ dedicated L2 caches

❏ shared L3 cache


Par

alle

l Com

putin

g –

Intr

oAMD Opteron – quad-core


Par

alle

l Com

putin

g –

Intr

oUltraSPARC-T2

System on a chip:

❏ 8 cores with 8 threads = 64 threads

❏ integrated multi-threaded 10 Gb/s Ethernet

❏ integrated crypto-unit per core

❏ low power (< 95W)

❏ < 1.5W/thread


Par

alle

l Com

putin

g –

Intr

oUltraSPARC-T2


Par

alle

l Com

putin

g –

Intr

oWhy adding threads to a core?

Execution of two threads:

Interleaving the work – better utilization:

Keyword: “Throughput Computing”


Par

alle

l Com

putin

g –

Intr

oThe future

Prediction is very difficult,

especially if it is about the future.”

-- Niels Bohr (1885-1962)


Par

alle

l Com

putin

g –

Intr

oThe future ... of multi-core

Many Floating-Point Cores

All Small Core

Many Small CoresAll Large Core

Mixed LargeandSmall Core

Different Classes of Chips:● Home● Games / Graphics● Business ● Scientific


Par

alle

l Com

putin

g –

Intr

oMany-core challenges I


Par

alle

l Com

putin

g –

Intr

oMany-core challenges II


Par

alle

l Com

putin

g –

Intr

oMany-core challenges III


Par

alle

l Com

putin

g –

Intr

oA 2014 multi-core chip: Intel Haswell


Par

alle

l Com

putin

g –

Intr

oSummary

❏ You have heard about:

❏ Parallel programming models and basic concepts

❏ Parallel architectures (shared / distributed memory)

❏ Multi-core architectures

❏ GPUs/accelerators (more next week)


Par

alle

l Com

putin

g –

Intr

oThe future ... of software

❏ Challenges:

❏ how to handle millions of cores/threads ...

... reliably

❏ re-design of algorithms: from coarse-grained parallelism to fine-grained parallelism

❏ more development tools needed to achieve this

❏ we need standards, to assure portability!

❏ long-term perspectives

Documents

e c n c Introduction to S a Parallel Computing t a D o f s ... · Fall 2014 Tools for Data Science 24 P a r a l l e l C o m p u t i n g – I n t r o Parallel Overhead The total CPU