Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Tool
s fo
r D
ata
Sci
ence
Introduction to Parallel Computing
Fall 2014 Tools for Data Science 2
Par
alle
l Com
putin
g –
Intr
oOverview
❏ Parallel Computing
❏ The name of the game
❏ Programming models
❏ Parallel architectures
❏ Multi-core everywhere
❏ Hardware examples
Fall 2014 Tools for Data Science 3
Par
alle
l Com
putin
g –
Intr
oParallelism is everywhere
❏ In today's computer installations one has many levels of parallelism:
❏ Instruction level (ILP)
❏ Chip level (multi-core, multi-threading)
❏ System level (SMP)
❏ GP-GPUs
❏ Grid/Cluster
Fall 2014 Tools for Data Science 4
Par
alle
l Com
putin
g –
Intr
oWhat is Parallelization?
The Name of the Game
Fall 2014 Tools for Data Science 5
Par
alle
l Com
putin
g –
Intr
oWhat is Parallelization?
An attempt of a definition:
“Something” is parallel, if there is a certain level ofindependence in the order of operations
“Something” can be:
► A collection of program statements
► An algorithm
► A part of your program
► The problem you are trying to solve
gran
ularity
Fall 2014 Tools for Data Science 6
Par
alle
l Com
putin
g –
Intr
oParallelism – when?
Something that does not follow this rule is not parallel !!!
Fall 2014 Tools for Data Science 7
Par
alle
l Com
putin
g –
Intr
oParallelism – Example 1
Fall 2014 Tools for Data Science 8
Par
alle
l Com
putin
g –
Intr
oParallelism – Example 2
Fall 2014 Tools for Data Science 9
Par
alle
l Com
putin
g –
Intr
oParallelism – Results example 2
❏ parallel version of example 2 was run 4 times each on 1, 8, 32 and 64 threads/processors
❏ Output: sum over all elements of vector a
❏ Except for P=1, the results are:
❏ Wrong
❏ Inconsistent
❏ NOT reproducable
❏ This is called a 'Data Race'
Fall 2014 Tools for Data Science 10
Par
alle
l Com
putin
g –
Intr
oParallelism
for (i = 0; i < n; i++ ) a[i]= a[i+M] + b[i];
M = 0 : parallel M >= 1 : not parallel
Fundamental problem:
Fall 2014 Tools for Data Science 11
Par
alle
l Com
putin
g –
Intr
oWhat is a Thread?❏ Loosely said, a thread consists of a series of
instructions with it's own program counter (“PC”) and state
❏ A parallel program will execute threads in parallel
❏ These threads are then scheduled onto processors
Fall 2014 Tools for Data Science 12
Par
alle
l Com
putin
g –
Intr
oSingle- vs. multi-threaded
code data I/O handles
registers stack
single-thread
code data I/O handles
registers
stack
thread-privatedata
registers
stack
thread-privatedata
registers
stack
thread-privatedata
registers
stack
thread-privatedata
. . . . .
multi-thread
Fall 2014 Tools for Data Science 14
Par
alle
l Com
putin
g –
Intr
oParallelism vs Concurrency
Concurrent, and parallel execution
Concurrent, non-parallel execution:
e.g. multiple threads on a single core CPU
Fall 2014 Tools for Data Science 15
Par
alle
l Com
putin
g –
Intr
oParallelism vs Concurrency
programs
concurrent
parallel
Fall 2014 Tools for Data Science 19
Par
alle
l Com
putin
g –
Intr
oNumerical Results
Fall 2014 Tools for Data Science 20
Par
alle
l Com
putin
g –
Intr
oBasic concepts
❏ Consider the following code with two loops
❏ Running this in parallel over i might give the wrong answer.
for (i = 0; i < n; i++) a[i] = b[i] + c[i];
for (i = 0; i < n; i++) d[i] = a[i] + e[i];
Fall 2014 Tools for Data Science 21
Par
alle
l Com
putin
g –
Intr
oBasic concepts – the barrier
❏ The problem can be fixed:
❏ The barrier assures that no thread starts working on the second loop before the work on loop one is finished.
for (i = 0; i < n; i++) a[i] = b[i] + c[i];
for (i = 0; i < n; i++) d[i] = a[i] + e[i];
wait!
Fall 2014 Tools for Data Science 22
Par
alle
l Com
putin
g –
Intr
oBasic concepts – the barrier
When to use barriers?
❏ To assure data integrity, e.g.
❏ after one iteration in a solver
❏ between parts of the code that read and write the same variables
❏ Barriers are expensive and don't scale to a large number of threads
Fall 2014 Tools for Data Science 23
Par
alle
l Com
putin
g –
Intr
o ❏ A typical code fragment:
❏ This loop can not run in parallel, unless the update of sum is protected.
❏ An operation like the above is called a “reduction” operation, and there are ways to handle this issue (more later...).
Basic concepts – reduction
for( i = 0; i < n; i++ ) { ... sum += a[i]; ...}
☞ serial code
Fall 2014 Tools for Data Science 24
Par
alle
l Com
putin
g –
Intr
oParallel Overhead
❏ The total CPU time may exceed the serial CPU time:
✔ The newly introduced parallel portions in your program need to be executed
✔ Threads need time for sending data to each other and for synchronizing (“communication”)
❏ Typically, things also get worse when increasing the number of threads
❏ Efficient parallelization is about minimizing the communication overhead
Fall 2014 Tools for Data Science 25
Par
alle
l Com
putin
g –
Intr
oCommunication
Fall 2014 Tools for Data Science 26
Par
alle
l Com
putin
g –
Intr
oLoad Balancing
Fall 2014 Tools for Data Science 27
Par
alle
l Com
putin
g –
Intr
oDilemma – Where to parallelize?
Fall 2014 Tools for Data Science 29
Par
alle
l Com
putin
g –
Intr
oScalability – speed-up & efficiency
Fall 2014 Tools for Data Science 30
Par
alle
l Com
putin
g –
Intr
oAmdahl's Law
Fall 2014 Tools for Data Science 31
Par
alle
l Com
putin
g –
Intr
oAmdahl's Law
Fall 2014 Tools for Data Science 32
Par
alle
l Com
putin
g –
Intr
oAmdahl's Law in Practice
Fall 2014 Tools for Data Science 38
Par
alle
l Com
putin
g –
Intr
oCode scalibility in practice – II
❏ Ideally, HPC codes would be able to scale to the theoretical limit, but ...
❏ Never the case in reality
❏ All codes eventually reach a real upper limit on speedup
❏ At some point codes become “bound” to one or more limiting hardware factors (memory, network, I/O)
Fall 2014 Tools for Data Science 39
Par
alle
l Com
putin
g –
Intr
oWhat is Parallelization? - Summary❏ Parallelization is simply another optimization
technique to get your results sooner
❏ To this end, more that one processor is used to solve the problem
❏ The “Elapsed Time” (also called wallclock time) will come down, but total CPU time will probably go up
❏ The latter is a difference with serial optimization, where one makes better use of existing resources, i.e. the cost will come down
Fall 2014 Tools for Data Science 40
Par
alle
l Com
putin
g –
Intr
o
Parallel Programming Models
Fall 2014 Tools for Data Science 41
Par
alle
l Com
putin
g –
Intr
oParallel Programming ModelsTwo “classic” parallel programming models:
❏ Distributed memory
❏ PVM (standardized)
❏ MPI (de-facto standard, widely used)
❏ http://mpi-forum.org or http://open-mpi.org/
❏ Shared memory
❏ Pthreads (standardized)
❏ OpenMP (de-facto standard)
❏ http://openmp.org/❏ Automatic parallelization (depends on compiler)
Clusters,
SMPs
SMPonly
Fall 2014 Tools for Data Science 42
Par
alle
l Com
putin
g –
Intr
oParallel Programming Models
“Upcoming” programming models
❏ PGAS (Partitioned Global Address Space):
❏ UPC (Unified Parallel C)
❏ Co-Array Fortran
❏ GPUs: massively parallel & shared memory
❏ CUDA
❏ OpenCL
Fall 2014 Tools for Data Science 43
Par
alle
l Com
putin
g –
Intr
oParallel Programming Models
Distributed memory programming model, e.g. MPI:
❏ all data is private to the threads
❏ data is shared by exchanging buffers
❏ explicit synchronization
Fall 2014 Tools for Data Science 44
Par
alle
l Com
putin
g –
Intr
oParallel Programming ModelsMPI:
❏ An MPI application is a set of independent processes (threads)
❏ on different machines
❏ on the same machine
❏ communication over the interconnect
❏ network (network of workstations, cluster, grid)
❏ memory (SMP)
❏ communication is under control of the programmer
Fall 2014 Tools for Data Science 45
Par
alle
l Com
putin
g –
Intr
oParallel Programming Models
Shared memory model, e.g. OpenMP:
❏ all threads have access to the same global memory
❏ data transfer is transparent to the programmer
❏ synchronization is (mostly) implicit
❏ there is private data as well
Fall 2014 Tools for Data Science 46
Par
alle
l Com
putin
g –
Intr
oParallel Programming Models
OpenMP:
❏ needs an SMP
❏ but ... with newer CPU designs, there is an SMP in (almost) every computer
❏ multi-core CPUs (CMP)
❏ chip multi-threading (CMT)
❏ or a combination of both, e.g the Sun UltraSPARC-T series
❏ or ... (whatever we'll see in the future)
Fall 2014 Tools for Data Science 47
Par
alle
l Com
putin
g –
Intr
oMPI vs OpenMP
OpenMP version of “Hello world”:
#include <stdio.h>
int main(int argc, char *argv[]) { printf("Hello world!\n"); return(0);}
#pragma omp parallel { } /* end parallel */
Fall 2014 Tools for Data Science 48
Par
alle
l Com
putin
g –
Intr
oMPI vs OpenMP
no. of threads: OMP_NUM_THREADS
% cc -o hello -xopenmp hello.c% ./helloHello world!
% OMP_NUM_THREADS=2 ./helloHello world!Hello world!
% OMP_NUM_THREADS=8 ./helloHello world!Hello world!Hello world!Hello world!
Fall 2014 Tools for Data Science 49
Par
alle
l Com
putin
g –
Intr
oMPI vs OpenMP
#include <stdio.h>#include <stdlib.h>#include "mpi.h"
int main(int argc, char **argv) { int myrank, p; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &myrank); MPI_Comm_size(MPI_COMM_WORLD, &p); printf("Hello world from %d!\n", myrank); MPI_Finalize(); return 0;}
MPI version of “Hello world”:
Fall 2014 Tools for Data Science 50
Par
alle
l Com
putin
g –
Intr
oMPI vs OpenMPMPI version: compile and run
$ cc -I/.../MPI/include -R/.../MPI/lib \ -L/.../MPI/lib -o hello_mpi hello_mpi.c -lmpi
$ ./hello_mpiERROR in MPI_Init: unclassified error:RTE_Init_lib: Cannot initialize: the program mustbe started by mprun.: Invalid requestFatal error, aborting.
$ mprun -np 4 ./helloHello world from 1!Hello world from 3!Hello world from 0!Hello world from 2!
Fall 2014 Tools for Data Science 78
Par
alle
l Com
putin
g –
Intr
o
Multi-core – everywhere!
Welcome to a “threaded” world
Fall 2014 Tools for Data Science 79
Par
alle
l Com
putin
g –
Intr
oToday’s Multicores99% of Top500 Systems Are Based on Multicore
Sun Niagara2 (8 cores)
Intel Polaris [experimental] (80 cores)
AMD Istanbul (6 cores)
IBM Cell (9 cores)
Intel Nehalem (4 cores)
282 use Quad-Core204 use Dual-Core 3 use Nona-core
Fujitsu Venus (8 cores)
IBM Power 7 (8 cores)
Fall 2014 Tools for Data Science 80
Par
alle
l Com
putin
g –
Intr
oWhat is a multi-core chip?
❏ A “core” is not well-defined – let us assume it covers the processing units and the L1 caches (a very simplified CPU).
❏ Different implementations are possible – and available (examples follow), e.g. multi-threaded cores
❏ Cache hierarchy of private and shared caches
❏ For software developers it matters that there is parallellism in the hardware, they can take advantage of
Fall 2014 Tools for Data Science 81
Par
alle
l Com
putin
g –
Intr
oA generic multi-core design
Fall 2014 Tools for Data Science 84
Par
alle
l Com
putin
g –
Intr
oAMD Opteron – quad-core
❏ dedicated L2 caches
❏ shared L3 cache
Fall 2014 Tools for Data Science 85
Par
alle
l Com
putin
g –
Intr
oAMD Opteron – quad-core
Fall 2014 Tools for Data Science 92
Par
alle
l Com
putin
g –
Intr
oUltraSPARC-T2
System on a chip:
❏ 8 cores with 8 threads = 64 threads
❏ integrated multi-threaded 10 Gb/s Ethernet
❏ integrated crypto-unit per core
❏ low power (< 95W)
❏ < 1.5W/thread
Fall 2014 Tools for Data Science 93
Par
alle
l Com
putin
g –
Intr
oUltraSPARC-T2
Fall 2014 Tools for Data Science 95
Par
alle
l Com
putin
g –
Intr
oWhy adding threads to a core?
Execution of two threads:
Interleaving the work – better utilization:
Keyword: “Throughput Computing”
Fall 2014 Tools for Data Science 119
Par
alle
l Com
putin
g –
Intr
oThe future
Prediction is very difficult,
especially if it is about the future.”
-- Niels Bohr (1885-1962)
Fall 2014 Tools for Data Science 121
Par
alle
l Com
putin
g –
Intr
oThe future ... of multi-core
Many Floating-Point Cores
All Small Core
Many Small CoresAll Large Core
Mixed LargeandSmall Core
Different Classes of Chips:● Home● Games / Graphics● Business ● Scientific
Fall 2014 Tools for Data Science 122
Par
alle
l Com
putin
g –
Intr
oMany-core challenges I
Fall 2014 Tools for Data Science 123
Par
alle
l Com
putin
g –
Intr
oMany-core challenges II
Fall 2014 Tools for Data Science 124
Par
alle
l Com
putin
g –
Intr
oMany-core challenges III
Fall 2014 Tools for Data Science 125
Par
alle
l Com
putin
g –
Intr
oA 2014 multi-core chip: Intel Haswell
Fall 2014 Tools for Data Science 128
Par
alle
l Com
putin
g –
Intr
oSummary
❏ You have heard about:
❏ Parallel programming models and basic concepts
❏ Parallel architectures (shared / distributed memory)
❏ Multi-core architectures
❏ GPUs/accelerators (more next week)
Fall 2014 Tools for Data Science 129
Par
alle
l Com
putin
g –
Intr
oThe future ... of software
❏ Challenges:
❏ how to handle millions of cores/threads ...
... reliably
❏ re-design of algorithms: from coarse-grained parallelism to fine-grained parallelism
❏ more development tools needed to achieve this
❏ we need standards, to assure portability!
❏ long-term perspectives