38
An introduction to parallel programming Paolo Burgio [email protected]

An introduction to parallel programminghipert.unimore.it/people/paolob/pub/PhD/Slides/01-Intro... · 2018-04-28 · OpenMP task 29. P0 P1 Symmetric multi-processing ›Memory: centralized

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: An introduction to parallel programminghipert.unimore.it/people/paolob/pub/PhD/Slides/01-Intro... · 2018-04-28 · OpenMP task 29. P0 P1 Symmetric multi-processing ›Memory: centralized

An introduction to parallel programming

Paolo [email protected]

Page 2: An introduction to parallel programminghipert.unimore.it/people/paolob/pub/PhD/Slides/01-Intro... · 2018-04-28 · OpenMP task 29. P0 P1 Symmetric multi-processing ›Memory: centralized

Definitions

› Parallel computing– Partition computation across different compute engines

› Distributed computing– Paritition computation across different machines

Same principle, more general

2

Page 3: An introduction to parallel programminghipert.unimore.it/people/paolob/pub/PhD/Slides/01-Intro... · 2018-04-28 · OpenMP task 29. P0 P1 Symmetric multi-processing ›Memory: centralized

Outline

› Introduction to "traditional" programming– Writing code

– Operating systems

– …

› Why do we need parallel programming?– Focus on programming shared memory

› Different ways of parallel programming– PThreads

– OpenMP

– MPI?

– GPU/accelerators programming

3

Page 4: An introduction to parallel programminghipert.unimore.it/people/paolob/pub/PhD/Slides/01-Intro... · 2018-04-28 · OpenMP task 29. P0 P1 Symmetric multi-processing ›Memory: centralized

As a side…

› A bit of computer architecture– We will understand why…

– Focus on shared memory systems

› A bit of algorithms– We will understand why…

› A bit of performance analysis– Which is our ultimate goal!

– Being able to identify bottlenecks

4

Page 5: An introduction to parallel programminghipert.unimore.it/people/paolob/pub/PhD/Slides/01-Intro... · 2018-04-28 · OpenMP task 29. P0 P1 Symmetric multi-processing ›Memory: centralized

Programming basics

Page 6: An introduction to parallel programminghipert.unimore.it/people/paolob/pub/PhD/Slides/01-Intro... · 2018-04-28 · OpenMP task 29. P0 P1 Symmetric multi-processing ›Memory: centralized

Take-aways

› Programming basics– Variables

– Functions

– Loops

› Programming stacks– BSP

– Operating systems

– Runtimes

› Computer architectures– Computing domains

– Single processor/multiple processors

– From single- to multi- to many- core

6

Page 7: An introduction to parallel programminghipert.unimore.it/people/paolob/pub/PhD/Slides/01-Intro... · 2018-04-28 · OpenMP task 29. P0 P1 Symmetric multi-processing ›Memory: centralized

Why do we need parallel computing?

Increase performance of our machines

› Scale-up– Solve a "bigger" problem in the same time

› Scale-out– Solve the same problem in less time

7

Page 8: An introduction to parallel programminghipert.unimore.it/people/paolob/pub/PhD/Slides/01-Intro... · 2018-04-28 · OpenMP task 29. P0 P1 Symmetric multi-processing ›Memory: centralized

Yes but..

› Why (highly) parallel machines…

› …and not faster single-core machines?

8

Page 9: An introduction to parallel programminghipert.unimore.it/people/paolob/pub/PhD/Slides/01-Intro... · 2018-04-28 · OpenMP task 29. P0 P1 Symmetric multi-processing ›Memory: centralized

The answer #1 - Money

9

Page 10: An introduction to parallel programminghipert.unimore.it/people/paolob/pub/PhD/Slides/01-Intro... · 2018-04-28 · OpenMP task 29. P0 P1 Symmetric multi-processing ›Memory: centralized

The answer #2 – the "hot" one

Moore's law

› "The number of transistors that we can pack in a given die area doubles every 18 months"

Dennard's scaling

› "performance per watt of computing is growing exponentially at roughly the same rate"

10

Page 11: An introduction to parallel programminghipert.unimore.it/people/paolob/pub/PhD/Slides/01-Intro... · 2018-04-28 · OpenMP task 29. P0 P1 Symmetric multi-processing ›Memory: centralized

Transistors (K’s)

Clock (MHz)

Power (W)

Perf/Clock (ILP)

› SoC design paradigm

› Gordon Moore

– His law is still valid, but…

Performance

frequency

The answer #2 – the "hot" one

11

Page 12: An introduction to parallel programminghipert.unimore.it/people/paolob/pub/PhD/Slides/01-Intro... · 2018-04-28 · OpenMP task 29. P0 P1 Symmetric multi-processing ›Memory: centralized

Transistors (K’s)

Clock (MHz)

Power (W)

Perf/Clock (ILP)

› SoC design paradigm

› Gordon Moore

– His law is still valid, but…

› “The free lunch is over”– Herb Sutter, 2005

The answer #2 – the "hot" one

11

Performance

parallelism

Page 13: An introduction to parallel programminghipert.unimore.it/people/paolob/pub/PhD/Slides/01-Intro... · 2018-04-28 · OpenMP task 29. P0 P1 Symmetric multi-processing ›Memory: centralized

In other words…

1970 1980 1990 2000 2010 2020

Hot plate

Summertemperature

NuclearReactor

Surface ofthe sun

Rocketnozzle

First PCs

The explosionof web

Moderncomputers

12

Page 14: An introduction to parallel programminghipert.unimore.it/people/paolob/pub/PhD/Slides/01-Intro... · 2018-04-28 · OpenMP task 29. P0 P1 Symmetric multi-processing ›Memory: centralized

Instead of going faster..

› ..(go faster but through) parallelism!

Problem #1

› New computer architectures

› At least, three architectural templates

Problem #2

› Need to efficiently program them

› HPC already has this problem!

The problem

› Programmers must know a bit of the architecture!

› To make parallelization effective

› "Let's run this on a GPU. It certainly goes faster" (cit.)

13

Page 15: An introduction to parallel programminghipert.unimore.it/people/paolob/pub/PhD/Slides/01-Intro... · 2018-04-28 · OpenMP task 29. P0 P1 Symmetric multi-processing ›Memory: centralized

The Big problem

› Effectively programming in parallel is difficult

“Everyone knows that debugging is twice as hard as writing a program in

the first place.

So if you're as clever as you can be when you write it, how will you ever

debug it?”

14

Page 16: An introduction to parallel programminghipert.unimore.it/people/paolob/pub/PhD/Slides/01-Intro... · 2018-04-28 · OpenMP task 29. P0 P1 Symmetric multi-processing ›Memory: centralized

I am *really* sorry guys..

› I will give you code…

› ..but first I need to give you some maths…

› …and then, some architectual principles

15

Page 17: An introduction to parallel programminghipert.unimore.it/people/paolob/pub/PhD/Slides/01-Intro... · 2018-04-28 · OpenMP task 29. P0 P1 Symmetric multi-processing ›Memory: centralized

Amdahl's Law

Page 18: An introduction to parallel programminghipert.unimore.it/people/paolob/pub/PhD/Slides/01-Intro... · 2018-04-28 · OpenMP task 29. P0 P1 Symmetric multi-processing ›Memory: centralized

Amdahl's law

› A sequential program that takes 100 sec to exec

› Only 95% can run in parallel (it's a lot)

› And.. you are an extremely good programmer, and you have a machine with 1billioncores, so that part takes 0 sec

› So,𝑇𝑝𝑎𝑟 = 100𝑠𝑒𝑐 − 95𝑠𝑒𝑐 = 5𝑠𝑒𝑐

𝑆𝑝𝑒𝑒𝑑𝑢𝑝 =100𝑠𝑒𝑐5𝑠𝑒𝑐

= 20𝑥

…20x, on one billion cores!!!

17

Page 19: An introduction to parallel programminghipert.unimore.it/people/paolob/pub/PhD/Slides/01-Intro... · 2018-04-28 · OpenMP task 29. P0 P1 Symmetric multi-processing ›Memory: centralized

Computer architecture

Page 20: An introduction to parallel programminghipert.unimore.it/people/paolob/pub/PhD/Slides/01-Intro... · 2018-04-28 · OpenMP task 29. P0 P1 Symmetric multi-processing ›Memory: centralized

Step-by-step

1. "Traditional" multi-cores– Typically, shared-memory

– Max 8-16 cores

– This laptop

2. Many-cores– GPUs but not only

– Heterogeneous architectures

3. More advanced stuff– Field-programmable Gate Arrays

– Neural Networks

19

Page 21: An introduction to parallel programminghipert.unimore.it/people/paolob/pub/PhD/Slides/01-Intro... · 2018-04-28 · OpenMP task 29. P0 P1 Symmetric multi-processing ›Memory: centralized

Symmetric multi-processing

› Memory: centralized with bus interconnect, I/O

› Typically, multi-core (sub)systems– Examples: Sun Enterprise 6000, SGI Challenge, Intel (this laptop)

CPU 0

One or more cachelevels

Main memory

CPU 1

One or more cachelevels

CPU 2

One or more cachelevels

CPU 3

One or more cachelevels

I/O system

Can be 1 bus, N busses, or any

network

20

Page 22: An introduction to parallel programminghipert.unimore.it/people/paolob/pub/PhD/Slides/01-Intro... · 2018-04-28 · OpenMP task 29. P0 P1 Symmetric multi-processing ›Memory: centralized

Asymmetric multi-processing

› Memory: centralized with uniform access time (UMA) and bus interconnect, I/O

› Typically, multi-core (sub)systems– Examples: ARM Big.LITTLE, NVIDIA Tegra X2 (Drive PX)

CPU 0

One or more cachelevels

Main memory

CPU 1

One or more cachelevels

CPU A

One or more cachelevels

CPU B

One or more cachelevels

I/O system

Can be 1 bus, N busses, or any

network

21

Page 23: An introduction to parallel programminghipert.unimore.it/people/paolob/pub/PhD/Slides/01-Intro... · 2018-04-28 · OpenMP task 29. P0 P1 Symmetric multi-processing ›Memory: centralized

SMP – distributed shared memory

› Non-Uniform Access Time - NUMA

› Scalable interconnect– Typically, many cores

– Examples: embedded accelerators, GPUs

CPU 0

$

Main memory

SPM

I/O system

$ = "cache" CPU 1

$

SPM

CPU 2

$

SPM

CPU 3

$

SPMScratchPadMemory

Scalable interconnection

22

Page 24: An introduction to parallel programminghipert.unimore.it/people/paolob/pub/PhD/Slides/01-Intro... · 2018-04-28 · OpenMP task 29. P0 P1 Symmetric multi-processing ›Memory: centralized

Go complex: NVIDIA's Tegra

› Complex heterogeneous system– 3 ISAs

– 2 subdomains

– Shmem between Big.SUPER host and GP-GPU

23

Page 25: An introduction to parallel programminghipert.unimore.it/people/paolob/pub/PhD/Slides/01-Intro... · 2018-04-28 · OpenMP task 29. P0 P1 Symmetric multi-processing ›Memory: centralized

NUMAUMA

UMA vs. NUMA

› Shared mem: every thread can access every memory item– (Not considering security issues…)

› Uniform Memory Access (UMA) vs Non-Uniform Memory Access (NUMA)– Different access time for accessing different memory spaces

CPU 0

CPU 1

CPU 3

CPU 2

SPM0

CPU 12

CPU 13

CPU 15

CPU 14

SPM3

CPU 4

CPU 5

CPU 7

CPU 6

CPU 8

CPU 9

CPU 11

CPU 10

SPM1

SPM2

CPU 0

CPU 1

CPU 3

CPU 2

MAINMEM

24

Page 26: An introduction to parallel programminghipert.unimore.it/people/paolob/pub/PhD/Slides/01-Intro... · 2018-04-28 · OpenMP task 29. P0 P1 Symmetric multi-processing ›Memory: centralized

NUMAUMA

UMA vs. NUMA

› Shared mem: every thread can access every memory item– (Not considering security issues…)

› Uniform Memory Access (UMA) vs Non-Uniform Memory Access (NUMA)– Different access time for accessing different memory spaces

CPU 0

CPU 1

CPU 3

CPU 2

SPM0

CPU 12

CPU 13

CPU 15

CPU 14

SPM3

CPU 4

CPU 5

CPU 7

CPU 6

CPU 8

CPU 9

CPU 11

CPU 10

SPM1

SPM2

CPU 0

CPU 1

CPU 3

CPU 2

MAINMEM

MEM0 MEM1 MEM2 MEM3

CPU0…3 0 clock 10 clock 20 clock 10 clock

CPU4…7 10 clock 0 clock 10 clock 20 clock

CPU8…11 20 clock 10 clock 0 clock 10 clock

CPU12..15 10 clock 20 clock 10 clock 00 clock

24

Page 27: An introduction to parallel programminghipert.unimore.it/people/paolob/pub/PhD/Slides/01-Intro... · 2018-04-28 · OpenMP task 29. P0 P1 Symmetric multi-processing ›Memory: centralized

Some definitions

Page 28: An introduction to parallel programminghipert.unimore.it/people/paolob/pub/PhD/Slides/01-Intro... · 2018-04-28 · OpenMP task 29. P0 P1 Symmetric multi-processing ›Memory: centralized

What is…

› ..a core?– An electronic circuit to execute instruction (=> programs)

› …a program?– The implementation of an algorithm

› …a process?– A program that is executing

› …a thread?– A unit of execution (of a process)

› ..a task?– A unit of work (of a program)

27

Page 29: An introduction to parallel programminghipert.unimore.it/people/paolob/pub/PhD/Slides/01-Intro... · 2018-04-28 · OpenMP task 29. P0 P1 Symmetric multi-processing ›Memory: centralized

What is…

› ..a core?– An electronic circuit to execute instruction (=> programs)

› …a program?– The implementation of an algorithm

› …a process?– A program that is executing

› …a thread?– A unit of execution (of a process)

› ..a task?– A unit of work (of a program)

28

T

t code.c DATA

CORE0

CORE 1

CORE3

CORE 2

MEM

code.c

DATADATA

code.ccode.c

P SHARED MEMT

TT

Page 30: An introduction to parallel programminghipert.unimore.it/people/paolob/pub/PhD/Slides/01-Intro... · 2018-04-28 · OpenMP task 29. P0 P1 Symmetric multi-processing ›Memory: centralized

What is a task?

Operating System

task

Real-timetask

OpenMPtask

29

Page 31: An introduction to parallel programminghipert.unimore.it/people/paolob/pub/PhD/Slides/01-Intro... · 2018-04-28 · OpenMP task 29. P0 P1 Symmetric multi-processing ›Memory: centralized

P0 P1

Symmetric multi-processing

› Memory: centralized with bus interconnect, I/O

› Typically, multi-core (sub)systems– Examples: Sun Enterprise 6000, SGI Challenge, Intel (this laptop)

CPU 0

One or more cachelevels

Main memory

CPU 1

One or more cachelevels

CPU 2

One or more cachelevels

CPU 3

One or more cachelevels

I/O system

T TT T T

Can be 1 bus, N busses, or any

network

30

Page 32: An introduction to parallel programminghipert.unimore.it/people/paolob/pub/PhD/Slides/01-Intro... · 2018-04-28 · OpenMP task 29. P0 P1 Symmetric multi-processing ›Memory: centralized

..start simple…

Page 33: An introduction to parallel programminghipert.unimore.it/people/paolob/pub/PhD/Slides/01-Intro... · 2018-04-28 · OpenMP task 29. P0 P1 Symmetric multi-processing ›Memory: centralized

Something you're used to..

› Multiple processes

› That communicate via shared data

ProcessP0

????

T

ProcessP1

T

(read, write)

(read, write)

DATUM

32

Page 34: An introduction to parallel programminghipert.unimore.it/people/paolob/pub/PhD/Slides/01-Intro... · 2018-04-28 · OpenMP task 29. P0 P1 Symmetric multi-processing ›Memory: centralized

Howto #1 - MPI

› Multiple processes

› That communicate via shared data

ProcessP0

T

ProcessP1

T

DATUM

MPI_Send

MPI_Recv

33

Page 35: An introduction to parallel programminghipert.unimore.it/people/paolob/pub/PhD/Slides/01-Intro... · 2018-04-28 · OpenMP task 29. P0 P1 Symmetric multi-processing ›Memory: centralized

Howto #2 – UNIX pipes

› Multiple processes

› That communicate via shared data

ProcessP0

T

ProcessP1

T

DATUM

int main(void)

{

int fd[2], nbytes;

char string[] = "Hello, world!\n";

pipe(fd);

/* Send "string" through

the output side of pipe */

write(fd[1], string,

(strlen(string)+1));

return(0);

}

int main(void)

{

int fd[2], nbytes; pipe(fd);

/* Receive "string" from

the input side of pipe */

nbytes = read(fd[0], readbuffer,

sizeof(readbuffer));

return(0);

}

34

Page 36: An introduction to parallel programminghipert.unimore.it/people/paolob/pub/PhD/Slides/01-Intro... · 2018-04-28 · OpenMP task 29. P0 P1 Symmetric multi-processing ›Memory: centralized

File.txt

Howto #3 – Files

› Multiple processes

› That communicate via shared data

ProcessP0

T

ProcessP1

T

DATUM

35

Page 37: An introduction to parallel programminghipert.unimore.it/people/paolob/pub/PhD/Slides/01-Intro... · 2018-04-28 · OpenMP task 29. P0 P1 Symmetric multi-processing ›Memory: centralized

Shared memory

› Coherence problem– Memory consistency issue

– Data races

› Can share data ptrs– Ease-to-use

ProcessP0

Shared memory

TT

T

(read, write)(read, write)

DATUM

36

Page 38: An introduction to parallel programminghipert.unimore.it/people/paolob/pub/PhD/Slides/01-Intro... · 2018-04-28 · OpenMP task 29. P0 P1 Symmetric multi-processing ›Memory: centralized

References

› "Calcolo parallelo" website– http://hipert.unimore.it/people/paolob/pub/PhD/index.html

› My contacts– [email protected]

– http://hipert.mat.unimore.it/people/paolob/

› Useful links

› A "small blog"– http://www.google.com

37