An introduction to parallel...

An introduction to parallel programming

Paolo Burgiopaolo.burgio@unimore.it

Definitions

› Parallel computing– Partition computation across different compute engines

› Distributed computing– Paritition computation across different machines

Same principle, more general

Outline

› Introduction to "traditional" programming– Writing code

– Operating systems

– …

› Why do we need parallel programming?– Focus on programming shared memory

› Different ways of parallel programming– PThreads

– OpenMP

– MPI?

– GPU/accelerators programming

As a side…

› A bit of computer architecture– We will understand why…

– Focus on shared memory systems

› A bit of algorithms– We will understand why…

› A bit of performance analysis– Which is our ultimate goal!

– Being able to identify bottlenecks

Programming basics

Take-aways

› Programming basics– Variables

– Functions

– Loops

› Programming stacks– BSP

– Operating systems

– Runtimes

› Computer architectures– Computing domains

– Single processor/multiple processors

– From single- to multi- to many- core

Why do we need parallel computing?

Increase performance of our machines

› Scale-up– Solve a "bigger" problem in the same time

› Scale-out– Solve the same problem in less time

Yes but..

› Why (highly) parallel machines…

› …and not faster single-core machines?

The answer #1 - Money

The answer #2 – the "hot" one

Moore's law

› "The number of transistors that we can pack in a given die area doubles every 18 months"

Dennard's scaling

› "performance per watt of computing is growing exponentially at roughly the same rate"

Transistors (K’s)

Clock (MHz)

Power (W)

Perf/Clock (ILP)

› SoC design paradigm

› Gordon Moore

– His law is still valid, but…

Performance

frequency

Transistors (K’s)

Clock (MHz)

Power (W)

Perf/Clock (ILP)

› SoC design paradigm

› Gordon Moore

– His law is still valid, but…

› “The free lunch is over”– Herb Sutter, 2005

Performance

parallelism

In other words…

1970 1980 1990 2000 2010 2020

Hot plate

Summertemperature

NuclearReactor

Surface ofthe sun

Rocketnozzle

First PCs

The explosionof web

Moderncomputers

Instead of going faster..

› ..(go faster but through) parallelism!

Problem #1

› New computer architectures

› At least, three architectural templates

Problem #2

› Need to efficiently program them

› HPC already has this problem!

The problem

› Programmers must know a bit of the architecture!

› To make parallelization effective

› "Let's run this on a GPU. It certainly goes faster" (cit.)

The Big problem

› Effectively programming in parallel is difficult

“Everyone knows that debugging is twice as hard as writing a program in

the first place.

So if you're as clever as you can be when you write it, how will you ever

debug it?”

I am *really* sorry guys..

› I will give you code…

› ..but first I need to give you some maths…

› …and then, some architectual principles

Amdahl's Law

Amdahl's law

› A sequential program that takes 100 sec to exec

› Only 95% can run in parallel (it's a lot)

› And.. you are an extremely good programmer, and you have a machine with 1billioncores, so that part takes 0 sec

› So,𝑇𝑝𝑎𝑟 = 100𝑠𝑒𝑐 − 95𝑠𝑒𝑐 = 5𝑠𝑒𝑐

𝑆𝑝𝑒𝑒𝑑𝑢𝑝 =100𝑠𝑒𝑐5𝑠𝑒𝑐

= 20𝑥

…20x, on one billion cores!!!

Computer architecture

Step-by-step

1. "Traditional" multi-cores– Typically, shared-memory

– Max 8-16 cores

– This laptop

2. Many-cores– GPUs but not only

– Heterogeneous architectures

3. More advanced stuff– Field-programmable Gate Arrays

– Neural Networks

Symmetric multi-processing

› Memory: centralized with bus interconnect, I/O

› Typically, multi-core (sub)systems– Examples: Sun Enterprise 6000, SGI Challenge, Intel (this laptop)

One or more cachelevels

Main memory

I/O system

Can be 1 bus, N busses, or any

network

Asymmetric multi-processing

› Memory: centralized with uniform access time (UMA) and bus interconnect, I/O

› Typically, multi-core (sub)systems– Examples: ARM Big.LITTLE, NVIDIA Tegra X2 (Drive PX)

Main memory

I/O system

network

SMP – distributed shared memory

› Non-Uniform Access Time - NUMA

› Scalable interconnect– Typically, many cores

– Examples: embedded accelerators, GPUs

Main memory

I/O system

$ = "cache" CPU 1

SPMScratchPadMemory

Scalable interconnection

Go complex: NVIDIA's Tegra

› Complex heterogeneous system– 3 ISAs

– 2 subdomains

– Shmem between Big.SUPER host and GP-GPU

NUMAUMA

UMA vs. NUMA

› Shared mem: every thread can access every memory item– (Not considering security issues…)

› Uniform Memory Access (UMA) vs Non-Uniform Memory Access (NUMA)– Different access time for accessing different memory spaces

CPU 12

CPU 13

CPU 15

CPU 14

CPU 11

CPU 10

MAINMEM

NUMAUMA

UMA vs. NUMA

› Shared mem: every thread can access every memory item– (Not considering security issues…)

› Uniform Memory Access (UMA) vs Non-Uniform Memory Access (NUMA)– Different access time for accessing different memory spaces

CPU 12

CPU 13

CPU 15

CPU 14

CPU 11

CPU 10

MAINMEM

MEM0 MEM1 MEM2 MEM3

CPU0…3 0 clock 10 clock 20 clock 10 clock

CPU12..15 10 clock 20 clock 10 clock 00 clock

Some definitions

What is…

› ..a core?– An electronic circuit to execute instruction (=> programs)

› …a program?– The implementation of an algorithm

› …a process?– A program that is executing

› …a thread?– A unit of execution (of a process)

› ..a task?– A unit of work (of a program)

What is…

› ..a core?– An electronic circuit to execute instruction (=> programs)

› …a program?– The implementation of an algorithm

› …a process?– A program that is executing

› …a thread?– A unit of execution (of a process)

› ..a task?– A unit of work (of a program)

t code.c DATA

CORE 1

CORE 2

code.c

DATADATA

code.ccode.c

P SHARED MEMT

What is a task?

Operating System

Real-timetask

OpenMPtask

Symmetric multi-processing

› Memory: centralized with bus interconnect, I/O

› Typically, multi-core (sub)systems– Examples: Sun Enterprise 6000, SGI Challenge, Intel (this laptop)

Main memory

I/O system

T TT T T

network

..start simple…

Something you're used to..

› Multiple processes

› That communicate via shared data

ProcessP0

ProcessP1

(read, write)

Howto #1 - MPI

ProcessP0

ProcessP1

MPI_Send

MPI_Recv

Howto #2 – UNIX pipes

ProcessP0

ProcessP1

int main(void)

int fd[2], nbytes;

char string[] = "Hello, world!\n";

pipe(fd);

/* Send "string" through

the output side of pipe */

write(fd[1], string,

(strlen(string)+1));

return(0);

int main(void)

int fd[2], nbytes; pipe(fd);

/* Receive "string" from

the input side of pipe */

nbytes = read(fd[0], readbuffer,

sizeof(readbuffer));

return(0);

File.txt

Howto #3 – Files

ProcessP0

ProcessP1

Shared memory

› Coherence problem– Memory consistency issue

– Data races

› Can share data ptrs– Ease-to-use

ProcessP0

Shared memory

(read, write)(read, write)

References

› "Calcolo parallelo" website– http://hipert.unimore.it/people/paolob/pub/PhD/index.html

› My contacts– paolo.burgio@unimore.it

– http://hipert.mat.unimore.it/people/paolob/

› Useful links

› A "small blog"– http://www.google.com

An introduction to parallel...

Documents

Параллельное программирование на OpenMPccfit.nsu.ru/arom/data/openmp.pdf · 2007-06-22 · OpenMP Введение в OpenMP OpenMP – механизм

OpenMP Offload Evaluating Support for Features · OpenMP 1.0 OpenMP 2.0 OpenMP 2.0 OpenMP 2.5 OpenMP 3.0 OpenMP 3.1 OpenMP 4.0 OpenMP 4.5 OpenMP 5.0 OpenACC 1.0 OpenACC 2.5 OpenACC

OpenMP · 2011-07-05 · OpenMP ... pc ?

Parallel Programming with OpenMP part 2 – OpenMP v3.0 - tasking

OpenMP A.Klypin

openmp - 超级计算中心 - CASscc.qibebt.cas.cn/docs/program/openmp/OpenMP.pdf · OpenMP What does OpenMP stand for? Open specifications for Multi Processing via collaborative

Programming OpenMP

OpenMP Tutorial

Phase coexistence in the system Mg (P0 —Ca (P0 …chempap.org/file_access.php?file=302a145.pdf · Phase coexistence in the system Mg 3 (P0 4) 2 —Ca 3 (P0 4) 2 —Na 3 P0 4 J

Shared Memory Parallelism - OpenMP Sathish Vadhiyar Credits/Sources: OpenMP C/C++ standard (openmp.org) OpenMP tutorial (

OpenMP TutorialAdvanced OpenMP, SC'2001 2 SC’2000 Tutorial Agenda Summary of OpenMP basics ! OpenMP: The more subtle/advanced stuff ! OpenMP case studies ! Automatic parallelism

Presentasi P0

Programming Irregular Applications with OpenMP · 2016. 11. 14. · 1 1 Programming Irregular Applications with OpenMP* * The name “OpenMP” is the property of the OpenMP Architecture

Resumo OpenMP

C66x KeyStone Training OpenMP: An Overview. Motivation: The Need The OpenMP Solution OpenMP Features OpenMP Implementation Getting Started with

Blogger:ImplicationsFourthAmendment Handouts · !//+* p ),(% 0%+/p+"p0$!p+1.0$p )! )!0 +1. !/p0+p /! !!'%#p 10p %+.%0%!/ p0$%/p /! p0$!p$%!/!p#+2!.)!0p%/p1/%#p" % (p.! +#%0%+p0+p'!!,p0

OpenMP Basics

Advanced OpenMP

Programming with OpenMP and mixed MPI-OpenMP

OpenMP China MCP 1. Agenda Motivation: The Need The OpenMP Solution OpenMP Features OpenMP Implementation Getting Started with OpenMP on KeyStone