High Performance Computing - moreno.marzolla.name€¦ · High Performance Computing 12 Exam Written exam (weight: 40%) – Questions/simple exercises on all topics addressed during

High Performance High Performance ComputingComputing

Moreno MarzollaDip. di Informatica—Scienza e Ingegneria (DISI)Università di Bologna

http://www.moreno.marzolla.name/

Pacheco, Chapter 1


High Performance Computing 2


Credits

● prof. Salvatore Orlando, Univ. Ca' Foscari di Veneziahttp://www.dsi.unive.it/~orlando/

● prof. Mary Hall, University of Utahhttps://www.cs.utah.edu/~mhall/

● Tim Mattson, Intel

http://www.dsi.unive.it/~orlando/

https://www.cs.utah.edu/~mhall/


Who I am

● Moreno Marzolla– Associate professor @ DISI– http://www.moreno.marzolla.name/

● Current and past teaching activity– High Performance Computing @ ISI– Fondamenti di Informatica A @ Ing. Biomedica/Elettronica– Past: Algoritmi e Strutture Dati; Sistemi Complessi;

Ingegneria del Software● Research activity

– Parallel programming– Modeling and simulation



High Performance Computing

● Web page– http://www.moreno.marzolla.name/teaching/HPC/

● Schedule– Monday 9:00—12:00 Room 3.7– Wednesday 14:00—17:00 Lab 2.2– Please check the course Web page and the official course

timetable for variations● Office hours

– At any time; please send e-mail

http://www.moreno.marzolla.name/teaching/HPC/


References● Peter S. Pacheco, An Introduction to

Parallel Programming, Morgan Kaufmann 2011, ISBN 9780123742605– Theory + OpenMP/MPI programming

● CUDA C programming guide http://docs.nvidia.com/cuda/cuda-c-programming-guide/

– CUDA/C programming● See the Web page for slides and

links to online materialhttps://www.moreno.marzolla.name/

http://docs.nvidia.com/cuda/cuda-c-programming-guide/

https://www.moreno.marzolla.name/

On lecture slides

https://biblioklept.org/2010/06/11/the-british-library-acquires-j-g-ballard-archive/

How you see lecture slides How I see lecture slides

..


Prerequisites

High PerformanceComputing

ProgrammingAlgorithms andData Structures

ComputerArchitectures

Operating Systems

….


Syllabus

● 6 CFU (~ 60 hours of lectures/lab)– Lectures ~40 hours– Lab sessions ~20 hours

● Theory (first ~3 weeks)– Parallel architectures– Parallel programming patterns– Performance evaluation of parallel programs

● Parallel programming (rest of the course)– Shared-memory programming with C/OpenMP– Distributed-memory programming with C/MPI– GPU programming with CUDA/C– SIMD programming (if there is enough time)


Lab sessions

● Hands-on programming exercises● We will work under Linux only● Why?

Updated on sep 2019;Source: http://www.top500.org

http://www.top500.org/


Hardware resources

● isi-raptor03.csr.unibo.it– Dual socket Xeon, 12 cores, 64 GB RAM, Ubuntu 16.04– 3x NVidia GeForce GTX 1070

● A 16 cores VM w/16 GB RAM, Debian/Jessie is available as a backup solution for OpenMP and MPI programming


Exam

● Written exam (weight: 40%)– Questions/simple exercises on all topics addressed during the course (sample exams

are available on the Web page)– 6 dates: 2 in the winter term (jan/feb 2020); 3 in the summer term (jun/jul 2020); 1 in

the fall term (sep 2020)– You can refuse the grade and redo the written exam

● Individual programming project + written report (weight: 60%)– Project specification defined by the instructor– There is no discussion, unless I need explanations– If you refuse the grade, you must hand-in a NEW project on NEW specifications

● Final grade rounded to the nearest integer● The written exam and programming project are independent, and can be

completed in any order● Grades remain valid until sep 30, 2020

– After that, a new academic year starts– There is no guarantee that the instructor and/or type of exam remain the same...


Grading the programming project

● Correctness● Clarity● Efficiency● Quality of the written report

– Proper grammar, syntax, ...– Technical correctness– Performance evaluation


Questions?


Intro to parallel programming


High Performance Computing

● Many applications need considerable computing power– Weather forecast, climate modeling, physics simulation,

product engineering, 3D animation, finance, ...● Why?

– To solve more complex problems– To solve the same problem in less time– To solve the same problem more accurately– To make better use of available computing resources


Parallel programming

● “Traditional” scientific paradigm– Make a theory, then experiment

● “Traditional” engineering paradigm– Design, then build

● Enter numeric experimentation and prototyping– Some phenomena are too complex to be modeled accurately

(e.g., weather forecast)– Soime experiments are too complex, or costly, or dangerous, or

impossible to do in the lab (e.g., wind tunnels, seismic simulations, stellar dynamics...)

● Computational science– Numerical simulations are becoming a new way to “do science”

Slide credits: S. Orlando

Intro to Parallel Programming 19

Applications: Numerical Wind Tunnel

Source: http://ecomodder.com/forum/showthread.php/random-wind-tunnel-smoke-pictures-thread-26678-12.html

http://ecomodder.com/forum/showthread.php/random-wind-tunnel-smoke-pictures-thread-26678-12.html

Applications:Molecular dynamics

http://bionano.physics.illinois.edu/node/109

https://www.ipcc.ch/report/ar5/wg1/


Applications:Cosmological Simulation

Bolshoi simulation https://vimeo.com/29769051

The Bolshoi Simulation recreates the large-scale structure of the universe; it required 6 million CPU hours on NASA's Pleiades Supercomputer

Source : https://www.nas.nasa.gov/hecc/resources/pleiades.html

https://vimeo.com/29769051

http://hipacc.ucsc.edu/Bolshoi/

https://www.nas.nasa.gov/hecc/resources/pleiades.html


Moore's Law

"The number of transistors on an IC doubles every 24 months"

● That used to mean that every new generation of processors was based on smaller transistors

Moore, G.E., Cramming more components onto integrated circuits. Electronics, 38(8), April 1965 Gordon E.

Moore(1929– )

Intro to Parallel Programming 24By Wgsimon - Own work, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=15193542

Log scale

https://commons.wikimedia.org/w/index.php?curid=15193542


Physics lesson

● Smaller transistors → Faster processor● Faster processor → Higher power consumption● Higher power consumption → More heat produced● More heat produced → Unreliable processor


Power

● The power required by an IC (e.g., a processor) can be expressed as

Power = C ´ V 2 ´ f

where:– C is the capacitance (ability of a circuit to store energy)– V is the voltage– f is the frequency at which the processor operates


Power

ProcessorInput Output

f

Processor

Processor

Input Output

f / 2

f / 2

f

Credits: Tim Mattson

Capacitance CVoltage VFrequency fPower = C V 2 f

Capacitance 2.2 CVoltage 0.6 VFrequency 0.5 f Power = 0.396 C V 2 f

Intro to Parallel Programming 28Source: https://www.karlrupp.net/2018/02/42-years-of-microprocessor-trend-data/

https://www.karlrupp.net/2018/02/42-years-of-microprocessor-trend-data/


Processor/Memory wall

Source: John L. Hennessy, David A. Patterson, Computer Architecture: a Quantitative Approach, Fifth Ed., Morgan Kaufman 2012, ISBN: 978-0-12-383872-8


Limits

● There are limits to “automatic” improvement of scalar performance:– The Power Wall: Clock frequency cannot be increased

without exceeding air cooling– The Memory Wall: Access to data is a limiting factor– The ILP Wall: All the existing instruction-level parallelism

(ILP) is already being used● Conclusion:

– Explicit parallel mechanisms and explicit parallel programming are required for performance scaling

Slide credits: Hebenstreit, Reinders, Robison, McCool, SC13 tutorial


What happens today?

● HW designers create processors with more cores

● Result:– parallel hardware is

ubiquitous– parallel software is

rare● The challenge

– Make parallel software as common as parallel hardware

NVidia Tegra 4 SoC


Parallel programming in brief

● Decompose the problem in sub-problems● Distribute sub-problems to the available execution

units● Solve sub-problems independently

– Cooperate to solve sub-problems● Goals

– Reduce the wall-clock time– Balance the workload across execution units– Reduce communication and synchronization overhead

Slide credits: S. Orlando


Concurrency vs Parallelism

Slide credits: Tim Mattson, Intel

Task 1

Task 2

Task 3

Concurrency without parallelism Parallelism


The “Holy Grail”

● Write serial code and have a “smart” compiler capable of parallelizing programs automatically

● It has been done in some very specific cases– In practice, no compiler proved to be “smart” enough

● Writing efficient parallel code requires that the programmer understands, and makes explicit use of, the underlying hardware Here we come


Parallel programming is difficult

Serial version(~49 lines of C++ code)

Parallel version(~1000 lines of C/C++ code)

http://www.moreno.marzolla.name/software/svmcell/

.

http://www.moreno.marzolla.name/software/svmcell/


Issues of parallel programming

● Writing parallel programs is in general much harder than writing sequential code

● There is limited portability across different types of architectures– E.g., a distributed-memory parallel program must be

rewritten from scratch to run on a GPU– However, there are standards (OpenMP, MPI, OpenCL) that

allow portability across the same type of parallel architecture● Tuning for best performance is time-consuming


ExampleSum-reduction of an array

("Hello, world!" of parallel programming)


Sum-Reduction

● We start assuming a shared-memory architecture– All execution units share a common memory space

● We begin with a sequential solution and parallelize it– This is not always a good idea; some parallel algorithms

have nothing in common with their sequential counterparts!– However, it is sometimes a reasonable starting point

Credits: Mary Hall


Sequential algorithm

● Compute the sum of the content of an array A of length n

float seq_sum(float* A, int n){

int i;float sum = 0.0;for (i=0; i<n; i++) {

sum += A[i];}return sum;

}

Credits: Mary Hall


Version 1 (Wrong!)

● Assuming P execution units (e.g., processors), each one computes a partial sum of n / P adjacent elements

● Example: n = 15, P = 3

my_block_len = n/P; my_start = my_id * my_block_len; my_end = my_start + my_block_len; sum = 0.0;for (my_i=my_start; my_i<my_end; my_i++) { my_x = get_value(my_i); sum += my_x;}

Proc 0 Proc 1 Proc 2

WRONG

Variables whose names start with my_ are assumed to be local (private) to each processor; all other

variables are assumed to be global (shared)

Race condition!

.


Version 1 (better, but still wrong)

● Assuming P processors, each one computes a partial sum of n / P adjacent elements

● Example: n = 15, P = 3

my_block_len = n/P; my_start = my_id * my_block_len; my_end = my_start + my_block_len; sum = 0.0; mutex m;for (my_i=my_start; my_i<my_end; my_i++) { my_x = get_value(my_i); mutex_lock(&m); sum += my_x; mutex_unlock(&m);}

WRONG



Version 1 (better, but still wrong)


● Esempio: n = 17, P = 3

my_block_len = n/P; my_start = my_id * my_block_len; my_end = my_start + my_block_len; sum = 0.0; mutex m;for (my_i=my_start; my_i<my_end; my_i++) { my_x = get_value(my_i); mutex_lock(&m); sum += my_x; mutex_unlock(&m);}

?? ??


WRONG


Version 1(correct, but not efficient)


● Example: n = 17, P = 3

my_start = n * my_id / P; my_end = n * (my_id + 1) / P;sum = 0.0; mutex m;for (my_i=my_start; my_i<my_end; my_i++) { my_x = get_value(my_i); mutex_lock(&m); sum += my_x; mutex_unlock(&m);}



Version 2

● Too much contention on mutex m– Each processor acquires and releases the mutex for each element of the

array!● Solution: increase the mutex granularity

– Each processor accumulates the partial sum on a local (private) variable– The mutex is used at the end to update the global sum

my_start = n * my_id / P; my_end = n * (my_id + 1) / P;sum = 0.0; my_sum = 0.0; mutex m;for (my_i=my_start; my_i<my_end; my_i++) {

my_x = get_value(my_i);my_sum += my_x;

}mutex_lock(&m);sum += my_sum;mutex_unlock(&m);


Version 3: Remove the mutex(wrong, in a subtle way)

● We use a shared array psum[] where each processor can store its local sum

● At the end, one processor computes the global sum

my_start = n * my_id / P; my_end = n * (my_id + 1) / P;psum[0..P-1] = 0.0; /* all elements set to 0.0 */for (my_i=my_start; my_i<my_end; my_i++) {

my_x = get_value(my_i);psum[my_id] += my_x;

}if ( 0 == my_id ) { /* only the master executes this */

sum = 0.0;for (my_i=0; my_i<P; my_i++)

sum += psum[my_i];}

WRONG


The problem with version 3

● Processor 0 could start the computation of the global sum before all other processors have computed the local sums!

Compute local sums

Compute global sum

P0 P1 P2 P3


Version 4(correct)

● Use a barrier synchronization

my_start = n * my_id / P; my_end = n * (my_id + 1) / P;psum[0..P-1] = 0.0; for (my_i=my_start; my_i<my_end; my_i++) {

my_x = get_value(my_i);psum[my_id] += my_x;

}barrier();if ( 0 == my_id ) {

sum = 0.0;for (my_i=0; my_i<P; my_i++)

sum += psum[my_i];}

Compute local sums

Compute global sum

P0 P1 P2 P3

barrier()


Version 5Distributed-memory version

● P << n processors● Each processor

computes a local sum● Each processor

sends the local sum to processor 0 (the master)

...my_sum = 0.0;my_start = …, my_end = …;

for ( i = my_start; i < my_end; i++ ) {my_sum += get_value(i);

}if ( 0 == my_id ) {

for ( i=1; i<P; i++ ) {tmp = receive from proc i;my_sum += tmp;

}printf(“The sum is %f\n”, my_sum);

} else {send my_sum to thread 0;

}


Version 5Proc 0

A[ ]

my_sum

Proc 1 Proc 2 Proc 3 Proc 4 Proc 5 Proc 6 Proc 7

1 3 -2 7 -6 5 3 4

15

4

2

9

3

8

11

Bottleneck

.


Parallel reductionProc 0

A[ ]

my_sum

Proc 1 Proc 2 Proc 3 Proc 4 Proc 5 Proc 6 Proc 7

1 3 -2 7 -6 5 3 4

5 -1 7

6

15

4

9

● (P – 1) sums are still performed; however, processor 0 receives ~ log

2 P messages and performs ~ log

2 P sums


Task parallelism vs Data parallelism

● Task Parallelism– Distribute (possibly different) tasks to processors

● Data Parallelism– Distribute data to processors– Each processor executes the same task on different data


Example

● We have a table containing hourly temperatures on some location– 24 columns, 365 rows

● Compute the minimum, maximum and average temperatore for each day

● Assume we have 3 independent processors


Example

max min ave0 1 2 3 22 23

Hour (0—23)

0

1

2

364

Day

s (0

—36

4)


Data parallel approach

max min ave0 1 2 3 22 23

Hour (0—23)

0

1

2

364

Day

s (0

—36

4)

Proc 0

Proc 1

Proc 2


Task parallel approach

max min ave0 1 2 3 22 23

Hour (0—23)

0

1

2

364

Day

s (0

—36

4)

Pro

c 0

Pro

c 1

Pro

c 2


Key concepts

● Parallel architectures “naturally” derive from physics laws

● Parallel architectures require parallel programming paradigms

● Writing parallel programs is much harder than writing sequential programs

Documents

High Performance Computing - moreno.marzolla.name€¦ · High Performance Computing 12 Exam Written exam (weight: 40%) – Questions/simple exercises on all topics addressed during