Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
High Performance High Performance ComputingComputing
Moreno MarzollaDip. di Informatica—Scienza e Ingegneria (DISI)Università di Bologna
http://www.moreno.marzolla.name/
Pacheco, Chapter 1
High Performance Computing 2
High Performance Computing 3
Credits
● prof. Salvatore Orlando, Univ. Ca' Foscari di Veneziahttp://www.dsi.unive.it/~orlando/
● prof. Mary Hall, University of Utahhttps://www.cs.utah.edu/~mhall/
● Tim Mattson, Intel
High Performance Computing 4
Who I am
● Moreno Marzolla– Associate professor @ DISI– http://www.moreno.marzolla.name/
● Current and past teaching activity– High Performance Computing @ ISI– Fondamenti di Informatica A @ Ing. Biomedica/Elettronica– Past: Algoritmi e Strutture Dati; Sistemi Complessi;
Ingegneria del Software● Research activity
– Parallel programming– Modeling and simulation
High Performance Computing 5
High Performance Computing
● Web page– http://www.moreno.marzolla.name/teaching/HPC/
● Schedule– Monday 9:00—12:00 Room 3.7– Wednesday 14:00—17:00 Lab 2.2– Please check the course Web page and the official course
timetable for variations● Office hours
– At any time; please send e-mail
High Performance Computing 6
References● Peter S. Pacheco, An Introduction to
Parallel Programming, Morgan Kaufmann 2011, ISBN 9780123742605– Theory + OpenMP/MPI programming
● CUDA C programming guide http://docs.nvidia.com/cuda/cuda-c-programming-guide/
– CUDA/C programming● See the Web page for slides and
links to online materialhttps://www.moreno.marzolla.name/
On lecture slides
https://biblioklept.org/2010/06/11/the-british-library-acquires-j-g-ballard-archive/
How you see lecture slides How I see lecture slides
..
High Performance Computing 8
Prerequisites
High PerformanceComputing
ProgrammingAlgorithms andData Structures
ComputerArchitectures
Operating Systems
….
High Performance Computing 9
Syllabus
● 6 CFU (~ 60 hours of lectures/lab)– Lectures ~40 hours– Lab sessions ~20 hours
● Theory (first ~3 weeks)– Parallel architectures– Parallel programming patterns– Performance evaluation of parallel programs
● Parallel programming (rest of the course)– Shared-memory programming with C/OpenMP– Distributed-memory programming with C/MPI– GPU programming with CUDA/C– SIMD programming (if there is enough time)
High Performance Computing 10
Lab sessions
● Hands-on programming exercises● We will work under Linux only● Why?
Updated on sep 2019;Source: http://www.top500.org
High Performance Computing 11
Hardware resources
● isi-raptor03.csr.unibo.it– Dual socket Xeon, 12 cores, 64 GB RAM, Ubuntu 16.04– 3x NVidia GeForce GTX 1070
● A 16 cores VM w/16 GB RAM, Debian/Jessie is available as a backup solution for OpenMP and MPI programming
High Performance Computing 12
Exam
● Written exam (weight: 40%)– Questions/simple exercises on all topics addressed during the course (sample exams
are available on the Web page)– 6 dates: 2 in the winter term (jan/feb 2020); 3 in the summer term (jun/jul 2020); 1 in
the fall term (sep 2020)– You can refuse the grade and redo the written exam
● Individual programming project + written report (weight: 60%)– Project specification defined by the instructor– There is no discussion, unless I need explanations– If you refuse the grade, you must hand-in a NEW project on NEW specifications
● Final grade rounded to the nearest integer● The written exam and programming project are independent, and can be
completed in any order● Grades remain valid until sep 30, 2020
– After that, a new academic year starts– There is no guarantee that the instructor and/or type of exam remain the same...
High Performance Computing 13
Grading the programming project
● Correctness● Clarity● Efficiency● Quality of the written report
– Proper grammar, syntax, ...– Technical correctness– Performance evaluation
High Performance Computing 15
Questions?
High Performance Computing 16
Intro to parallel programming
High Performance Computing 17
High Performance Computing
● Many applications need considerable computing power– Weather forecast, climate modeling, physics simulation,
product engineering, 3D animation, finance, ...● Why?
– To solve more complex problems– To solve the same problem in less time– To solve the same problem more accurately– To make better use of available computing resources
High Performance Computing 18
Parallel programming
● “Traditional” scientific paradigm– Make a theory, then experiment
● “Traditional” engineering paradigm– Design, then build
● Enter numeric experimentation and prototyping– Some phenomena are too complex to be modeled accurately
(e.g., weather forecast)– Soime experiments are too complex, or costly, or dangerous, or
impossible to do in the lab (e.g., wind tunnels, seismic simulations, stellar dynamics...)
● Computational science– Numerical simulations are becoming a new way to “do science”
Slide credits: S. Orlando
Intro to Parallel Programming 19
Applications: Numerical Wind Tunnel
Source: http://ecomodder.com/forum/showthread.php/random-wind-tunnel-smoke-pictures-thread-26678-12.html
Applications:Molecular dynamics
Intro to Parallel Programming 22
Applications:Cosmological Simulation
Bolshoi simulation https://vimeo.com/29769051
The Bolshoi Simulation recreates the large-scale structure of the universe; it required 6 million CPU hours on NASA's Pleiades Supercomputer
Source : https://www.nas.nasa.gov/hecc/resources/pleiades.html
Intro to Parallel Programming 23
Moore's Law
"The number of transistors on an IC doubles every 24 months"
● That used to mean that every new generation of processors was based on smaller transistors
Moore, G.E., Cramming more components onto integrated circuits. Electronics, 38(8), April 1965 Gordon E.
Moore(1929– )
Intro to Parallel Programming 24By Wgsimon - Own work, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=15193542
Log scale
Intro to Parallel Programming 25
Physics lesson
● Smaller transistors → Faster processor● Faster processor → Higher power consumption● Higher power consumption → More heat produced● More heat produced → Unreliable processor
Intro to Parallel Programming 26
Power
● The power required by an IC (e.g., a processor) can be expressed as
Power = C ´ V 2 ´ f
where:– C is the capacitance (ability of a circuit to store energy)– V is the voltage– f is the frequency at which the processor operates
Intro to Parallel Programming 27
Power
ProcessorInput Output
f
Processor
Processor
Input Output
f / 2
f / 2
f
Credits: Tim Mattson
Capacitance CVoltage VFrequency fPower = C V 2 f
Capacitance 2.2 CVoltage 0.6 VFrequency 0.5 f Power = 0.396 C V 2 f
Intro to Parallel Programming 28Source: https://www.karlrupp.net/2018/02/42-years-of-microprocessor-trend-data/
High Performance Computing 29
Processor/Memory wall
Source: John L. Hennessy, David A. Patterson, Computer Architecture: a Quantitative Approach, Fifth Ed., Morgan Kaufman 2012, ISBN: 978-0-12-383872-8
High Performance Computing 30
Limits
● There are limits to “automatic” improvement of scalar performance:– The Power Wall: Clock frequency cannot be increased
without exceeding air cooling– The Memory Wall: Access to data is a limiting factor– The ILP Wall: All the existing instruction-level parallelism
(ILP) is already being used● Conclusion:
– Explicit parallel mechanisms and explicit parallel programming are required for performance scaling
Slide credits: Hebenstreit, Reinders, Robison, McCool, SC13 tutorial
Intro to Parallel Programming 31
What happens today?
● HW designers create processors with more cores
● Result:– parallel hardware is
ubiquitous– parallel software is
rare● The challenge
– Make parallel software as common as parallel hardware
NVidia Tegra 4 SoC
High Performance Computing 32
Parallel programming in brief
● Decompose the problem in sub-problems● Distribute sub-problems to the available execution
units● Solve sub-problems independently
– Cooperate to solve sub-problems● Goals
– Reduce the wall-clock time– Balance the workload across execution units– Reduce communication and synchronization overhead
Slide credits: S. Orlando
High Performance Computing 33
Concurrency vs Parallelism
Slide credits: Tim Mattson, Intel
Task 1
Task 2
Task 3
Concurrency without parallelism Parallelism
High Performance Computing 34
The “Holy Grail”
● Write serial code and have a “smart” compiler capable of parallelizing programs automatically
● It has been done in some very specific cases– In practice, no compiler proved to be “smart” enough
● Writing efficient parallel code requires that the programmer understands, and makes explicit use of, the underlying hardware Here we come
High Performance Computing 35
Parallel programming is difficult
Serial version(~49 lines of C++ code)
Parallel version(~1000 lines of C/C++ code)
http://www.moreno.marzolla.name/software/svmcell/
.
High Performance Computing 36
Issues of parallel programming
● Writing parallel programs is in general much harder than writing sequential code
● There is limited portability across different types of architectures– E.g., a distributed-memory parallel program must be
rewritten from scratch to run on a GPU– However, there are standards (OpenMP, MPI, OpenCL) that
allow portability across the same type of parallel architecture● Tuning for best performance is time-consuming
Intro to Parallel Programming 37
ExampleSum-reduction of an array
("Hello, world!" of parallel programming)
Intro to Parallel Programming 38
Sum-Reduction
● We start assuming a shared-memory architecture– All execution units share a common memory space
● We begin with a sequential solution and parallelize it– This is not always a good idea; some parallel algorithms
have nothing in common with their sequential counterparts!– However, it is sometimes a reasonable starting point
Credits: Mary Hall
Intro to Parallel Programming 39
Sequential algorithm
● Compute the sum of the content of an array A of length n
float seq_sum(float* A, int n){
int i;float sum = 0.0;for (i=0; i<n; i++) {
sum += A[i];}return sum;
}
Credits: Mary Hall
Intro to Parallel Programming 40
Version 1 (Wrong!)
● Assuming P execution units (e.g., processors), each one computes a partial sum of n / P adjacent elements
● Example: n = 15, P = 3
my_block_len = n/P; my_start = my_id * my_block_len; my_end = my_start + my_block_len; sum = 0.0;for (my_i=my_start; my_i<my_end; my_i++) { my_x = get_value(my_i); sum += my_x;}
Proc 0 Proc 1 Proc 2
WRONG
Variables whose names start with my_ are assumed to be local (private) to each processor; all other
variables are assumed to be global (shared)
Race condition!
.
Intro to Parallel Programming 41
Version 1 (better, but still wrong)
● Assuming P processors, each one computes a partial sum of n / P adjacent elements
● Example: n = 15, P = 3
my_block_len = n/P; my_start = my_id * my_block_len; my_end = my_start + my_block_len; sum = 0.0; mutex m;for (my_i=my_start; my_i<my_end; my_i++) { my_x = get_value(my_i); mutex_lock(&m); sum += my_x; mutex_unlock(&m);}
WRONG
Proc 0 Proc 1 Proc 2
Intro to Parallel Programming 42
Version 1 (better, but still wrong)
● Assuming P processors, each one computes a partial sum of n / P adjacent elements
● Esempio: n = 17, P = 3
my_block_len = n/P; my_start = my_id * my_block_len; my_end = my_start + my_block_len; sum = 0.0; mutex m;for (my_i=my_start; my_i<my_end; my_i++) { my_x = get_value(my_i); mutex_lock(&m); sum += my_x; mutex_unlock(&m);}
?? ??
Proc 0 Proc 1 Proc 2
WRONG
Intro to Parallel Programming 43
Version 1(correct, but not efficient)
● Assuming P processors, each one computes a partial sum of n / P adjacent elements
● Example: n = 17, P = 3
my_start = n * my_id / P; my_end = n * (my_id + 1) / P;sum = 0.0; mutex m;for (my_i=my_start; my_i<my_end; my_i++) { my_x = get_value(my_i); mutex_lock(&m); sum += my_x; mutex_unlock(&m);}
Proc 0 Proc 1 Proc 2
Intro to Parallel Programming 44
Version 2
● Too much contention on mutex m– Each processor acquires and releases the mutex for each element of the
array!● Solution: increase the mutex granularity
– Each processor accumulates the partial sum on a local (private) variable– The mutex is used at the end to update the global sum
my_start = n * my_id / P; my_end = n * (my_id + 1) / P;sum = 0.0; my_sum = 0.0; mutex m;for (my_i=my_start; my_i<my_end; my_i++) {
my_x = get_value(my_i);my_sum += my_x;
}mutex_lock(&m);sum += my_sum;mutex_unlock(&m);
Intro to Parallel Programming 45
Version 3: Remove the mutex(wrong, in a subtle way)
● We use a shared array psum[] where each processor can store its local sum
● At the end, one processor computes the global sum
my_start = n * my_id / P; my_end = n * (my_id + 1) / P;psum[0..P-1] = 0.0; /* all elements set to 0.0 */for (my_i=my_start; my_i<my_end; my_i++) {
my_x = get_value(my_i);psum[my_id] += my_x;
}if ( 0 == my_id ) { /* only the master executes this */
sum = 0.0;for (my_i=0; my_i<P; my_i++)
sum += psum[my_i];}
WRONG
Intro to Parallel Programming 46
The problem with version 3
● Processor 0 could start the computation of the global sum before all other processors have computed the local sums!
Compute local sums
Compute global sum
P0 P1 P2 P3
Intro to Parallel Programming 47
Version 4(correct)
● Use a barrier synchronization
my_start = n * my_id / P; my_end = n * (my_id + 1) / P;psum[0..P-1] = 0.0; for (my_i=my_start; my_i<my_end; my_i++) {
my_x = get_value(my_i);psum[my_id] += my_x;
}barrier();if ( 0 == my_id ) {
sum = 0.0;for (my_i=0; my_i<P; my_i++)
sum += psum[my_i];}
Compute local sums
Compute global sum
P0 P1 P2 P3
barrier()
Intro to Parallel Programming 48
Version 5Distributed-memory version
● P << n processors● Each processor
computes a local sum● Each processor
sends the local sum to processor 0 (the master)
...my_sum = 0.0;my_start = …, my_end = …;
for ( i = my_start; i < my_end; i++ ) {my_sum += get_value(i);
}if ( 0 == my_id ) {
for ( i=1; i<P; i++ ) {tmp = receive from proc i;my_sum += tmp;
}printf(“The sum is %f\n”, my_sum);
} else {send my_sum to thread 0;
}
Intro to Parallel Programming 49
Version 5Proc 0
A[ ]
my_sum
Proc 1 Proc 2 Proc 3 Proc 4 Proc 5 Proc 6 Proc 7
1 3 -2 7 -6 5 3 4
15
4
2
9
3
8
11
Bottleneck
.
Intro to Parallel Programming 50
Parallel reductionProc 0
A[ ]
my_sum
Proc 1 Proc 2 Proc 3 Proc 4 Proc 5 Proc 6 Proc 7
1 3 -2 7 -6 5 3 4
5 -1 7
6
15
4
9
● (P – 1) sums are still performed; however, processor 0 receives ~ log
2 P messages and performs ~ log
2 P sums
High Performance Computing 51
Task parallelism vs Data parallelism
● Task Parallelism– Distribute (possibly different) tasks to processors
● Data Parallelism– Distribute data to processors– Each processor executes the same task on different data
High Performance Computing 52
Example
● We have a table containing hourly temperatures on some location– 24 columns, 365 rows
● Compute the minimum, maximum and average temperatore for each day
● Assume we have 3 independent processors
High Performance Computing 53
Example
max min ave0 1 2 3 22 23
Hour (0—23)
0
1
2
364
Day
s (0
—36
4)
High Performance Computing 54
Data parallel approach
max min ave0 1 2 3 22 23
Hour (0—23)
0
1
2
364
Day
s (0
—36
4)
Proc 0
Proc 1
Proc 2
High Performance Computing 55
Task parallel approach
max min ave0 1 2 3 22 23
Hour (0—23)
0
1
2
364
Day
s (0
—36
4)
Pro
c 0
Pro
c 1
Pro
c 2
High Performance Computing 56
Key concepts
● Parallel architectures “naturally” derive from physics laws
● Parallel architectures require parallel programming paradigms
● Writing parallel programs is much harder than writing sequential programs