18
OpenMP at a glance 1 04/17/2019 [email protected]

OpenMP at a glance - UT Southwestern

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: OpenMP at a glance - UT Southwestern

OpenMP at a glance

1

04/17/2019

[email protected]

Page 2: OpenMP at a glance - UT Southwestern

Book

– Peter S. Pacheco, An Introduction to Parallel Programming, 2011

– Victor Eijkhout, Online version of book

“Parallel Programming in MPI and OPENMP”

(http://pages.tacc.utexas.edu/~eijkhout/pcse/html/ )

Online Resource

– Blaise Barney, LLNL, https://computing.llnl.gov/tutorials/openMP/

– Joel Yliluoma, https://bisqwit.iki.fi/story/howto/openmp/

Useful reference

2

Page 3: OpenMP at a glance - UT Southwestern

Very limited modifications to the original code (4 lines of extra code).

Computation wall time reduced with increasing parallelization.

Matrix Multiplication Example with OpenMP

3

C=A*B

A 3000 x 3000B 3000 x 3000

Using Intel Compiler 16.0.2 with –O3 on NucleusA040

Page 4: OpenMP at a glance - UT Southwestern

Shared memory programming. Works on one node with shared memory.

BioHPC compute nodes have at lease 32 cores to parallelize.

What is OpenMP?

4

OpenMP works withShared Memory System

MPI works with Distributed Memory System

Image credit: Peter S. Pachero, An introduction to parallel programming, Morgan Kaufmann publications, 2011

Page 5: OpenMP at a glance - UT Southwestern

Not a new language, but an Application Programming Interface.

– Library, Directives, Environment Variable

– Works with C, C++, Fortran

– Needs compiler support, GCC/Intel

– Purpose: By fully utilizing computational resource to achieve results in a shorter time.

What is OpenMP?

5

#include <omp.h> /*include the library */

#pragma omp parallel /* Compiler directive openmp */

function_to_run() /* each thread runs the same code */

int nthreads = omp_get_num_threads(); /* for each thread, get the OMP_NUM_THREADS environment variable */

int my_rank = omp_get_thread_num(); /* each thread can have itsown ID, called rank */

Page 6: OpenMP at a glance - UT Southwestern

How does OpenMP work?

6

• A “fork-join” work scheme with master and slaves.

: : : /* Serial code */: : :#pragma omp parallel num_threads(nthreads) [options]{::::::}: : : /* Serial code */: : :

Page 7: OpenMP at a glance - UT Southwestern

#include <stdlib.h>#include <stdio.h>#include <omp.h> /* OpenMP header library */

void Hello(void);

int main(int argc, char* argv[]) {

int nthreads = strtol(argv[1], NULL, 10); /* get # of threads from CLI */

#pragma omp parallel num_threads(nthreads) /* The OpenMP directive */Hello();

return 0;}

void Hello(void) {

int my_rank = omp_get_thread_num(); /* each slave gets its rank id */int nthreads = omp_get_num_threads(); /* get # of threads from envmt */

printf("Hello from thread %d of %d\n", my_rank, nthreads);}

hello_world.c

7

Page 8: OpenMP at a glance - UT Southwestern

Compile using GCC 4.8.5 shipped with the system (compute node/workstation),

$ gcc –o hello hello_world.c –fopenmp

Compile with Intel compiler,

$ module load intel/16.0.2

$ icc –o hello hello_world.c –qopenmp

Other optional arguments: –O3, –Wall or –w

To run:

$ ./hello 4

Hello from thread 0 of 4

Hello from thread 2 of 4

Hello from thread 1 of 4

Hello from thread 3 of 4

hello_world.c

8

Page 9: OpenMP at a glance - UT Southwestern

𝜋 = 4 1 −1

3+1

5−1

7+⋯ = 4

𝑘=0

∞(−1)𝑘

2𝑘 + 1

Calculate the Pi when k=32 and 10000000

9

Serial code version:

double sum = 0.0, factor = 1.0;

for ( k = 0; k < n; k++){

factor = ( k%2 == 0 ) ? 1.0 : -1.0sum += factor / (2 * k + 1)

}sum *= 4.;

Page 10: OpenMP at a glance - UT Southwestern

In the parallel regime, each thread gets a copy of the variables in its own stack/heap

double sum = 0., local_result = 0, \factor = 0.;

int my_rank;int nthreads = strtol(argv[1], NULL, 10);

#pragma omp parallel num_threads(nthreads){

local_result = 0.;my_rank = omp_get_thread_num();

factor = ( my_rank%2 == 0) ? 1.0 : -1.0;local_result = 4. * factor / (2 * my_rank + 1);

sum += local_result;}

Scope of variable – private vs shared

10

Page 11: OpenMP at a glance - UT Southwestern

Parallel threads often have local values needed to be summed/combined.

Race condition may involve with all threads write to one value.

Critical clause prevents race condition, but not efficient.

Reduction clause is an efficient way to combine local results.

critical vs reduction

11

#pragma omp parallel{

my_rank = omp_get_thread_num();local_result = f(x, my_rank);

#pragma omp critical global_result += local_result;

}

#pragma omp parallel reduction(+:global_result) { my_rank = omp_get_thread_num();

local_result = f(x, my_rank);

global_result += local_result;}

Image retrieved from http://web.sfc.keio.ac.jp/~rdv/keio/sfc/teaching/architecture/computer-architecture-2018/lec13-parallel.html

Page 12: OpenMP at a glance - UT Southwestern

double sum = 0.0, factor = 1.;

int k, n;

#pragma omp parallel for num_threads(nthreads) reduction(+:sum)

for (k = 0; k < n; k++)

{

factor = ( k%2 == 0) ? 1.0 : -1.0;

sum += 4. * factor / (2 * k + 1);

}

parallel for clause

12

Page 13: OpenMP at a glance - UT Southwestern

The parallel for needs a definite loop iteration description to distribute chucks.

– variable i must be integer or pointer

–Expression start, end and incr must have compatible type

–Start, end and incr must not change during the execution of the loop

–Variable i can only be modified by the increment expression in the for statement

𝑓𝑜𝑟𝑖 < 𝑒𝑛𝑑 𝑖 + +

𝑖 = 𝑠𝑡𝑎𝑟𝑡; 𝑖 < = 𝑒𝑛𝑑; + +𝑖 ;𝑖 > 𝑒𝑛𝑑 𝑖 += 𝑖𝑛𝑐𝑟

OpenMP does not work with while loop.

parallel for clause

13

Page 14: OpenMP at a glance - UT Southwestern

Schedule specified how each chunk of the calculations be assigned to threads.

Benefit: Align the computing with your data structure.

schedule(<type> [, <chunk size>])

schedule

14

Image Source: http://pages.tacc.utexas.edu/~eijkhout/pcse/html/omp-loop.html

Page 15: OpenMP at a glance - UT Southwestern

When shared variable is updated by all the threads in the parallel pool, false sharing

can happen.

#pragma omp parallel for private(i, j) \

schedule(static, 1)

for ( i = 0; i < size; i++)

for ( j = 0; j < size; j++)

y[i] += f(i, j);

Cache and False sharing

15

Page 16: OpenMP at a glance - UT Southwestern

Speedup: This is how much time you don’t need to wait.

𝑆 =𝑇𝑠𝑒𝑟𝑖𝑎𝑙𝑇𝑝𝑎𝑟𝑎𝑙𝑙𝑒𝑙

Efficiency: You will always have some overhead.

𝐸 =𝑇𝑠𝑒𝑟𝑖𝑎𝑙

𝑝𝑇𝑝𝑎𝑟𝑎𝑙𝑙𝑒𝑙

Parallelization performance

16

Page 17: OpenMP at a glance - UT Southwestern

OpenMP is a shared memory programming API.

OpenMP works with C/C++/Fortran

OpenMP works in a fork-join mode with master-slave threads

Variables have scope in parallel region, by default they are shared (except for the for

loop index)

Use reduction clause for summation of local value.

False sharing needs to be avoided.

Recapping

17

Page 18: OpenMP at a glance - UT Southwestern

Thanks for your attention!