OpenMP at a glance - UT Southwestern

OpenMP at a glance

1

04/17/2019

[email protected]

Book

– Peter S. Pacheco, An Introduction to Parallel Programming, 2011

– Victor Eijkhout, Online version of book

“Parallel Programming in MPI and OPENMP”

(http://pages.tacc.utexas.edu/~eijkhout/pcse/html/ )

Online Resource

– Blaise Barney, LLNL, https://computing.llnl.gov/tutorials/openMP/

– Joel Yliluoma, https://bisqwit.iki.fi/story/howto/openmp/

Useful reference

2

http://pages.tacc.utexas.edu/~eijkhout/pcse/html/

https://computing.llnl.gov/tutorials/openMP/

https://bisqwit.iki.fi/story/howto/openmp/

Very limited modifications to the original code (4 lines of extra code).

Computation wall time reduced with increasing parallelization.

Matrix Multiplication Example with OpenMP

3

C=A*B

A 3000 x 3000B 3000 x 3000

Using Intel Compiler 16.0.2 with –O3 on NucleusA040

Shared memory programming. Works on one node with shared memory.

BioHPC compute nodes have at lease 32 cores to parallelize.

What is OpenMP?

4

OpenMP works withShared Memory System

MPI works with Distributed Memory System

Image credit: Peter S. Pachero, An introduction to parallel programming, Morgan Kaufmann publications, 2011

Not a new language, but an Application Programming Interface.

– Library, Directives, Environment Variable

– Works with C, C++, Fortran

– Needs compiler support, GCC/Intel

– Purpose: By fully utilizing computational resource to achieve results in a shorter time.

What is OpenMP?

5

#include <omp.h> /*include the library */

#pragma omp parallel /* Compiler directive openmp */

function_to_run() /* each thread runs the same code */

int nthreads = omp_get_num_threads(); /* for each thread, get the OMP_NUM_THREADS environment variable */

int my_rank = omp_get_thread_num(); /* each thread can have itsown ID, called rank */

How does OpenMP work?

6

• A “fork-join” work scheme with master and slaves.

: : : /* Serial code */: : :#pragma omp parallel num_threads(nthreads) [options]{::::::}: : : /* Serial code */: : :

#include <stdlib.h>#include <stdio.h>#include <omp.h> /* OpenMP header library */

void Hello(void);

int main(int argc, char* argv[]) {

int nthreads = strtol(argv[1], NULL, 10); /* get # of threads from CLI */

#pragma omp parallel num_threads(nthreads) /* The OpenMP directive */Hello();

return 0;}

void Hello(void) {

int my_rank = omp_get_thread_num(); /* each slave gets its rank id */int nthreads = omp_get_num_threads(); /* get # of threads from envmt */

printf("Hello from thread %d of %d\n", my_rank, nthreads);}

hello_world.c

7

Compile using GCC 4.8.5 shipped with the system (compute node/workstation),

$ gcc –o hello hello_world.c –fopenmp

Compile with Intel compiler,

$ module load intel/16.0.2

$ icc –o hello hello_world.c –qopenmp

Other optional arguments: –O3, –Wall or –w

To run:

$ ./hello 4

Hello from thread 0 of 4




hello_world.c

8

𝜋 = 4 1 −1

3+1

5−1

7+⋯ = 4

𝑘=0

∞(−1)𝑘

2𝑘 + 1

Calculate the Pi when k=32 and 10000000

9

Serial code version:

double sum = 0.0, factor = 1.0;

for ( k = 0; k < n; k++){

factor = ( k%2 == 0 ) ? 1.0 : -1.0sum += factor / (2 * k + 1)

}sum *= 4.;

In the parallel regime, each thread gets a copy of the variables in its own stack/heap

double sum = 0., local_result = 0, \factor = 0.;

int my_rank;int nthreads = strtol(argv[1], NULL, 10);

#pragma omp parallel num_threads(nthreads){

local_result = 0.;my_rank = omp_get_thread_num();

factor = ( my_rank%2 == 0) ? 1.0 : -1.0;local_result = 4. * factor / (2 * my_rank + 1);

sum += local_result;}

Scope of variable – private vs shared

10

Parallel threads often have local values needed to be summed/combined.

Race condition may involve with all threads write to one value.

Critical clause prevents race condition, but not efficient.

Reduction clause is an efficient way to combine local results.

critical vs reduction

11

#pragma omp parallel{

my_rank = omp_get_thread_num();local_result = f(x, my_rank);

#pragma omp critical global_result += local_result;

}

#pragma omp parallel reduction(+:global_result) { my_rank = omp_get_thread_num();

local_result = f(x, my_rank);

global_result += local_result;}

Image retrieved from http://web.sfc.keio.ac.jp/~rdv/keio/sfc/teaching/architecture/computer-architecture-2018/lec13-parallel.html

double sum = 0.0, factor = 1.;

int k, n;

#pragma omp parallel for num_threads(nthreads) reduction(+:sum)

for (k = 0; k < n; k++)

{

factor = ( k%2 == 0) ? 1.0 : -1.0;

sum += 4. * factor / (2 * k + 1);

}

parallel for clause

12

The parallel for needs a definite loop iteration description to distribute chucks.

– variable i must be integer or pointer

–Expression start, end and incr must have compatible type

–Start, end and incr must not change during the execution of the loop

–Variable i can only be modified by the increment expression in the for statement

𝑓𝑜𝑟𝑖 < 𝑒𝑛𝑑 𝑖 + +

𝑖 = 𝑠𝑡𝑎𝑟𝑡; 𝑖 < = 𝑒𝑛𝑑; + +𝑖 ;𝑖 > 𝑒𝑛𝑑 𝑖 += 𝑖𝑛𝑐𝑟

OpenMP does not work with while loop.

parallel for clause

13

Schedule specified how each chunk of the calculations be assigned to threads.

Benefit: Align the computing with your data structure.

schedule(<type> [, <chunk size>])

schedule

14

Image Source: http://pages.tacc.utexas.edu/~eijkhout/pcse/html/omp-loop.html

When shared variable is updated by all the threads in the parallel pool, false sharing

can happen.

#pragma omp parallel for private(i, j) \

schedule(static, 1)

for ( i = 0; i < size; i++)

for ( j = 0; j < size; j++)

y[i] += f(i, j);

Cache and False sharing

15

Speedup: This is how much time you don’t need to wait.

𝑆 =𝑇𝑠𝑒𝑟𝑖𝑎𝑙𝑇𝑝𝑎𝑟𝑎𝑙𝑙𝑒𝑙

Efficiency: You will always have some overhead.

𝐸 =𝑇𝑠𝑒𝑟𝑖𝑎𝑙

𝑝𝑇𝑝𝑎𝑟𝑎𝑙𝑙𝑒𝑙

Parallelization performance

16

OpenMP is a shared memory programming API.

OpenMP works with C/C++/Fortran

OpenMP works in a fork-join mode with master-slave threads

Variables have scope in parallel region, by default they are shared (except for the for

loop index)

Use reduction clause for summation of local value.

False sharing needs to be avoided.

Recapping

17

Thanks for your attention!

Documents

OpenMP at a glance - UT Southwestern