OpenMP Programming 2 - Anasayfa

OpenMP Programming 2 Advanced OpenMP Programming

Berk ONAT İTÜ Bilişim Enstitüsü

21 Haziran 2012

Outline

OpenMP Prog. & App. 21.06.2012 2/36

•  OpenMP Synchroniza6on Constructs –  Single, Cri6cal, Atomic, Barrier

•  OpenMP Data Scope Clauses –  Firstprivate, Lastprivate

•  Advanced OpenMP Direc6ves –  Flush, Threadprivate, Copyin

•  Run6me library rou6nes •  OpenMP “Danger Zones”

–  Race Condi6ons –  Deadlock –  Livelock

Synchronization

3/36

•  Synchroniza6on direc6ves help to organize accesses to shared data by mul6ple threads

•  Types –  Barrier –  Single –  Master –  Ordered –  Cri6cal –  Atomic –  Locks

21.06.2012 OpenMP Prog. & App.

Data Scope Clauses

4/36

•  The OpenMP Data Scope AOribute Clauses are used to explicitly define how variables should be scoped.

•  These constructs provide the ability to control the data environment during execu6on of parallel constructs. o  They define how and which data variables in the serial sec6on of

the program are transferred to the parallel sec6ons of the program (and back)

o  They define which variables will be visible to all threads in the parallel sec6ons and which variables will be privately allocated to all threads

•  Types –  Firstprivate –  Lastprivate –  Reduc6on –  Copyin


Data Scope Clauses (firstprivate)

5/36

•  FIRSTPRIVATE Clause: –  It combines the behavior of the PRIVATE clause with automa6c

ini6aliza6on of the variables –  Variables that are declared to be “firstprivate” are private variables –  Listed variables are ini6alized according to the value of their original

objects prior to entry into the parallel or work-‐sharing construct.

C/C++ Fortran

firstprivate (list) FIRSTPRIVATE (list)

Example: Check firstprivate.c example code.


Data Scope Clauses (lastprivate)

6/36

C/C++ Fortran

•  LASTPRIVATE Clause: –  It combines the behavior of the PRIVATE clause with a copy from

the last loop itera6on or sec6on to the original variable object. –  A performance penalty is likely to be associated with the use of

lastprivate, because the OpenMP library needs to keep track of which thread executes the last itera6on. For a sta6c workload distribu6on scheme this is rela6vely easy to do, but for a dynamic scheme this is more costly.

lastprivate (list) LASTPRIVATE (list)

Example: Check lastprivate.c example code.


Data Scope Clauses (lastprivate)

7/36

•  LASTPRIVATE Clause: …

output

If the lastprivate clause is used on a sec6ons construct, the object gets assigned the value that it has at the end of the lexically last sec6ons construct.


Data Scope Clauses (reduction)

8/36

•  REDUCTION Clause: –  It provides some forms of recurrence

calcula6ons (involving mathema6cally associa6ve and commuta6ve operators) so that they can be performed in parallel without code modifica6on.

–  The results will be shared and it is not necessary to define as shared variables.

C/C++ Fortran

reducAon (operator : list) REDUCTION (operator : list)

Example: Check reduction.c or reduction.F example code.



9/36

•  REDUCTION operator values:

Example: Check reduction.c example code.

C/C++ Fortran



10/36 21.06.2012

SUBROUTINE REDUCTION(A, B, C, D, X, Y, N) REAL :: X(*), A, D INTEGER :: Y(*), N, B, C INTEGER :: I A=0 B=0 C = Y(1) D = X(1)

!$OMP PARALLEL DO PRIVATE(I) SHARED(X, Y, N) REDUCTION(+:A) & !$OMP& REDUCTION(IEOR:B) REDUCTION(MIN:C) REDUCTION(MAX:D)

DO I=1,N A = A + X(I) B = IEOR(B, Y(I)) C = MIN(C, Y(I)) IF (D < X(I)) D = X(I) END DO

END SUBROUTINE REDUCTION

Fortran Intrinsic since OpenMP 2.5 C/C++ since OpenMP 3.1 (July 2011)

OpenMP Prog. & App.

Data Scope Clauses (copyin)

11/36

•  COPYIN Clause: –  It provides a means for assigning the same value to thread private

variables for all threads in the team. –  List contains the names of variables to copy. In Fortran, the list can

contain both the names of common blocks and named variables. –  The master thread variable is used as the copy source. The team

threads are ini6alized with its value upon entry into the parallel construct.

C/C++ Fortran

copyin (list) COPYIN (list)


Advanced OpenMP Constructs

12/36

•  THREADPRIVATE Direc6ve: –  It controls the defini6on of global data (Sta6c in C/C++ and Common

Block in Fortran) as a private or shared –  By default global data is shared but some6mes you need to define as

private –  Supports pointer in C/C++ & Fortran , Allocatable in Fortran –  COPYIN clause can be used to ini6liza6on of the global data variable

C/C++ Fortran

#pragma omp threadprivate (list) !$OMP THREADPRIVATE (/cmn/, list...)

Example: Check threadprivate.c or threadprivate.F example



13/36

•  THREADPRIVATE Direc6ve: –  Each thread then gets its own copy of the variable/common block,

so data wriOen by one thread is not visible to other threads. –  Restric6ons:

•  To use THREADPRIVATE variables, the parallel regions must be executed by the same number of threads. Each of the threads will con6nue to work on one of the sets of data previously produced.

•  The value of the dyn-‐var internal control variable is false at entry to the first parallel region and remains false un6l entry to the second parallel region.



14/36

•  PRIVATE vs. THREADPRIVATE

Data Item C/C++: variable Fortran: variable common block

C/C++: variable Fortran: common block

Where Declared

Start of region or work-sharing group

In declarations of each routine using block or global file scope

Persistent NO YES

Initiliaze FIRSTPRIVATE COPYIN

PRIVATE THREADPRIVATE


Clause/Directives Summary

15/36 21.06.2012 OpenMP Prog. & App.

Runtime Library Routines

16/36

•  OMP_SET_NUM_THREADS –  Its the number of threads that will be used in the next parallel

region. Must be a posi6ve integer. –  Notes

•  This rou6ne can only be called from the serial por6ons of the code •  This call has precedence over the OMP_NUM_THREADS environment variable

C/C++ Fortran

#include <omp.h> void omp_set_num_threads(int n)

USE omp.h SUBROUTINE OMP_SET_NUM_THREADS(N)



17/36

C/C++ Fortran

•  OMP_GET_NUM_THREADS –  Returns the number of threads that are currently in the team execu6ng

the parallel region from which it is called. –  Notes

•  If this call is made from a serial por6on of the program, or a nested parallel region that is serialized, it will return 1.

#include <omp.h> int omp_get_num_threads(void)

USE omp.h INTEGER OMP_GET_NUM_THREADS()



18/36

C/C++ Fortran

•  OMP_GET_MAX_THREADS –  Returns the maximum value that can be returned by a call to the

OMP_GET_NUM_THREADS func6on. –  Notes

•  Generally reflects the number of threads as set by the OMP_NUM_THREADS environment variable or the OMP_SET_NUM_THREADS() library.

#include <omp.h> int omp_get_max_threads(void)

USE omp.h INTEGER OMP_GET_MAX_THREADS()

Example: Check omp_getEnvInfo.c example code.



19/36

C/C++ Fortran

•  OMP_GET_THREAD_NUM –  Returns the thread number of the thread, within the team, making this

call. This number will be between 0 and OMP_GET_NUM_THREADS-‐1. The master thread of the team is thread 0

–  Notes •  If called from a nested parallel region, or a serial region, this func6on will return 0.

#include <omp.h> int omp_get_thread_num(void)

USE omp.h INTEGER OMP_GET_THREAD_NUM()



20/36

C/C++ Fortran

•  OMP_GET_WTIME –  Provides a portable wall clock 6ming rou6ne. Returns seconds in double

precision. –  Usually used in "pairs" with the value of the first call subtracted from the

value of the second call to obtain the elapsed 6me for a block of code.

#include <omp.h> double omp_get_w8me(void)

USE omp.h DOUBLE PRECISION OMP_GET_WTIME()

Example: Check omp_wtime.F example code.

GNU fortran compiler implemented correctly

USE: gfortran -‐fopenmp omp_w6me.F -‐o omp_w6me.x



21/36

•  FLUSH Direc6ve: –  If a thread updates shared data,

•  the new values will first be saved in a register •  then stored back to the local cache

–  The updates are thus not necessarily immediately visible to other threads

–  On a cache-‐coherent machine, the modifica6on to cache is broadcast to other processors to make them aware of changes

–  It depends on the plaporm !!!



22/36

•  FLUSH Direc6ve: –  The OpenMP standard specifies that all modifi-‐

ca6ons are wriOen back to main memory and are thus available to all threads, at synchroniza6on points in the program.

–  Some6mes updated values of shared values must become visible to other threads in-‐between synchroniza6on points.

–  The FLUSH direc6ve is used for this purpose. –  The purpose of the flush direc6ve is to to make a

thread’s temporary view of shared data consistent with the values in memory



23/36

•  FLUSH Direc6ve: –  Implicit FLUSH opera6ons

•  All explicit and implicit barriers (e.g., at the end of a parallel region or work sharing construct)

•  Entry to and exit from cri6cal regions •  Entry to and exit from lock rou6nes

C/C++ Fortran

#pragma omp flush (list) !$OMP FLUSH (list)

Example: Check omp_flush_prod_cons.c example code.


Conditional Parallel Regions

24/36

•  IF Clause: –  it is used to specify condi6onal execu6on –  supported on the parallel construct only


OpenMP Danger Zones

25/36

•  There are three major SMP programming error –  Race Condi6ons

• A race condi6on exists when two unsynchronized threads access the same shared variable with at least one thread modifying the variable.

–  Deadlock • deadlock describes a condi6on where two or more threads are blocked (hang) forever, wai6ng for each other

–  Livelock • mul6ple threads working individual tasks which the ensemble can not finish.


OpenMP Danger Zones

26/36

•  Race Condi6ons –  Another common mistake is the use of un-‐ini6alized variables. Remember that private variables do not have ini6al values upon entering a parallel construct. Use the firstprivate and lastprivate clauses to ini6alize them only when necessary, because doing so adds extra overhead.

–  Debug: Intel C++ Compiler specific environment variable KMP_LIBRARY=serial or just compile without -‐openmp flag.


OpenMP Danger Zones

27/36

•  Global Data:

… include “global.h” … !$omp parallel private(j) do j=1, n call suba(j) end do !$omp end do !$omp end paralel

subrou6ne suba(j) … include “global.h” … do i=1, m b(i) = j end do return end

common /work/ a(m,n), b(m)

race condi6on


OpenMP Danger Zones

28/36

•  Global Data: RACE CONDITION !!!

subrou6ne suba(j=1) … include “global.h” … do i=1, m b(i) = 1 end do return end

subrou6ne suba(j=2) … include “global.h” … do i=1, m b(i) = 2 end do return end

Thread 1 Thread 2

both thread changes same variable


OpenMP Danger Zones

29/36

•  Global Data: SOLUTION 1


subrou6ne suba(j) … include “global.h” TID = omp_get_thread_num()+1 do i=1, m b(i, TID) = j end do return end

common /work/ a(m,n) common /tprivate/b(m,nthreads)

extend b to reach each thread to it’s unique storage area


OpenMP Danger Zones

30/36

•  Global Data: SOLUTION 2


subrou6ne suba(j) … include “global.h” … do i=1, m b(i) = j end do return end

common /work/ a(m,n) /tprivate/ b(m) !$omp threadprivate (tprivate)

Compiler create private cop of b for each thread


OpenMP Danger Zones

31/36

•  Race Condi6ons –  Lab: see racecond.F

•  The result varies unpredictably based on specific order of execu6on for each sec6on

• Wrong answers produced without warning !!!

–  Fixed version: racecond-‐fixed.F •  Choose “IC” counter and check it every calcula6on. IC forces the order of calcula6ons. FLUSH forces the update of the shared variable.


OpenMP Danger Zones

32/36

•  Deadlock –  Two or more threads in a process concurrently

access the same memory loca6on, at least one of the threads is accessing the memory loca6on for wri6ng

–  Types … •  ‘poten6al deadlock’: It is a deadlock that did not occur in a given run, but can occur in different runs of the program depending on the 6mings of the requests for locks by the threads.

•  ‘actual deadlock’: It is one that actually occurred in a given run of the program. An actual deadlock causes the threads involved to hang, but may or may not cause the whole process to hang.


OpenMP Danger Zones

33/36

•  Deadlock #pragma omp parallel private(me) { int me; me = omp_get_thread_num(); if (me == 0) goto MASTER;

#pragma omp barrier

MASTER:

#pragma omp single printf(“done”); }

In this example deadlock occurs because thread arrives different barriers. If one thread skip a barrier, it generally causes deadlock. Nested CRITICAL sec6ons or LOCK can cause deadlock


OpenMP Danger Zones

34/36

•  Livelock –  A livelock is similar to

a deadlock, except that the states of the processes involved in the livelock constantly change with regard to one another, none progressing.

!$OMP PARALLEL PRIVATE(ID) ID = OMP_GET_THREAD_NUM() N = OMP_GET_NUM_THREADS() 1000 CONTINUE PHASES(ID)=UPDATE(U,ID)

!$OMP SINGLE RES = MATCH(PHASES,N) !$OMP END SINGLE IF (RES**2.LT.TOL) GOTO 2000 GOTO 1000 2000 CONTINUE !$OMP END PARALLEL

If the square of RES is never smaller than TOL, the program spins endlessly in livelock


OpenMP Death Traps

35/36

•  Are you using thread safe libraries? •  I/O inside a parallel region can interleave

unpredictably. •  Make sure you understand what your constructors

are doing with private objects. •  Private variable can mask global ones. •  Understand when shared memory is coherent.

When in doubts use FLUSH •  NOWAIT removes implied barriers


OpenMP 3.0: task clause!

36/36 21.06.2012

•  TASK Direc6ve: –  Run defined subrou6ne, func6on or code block with omp task in a

seperate thread –  task direc6ve is Vendor specific! (Intel taskq) –  task direc6ve is implemented with OpenMP 3.0 standard (intel

10.1 and gcc 4.4 compilers implemented task in late 2008 )

C/C++ Fortran

#pragma omp task !$OMP TASK

Example: Check omp_linked_list.c and omp_task_omp3.c example code.

OpenMP Prog. & App.

OpenMP 3.0: Intel taskq

37/36 21.06.2012

C/C++ Fortran

#pragma intel omp task !$INTEL OMP TASKQ

Example: Check omp_task_intel.c example code.

bash $ icc omp_task_intel.c -‐o omp_task_intel.x -‐openmp -‐openmp-‐task intel

OpenMP Prog. & App.

TASK BARRIERS: taskwait

38/36 21.06.2012 OpenMP Prog. & App.

*Ref. 4

TASK BARRIERS: taskgroup

39/36 21.06.2012 OpenMP Prog. & App.

*Ref. 4

TASK SYNCRONIZATION: taskyield

40/36 21.06.2012 OpenMP Prog. & App.

*Ref. 4

Task, Workshare or Nested ?

41/36 21.06.2012 OpenMP Prog. & App.

*Ref. 4

Alignment Evalua6on Program

Task, Workshare or Nested ?

42/36 21.06.2012 OpenMP Prog. & App.

*Ref. 4

SparseLU Program

Lab: Exercise 1

43/36

Write correct OpenMP pragmas to parallel matrix mul6plica6on serial code.

$ ../openmp-‐applica6on/ matrixmultp.c


Lab: Exercise 2

44/36

Calculate

1.  Write a simple serial code first

2.  Implement OpenMP pragmas to your code

3.  Do your threads synchronized? €

41+ x 2

dx = h 41+ xi

2 =i=1

N

∑ h 41+ (h(i − 1

2))2

i=1

N

∑0

1

∫


Lab: Exercise 3

45/36

  Write A=LU decomposi6on code.   Back subs6tu6on is not necessary for now. Just do the paraleliza6on of

the given loops below.   Ini6alize matrix with any number (Ex: A[i][j]=1.0+(i*size)+j)   Print L and A a|er LU decomposi6on. (A will be new U)   Can you implement par8al pivo8ng and scaling ?

LU-‐factoriza8on algorithm : for k=1 to n-1 ! for i=k+1 to n ! L(i,k)=A(i,k)/A(k,k) ! for j=k+1 to n ! A(i,j)=A(i,j)-L(i,k)*A(k,j) ! end for ! end for ! end for !


References: 1.  Blaise Barney, OpenMP Tutorial, hOps://compu6ng.llnl.gov/tutorials/openMP/ 2.  Rohit Chandra, Parallel Programming in OpenMP, 2000, Morgan Kaufmann 3.  Micheal J. Quinn, Parallel Programming in C with MPI and OpenMP, 2003, McGraw

Hill 4.  Eduard Ayguadé, Alejandro Duran, Jay Hoeflinger, Federico Massaioli and Xavier

Teruel, An Experimental Evalua6on of the New OpenMP Tasking Model , LANGUAGES AND COMPILERS FOR PARALLEL COMPUTING, Lecture Notes in Computer Science, 2008, Volume 5234/2008, 63-‐77

Documents

OpenMP Programming 2 - Anasayfa