49
Multi-core Programming Programming with OpenMP

Multi-core Programming

  • Upload
    shani

  • View
    54

  • Download
    0

Embed Size (px)

DESCRIPTION

Multi-core Programming. Programming with OpenMP. Topics. What is OpenMP ? Parallel Regions Worksharing Construct Data Scoping to Protect Data Explicit Synchronization Scheduling Clauses Other Helpful Constructs and Clauses. What Is OpenMP*?. - PowerPoint PPT Presentation

Citation preview

Page 1: Multi-core Programming

Multi-core Programming

Programming with OpenMP

Page 2: Multi-core Programming

2

Topics

• What is OpenMP?• Parallel Regions• Worksharing Construct• Data Scoping to Protect Data• Explicit Synchronization• Scheduling Clauses• Other Helpful Constructs and Clauses

Page 3: Multi-core Programming

3Programming with OpenMP*

What Is OpenMP*?

• Compiler directives for multithreaded programming

• Easy to create threaded Fortran and C/C++ programs

• Supports data parallelism model• Incremental parallelism

Combines serial and parallel code in single source

Page 4: Multi-core Programming

Programming with OpenMP* 4

What Is OpenMP*?

omp_set_lock(lck)

#pragma omp parallel for private(A, B)

#pragma omp critical

C$OMP parallel do shared(a, b, c)

C$OMP PARALLEL REDUCTION (+: A, B)

call OMP_INIT_LOCK (ilok)

call omp_test_lock(jlok)

setenv OMP_SCHEDULE “dynamic”

CALL OMP_SET_NUM_THREADS(10)

C$OMP DO lastprivate(XX)

C$OMP ORDERED

C$OMP SINGLE PRIVATE(X)

C$OMP SECTIONS

C$OMP MASTER

C$OMP ATOMIC

C$OMP FLUSH

C$OMP PARALLEL DO ORDERED PRIVATE (A, B, C)

C$OMP THREADPRIVATE(/ABC/)

C$OMP PARALLEL COPYIN(/blk/)

Nthrds = OMP_GET_NUM_PROCS()

!$OMP BARRIER

http://www.openmp.orgCurrent spec is OpenMP 2.5

250 Pages (combined C/C++ and Fortran)

Page 5: Multi-core Programming

Programming with OpenMP* 5

What Is OpenMP*?Example:

NREL AreoDyn/FreeWake FORTRAN Wind Turbine Wake Modeling Program

• > 100,000 lines of FORTRAN Code

Page 6: Multi-core Programming

Programming with OpenMP* 6

What Is OpenMP*?

Example: OMP Implemenetation

DO ibn=1,NB ! each wake: blade !$OMP PARALLEL DO PRIVATE(xa,ya,za,xb,yb,zb,wx,wy,wz,nnfirst),

REDUCTION(+:xindv), REDUCTION(+:yindv), REDUCTION(+:zindv) DO je=1,NELM+1 ! each wake: element boundary ... END DO !$OMP END PARALLEL DO END DO

1 2 3 4 5 6 7 80

100

200

300

400

500

600

Run Time vs. Threads

Run Time - Seconds

Page 7: Multi-core Programming

7Programming with OpenMP*

OpenMP* Architecture

• Fork-join model• Work-sharing constructs• Data environment constructs• Synchronization constructs • Extensive Application Program Interface (API)

for finer control

Page 8: Multi-core Programming

Programming with OpenMP* 8

Programming Model Fork-join parallelism:

• Master thread spawns a team of threads as needed• Parallelism is added incrementally: the sequential

program evolves into a parallel program

Parallel Regions

Master Thread

Page 9: Multi-core Programming

• Most constructs in OpenMP* are compiler directives or pragmas.

• For C and C++, the pragmas take the form:

#pragma omp construct [clause [clause]…]

9Programming with OpenMP*

OpenMP* Pragma Syntax

Page 10: Multi-core Programming

10Programming with OpenMP*

Parallel Regions

• Defines parallel region over structured block of code

• Threads are created as ‘parallel’ pragma is crossed

• Threads block at end of region• Data is shared among threads unless

specified otherwise

#pragma omp parallel

Thread

1Thread

2Thread

3

C/C++ : #pragma omp parallel

{ block

}

Page 11: Multi-core Programming

• Set environment variable for number of threads

set OMP_NUM_THREADS=4

• There is no standard default for this variable– Many systems:

• # of threads = # of processors• Intel® compilers use this default

11Programming with OpenMP*

How Many Threads?

Page 12: Multi-core Programming

12Programming with OpenMP*

Work-sharing Construct

• Splits loop iterations into threads• Must be in the parallel region• Must precede the loop

#pragma omp parallel#pragma omp for

for (i=0; i<N; i++){Do_Work(i);

}

Page 13: Multi-core Programming

13Programming with OpenMP*

Work-sharing Construct

• Threads are assigned an independent set of iterations

• Threads must wait at the end of work-sharing construct

#pragma omp parallel

#pragma omp for

Implicit barrier

i = 0i = 1i = 2i = 3

i = 4i = 5i = 6i = 7

i = 8i = 9

i = 10i = 11

#pragma omp parallel#pragma omp for for(i = 0; i < 12; i++) c[i] = a[i] + b[i]

Page 14: Multi-core Programming

14Programming with OpenMP*

Combining pragmas

• These two code segments are equivalent#pragma omp parallel { #pragma omp for for (i=0; i< MAX; i++) { res[i] = huge(); } }

#pragma omp parallel for for (i=0; i< MAX; i++) { res[i] = huge(); }

Page 15: Multi-core Programming

15Programming with OpenMP*

Data Environment

• OpenMP uses a shared-memory programming model

• Most variables are shared by default.

• Global variables are shared among threads– C/C++: File scope variables, static

Page 16: Multi-core Programming

16Programming with OpenMP*

Data Environment

• But, not everything is shared...– Stack variables in functions called from parallel

regions are PRIVATE– Automatic variables within a statement block are

PRIVATE– Loop index variables are private (with

exceptions) • C/C+: The first loop index variable in nested loops

following a #pragma omp for

Page 17: Multi-core Programming

• The default status can be modified with– default (shared | none)

• Scoping attribute clauses

– shared(varname,…)

– private(varname,…)

17Programming with OpenMP*

Data Scope Attributes

Page 18: Multi-core Programming

18Programming with OpenMP*

The Private Clause

• Reproduces the variable for each thread• Variables are un-initialized; C++ object is default

constructed• Any value external to the parallel region is undefinedvoid* work(float* c, int N) { float x, y; int i; #pragma omp parallel for private(x,y) for(i=0; i<N; i++) {

x = a[i]; y = b[i]; c[i] = x + y; }}

Page 19: Multi-core Programming

Programming with OpenMP* 19

Example: Dot Product

float dot_prod(float* a, float* b, int N) { float sum = 0.0;#pragma omp parallel for shared(sum) for(int i=0; i<N; i++) { sum += a[i] * b[i]; } return sum;}

What is Wrong?

Page 20: Multi-core Programming

20Programming with OpenMP*

Protect Shared Data

• Must protect access to shared, modifiable data float dot_prod(float* a, float* b, int N) { float sum = 0.0;#pragma omp parallel for shared(sum) for(int i=0; i<N; i++) {#pragma omp critical sum += a[i] * b[i]; } return sum;}

Page 21: Multi-core Programming

21Programming with OpenMP*

• #pragma omp critical [(lock_name)]

• Defines a critical region on a structured block

OpenMP* Critical Construct

float R1, R2;#pragma omp parallel{ float A, B; #pragma omp for for(int i=0; i<niters; i++){ B = big_job(i);#pragma omp critical consum (B, &R1); A = bigger_job(i);#pragma omp critical consum (A, &R2); }}

Threads wait their turn –at a time, only one calls consum() thereby protecting R1 and R2 from race conditions.Naming the critical constructs is optional, but may increase performance.

(R1_lock)

(R2_lock)

Page 22: Multi-core Programming

22Programming with OpenMP*

OpenMP* Reduction Clause

reduction (op : list)• The variables in “list” must be shared in the

enclosing parallel region• Inside parallel or work-sharing construct:

• A PRIVATE copy of each list variable is created and initialized depending on the “op”

• These copies are updated locally by threads

• At end of construct, local copies are combined through “op” into a single value and combined with the value in the original SHARED variable

Page 23: Multi-core Programming

23Programming with OpenMP*

Reduction Example

• Local copy of sum for each thread• All local copies of sum added together and

stored in “global” variable

#pragma omp parallel for reduction(+:sum) for(i=0; i<N; i++) { sum += a[i] * b[i]; }

Page 24: Multi-core Programming

24Programming with OpenMP*

• A range of associative and commutative operators can be used with reduction

• Initial values are the ones that make sense

C/C++ Reduction Operations

Operator Initial Value+ 0

* 1

- 0

^ 0

Operator Initial Value

& ~0

| 0

&& 1

|| 0

Page 25: Multi-core Programming

Programming with OpenMP* 25

Numerical Integration Example4.0

2.0

1.00.0

4.0

(1+x2)f(x) =

4.0

(1+x2) dx = 0

1

X

static long num_steps=100000; double step, pi;

void main(){ int i; double x, sum = 0.0;

step = 1.0/(double) num_steps; for (i=0; i< num_steps; i++){ x = (i+0.5)*step; sum = sum + 4.0/(1.0 + x*x); } pi = step * sum; printf(“Pi = %f\n”,pi);}

Page 26: Multi-core Programming

26Programming with OpenMP*

Assigning Iterations

• The schedule clause affects how loop iterations are mapped onto threads

schedule(static [,chunk])• Blocks of iterations of size “chunk” to threads• Round robin distribution

schedule(dynamic[,chunk])• Threads grab “chunk” iterations • When done with iterations, thread requests next set

schedule(guided[,chunk])• Dynamic schedule starting with large block • Size of the blocks shrink; no smaller than “chunk”

Page 27: Multi-core Programming

Programming with OpenMP* 27

Schedule Clause When To Use

STATIC Predictable and similar work per iteration

DYNAMIC Unpredictable, highly variable work per iteration

GUIDED Special case of dynamic to reduce scheduling overhead

Which Schedule to Use

Page 28: Multi-core Programming

Programming with OpenMP* 28

Schedule Clause Example#pragma omp parallel for schedule (static, 8) for( int i = start; i <= end; i += 2 ) { if ( TestForPrime(i) ) gPrimesFound++; }

Iterations are divided into chunks of 8• If start = 3, then first chunk is i={3,5,7,9,11,13,15,17}

Page 29: Multi-core Programming

Programming with OpenMP* 29

Parallel Sections• Independent sections of code can execute

concurrently

Serial Parallel

#pragma omp parallel sections{ #pragma omp section phase1(); #pragma omp section phase2(); #pragma omp section phase3();}

Page 30: Multi-core Programming

30Programming with OpenMP*

• Denotes block of code to be executed by only one thread

• Thread chosen is implementation dependent

• Implicit barrier at end

Single Construct

#pragma omp parallel{ DoManyThings();#pragma omp single { ExchangeBoundaries(); } // threads wait here for single DoManyMoreThings();}

Page 31: Multi-core Programming

31Programming with OpenMP*

• Denotes block of code to be executed only by the master thread

• No implicit barrier at end

Master Construct

#pragma omp parallel{ DoManyThings();#pragma omp master { // if not master skip to next stmt ExchangeBoundaries(); } DoManyMoreThings();}

Page 32: Multi-core Programming

32Programming with OpenMP*

Implicit Barriers

• Several OpenMP* constructs have implicit barriers

• parallel• for• single

• Unnecessary barriers hurt performance• Waiting threads accomplish no work!

• Suppress implicit barriers, when safe, with the nowait clause

Page 33: Multi-core Programming

33Programming with OpenMP*

Nowait Clause

• Use when threads would wait between independent computations

#pragma single nowait{ [...] }

#pragma omp for nowait for(...) {...};

#pragma omp for schedule(dynamic,1) nowait for(int i=0; i<n; i++) a[i] = bigFunc1(i);

#pragma omp for schedule(dynamic,1) for(int j=0; j<m; j++) b[j] = bigFunc2(j);

Page 34: Multi-core Programming

34Programming with OpenMP*

Barrier Construct

• Explicit barrier synchronization• Each thread waits until all threads arrive

#pragma omp parallel shared (A, B, C) {

DoSomeWork(A,B);printf(“Processed A into B\n”);

#pragma omp barrier DoSomeWork(B,C);printf(“Processed B into C\n”);

}

Page 35: Multi-core Programming

35Programming with OpenMP*

Atomic Construct

• Special case of a critical section • Applies only to simple update of memory

location#pragma omp parallel for shared(x, y, index, n) for (i = 0; i < n; i++) { #pragma omp atomic x[index[i]] += work1(i); y[i] += work2(i); }

Page 36: Multi-core Programming

36Programming with OpenMP*

OpenMP* API

• Get the thread number within a teamint omp_get_thread_num(void);

• Get the number of threads in a teamint omp_get_num_threads(void);

• Usually not needed for OpenMP codes• Can lead to code not being serially consistent• Does have specific uses (debugging)• Must include a header file#include <omp.h>

Page 37: Multi-core Programming

Programming with OpenMP* 37

Monte Carlo Pi

squarein darts of #circle hitting darts of#4

41

squarein darts of #circle hitting darts of#

2

2

rr

loop 1 to MAX

x.coor=(random#) y.coor=(random#) dist=sqrt(x^2 + y^2)

if (dist <= 1)hits=hits+1

pi = 4 * hits/MAX

r

Page 38: Multi-core Programming

38Programming with OpenMP*

Making Monte Carlo’s Parallel

hits = 0call SEED48(1)DO I = 1, max x = DRAND48() y = DRAND48() IF (SQRT(x*x + y*y) .LT. 1) THEN

hits = hits+1 ENDIF

END DOpi = REAL(hits)/REAL(max) * 4.0

What is the challenge here?

Page 39: Multi-core Programming

39Programming with OpenMP*

Programming with OpenMPWhat’s Been Covered

• OpenMP* is:• A simple approach to parallel programming for shared

memory machines

• We explored basic OpenMP coding on how to:• Make code regions parallel (omp parallel)• Split up work (omp for)• Categorize variables (omp private….)• Synchronize (omp critical…)

• We reinforced fundamental OpenMP concepts through several labs

Page 40: Multi-core Programming

Advanced Concepts

Page 41: Multi-core Programming

41Programming with OpenMP*

More OpenMP*

• Data environment constructs• FIRSTPRIVATE• LASTPRIVATE• THREADPRIVATE

Page 42: Multi-core Programming

42Programming with OpenMP*

• Variables initialized from shared variable• C++ objects are copy-constructed

Firstprivate Clause

incr=0;#pragma omp parallel for firstprivate(incr)for (I=0;I<=MAX;I++) {

if ((I%2)==0) incr++;A(I)=incr;

}

Page 43: Multi-core Programming

Programming with OpenMP* 43

• Variables update shared variable using value from last iteration

• C++ objects are updated as if by assignment

Lastprivate Clause

void sq2(int n, double *lastterm)

{ double x; int i; #pragma omp parallel #pragma omp for lastprivate(x) for (i = 0; i < n; i++){ x = a[i]*a[i] + b[i]*b[i]; b[i] = sqrt(x); } lastterm = x;}

Page 44: Multi-core Programming

Programming with OpenMP* 44

• Preserves global scope for per-thread storage• Legal for name-space-scope and file-scope• Use copyin to initialize from master thread

Threadprivate Clause

struct Astruct A;#pragma omp threadprivate(A)…

#pragma omp parallel copyin(A) do_something_to(&A);…

#pragma omp parallel do_something_else_to(&A);

Private copies of “A” persist between regions

Page 45: Multi-core Programming

45Programming with OpenMP*

Performance Issues

• Idle threads do no useful work• Divide work among threads as evenly as

possible• Threads should finish parallel tasks at same time

• Synchronization may be necessary• Minimize time waiting for protected resources

Page 46: Multi-core Programming

46Programming with OpenMP*

Load Imbalance

• Unequal work loads lead to idle threads and wasted time.

time

Busy Idle

#pragma omp parallel{

#pragma omp for for( ; ; ){

}

}time

Page 47: Multi-core Programming

47Programming with OpenMP*

#pragma omp parallel{

#pragma omp critical { ... } ...}

Synchronization

• Lost time waiting for locks

time

Busy Idle In Critical

Page 48: Multi-core Programming

48Programming with OpenMP*

Performance Tuning

• Profilers use sampling to provide performance data.

• Traditional profilers are of limited use for tuning OpenMP*:

• Measure CPU time, not wall clock time• Do not report contention for synchronization objects• Cannot report load imbalance• Are unaware of OpenMP constructs

Programmers need profilers specifically designed for OpenMP.

Page 49: Multi-core Programming

49Programming with OpenMP*

Static Scheduling: Doing It By Hand

• Must know:• Number of threads (Nthrds)• Each thread ID number (id)

• Compute start and end iterations:#pragma omp parallel{

int i, istart, iend;istart = id * N / Nthrds;iend = (id+1) * N / Nthrds;for(i=istart;i<iend;i++){ c[i] = a[i] + b[i];}

}