PARALLEL PROCESSING/PROGRAMMING - ASE · THREADS CONCEPTS •Atomic operations –operations which are executed at once (in the same processor time) by a thread (and not partially);

PARALLEL PROCESSING/PROGRAMMING

CATALIN BOJA

[email protected]

BUCHAREST UNIVERSITY OF ECONOMIC STUDIES

mailto:[email protected]

MULTI-PROCESS VS MULTI-THREAD

https://computing.llnl.gov/tutorials/pthreads/ http://www.javamex.com/tutorials/threads/how_threads_work.shtml

https://computing.llnl.gov/tutorials/pthreads/

http://www.javamex.com/tutorials/threads/how_threads_work.shtml


• A process is an instance of a computer

program that is being executed. It contains

the program code and its current activity

• Different processes do not share these

resources

• Each process has its own address space

• A process contains all the information

needed to execute the program (ID,

Program code, Data, Global data, Heap

data)

• A thread of execution is the smallest unit of

processing that can be scheduled by an

operating system

• A thread is contained inside a process.

Multiple threads can exist within the same

process and share resources such as

memory. The threads of a process share the

latter’s instructions (code) and its context

(values that its variables reference at any

given moment)

http://en.wikipedia.org/wiki/Process_(computing)

http://en.wikipedia.org/wiki/Process_(computing)


• Thread model is an extension of the process model.

• Each process consists of multiple independent

instruction streams (or threads) that are assigned

computer resources by some scheduling procedure.

• Threads of a process share the address space of this

process. Global variables and all dynamically

allocated data objects are accessible by all threads of

a process.

• Each thread has its own run time stack, register,

program counter.

• Threads can communicate by reading/writing

variables in the common address space.

• Thread operations include thread creation, termination,

synchronization (joins, blocking), scheduling, data

management and process interaction.

• A thread does not maintain a list of created threads,

nor does it know the thread that created it.

• Threads in the same process share: Process instructions,

Most data, open files (descriptors), signals and signal

handlers, current working directory, User and group id

• Each thread has a unique: (Thread ID, set of registers,

stack pointer, stack for local variables, return

addresses, signal mask, priority, Return value: errno)

• pthread functions return "0" if OK.

THREADS CONCEPTS

• Thread synchronization – establish running conditions between threads

• A thread will wait until another one has reached a predefined point or it has released a

required resource

• A thread will transmit data to another one when the later is able to receive it

• The main thread will continue the execution when all other threads finish their tasks

• Thread communication – sending and receiving data between threads

THREADS CONCEPTS

• Atomic operations – operations which are executed at once (in the same processor

time) by a thread (and not partially); the thread will not executed parts of it; it’s all

or nothing

• Race condition – 2 or more threads have simultaneously read/write access on the

same resource (object, variable, file) and the operation is not atomic

• Critical section – code sequence executed by multiple threads in the same time ,

which is accessing a shared resource

• Deadlock – 2 or more threads blocking each other as each of them is locking

simultaneously a resource needed by the others

THREADS CONCEPTS

• Mutual exclusion – preventing 2 or more threads to access a critical section

in the same time (mutex); problem identified by Edsger W. Dijkstra in 1965

• Lock – a mechanism that allows a thread to block the access to other threads

wile is executing the critical section

• Mutex – Mutual Exclusion Object implements a locking a mechanism

• Semaphore – a signaling mechanism that allows threads to communicate (a

mutex is a binary semaphore)

https://dl.acm.org/citation.cfm?doid=365559.365617

MULTI-THREADS IN C++

• C++ 11 introduced the multi-threading support based on the <thread>

library and the thread class

• Different than the POSIX pthread library (which is a C library)

• Provides a simplified syntax that integrates support for mutexes, condition

variables and locks

MULTI-THREADS IN C++

void thread_function() {

std::cout << "Hello World !";

}

int main() {

//start a thread

std::thread t1(thread_function);

//join the thread with main

t1.join();

}

• Threads are created based on

existing functions

• Creating a thread object will start a

new one automatically

• Joining the thread with the main one

is done by calling join()

MULTI-THREADS IN C++ - SHARING DATA

static int SUM = 0;

static void increment(int iterations, int& s) {

for (int j = 0; j < iterations; j++)

for (int i = 0; i < iterations; i++)

s += 1;

}

void main() {

std::thread t1(increment, ITERATIONS, std::ref(SUM));

std::thread t2(increment, ITERATIONS, std::ref(SUM));

}

both threads will share SUM

MULTI-THREADS IN C++ - PROTECTING SHARED DATA

• Semaphores – mutex is a binary semaphore

• Atomic references

• Monitors – implemented in Java using the synchronization mechanism

• Condition variables

• Compare-and-swap – checks if a memory variable has the same value with a

given one and if true then it modifies it

MULTI-THREADS IN C++ - MUTEX

class Counter {

private:

int counter = 0;

std::mutex mutex;

public:

void increment() {

mutex.lock();

counter += 1;

mutex.unlock();

}

};

• A mutex is implemented by the std::mutex

class defined in the <mutex> library

• A lock is acquired by calling the lock()

method

• The lock is released by calling unlock()

MULTI-THREADS IN C++ - MUTEX AND RAII

class RAIICounter {

private:

Counter counter;

std::mutex mutex;

public:

void increment() {

mutex.lock();

this->counter.increment();

mutex.unlock();

}

};

• Use the RAII - "resource acquisition is

initialization" technique

• http://www.stroustrup.com/bs_faq2.html#fi

nally

• Lock the call to the function – you avoid

locking and unlocking the mutex on all

execution paths (even exceptions)

http://www.stroustrup.com/bs_faq2.html#finally

MULTI-THREADS IN C++ - MUTEX AND RAII

class Counter {

private:

int counter = 0;

std::mutex mutex;

public:

void safe_increment()

{

std::lock_guard<std::mutex> lock(mutex);

counter += 1;

// mutex is automatically released when lock

// goes out of scope

}

};

• lock_guard is a mutex wrapper that provides a convenient

RAII-style mechanism for owning a mutex for the duration

of a scoped block

• mutex is automatically released when lock goes out of

scope

• you avoid locking and unlocking the mutex on all

execution paths (even exceptions)

MULTI-THREADS IN C++ - ADVANCED MUTEXES

• Recursive Locking: std::recursive_mutex

• enable the same thread to lock the same mutex twice and won’t deadlock.

• Timed Locking: std::timed_mutex, std::recursive_timed_mutex

• enable a thread to do something else when waiting for a thread to finish.

• Call once: std::call_once(std::once_flag flag, function);

• a function will be called only one time no matter how many thread are launched. Each

std::call_once is matched to a std:::once_flag variable.

MULTI-THREADS IN C++ - ATOMIC

class AtomiCounter {

private:

std::atomic<int> counter = 0;

public:

void increment() {

counter += 1;

}

int getCounter() {

return counter.load();

}

};

• C++ 11 introduced atomic types in the <atomic>

library

• It’s a template class std::atomic<type>

• the operation on that variable will be atomic and

so thread-safe

• Different locking techniques are applied based on

the the data type and size

• lock-free technique: integral types like int, long, float. It

is much faster than mutexes technique

• Mutexes technique: for big type (such as 2MB storage).

There is no performance advantage for atomic type

over mutexes

MULTI-THREADS IN C++ - PROBLEMS

• Race conditions

• Shared variables

• Load balancing

• Cache coherence

• Cache false sharing

• Deadlocks

• Synchronization overhead

OPENMP – PROGRAMMING MODEL

Shared memory, thread-based parallelism

OpenMP is based on the existence of multiple threads in the shared memory programming

paradigm.

A shared memory process consists of multiple threads.

Explicit Parallelism

Programmer has full control over parallelization. OpenMP is not an automatic parallel

programming model.

Compiler directive based

Most OpenMP parallelism is specified through the use of compiler directives which are

embedded in the source code.

OPENMP ARCHITECTURE

OPENMP – WHAT IS NOT

Necessarily implemented identically by all vendors

Meant for distributed-memory parallel systems (it is designed for shared address spaced machines)

Guaranteed to make the most efficient use of shared memory

Required to check for data dependencies, data conflicts, race conditions, or deadlocks

Required to check for code sequences

Meant to cover compiler-generated automatic parallelization and directives to the compiler to assist such

parallelization

Designed to guarantee that input or output to the same file is synchronous when executed in parallel.

OPENMP

OpenMP program begin as a single process: the master thread (in pictures in red/grey). The

master thread executes sequentially until the first parallel region construct is encountered.

When a parallel region is encountered, master thread

Create a group of threads by FORK.

Becomes the master of this group of threads, and is assigned the thread id 0 within the

group.

The statement in the program that are enclosed by the parallel region construct are then executed

in parallel among these threads.

JOIN: When the threads complete executing the statement in the parallel region construct, they

synchronize and terminate, leaving only the master thread.

OPENMP

OPENMP

I/O

OpenMP does not specify parallel I/O.

It is up to the programmer to ensure that I/O is conducted correctly within the context of a multi-threaded

program.

Memory Model

Threads can “cache” their data and are not required to maintain exact consistency with real memory all of

the time.

When it is critical that all threads view a shared variable identically, the programmer is responsible for

insuring that the variable is updated by all threads as needed.

OPENMP – HELLO WORLD

#include <stdlib.h>

#include <stdio.h>

#include "omp.h"

int main()

{

#pragma omp parallel

{

int ID = omp_get_thread_num();

printf("Hello (%d)\n", ID);

printf(" world (%d)\n", ID);

}

}

Set # of threads for OpenMP:

- In csh:

setenv OMP_NUM_THREAD 8

- In bash:

set OMP_NUM_THREAD=8

export $OMP_NUM_THREAD

Compile: g++ -fopenmp hello.c

Run: ./a.out

OPENMP

#include “omp.h”

void main ()

{ int var1, var2, var3;

// 1. Serial code

. . .

// 2. Beginning of parallel section.

// Fork a team of threads. Specify variable scoping

#pragma omp parallel private(var1, var2) shared(var3)

{

// 3. Parallel section executed by all threads

. . .

// 4. All threads join master thread and disband

}

// 5. Resume serial code . . .

}

OPENMP - C/C++ DIRECTIVE FORMAT

• C/C++ use compiler directives

• Prefix: #pragma omp …

• A directive consists of a directive name followed by clauses

#pragma omp parallel [clause list]

• General Rules:• Case sensitive

• Only one directive-name may be specified per directive

• Each directive applies to at most one succeeding statement, which must be a structured block.

• Long directive lines can be “continued” on succeeding lines by escaping the newline character with a

backslash “\” at the end of a directive line.


#pragma omp parallel [clause list]

Typical clauses in [clause list]

• Conditional parallelization • if (scalar expression): Determine whether the parallel construct creates threads

• Degree of concurrency • num_threads (integer expression): number of threads to create

• Date Scoping • private (variable list): Specifies variables local to each thread

• firstprivate (variable list): Similar to the private but Private variables are initialized to variable value before

the parallel directive

• shared (variable list): Specifies variables that are shared among all the threads

• default (data scoping specifier): default data scoping specifier may be shared or none


int is_parallel = 1;

int var_c = 10;

int var_b = 0;

//….

#pragma omp parallel if (is_parallel == 1) num_threads(8) \

shared (var_b) private (var_a) firstprivate (var_c) \

default (none)

{

var_a = 0; //must be initialized

/* structured block */

}

if (is_parallel == 1) num_threads(8)

• If the value of the variable is_parallel is one then creates 8 threads

shared (var_b)

• Each thread shares a single copy of variable var_b

private (var_a) firstprivate (var_c)

• Each thread gets private copies of variable var_a and var_c

• Each private copy of var_c is initialized with the value of var_c

in main thread when the parallel directive is encountered

• var_a must be initialized in the parallel block

default (none)

• Default state of a variable is specified as none (rather than shared)

• Signals error if not all variables are specified as shared or private

OPENMP – NUMBER OF THREADS

The number of threads in a parallel region is determined by the following factors, in order

of precedence:

1. Evaluation of the if clause

2. Setting of the num_threads() clause

3. Use of the omp_set_num_threads() library function

4. Setting of the OMP_NUM_THREAD environment variable

5. Implementation default – usually the number of cores on a node

Threads are numbered from 0 (master thread) to N-1

OPENMP – RUNTIME LIBRARY METHODS

• omp_get_max_threads() – returns the maximum number of threads that can

be used without a num_threads clause is encountered; calling

omp_set_num_threads() will change this value

• omp_get_num_threads() – returns number of threads currently executing the

parallel region from which it is called

• omp_get_thread_num() – returns the ID of the current thread executing the

parallel region from which it is called (master is always 0)

• omp_get_wtime() – returns a value in seconds of the time elapsed from some

arbitrary, but consistent point

OPENMP – EXAMPLE

• Requests 4 threads to run the parallel

region

• Each thread runs a copy of the { } block

• tId – a private variable gets the ID of

each running thread

• nThreads – shared variable by all

threads that is initialized by master

thread (Id = 0) to number of available

threads

OPENMP – EXAMPLE

OPENMP – PERFORMANCE PROBLEMS

• Race conditions – generated by shared variables

• Load balancing – generated by a non-equal distribution of the processing

effort among threads

• Cache coherence – generated by shared variables

• Cache false sharing – generated by variables that share the same cache line

• Synchronization overhead – generated by locking operations

OPENMP – SYNCHRONIZATION

Thread

1 Thread

2

Thread

3

Thread

4

wait

Time

wait

wait

barrier

Thread

1 Thread

2

Thread

4

Thread

3

Thread

1

Thread

2

Thread

4

critical

section

Time

waits

critical

section

Thread

2

critical

section

Thread

1

Thread

2

waits

Barrier Locking critical sections

OPENMP – SYNCHRONIZATION

• #pragma omp barrier

• all threads need to reach it in order to continue

• #pragma omp critical

• only one of the threads can execute the atomic section

• implement a mutual exclusion lock - mutex

• #pragma omp atomic

• only one of the threads can execute the atomic section

• uses special constructs in hardware that will speed up updating the memory for simple operations: +=, -

=, *=, /=, ++ (post and prefix) , -- (post and prefix)

• dependent on hardware – if not available will behave like critical

OPENMP - WORK SHARING CONSTRUCTS

A parallel construct by itself creates a “Single

Program/Instruction Multiple Data”

(SIMD/SPMD) program, i.e., each thread

executes the same code.

Work-sharing is to split up pathways through

the code between threads within a team.

• Loop construct (for/do): concurrent loop

iterations

• Sections/section constructs: concurrent

tasks

• Single construct

• Tasks


• Work-sharing directives allow concurrency between iterations or tasks

• A work-sharing construct must be enclosed dynamically within a parallel

region in order for the directive to execute in parallel

• Work-sharing constructs do not create new threads

• Work-sharing constructs must be encountered by all members of a team or

none at all


Work-Sharing do/for Directives:

• Shares iterations of a loop across the group

• Represents a “data parallelism”

• for directive partitions parallel iterations

across threads in C++

• do is the analogous directive in Fortran

• Implicit barrier at end of for loop

#pragma omp for [clause list]

/* for loop */

void main() {

int nthreads, tid;

omp_set_num_threads(3);

#pragma omp parallel private(tid)

{

int i;

tid = omp_get_thread_num();

printf("Hello world from (%d)\n", tid);

#pragma omp for

for(i = 0; i <=4; i++)

{

printf(“Iteration %d by %d\n”, i, tid);

}

} // all threads join master thread and terminates

}

Sequential code to add two vectors:

for(int i = 0 ; i < N ; i++) {

c[i] = b[i] + a[i];

}

Parallel logic to add two vectors:


//OpenMP implementation 1 (not desired):


{

int id, i, Nthrds, istart, iend;

id = omp_get_thread_num();

Nthrds = omp_get_num_threads();

istart = id*N/Nthrds;

iend = (id+1)*N/Nthrds;

if(id == Nthrds-1) iend = N;

for(I = istart; i<iend; i++) {

c[i] = b[i]+a[i];

}

}

//A worksharing for construct to add vectors:


{

#pragma omp for

{

for(i=0; i<N; i++) { c[i]=b[i]+a[i]; }

}

}

//A worksharing for construct to add vectors:

#pragma omp parallel for

for(i=0; i<N; i++) { c[i]=b[i]+a[i]; }



for directive syntax:

#pragma omp for [clause list]

schedule (type [,chunk])

ordered

private (variable list)

firstprivate (variable list)

shared (variable list)

reduction (operator: variable list)

collapse (n)

nowait

/* for_loop */

Directive Restrictions for the “for loop” that follows the for

omp directive:

It must NOT have a break statement

The loop control variable must be an integer

The initialization expression of the “for loop” must be

an integer assignment.

The logical expression must be one of <, ≤, > ,≥

The increment expression must have integer increments

or decrements only.

OPENMP – DATA ENVIRONMENT

Changing Storage Attributes

• shared – the variable is shared by all threads (watch out for race conditions)

• private – the variable has a private copy in each thread

• firstprivate – the variable has a private copy in each thread but the initial value is the last

global one

• lastprivate – the variable has a private copy in each thread but the last value is the the one

after the parallel region

• default(private|shared|none) – for private and shared interprets variables as default; for

none forces the programmer to define each used variable as private or shared

OPENMP – DATA ENVIRONMENT

void something(){

int temp = 0;

#pragma omp parallel for private(temp)

for(int i=0;i<100;i++)

temp+=i;

cout << endl << “temp =“ << temp;

}

• private does NOT initialize the value

• using a private declared variable will

generate a compiler error

• to retain the global value must be

defined as firstprivate

• outside of parallel loop, private

variables loses their value


How to combine values into a single accumulation variable (avg)?

//Sequential code to do average value from an array-vector:

{

double avg = 0.0, A[MAX];

int i;

…

for(i =0; i<MAX; i++) {

avg += a[i];

}

avg /= MAX;

}


reduction clause

• reduction (operator: variable list)

• combines local copies of a variable in

different threads into a single copy in

the master when threads exit

• variables in variable list are implicitly

private to threads

• operators used in Reduction Clause: +,

*, -, &, |, ^, &&, and ||

int VB= 0;

#pragma omp parallel reduction(+: VB)

num_threads(4)

{

/* compute local VBs in each thread */

}

/* VB here contains sum of all local instances

of VB*/


reduction clause

• Used inside a parallel or a work-sharing construct:

• A local copy of each list variable is made and

initialized depending on operator (e.g. 0 for “+”)

• Compiler finds standard reduction expressions

containing operator and uses it to update the local

copy

• Local copies are reduced into a single value and

combined with the original global value when

returns to the master thread.

Reduction Operators/Initial-Values in C/C++

OpenMP

Operator Initial Value

+ 0

* 1

- 0

& ~0

Operator Initial Value

| 0

^ 0

&& 1

|| 0


//A work-sharing for average value from a vector:

{

double avg = 0.0, A[MAX];

int i;

…

#pragma omp parallel for reduction (+: avg)

for(i = 0; i < MAX; i++) {avg += a[i];}

avg /= MAX;

}

• avg – is a local variable in each thread of the

parallel region

• After the for loop, the external avg variable

becomes the sum of the local avg variables

OPENMP - MATRIX-VECTOR MULTIPLICATION

#pragma omp parallel default (none) \

shared (a, b, c, m,n) private (i,j,sum) \

num_threads(4)

for(i = 0; i < m; i++)

{

sum = 0.0;

for(j=0; j < n; j++)

sum += b[i][j]*c[j];

a[i] =sum;

}

………


for schedule clause

• Describe how iterations of the loop

are divided among the threads in

the group. The default schedule is

implementation dependent.

schedule (scheduling_class[, parameter])

Schedule

static

dynamic

guidedruntime

auto


• schedule (static [, chunk]) - Loop iterations are divided into pieces of size chunk and then

statically (compile time) assigned to threads. If chunk is not specified, the iteration are evenly

(if possible) divided contiguously among the threads.

• schedule (dynamic [, chunk]) - Loop iterations are divided into pieces of size chunk and then

dynamically assigned to threads. When a thread finishes one chunk, it is dynamically assigned

another. The default chunk size is 1.

• Schedule (guided [, chunk]) - For a chunk size of 1, the size of each chunk is proportional to

the number of unassigned iterations divided by the number of threads, decreasing to 1. For a

chunk size with value 𝑘(𝑘>1), the size of each chunk is determined in the same way with the

restriction that the chunks do not contain fewer than 𝑘 iterations (except for the last chunk to

be assigned, which may have fewer than 𝑘 iterations). The default chunk size is 1.


• schedule (runtime) - The scheduling decision is deferred until runtime by the

environment variable OMP_SCHEDULE. It is illegal to specify a chunk size for

this clause

• schedule (auto) - The scheduling decision is made by the compiler and/or

runtime system (not supported in OpenMP 2.0 – VS)


Static

• Predictable

• Pre-determined at compile time by

the programmer

• Reduce overhead at run-time

Dynamic

• Unpredictable

• Determined at run-time

• Complex logic at run-time which

leads to an overhead


Find loop intensive routines

Remove loop carry

dependencies

Implement a work-sharing

construct

OPENMP - MATRIX-VECTOR MULTIPLICATION

// Static schedule maps iterations to threads at compile time

// static scheduling of matrix multiplication loops

#pragma omp parallel default (private) \

shared (a, b, c, dim) num_threads(4)

#pragma omp for schedule(static)

for(i=0; i < dim; i++)

{

for(j=0; j < dim; j++)

{

c[i][j] = 0.0;

for(k=0; j < dim; k++)

c[i][j] += a[i][k]*b[k][j];

}

}

Static scheduling - 16 iterations, 4 threads:


All work-share sections have a barrier at the

end of a parallel construct



{

#pragma omp for

for (int i = 0; i < 10; i++) {

//...work sharing

}//implicit barrier

#pragma omp for nowait

for (int i = 0; i < 10; i++) {

//...work sharing

}//no implicit barrier

}

nowait for clause

• A for construct has a default barrier

at the end

• The nowait clause will instruct the

threads to continue the execution

and to no wait for others



{

//parallel section for all threads

#pragma omp master

{

//section only for master thread

}

}

master parallel clause

• defines a sequence that will be

executed only by the master thread

(Id = 0)



{

//parallel section for all threads

#pragma omp single

{

//only one thread will execute it

}

}

single parallel clause

• identifies a section of code that is

executed by a single thread

• which thread will execute the single

sections depends on the environment

• using copyprivate(variable) the

value is transmited to other threads

OPENMP – SECTIONS CONSTRUCT

#pragma omp parallel sections

{

#pragma omp section

{

//only one thread

}

#pragma omp section

{

//only one thread

}

}

sections construct

• the sections construct contains 1 ore more

#pragma omp section areas

• Each section is executed by only one thread

• How a thread will get any of the sections

depends on environment

• If there are more threads than available

sections, the others will wait

OPENMP – LOCKS

• omp_init_lock() – inits/creates a mutex lock using a omp_lock_t type

variable

• omp_set_lock() – sets the lock

• omp_unset_lock() – removes the lock

• omp_destroy_lock() – destroys the lock

• omp_test_lock() - test if the lock is already set; the thread can do

other processing

OPENMP – RUNTIME LIBRARY METHODS

• omp_in_parallel() – returns true if the section is executed in parallel

region

• omp_set_dynamic() – the number of threads available in next parallel

region can be adjusted at the run time

• omp_get_dynamic() – returns if the number of threads available in the

next parallel region can be adjusted by the run time (if non zero)

• omp_num_procs() – returns the number of processors available

OPENMP – RUNTIME ROUTINES

int nThreaduri;

omp_set_dynamic(0);

omp_set_num_threads(omp_get_num_procs());


{

int id = omp_get_thread_num();

#pragma omp single

nThreaduri = omp_get_num_threads();

//sectiune paralela

}

• disable dynamic thread allocation

• request threads number = available

processors

• get each thread id

• get total number of threads only

once

OPENMP – ENVIRONMENT VARIABLES

• OMP_NUM_THREADS – a default number of threads to use

• OMP_STACKSIZE – defines the stack size for each thread

• OMP_WAIT_POLICY – defines the default wait policy (ACTIVE | PASSIVE)

• OMP_PROC_BIND – binds threads to the same core (ACTIVE | FALSE); for

FALSE (default) depending on load the OS can move threads to a different

core with a lower load

OPENMP - TASKS

• introduced in OMP 3.0 (not supported by VS)

• allows parallelization of recursive/dependent routines

PARALLEL PROGRAMMING PROBLEMS

• Min, Max and other statistical data

• Sorting

• Vector and matrix processing

• Searching (Map – Reduce)

Documents

PARALLEL PROCESSING/PROGRAMMING - ASE · THREADS CONCEPTS •Atomic operations –operations which are executed at once (in the same processor time) by a thread (and not partially);