67
Introducción a la Programación Paralela (Memoria Compartida) Casiano Rodríguez León [email protected] Departamento de Estadística, Investigación Operativa y Computación.

Introducción a la Programación Paralela (Memoria Compartida) Casiano Rodríguez León [email protected] Departamento de Estadística, Investigación Operativa

  • View
    221

  • Download
    5

Embed Size (px)

Citation preview

Page 1: Introducción a la Programación Paralela (Memoria Compartida) Casiano Rodríguez León casiano@ull.es Departamento de Estadística, Investigación Operativa

Introducción a la Programación Paralela

(Memoria Compartida)Casiano Rodríguez León

[email protected] de

Estadística, Investigación Operativa y Computación.

Page 2: Introducción a la Programación Paralela (Memoria Compartida) Casiano Rodríguez León casiano@ull.es Departamento de Estadística, Investigación Operativa

Introduction

Page 3: Introducción a la Programación Paralela (Memoria Compartida) Casiano Rodríguez León casiano@ull.es Departamento de Estadística, Investigación Operativa

What is parallel computing?• Parallel computing: the use of multiple computers or

processors working together on a common task.

– Each processor works on its section of the problem

– Processors are allowed to exchange information with other processors

CPU #1 works on this area of the problem

CPU #3 works on this areaof the problem

CPU #4 works on this areaof the problem

CPU #2 works on this areaof the problem

Grid of Problem to be solved

y

x

exchange

exchange

Page 4: Introducción a la Programación Paralela (Memoria Compartida) Casiano Rodríguez León casiano@ull.es Departamento de Estadística, Investigación Operativa

Why do parallel computing?

• Limits of single CPU computing

– Available memory

– Performance

• Parallel computing allows:

– Solve problems that don’t fit on a single CPU

– Solve problems that can’t be solved in a reasonable time

• We can run…

– Larger problems

– Faster

– More cases

Page 5: Introducción a la Programación Paralela (Memoria Compartida) Casiano Rodríguez León casiano@ull.es Departamento de Estadística, Investigación Operativa

Performance Considerations

Page 6: Introducción a la Programación Paralela (Memoria Compartida) Casiano Rodríguez León casiano@ull.es Departamento de Estadística, Investigación Operativa

SPEEDUPTIME OF THE FASTEST SEQUENTIAL ALGORITHM

TIME OF THE PARALLEL ALGORITHMSPEEDUP =

SPEEDUP NUMBER OF PROCESSORS

•Consider a parallel algorithm that runs in T steps on P processors

•It is a simple fact that the parallel algorithm can be simulated by a sequential machine in TxP steps

•The best sequential algorithm runs in Tbest seq TxP

Page 7: Introducción a la Programación Paralela (Memoria Compartida) Casiano Rodríguez León casiano@ull.es Departamento de Estadística, Investigación Operativa

• Amdahl’s Law places a strict limit on the speedup that can be realized by using multiple processors.

– Effect of multiple processors on run time

– Effect of multiple processors on speed up

– Where

• fs = serial fraction of code

• fp = parallel fraction of code

• P = number of processors

tn fp /P fs t1

S 1fs

fp / P

Amdahl’s Law

Page 8: Introducción a la Programación Paralela (Memoria Compartida) Casiano Rodríguez León casiano@ull.es Departamento de Estadística, Investigación Operativa

It takes only a small fraction of serial content in a code todegrade the parallel performance. It is essential todetermine the scaling behavior of your code before doing production runs using large numbers of processors

0

50

100

150

200

250

0 50 100 150 200 250Number of processors

fp = 1.000fp = 0.999fp = 0.990fp = 0.900

Illustration of Amdahl's Law

Page 9: Introducción a la Programación Paralela (Memoria Compartida) Casiano Rodríguez León casiano@ull.es Departamento de Estadística, Investigación Operativa

Amdahl’s Law provides a theoretical upper limit on parallelspeedup assuming that there are no costs for speedup assuming that there are no costs for communications. In reality, communications will result in a further degradation of performance

0

10

20

30

40

50

60

70

80

0 50 100 150 200 250Number of processors

Amdahl's LawReality

fp = 0.99

Amdahl’s Law Vs. Reality

Page 10: Introducción a la Programación Paralela (Memoria Compartida) Casiano Rodríguez León casiano@ull.es Departamento de Estadística, Investigación Operativa

Memory/Cache

CPU

MAIN MEMORY

Cache

SPEED SIZE Cost ($/bit)

Page 11: Introducción a la Programación Paralela (Memoria Compartida) Casiano Rodríguez León casiano@ull.es Departamento de Estadística, Investigación Operativa

a [i ] = b[ i ]+c[ i]

•On uniprocessors systems from a correctness point of view, memory is monolithic, from a performance point of view is not. •It might take more time to bring a(i) from memory than to bring b(i), •Bringing in a(i) at one point in time maight take longer than bringing it in at a later point in time.

Processor

L2 InstructionCache

L1 DataCache

L1 InstructionCache

L2 DataCache

InstructionMemory

DataMemory

Locality and Caches

Page 12: Introducción a la Programación Paralela (Memoria Compartida) Casiano Rodríguez León casiano@ull.es Departamento de Estadística, Investigación Operativa

for(i=0;i<n;i++) for(j=0;j<n;j++) a[j][i] = 0;

a[j][i] and a[j+1][i] have stride n, being n the dimension of a.

There is an stride 1 access to a[j][i+1] that occurs n iterations after the reference to a[j][i].

for(i=0;i<n;i++) for(j=0;j<n;j++) a[i][j] = 0;

Spatial locality is enhanced if the loops are exchanged

Spatial LocalityWhen an element is referenced its neighbors will be referenced too

Temporal LocalityWhen an element is referenced, it might be referenced again soon

Page 13: Introducción a la Programación Paralela (Memoria Compartida) Casiano Rodríguez León casiano@ull.es Departamento de Estadística, Investigación Operativa

Shared Memory Machines

Page 14: Introducción a la Programación Paralela (Memoria Compartida) Casiano Rodríguez León casiano@ull.es Departamento de Estadística, Investigación Operativa

P P P P P P

B U S

M e m o r y

Shared and Distributed memory

Shared memory - single address space. All processors have access to a pool of shared memory.

Methods of memory access : - Bus - Crossbar

Distributed memory - each processorhas it’s own local memory. Must do message passing to exchange data between processors.

M

P

M

P

M

P

M

P

M

P

M

P

Network

Page 15: Introducción a la Programación Paralela (Memoria Compartida) Casiano Rodríguez León casiano@ull.es Departamento de Estadística, Investigación Operativa

Uniform memory access (UMA)Each processor has uniform accessto memory - Also known assymmetric multiprocessors (SMPs)

P P P P

BUS

Memory

Styles of Shared memory: UMA and NUMA

Non-uniform memory access (NUMA)Time for memory access depends on location of data. Local access is faster than non-local access. Easier to scalethan SMPs

P P P P

BUS

Memory

Network

P P P P

BUS

Memory

Page 16: Introducción a la Programación Paralela (Memoria Compartida) Casiano Rodríguez León casiano@ull.es Departamento de Estadística, Investigación Operativa

UMA: Memory Access Problems

• Conventional wisdom is that systems do not scale well

– Bus based systems can become saturated

– Fast large crossbars are expensive

• Cache coherence problem

– Copies of a variable can be present in multiple caches

– A write by one processor my not become visible to others

– They'll keep accessing stale value in their caches

– Need to take actions to ensure visibility or cache coherence

Page 17: Introducción a la Programación Paralela (Memoria Compartida) Casiano Rodríguez León casiano@ull.es Departamento de Estadística, Investigación Operativa

I/O devices

Memory

P1

$ $ $

P2 P3

12

34 5

u = ?u = ?

u:5

u:5

u:5

u = 7

Cache coherence problem

• Processors see different values for u after event 3

• With write back caches, value written back to memory depends on circumstance of which cache flushes or writes back value when

• Processes accessing main memory may see the old value

Page 18: Introducción a la Programación Paralela (Memoria Compartida) Casiano Rodríguez León casiano@ull.es Departamento de Estadística, Investigación Operativa

Snooping-based coherence

• Basic idea:

– Transactions on memory are visible to all processors

– Processor or their representatives can snoop (monitor) bus and take action on relevant events

• Implementation

– When a processor writes a value a signal is sent over the bus

– Signal is either

• Write invalidate tell others cached value is invalid

• Write broadcast - tell others the new value

Page 19: Introducción a la Programación Paralela (Memoria Compartida) Casiano Rodríguez León casiano@ull.es Departamento de Estadística, Investigación Operativa

P0

While (there are more tasks) { Task = GetFromFreeList(); Task ->data = ....; insert Task in task queue;}Head = head of task queue;

P1 , P2 , P3 ... Pn-1

While (MyTask == NULL) { Begin Critical Section; if (Head != NULL) { MyTask = Head; Head = Head ->Next; } End Critical Section;}.... = MyTask->data;

What value is read here?

Memory Consistency Models

Page 20: Introducción a la Programación Paralela (Memoria Compartida) Casiano Rodríguez León casiano@ull.es Departamento de Estadística, Investigación Operativa

P0

While (there are more tasks) { Task = GetFromFreeList(); Task ->data = ....; insert Task in task queue;}Head = head of task queue;

P1 , P2 , P3 ... Pn-1

While (MyTask == NULL) { Begin Critical Section; if (Head != NULL) { MyTask = Head; Head = Head ->Next; } End Critical Section;}.... = MyTask->data;

Memory Consistency Models

In some commercial shared memory systems it is possible to observe the old value of MyTask->data!!

Page 21: Introducción a la Programación Paralela (Memoria Compartida) Casiano Rodríguez León casiano@ull.es Departamento de Estadística, Investigación Operativa

Distributed shared memory (NUMA)

• Consists of N processors and a global address space

– All processors can see all memory

– Each processor has some amount of local memory

– Access to the memory of other processors is slower

• NonUniform Memory AccessP P P P

BUS

Memory

Network

P P P P

BUS

Memory

Page 22: Introducción a la Programación Paralela (Memoria Compartida) Casiano Rodríguez León casiano@ull.es Departamento de Estadística, Investigación Operativa

SGI Origin 2000

Page 23: Introducción a la Programación Paralela (Memoria Compartida) Casiano Rodríguez León casiano@ull.es Departamento de Estadística, Investigación Operativa

OpenMP Programming

Page 24: Introducción a la Programación Paralela (Memoria Compartida) Casiano Rodríguez León casiano@ull.es Departamento de Estadística, Investigación Operativa

Origin2000 memory hierarchy

Level Latency (cycles)

register 0

primary cache 2..3

secondary cache 8..10

local main memory & TLB hit 75

remote main memory & TLB hit 250

main memory & TLB miss 2000

page fault 10^6

Page 25: Introducción a la Programación Paralela (Memoria Compartida) Casiano Rodríguez León casiano@ull.es Departamento de Estadística, Investigación Operativa

OpenMP C and C++ Application Program InterfaceDRAFTVersion 2.0 November 2001 DRAFT 11.05OPENMP ARCHITECTURE REVIEW BOARD

•http://www.openmp.org/•http://www.compunity.org/•http://www.openmp.org/specs/•http://www.it.kth.se/labs/cs/odinmp/•http://phase.etl.go.jp/Omni/

OpenMP

Page 26: Introducción a la Programación Paralela (Memoria Compartida) Casiano Rodríguez León casiano@ull.es Departamento de Estadística, Investigación Operativa

#include <omp.h>int main() { int iam =0, np = 1; #pragma omp parallel private(iam, np) {#if defined (_OPENMP) np = omp_get_num_threads(); iam = omp_get_thread_num();#endif printf(“Hello from thread %d out of %d \n”, iam, np); }}

parallel region directive with data scoping clause

Hello World in OpenMP

Page 27: Introducción a la Programación Paralela (Memoria Compartida) Casiano Rodríguez León casiano@ull.es Departamento de Estadística, Investigación Operativa

#pragma omp parallel private(i, id, p, load, begin, end) { p = omp_get_num_threads(); id = omp_get_thread_num(); load = N/p; begin = id*load; end = begin+load; for (i = begin; ((i<end) && keepon); i++) { if (a[i] == x) { keepon = 0; position = i; }#pragma omp flush(keepon) } }

begin

end

load

a

x?

Defines a parallel region, to be executed by all the threads in parallel

Page 28: Introducción a la Programación Paralela (Memoria Compartida) Casiano Rodríguez León casiano@ull.es Departamento de Estadística, Investigación Operativa

#pragma omp parallel private(i, id, p, load, begin, end) { p = omp_get_num_threads(); id = omp_get_thread_num(); load = N/p; begin = id*load; end = begin+load; for (i = begin; ((i<end) && keepon); i++) { if (a[i] == x) { keepon = 0; position = i; }#pragma omp flush(keepon) } }

Defines a parallel region, to be executed by all the threads in parallel

Page 29: Introducción a la Programación Paralela (Memoria Compartida) Casiano Rodríguez León casiano@ull.es Departamento de Estadística, Investigación Operativa

#pragma omp parallel private(i, id, p, load, begin, end) { p = omp_get_num_threads(); id = omp_get_thread_num(); load = N/p; begin = id*load; end = begin+load; for (i = begin; ((i<end) && keepon); i++) { if (a[i] == x) { keepon = 0; position = i; }#pragma omp flush(keepon) } }

Search for x= 100

A = (1000, ..., 901, 900, ..., 801, ..., 100, ... , 1)

The sequential algorithm traverses 900 elements

P = 10 processors

A = (1000, ..., 901,900, ..., 801, ... , 900, ...,801, ..., 100 ... , 1)

P0 P1 P8 P9

Processor 9 finds x = 10 in the first step

Speedup 900/1 > 10 = P

Page 30: Introducción a la Programación Paralela (Memoria Compartida) Casiano Rodríguez León casiano@ull.es Departamento de Estadística, Investigación Operativa

main() { double local, pi=0.0, w; long i;

w = 1.0 / N;

#pragma omp parallel private(i, local) {#pragma omp single pi = 0.0;#pragma omp for reduction (+: pi) for (i = 0; i < N; i++) { local = (i + 0.5)*w; pi = pi + 4.0/(1.0 + local*local); } }

A WS construct distributes the execution of the statement among the members of the team

= = 0

1

4(1+x2)

dx0<i<N

4N(1+((i+0.5)/N)2)

Page 31: Introducción a la Programación Paralela (Memoria Compartida) Casiano Rodríguez León casiano@ull.es Departamento de Estadística, Investigación Operativa

Nested Parallelism

Page 32: Introducción a la Programación Paralela (Memoria Compartida) Casiano Rodríguez León casiano@ull.es Departamento de Estadística, Investigación Operativa

•A parallel directive dynamically inside another parallel establishes a new team, which is composed of only the current thread unless nested parallelism is enabled.

•for, sections and single directives that bind to the same parallel are not allowed to be nested inside each other.

The expression of Nested Parallelism in OpenMPhas to conform to these two rules:

Page 33: Introducción a la Programación Paralela (Memoria Compartida) Casiano Rodríguez León casiano@ull.es Departamento de Estadística, Investigación Operativa

Q6: What about nested parallelism?

http://www.openmp.org/index.cgi?faq

OpenMP encourages vendors to experiment with nested parallelism to help us and the users of OpenMP understand the best model and API to include in our specification.

We will include the necessary functionality when we understand the issues better.

A6:

Nested parallelism is permitted by the OpenMP specification. Supporting nested parallelism effectively can be difficult, and we expect most vendors will start out by executing nested parallel constructs on a single thread.

Page 34: Introducción a la Programación Paralela (Memoria Compartida) Casiano Rodríguez León casiano@ull.es Departamento de Estadística, Investigación Operativa

NANOSAyguade E., Martorell X., Labarta J., Gonzalez M. and Navarro N.Exploiting Multiple Levels of Parallelism in OpenMP: A Case StudyProc. of the 1999 International Conference on Parallel Processing, Aizu (Japan), September 1999.http://www.ac.upc.es/nanos/

A parallel directive dynamically inside another parallel establishes a new team, which is composed of only the current thread unless nested parallelism is enabled.

Page 35: Introducción a la Programación Paralela (Memoria Compartida) Casiano Rodríguez León casiano@ull.es Departamento de Estadística, Investigación Operativa

KAIShah S, Haab G, Petersen P, Throop J.Flexible control structures for parallelism in OpenMP.1st European Workshop on OpenMP, Lund, Sweden, September 1999.http://developer.intel.com/software/products/trans/kai/

Nodeptr list;...#pragma omp taskqfor ( nodeptr p = list; p != NULL; p = p-< next) { #pragma omp task process(p->data); }

Page 36: Introducción a la Programación Paralela (Memoria Compartida) Casiano Rodríguez León casiano@ull.es Departamento de Estadística, Investigación Operativa

void traverse(Node & node) {

process(node.data);

if (node.has_left) traverse(node.left); if (node.has_right) traverse(node.right); }

#pragma omp taskq { #pragma task

}

The Workqueuing Model

Robert D. Blumofe, Christopher F. Joerg, Bradley C. Kuszmaul, Charles E. Leiserson, Keith H. Randall, Yuli Zhou: Cilk: An Efficient Multithreaded Runtime System. Journal of Parallel and Distributed Computing 37(1): 55-69 (1996). http://supertech.lcs.mit.edu/cilk/

Page 37: Introducción a la Programación Paralela (Memoria Compartida) Casiano Rodríguez León casiano@ull.es Departamento de Estadística, Investigación Operativa

OMNIYoshizumi Tanaka, Kenjiro Taura, Mitsuhisa Sato, and Akinori YonezawaPerformance Evaluation of OpenMP Applications with Nested ParallelismLanguages, Compilers, and Run-Time Systems for Scalable Computerspp. 100-112, 2000

http://pdplab.trc.rwcp.or.jp/Omni/

A parallel directive dynamically inside another parallel establishes a new team, which is composed of only the current thread unless nested parallelism is enabled.

Page 38: Introducción a la Programación Paralela (Memoria Compartida) Casiano Rodríguez León casiano@ull.es Departamento de Estadística, Investigación Operativa

for, sections and single directives that bind to the same parallel are not allowed to be nested inside each other.

What were the reasons that led the designers to the constraints implied by

the second rule?

Simplicity!!

Page 39: Introducción a la Programación Paralela (Memoria Compartida) Casiano Rodríguez León casiano@ull.es Departamento de Estadística, Investigación Operativa

for, sections and single directives that bind to the same parallel are not allowed to be nested inside each other.

Chandra R., Menon R., Dagum L., Kohr D., Maydan D. and McDonald J.Morgan Kaufmann Publishers. Academic press. 2001.

Page 40: Introducción a la Programación Paralela (Memoria Compartida) Casiano Rodríguez León casiano@ull.es Departamento de Estadística, Investigación Operativa

for, sections and single directives that bind to the same parallel are not allowed to be nested inside each other.

Page 122:

“A work-sharing construct divides a piece of work among ateam of parallel threads. However, once a thread is executing within a work-sharing construct, it is the only threadexecuting that code; there is no team of threads executing that specific piece of code anymore, so, it is nonsensical to attempt to further divide a portion of work using a work-sharing construct. ’’

Nesting of work-sharing constructs is therefore illegal in OpenMP.

Page 41: Introducción a la Programación Paralela (Memoria Compartida) Casiano Rodríguez León casiano@ull.es Departamento de Estadística, Investigación Operativa

void qs(int *v, int first, int last) {

int i, j;

if (first < last) {

#pragma ll MASTER

partition(v, &i, &j first, last);

#pragma ll sections firstprivate(i,j)

{

#pragma ll section

qs(v, first, j);

#pragma ll section

qs(v, i, last);

}

}

Divide and Conquer

P3.

P1 P2

. . . . . . . .. . . . .. . . .... .

. . QUICK HULL

F F T (a [0 ], a [2 ], . . . ,a [N -2 ]) F F T (a [1 ], a [3 ], . . . ,a [N -1 ])

F F T (a [0 ],... ,a [N -1 ])

FAST FOURIER TRANSFORM

Page 42: Introducción a la Programación Paralela (Memoria Compartida) Casiano Rodríguez León casiano@ull.es Departamento de Estadística, Investigación Operativa

void qs(int *v, int first, int last) {

int i, j;

if (first < last) {

#pragma ll MASTER

partition(v, &i, &j first, last);

#pragma ll sections firstprivate(i,j)

{

#pragma ll section

qs(v, first, j);

#pragma ll section

qs(v, i, last);

}

}

1 2 3 40

...

20 1 3 4

qs(v,first,last)

qs(v,first,j) qs(v,i,last)

0 1 2 3 4

qs(v,first,j)

0 1

Page 43: Introducción a la Programación Paralela (Memoria Compartida) Casiano Rodríguez León casiano@ull.es Departamento de Estadística, Investigación Operativa

void qs(int *v, int first, int last) {

int i, j;

if (first < last) {

#pragma ll MASTER

partition(v, &i, &j first, last);

#pragma ll sections firstprivate(i,j)

{

#pragma ll section

qs(v, first, j);

#pragma ll section

qs(v, i, last);

}

}

...

1 2 3 40

20 1 3 4

qs(v,first,last)

qs(v,first,j) qs(v,i,last)

0 1 2 3 4

Page 44: Introducción a la Programación Paralela (Memoria Compartida) Casiano Rodríguez León casiano@ull.es Departamento de Estadística, Investigación Operativa

void qs(int *v, int first, int last) {

int i, j;

if (first < last) {

#pragma ll MASTER

partition(v, &i, &j first, last);

#pragma ll sections firstprivate(i,j)

{

#pragma ll section

qs(v, first, j);

#pragma ll section

qs(v, i, last);

}

}

...

1 2 3 40

20 1 3 4

qs(v,first,last)

Page 45: Introducción a la Programación Paralela (Memoria Compartida) Casiano Rodríguez León casiano@ull.es Departamento de Estadística, Investigación Operativa

page 14:

“The sections directive identifies a non iterative work-sharing construct that specifies a set of constructs that are tobe divided among threads in a team. Each section is executed once by a thread in the team.’’

OpenMP Architecture Review Board: OpenMP C and C++ application program interface v. 1.0 - October. (1998).http://www.openmp.org/specs/mp-documents/cspec10.ps

void qs(int *v, int first, int last) {

int i, j;

if (first < last) {

#pragma ll MASTER

partition(v, &i, &j first, last);

#pragma ll sections firstprivate(i,j)

{

#pragma ll section

qs(v, first, j);

#pragma ll section

qs(v, i, last);

}

}

Page 46: Introducción a la Programación Paralela (Memoria Compartida) Casiano Rodríguez León casiano@ull.es Departamento de Estadística, Investigación Operativa

void qs(int *v, int first, int last) {

int i, j;

if (first < last) {

#pragma ll MASTER

partition(v, &i, &j first, last);

#pragma ll sections firstprivate(i,j)

{

#pragma ll section

qs(v, first, j);

#pragma ll section

qs(v, i, last);

}

}

void qs(int *v, int first, int last) { int i, j; if (first < last) { MASTER partition(v, &i, &j, first, last); ll_2_sections( ll_2_first_private(i, j), qs(v, first, j), qs(v, i, last) ); }}

NWS: The Run Time Library

Page 47: Introducción a la Programación Paralela (Memoria Compartida) Casiano Rodríguez León casiano@ull.es Departamento de Estadística, Investigación Operativa

#define ll_2_sections(decl, f1, f2) \

NWS: The Run Time Library

if (ll_NUMPROCESSORS > 1) { \

else { f1; f2; }

decl; \ll_barrier(); \

{ \ int ll_oldname = ll_NAME, \ ll_oldnp = ll_NUMPROCESSORS; \

if (ll_DIVIDE(2)) {f2; } \ else { f1; } \

ll_REJOIN(ll_oldname, ll_oldnp); \ } \ } \

Page 48: Introducción a la Programación Paralela (Memoria Compartida) Casiano Rodríguez León casiano@ull.es Departamento de Estadística, Investigación Operativa

void ll_barrier() { MASTER { int i, all_arrived; do { all_arrived = 1; for (i = 1; i < ll_NUMPROCESSORS; i++) if (!(ll_ARRIVED[i])) all_arrived = 0; } while (!all_arrived); for (i = 1; i < ll_NUMPROCESSORS; i++) ll_ARRIVED[i] = 0; } SLAVE { *ll_ARRIVED = 1; while (*ll_ARRIVED) ; }}

GR

OU

P B

AR

RIE

R(O

RIG

IN S

TY

LE

)

0 0 00

0 1 32

1 1 11

0 1 32

0 0 00

0 1 32

Page 49: Introducción a la Programación Paralela (Memoria Compartida) Casiano Rodríguez León casiano@ull.es Departamento de Estadística, Investigación Operativa

int ll_DIVIDE(int ngroups) { int ll_group; double ll_first; double ll_last;

ll_group = (int)floor(((double)(ngroups*ll_NAME))/((double)ll_NUMPROCESSORS));

ll_first = (int)ceil(((double)(ll_NUMPROCESSORS*ll_group))/((double)ngroups)); ll_last=(int)ceil((((double)(ll_NUMPROCESSORS*(ll_group+1)))/((double)ngroups))-1);

ll_NUMPROCESSORS = ll_last - ll_first + 1; ll_NAME = ll_NAME - ll_first;

return ll_group;}

void ll_REJOIN(int old_name, int old_np) { ll_NAME = old_name; ll_NUMPROCESSORS = old_np;}

0 ,1ll_ N U M P R O C E S S O R S = 2

ll_ N A M E = 0 ,1

2 ,3ll_ N U M P R O C E S S O R S = 2

ll_ N A M E = 0 ,1

0 ,1 ,2 ,3ll_ N U M P R O C E S S O R S = 4

ll_ N A M E = 0 ,1 ,2 ,3

(A bit) more overhead if weights are provided!

Page 50: Introducción a la Programación Paralela (Memoria Compartida) Casiano Rodríguez León casiano@ull.es Departamento de Estadística, Investigación Operativa

#define ll_2_first_private(ll_var1, ll_var2) { \ void **q; \ MASTER { \ *ll_FIRST_PRIVATE = (void *) malloc(2*sizeof(void *)); \ q = *ll_FIRST_PRIVATE; \ *q = (void *) &(ll_var1); \ q ++; \ *q = (void *) &(ll_var2); \ } \ ll_barrier(); \ SLAVE { \ q = *(ll_FIRST_PRIVATE-ll_NAME); \ memcpy((void *) &(ll_var1), *q, sizeof(ll_var1)); \ q ++; \ memcpy((void *) &(ll_var2), *q, sizeof(ll_var2)); \ } \ ll_barrier(); \ MASTER free(*ll_FIRST_PRIVATE); \}

0 1 32

&ll_var2

&ll_var1

ll_FIRST_PRIVATE

*ll_FIRST_PRIVATE

Page 51: Introducción a la Programación Paralela (Memoria Compartida) Casiano Rodríguez León casiano@ull.es Departamento de Estadística, Investigación Operativa

F F T (a [0 ], a [2 ], . . . ,a [N -2 ]){0 ,1 }

F F T (a [1 ], a [3 ], . . . ,a [N -1 ]){2 ,3 }

F F T (a [0 ],. . . ,a [N -1 ]){0 ,1 ,2 ,3 }

FAST FOURIER TRANSFORM

Page 52: Introducción a la Programación Paralela (Memoria Compartida) Casiano Rodríguez León casiano@ull.es Departamento de Estadística, Investigación Operativa

void llcFFT(Complex *A, Complex *a, Complex *W, unsigned N, unsigned stride, Complex *D) { Complex *B, *C; Complex Aux, *pW; unsigned i, n;if (N == 1)#pragma ll MASTER { A[0].re = a[0].re; A[0].im = a[0].im; }else { n = (N>>1);#pragma ll sections {#pragma ll section llcFFT(D, a, W, n, stride<<1, A);#pragma ll section llcFFT(D+n, a+stride, W, n, stride<<1, A+n)); } B = D; C = D + n; pW = W;#pragma ll for private(i) for(i=0; i<n; i++) \{ Aux.re = pW->re * C[i].re - pW->im * C[i].im; Aux.im = pW->re * C[i].im + pW->im * C[i].re; A[i].re = B[i].re + Aux.re; A[i].im = B[i].im + Aux.im; A[i+n].re = B[i].re - Aux.re; A[i+n].im = B[i].im - Aux.im; pW += stride; } }}

Page 53: Introducción a la Programación Paralela (Memoria Compartida) Casiano Rodríguez León casiano@ull.es Departamento de Estadística, Investigación Operativa

SunFire 6800

0369

1215

2 4 8 16

Processors

Sp

ee

du

p qs 50 M int

FFT 16 Mcomplex

Page 54: Introducción a la Programación Paralela (Memoria Compartida) Casiano Rodríguez León casiano@ull.es Departamento de Estadística, Investigación Operativa

void kaiFFTdp(Complex *A, ... , int level) { ... if (level >= CRITICAL_LEVEL) seqDandCFFT(A, a, W, N, stride, D); else if(N == 1) { A[0].re = a[0].re; A[0].im = a[0].im; } else { n = (N >> 1); B = D; C = D + n;# pragma omp taskq { # pragma omp task kaiFFTdp(B, a, W, n, stride<<1, A, level+1); # pragma omp task kaiFFTdp(C, a+stride, W, n, stride<<1, A+n, level+1); }

Page 55: Introducción a la Programación Paralela (Memoria Compartida) Casiano Rodríguez León casiano@ull.es Departamento de Estadística, Investigación Operativa

# pragma omp taskq private(i, Aux, pW, j, start, end) { Workers = omp_get_num_threads(); CHUNK_SIZE = n / Workers + ((n % Workers) > 0); for(j = 0; j < n; j += CHUNK_SIZE) { start = j * CHUNK_SIZE; end = min(start + CHUNK_SIZE, n); pW = W + start * stride; # pragma omp task { for(i = start; i < end; i++) { ... } } /* task */ } } /* taskq */ } }

Page 56: Introducción a la Programación Paralela (Memoria Compartida) Casiano Rodríguez León casiano@ull.es Departamento de Estadística, Investigación Operativa
Page 57: Introducción a la Programación Paralela (Memoria Compartida) Casiano Rodríguez León casiano@ull.es Departamento de Estadística, Investigación Operativa

OpenMP: Distributed Memory

Page 58: Introducción a la Programación Paralela (Memoria Compartida) Casiano Rodríguez León casiano@ull.es Departamento de Estadística, Investigación Operativa

Barbara Chapman, Piyush Mehortra and Hans ZimaEnhacing OpenMP With Features for Locality Control

Technical report TR99-02,

Inst. for Software Technology and Parallel Systems, U. Vienna, Feb. 1999. •http://www.par.univie.ac.at.

OpenMP: Distributed Memory

C. Amza, A. L. Cox , S. Dwarkadas , P. Keleher , H. Lu , R. Rajamony , W. Yu , W. ZwaenepoelTreadMarks: Shared Memory Computing on Networks of WorkstationsIEEE Computer, 29(2), pp. 18-28, February 1996. •http://www.cs.rice.edu/~willy/TreadMarks/overview.html

Page 59: Introducción a la Programación Paralela (Memoria Compartida) Casiano Rodríguez León casiano@ull.es Departamento de Estadística, Investigación Operativa

1 #pragma ll parallel for 2 #pragma ll result(ri+i, si[i]) 3 for(i=1; i <= 3; i++){ 4 ... 5 #pragma ll for 6 #pragma ll result (rj+j, sj[j]) 7 for(j=0; j<=i;j++) { 4 rj[j] = function_j(i,j, &sj[j], .... ); 6 } 7 ri[i] = function_i(i, &si[i], .... ); 8 }

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 230 ...

...6 120 9 153

i=11 4 7 10 1316...

i=22 5 8 1114 17 20 ...23

i=3

Page 60: Introducción a la Programación Paralela (Memoria Compartida) Casiano Rodríguez León casiano@ull.es Departamento de Estadística, Investigación Operativa

1 10... 4 13... 7 16... 2 14... 5 17... 8 20... 11 ...236 120 ... ...3 9 15

i=1, j=0 i=1, j=1

i=1 i=2 i=3

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 230 ...

...231 24 56 7 810 1112 13 1416 17 200 ......9 153

1 #pragma ll parallel for 2 #pragma ll result(ri+i, si[i]) 3 for(i=1; i <= 3; i++){ 4 ... 5 #pragma ll for 6 #pragma ll result (rj+j, sj[j]) 7 for(j=0; j<=i;j++) { 4 rj[j] = function_j(i,j, &sj[j], .... ); 6 } 7 ri[i] = function_i(i, &si[i], .... ); 8 }

Page 61: Introducción a la Programación Paralela (Memoria Compartida) Casiano Rodríguez León casiano@ull.es Departamento de Estadística, Investigación Operativa

6

54

32

1

7

worker

worker

worker

0

1 2 3 4 5 6 70

2 40 1 3 56 7

40 2 6 31 5 7

40 2 6 31 5 7

Page 62: Introducción a la Programación Paralela (Memoria Compartida) Casiano Rodríguez León casiano@ull.es Departamento de Estadística, Investigación Operativa

y = 3y = 4

y = 5y = 6

y = 5

y = 4

0 1

2

5

3 4

x = 3

x = 4

x = 5

1 #pragma ll parallel for 2 #pragma ll result(rx+x, sx[x]) 3 for(x=3; x <= 5; x++){ 5 #pragma ll for 6 #pragma ll result (ry+y, sy[y]) 7 for(y =x; y<=x+1;y++) { 4 ry[y] = ... ; 6 } 7 rx[x] = ... ; 8 }

Page 63: Introducción a la Programación Paralela (Memoria Compartida) Casiano Rodríguez León casiano@ull.es Departamento de Estadística, Investigación Operativa

6,5 5,4

Third division and dimension

Second division

Second di m

ens i on

0 1 2 3 4 5

11, 96,4

w1=20, w2=10

0

2 3

15

4First division

First dimensio

n

•Each Subset of Processors correspond to one and only one Thread. •A Group is a Family of Subsets.•A Team is a Family of Groups.

One Thread is One Set of Processors Model (OTOSP)

Page 64: Introducción a la Programación Paralela (Memoria Compartida) Casiano Rodríguez León casiano@ull.es Departamento de Estadística, Investigación Operativa
Page 65: Introducción a la Programación Paralela (Memoria Compartida) Casiano Rodríguez León casiano@ull.es Departamento de Estadística, Investigación Operativa
Page 66: Introducción a la Programación Paralela (Memoria Compartida) Casiano Rodríguez León casiano@ull.es Departamento de Estadística, Investigación Operativa

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 230 ...

1 10... 4 13... 7 16... 2 14... 5 17... 8 20... 11 ...236 120 ... ...3 9 15

i=1, j=0 i=1, j=1

1 24 56 7 810 1112 13 1416 17 20 ...230 ......9 153

i=1 i=2 i=3

j=0 j=1 j=2 j=3

1 forall(i=1; i<=3) result(ri+i, si[i]) { 2 ... 3 forall(j=0; j<=i) result(rj+j, sj[j]) { 4 int a, b; 5 ... 6 if (i % 2 == 1) 7 if (j % 2 == 0) 8 send(“j”, j+1, a, sizeof(int)); 9 else receive(“j”, j-1, b, sizeof(int));

Page 67: Introducción a la Programación Paralela (Memoria Compartida) Casiano Rodríguez León casiano@ull.es Departamento de Estadística, Investigación Operativa

•Is OpenMP here to stay?

•http://nereida.deioc.ull.es/html/openmp.html

Conclusions and Open Questions

(Shared Memory)

Pointer

•Performance Prediction Tools and Models?

•Scalability of Shared Memory Machines?

•OpenMP for Distributed Memory Machines?

•The Work Queuing Model?