Lecture 2 (Mapping Applications to Multi-core Arch)

Programming Multi-Core Processors based

Embedded Systems

A Hands-On Experience on Cavium Octeon based Platforms

Lecture 2 (Mapping Applications to Multi-core Arch)

KICS, UETCavium Univ Program © 2010 2-87

Course Outline Introduction Multi-threading on multi-core processors

Developing parallel applications Introduction to POSIX based multi-threading Multi-threaded application examples

Applications for multi-core processors Application layer computing on multi-

core Performance measurement and tuning


Agenda for Today Mapping applications to multi-core

applications Parallel programming using threads

POSIX multi-threading Using multi-threading for parallel

programming

Mapping Applications to Multi-Core Architectures

Chapter 2David E. Culler and Jaswinder Pal Singh,

Parallel Computer Architecture: A Hardware/Software Approach, Morgan

Kaufmann, 1998


Parallelization Assumption: Sequential algorithm is given

Sometimes need very different algorithm, but beyond scope

Pieces of the job: Identify work that can be done in parallel Partition work and perhaps data among processes Manage data access, communication and synchronization Note: work includes computation, data access and I/O

Main goal: Speedup (plus low prog. effort and resource needs)

Speedup (p) = Performance(p) / Performance(1) For a fixed problem:

Speedup (p) = Time(1) / Time(p)


Steps in Creating a Parallel Program

4 steps: Decomposition, Assignment, Orchestration, Mapping Done by programmer or system software (compiler, runtime, ...) Issues are the same, so assume programmer does it all explicitly

P0

Tasks Processes Processors

P1

P2 P3

p0 p1

p2 p3

p0 p1

p2 p3

Partitioning

Sequentialcomputation

Parallelprogram

Assignment

Decomposition

Mapping

Orchestration


Some Important Concepts Task:

Arbitrary piece of undecomposed work in parallel computation Executed sequentially; concurrency is only across tasks E.g. a particle/cell in Barnes-Hut, a ray or ray group in Raytrace Fine-grained versus coarse-grained tasks

Process (thread): Abstract entity that performs the tasks assigned to processes Processes communicate and synchronize to perform their tasks

Processor: Physical engine on which process executes Processes virtualize machine to programmer

first write program in terms of processes, then map to processors


Decomposition Break up computation into tasks to be

divided among processes Tasks may become available dynamically No. of available tasks may vary with time

i.e., identify concurrency and decide level at which to exploit it

Goal: Enough tasks to keep processes busy, but not too many No. of tasks available at a time is upper

bound on achievable speedup


Limited Concurrency: Amdahl’s Law

Most fundamental limitation on parallel speedup If fraction s of seq execution is inherently serial, speedup

<= 1/s Example: 2-phase calculation

sweep over n-by-n grid and do some independent computation sweep again and add each value to global sum

Time for first phase = n2/p Second phase serialized at global variable, so time = n2 Speedup <= or at most 2 Trick: divide second phase into two

accumulate into private sum during sweep add per-process private sum into global sum

Parallel time is n2/p + n2/p + p, and speedup at best

2n2

n2

p + n2

2n2

2n2 + p2


Pictorial Depiction

1

p

1

p

1

n2/p

n2

p

wor

k do

ne c

oncu

rren

tly

n2

n2

Timen2/p n2/p

(c)

(b)

(a)


Concurrency Profiles

Cannot usually divide into serial and parallel part Area under curve is total work done, or time with 1 processor Horizontal extent is lower bound on time (infinite processors)

Speedup is the ratio: , base case:

Amdahl’s law applies to any overhead, not just limited concurrency

fk k

fkkp

k=1

k=1

1

s + 1-sp

Con

curr

ency

150

219

247

286

313

343

380

415

444

483

504

526

564

589

633

662

702

733

0

200

400

600

800

1,000

1,200

1,400

Clock cycle number


Assignment Specifying mechanism to divide work up among processes

E.g. which process computes forces on which stars, or which rays Together with decomposition, also called partitioning Balance workload, reduce communication and management cost

Structured approaches usually work well Code inspection (parallel loops) or understanding of application Well-known heuristics Static versus dynamic assignment

As programmers, we worry about partitioning first Usually independent of architecture or prog model But cost and complexity of using primitives may affect decisions

As architects, we assume program does reasonable job of it


Orchestration Includes:

Naming data Structuring communication Synchronization Organizing data structures and scheduling tasks temporally

Goals Reduce cost of communication and synch. as seen by processors Reserve locality of data reference (incl. data structure organization) Schedule tasks to satisfy dependences early Reduce overhead of parallelism management

Closest to architecture (and programming model & language) Choices depend a lot on comm. abstraction, efficiency of primitives Architects should provide appropriate primitives efficiently


Mapping After orchestration, already have parallel program Two aspects of mapping:

Which processes will run on same processor, if necessary Which process runs on which particular processor

mapping to a network topology One extreme: space-sharing

Machine divided into subsets, only one app at a time in a subset Processes can be pinned to processors, or left to OS

Another extreme: complete resource management control to OS

OS uses the performance techniques we will discuss later Real world is between the two

User specifies desires in some aspects, system may ignore Usually adopt the view: process <-> processor


Parallelizing Computation vs. Data Above view is centered around computation

Computation is decomposed and assigned (partitioned) Partitioning Data is often a natural view too

Computation follows data: owner computes Grid example; data mining; High Performance Fortran

(HPF) But not general enough

Distinction between comp. and data stronger in many applications

Barnes-Hut, Raytrace (later) Retain computation-centric view Data access and communication is part of orchestration


High-level Goals

High performance (speedup over sequential program) But low resource usage and development effort Implications for algorithm designers and architects

Algorithm designers: high-perf., low resource needs Architects: high-perf., low cost, reduced programming effort

e.g. gradually improving perf. with programming effort may be preferable to sudden threshold after large programming effort

Table 2.1 Steps in the Parallelization Process and Their Goals

StepArchitecture-Dependent? Major Performance Goals

Decomposition Mostly no Expose enough concurrency but not too much

Assignment Mostly no Balance workloadReduce communication volume

Orchestration Yes Reduce noninherent communication via data locality

Reduce communication and synchronization cost as seen by the processor

Reduce serialization at shared resourcesSchedule tasks to satisfy dependences early

Mapping Yes Put related processes on the same processor if necessary

Exploit locality in network topology


Parallelization of An Example Program Motivating problems all lead to large, complex

programs Examine a simplified version of a piece of

Ocean simulation Iterative equation solver

Illustrate parallel program in low-level parallel language

C-like pseudocode with simple extensions for parallelism

Expose basic comm. and synch. primitives that must be supported

State of most real parallel programming today


Grid Solver Example

Simplified version of solver in Ocean simulation Gauss-Seidel (near-neighbor) sweeps to convergence

interior n-by-n points of (n+2)-by-(n+2) updated in each sweep updates done in-place in grid, and diff. from prev. value computed accumulate partial diffs into global diff at end of every sweep check if error has converged (to within a tolerance parameter) if so, exit solver; if not, do another sweep

A[i,j ] = 0.2 (A[i,j ] + A[i,j – 1] + A[i – 1, j] +A[i,j + 1] + A[i + 1, j ])

Expression for updating each interior point:


1. int n; /*size of matrix: (n + 2-by-n + 2) elements*/2. float **A, diff = 0;

3. main()4. begin5. read(n) ; /*read input parameter: matrix size*/6. A malloc (a 2-d array of size n + 2 by n + 2 doubles);7. initialize(A); /*initialize the matrix A somehow*/8. Solve (A); /*call the routine to solve equation*/9. end main

10. procedure Solve (A) /*solve the equation system*/11. float **A; /*A is an (n + 2)-by-(n + 2) array*/12. begin13. int i, j, done = 0;14. float diff = 0, temp;15. while (!done) do /*outermost loop over sweeps*/16. diff = 0; /*initialize maximum difference to 0*/17. for i 1 to n do /*sweep over nonborder points of grid*/18. for j 1 to n do19. temp = A[i,j]; /*save old value of element*/20. A[i,j] 0.2 * (A[i,j] + A[i,j-1] + A[i-1,j] +21. A[i,j+1] + A[i+1,j]); /*compute average*/22. diff += abs(A[i,j] - temp);23. end for24. end for25. if (diff/(n*n) < TOL) then done = 1;26. end while27. end procedure


Decomposition

Simple way to identify concurrency is to look at loop iterations dependence analysis; if not enough concurrency, then look further

Not much concurrency here at this level (all loops sequential) Examine fundamental dependences, ignoring loop structure

Concurrency O(n) along anti-diagonals, serialization O(n) along diag.

Retain loop structure, use pt-to-pt synch; Problem: too many synch ops.

Restructure loops, use global synch; imbalance and too much synch


Exploit Application Knowledge

Reorder grid traversal: red-black ordering

Different ordering of updates: may converge quicker or slower Red sweep and black sweep are each fully parallel: Global synch between them (conservative but convenient) Ocean uses red-black; we use simpler, asynchronous one to illustrate

no red-black, simply ignore dependences within sweep sequential order same as original, parallel program nondeterministic

Red point

Black point


Decomposition Only

Decomposition into elements: degree of concurrency n2

To decompose into rows, make line 18 loop sequential; degree n for_all leaves assignment left to system

but implicit global synch. at end of for_all loop

15. while (!done) do /*a sequential loop*/16. diff = 0; 17. for_all i 1 to n do /*a parallel loop nest*/18. for_all j 1 to n do19. temp = A[i,j];20. A[i,j] 0.2 * (A[i,j] + A[i,j-1] + A[i-1,j] +21. A[i,j+1] + A[i+1,j]);22. diff += abs(A[i,j] - temp);23. end for_all24. end for_all25. if (diff/(n*n) < TOL) then done = 1;26. end while


Assignment

Static assignments (given decomposition into rows) block assignment of rows: Row i is assigned to process cyclic assignment of rows: process i is assigned rows i, i+p, and so

on

Dynamic assignment get a row index, work on the row, get a new row, and so on

Static assignment into rows reduces concurrency (from n to p) block assign. reduces communication by keeping adjacent rows

together Let’s dig into orchestration under three programming models

P0

P1

P2

P4

ip


Data Parallel Solver1. int n, nprocs; /*grid size (n + 2-by-n + 2) and number of processes*/2. float **A, diff = 0;

3. main()4. begin5. read(n); read(nprocs); ; /*read input grid size and number of processes*/6. A G_MALLOC (a 2-d array of size n+2 by n+2 doubles);7. initialize(A); /*initialize the matrix A somehow*/8. Solve (A); /*call the routine to solve equation*/9. end main

10. procedure Solve(A) /*solve the equation system*/11. float **A; /*A is an (n + 2-by-n + 2) array*/12. begin13. int i, j, done = 0;14. float mydiff = 0, temp;14a. DECOMP A[BLOCK,*, nprocs];15. while (!done) do /*outermost loop over sweeps*/16. mydiff = 0; /*initialize maximum difference to 0*/17. for_all i 1 to n do /*sweep over non-border points of grid*/18. for_all j 1 to n do19. temp = A[i,j]; /*save old value of element*/20. A[i,j] 0.2 * (A[i,j] + A[i,j-1] + A[i-1,j] +21. A[i,j+1] + A[i+1,j]); /*compute average*/22. mydiff += abs(A[i,j] - temp);23. end for_all24. end for_all24a. REDUCE (mydiff, diff, ADD);25. if (diff/(n*n) < TOL) then done = 1;26. end while27. end procedure


Shared Address Space Solver

Assignment controlled by values of variables used as loop bounds

Sweep

Test Convergence

Processes

Solve Solve Solve Solve

Single Program Multiple Data (SPMD)


1. int n, nprocs; /*matrix dimension and number of processors to be used*/2a. float **A, diff; /*A is global (shared) array representing the grid*/

/*diff is global (shared) maximum difference in currentsweep*/

2b. LOCKDEC(diff_lock); /*declaration of lock to enforce mutual exclusion*/2c. BARDEC (bar1); /*barrier declaration for global synchronization between

sweeps*/

3. main()4. begin5. read(n); read(nprocs); /*read input matrix size and number of processes*/6. A G_MALLOC (a two-dimensional array of size n+2 by n+2 doubles);7. initialize(A); /*initialize A in an unspecified way*/8a. CREATE (nprocs–1, Solve, A);8. Solve(A); /*main process becomes a worker too*/8b. WAIT_FOR_END (nprocs–1); /*wait for all child processes created to terminate*/9. end main

10. procedure Solve(A)11. float **A; /*A is entire n+2-by-n+2 shared array,

as in the sequential program*/12. begin13. int i,j, pid, done = 0;14. float temp, mydiff = 0; /*private variables*/14a. int mymin = 1 + (pid * n/nprocs); /*assume that n is exactly divisible by*/14b. int mymax = mymin + n/nprocs - 1 /*nprocs for simplicity here*/

15. while (!done) do /*outer loop over all diagonal elements*/16. mydiff = diff = 0; /*set global diff to 0 (okay for all to do it)*/16a. BARRIER(bar1, nprocs); /*ensure all reach here before anyone modifies diff*/17. for i mymin to mymax do /*for each of my rows*/18. for j 1 to n do /*for all nonborder elements in that row*/19. temp = A[i,j];20. A[i,j] = 0.2 * (A[i,j] + A[i,j-1] + A[i-1,j] +21. A[i,j+1] + A[i+1,j]);22. mydiff += abs(A[i,j] - temp);23. endfor24. endfor25a. LOCK(diff_lock); /*update global diff if necessary*/25b. diff += mydiff;25c. UNLOCK(diff_lock);25d. BARRIER(bar1, nprocs); /*ensure all reach here before checking if done*/25e. if (diff/(n*n) < TOL) then done = 1; /*check convergence; all get

same answer*/25f. BARRIER(bar1, nprocs);26. endwhile27. end procedure


Notes on SAS Program SPMD: not all Code that does the update lockstep or even necessarily

same instructions Assignment controlled by values of variables used as

loop bounds unique pid per process, used to control assignment

Done condition evaluated redundantly by identical to sequential program

each process has private mydiff variable Most interesting special operations are for

synchronization accumulations into shared diff have to be mutually

exclusive why the need for all the barriers?


Need for Mutual Exclusion Code each process executes:load the value of diff into register r1add the register r2 to register r1store the value of register r1 into diff

A possible interleaving:P1 P2r1 diff {P1 gets 0 in its r1}r1 diff {P2 also gets 0}r1 r1+r2 {P1 sets its r1 to 1}r1 r1+r2 {P2 sets its r1 to 1}diff r1 {P1 sets cell_cost to 1}diff r1 {P2 also sets cell_cost to 1}

Need the sets of operations to be atomic (mutually exclusive)


Global Event Synchronization BARRIER(nprocs): wait here till nprocs processes get here

Built using lower level primitives Global sum example: wait for all to accumulate before using sum Often used to separate phases of computation

Process P_1 Process P_2 Process P_nprocsset up eqn system set up eqn system set up eqn systemBarrier (name, nprocs) Barrier (name, nprocs) Barrier (name,

nprocs)solve eqn system solve eqn system solve eqn systemBarrier (name, nprocs) Barrier (name, nprocs) Barrier (name,

nprocs)apply results apply results apply resultsBarrier (name, nprocs) Barrier (name, nprocs) Barrier (name,

nprocs)

Conservative form of preserving dependences, but easy to use

WAIT_FOR_END (nprocs-1)


Pt-to-pt Event Synch (Not Used Here) One process notifies another of an

event so it can proceed Common example: producer-consumer

(bounded buffer) Concurrent programming on uniprocessor:

semaphores Shared address space parallel programs:

semaphores, or use ordinary variables as flags

P1 P2

A = 1;

a: while (flag is 0) do nothing;b: flag = 1;

print A;

•Busy-waiting or spinning


Group Event Synchronization Subset of processes involved

Can use flags or barriers (involving only the subset)

Concept of producers and consumers

Major types: Single-producer, multiple-consumer Multiple-producer, single-consumer Multiple-producer, single-consumer


Message Passing Grid Solver Cannot declare A to be shared array any more Need to compose it logically from per-process

private arrays usually allocated in accordance with the assignment

of work process assigned a set of rows allocates them locally

Transfers of entire rows between traversals Structurally similar to SAS (e.g. SPMD), but

orchestration different data structures and data access/naming communication synchronization


1. int pid, n, b; /*process id, matrix dimension and number ofprocessors to be used*/

2. float **myA;3. main()4. begin5. read(n); read(nprocs); /*read input matrix size and number of processes*/8a. CREATE (nprocs-1, Solve);8b. Solve(); /*main process becomes a worker too*/8c. WAIT_FOR_END (nprocs–1); /*wait for all child processes created to terminate*/9. end main

10. procedure Solve()11. begin13. int i,j, pid, n’ = n/nprocs, done = 0;14. float temp, tempdiff, mydiff = 0; /*private variables*/6. myA malloc(a 2-d array of size [n/nprocs + 2] by n+2);

/*my assigned rows of A*/7. initialize(myA); /*initialize my rows of A, in an unspecified way*/

15. while (!done) do16. mydiff = 0; /*set local diff to 0*/16a. if (pid != 0) then SEND(&myA[1,0],n*sizeof(float),pid-1,ROW);16b. if (pid = nprocs-1) then

SEND(&myA[n’,0],n*sizeof(float),pid+1,ROW);16c. if (pid != 0) then RECEIVE(&myA[0,0],n*sizeof(float),pid-1,ROW);16d. if (pid != nprocs-1) then

RECEIVE(&myA[n’+1,0],n*sizeof(float), pid+1,ROW);/*border rows of neighbors have now been copiedinto myA[0,*] and myA[n’+1,*]*/

17. for i 1 to n’ do /*for each of my (nonghost) rows*/18. for j 1 to n do /*for all nonborder elements in that row*/19. temp = myA[i,j];20. myA[i,j] = 0.2 * (myA[i,j] + myA[i,j-1] + myA[i-1,j] +21. myA[i,j+1] + myA[i+1,j]);22. mydiff += abs(myA[i,j] - temp);23. endfor24. endfor

/*communicate local diff values and determine ifdone; can be replaced by reduction and broadcast*/

25a. if (pid != 0) then /*process 0 holds global total diff*/25b. SEND(mydiff,sizeof(float),0,DIFF);25c. RECEIVE(done,sizeof(int),0,DONE);25d. else /*pid 0 does this*/25e. for i 1 to nprocs-1 do /*for each other process*/25f. RECEIVE(tempdiff,sizeof(float),*,DIFF);25g. mydiff += tempdiff; /*accumulate into total*/25h. endfor25i if (mydiff/(n*n) < TOL) then done = 1;25j. for i 1 to nprocs-1 do /*for each other process*/25k. SEND(done,sizeof(int),i,DONE);25l. endfor25m. endif26. endwhile27. end procedure


Notes on Message Passing Program

Use of ghost rows Receive does not transfer data, send does

unlike SAS which is usually receiver-initiated (load fetches data) Communication done at beginning of iteration, so no asynchrony Communication in whole rows, not element at a time Core similar, but indices/bounds in local rather than global space Synchronization through sends and receives

Update of global diff and event synch for done condition Could implement locks and barriers with messages

Can use REDUCE and BROADCAST library calls to simplify code

/*communicate local diff values and determine if done, using reduction and broadcast*/25b. REDUCE(0,mydiff,sizeof(float),ADD);25c. if (pid == 0) then25i. if (mydiff/(n*n) < TOL) then done = 1;25k. endif25m. BROADCAST(0,done,sizeof(int),DONE);


Send and Receive Alternatives Can extend functionality: stride, scatter-gather, groups Semantic flavors: based on when control is returned

Affect when data structures or buffers can be reused at either end

Affect event synch (mutual excl. by fiat: only one process touches data)

Affect ease of programming and performance Synchronous messages provide built-in synch. through match

Separate event synchronization needed with asynch. messages With synch. messages, our code is deadlocked. Fix?

Send/Receive

Synchronous Asynchronous

Blocking asynch. Nonblocking asynch.


Orchestration: Summary Shared address space

Shared and private data explicitly separate Communication implicit in access patterns No correctness need for data distribution Synchronization via atomic operations on shared data Synchronization explicit and distinct from data

communication Message passing

Data distribution among local address spaces needed No explicit shared structures (implicit in comm. patterns) Communication is explicit Synchronization implicit in communication (at least in synch.

case) mutual exclusion by fiat


Correctness in Grid Solver Program Decomposition and Assignment similar in SAS

and message-passing Orchestration is different

Data structures, data access/naming, communication, synchronization

SAS Msg-Passing

Explicit global data structure? Yes No

Assignment indept of data layout? Yes No

Communication Implicit Explicit

Synchronization Explicit Implicit

Explicit replication of border rows? No Yes

Programming for Performance

Chapter 3 David E. Culler and Jaswinder Pal Singh,

Parallel Computer Architecture: A Hardware/Software Approach, Morgan

Kaufmann, 1998


OutlineProgramming techniques for performance Partitioning for performance Relationship of communication, data locality and architecture Programming for performance For each issue:

Techniques to address it, and tradeoffs with previous issues Application to grid solver Some architectural implications

Components of execution time as seen by processor What workload looks like to architecture, and relate to software

issues

Implications for programming models


Partitioning for Performance Balancing the workload and reducing wait time at synch

points Reducing inherent communication Reducing extra work

Even these algorithmic issues trade off: Minimize comm. => run on 1 processor => extreme load

imbalance Maximize load balance => random assignment of tiny tasks

=> no control over communication Good partition may imply extra work to compute or manage it

Goal is to compromise Fortunately, often not difficult in practice


Load Balance and Synch Wait Time Limit on speedup: Speedupproblem(p) <

Work includes data access and other costs Not just equal work, but must be busy at same time

Four parts to load balance and reducing synch wait time:

1. Identify enough concurrency2. Decide how to manage it3. Determine the granularity at which to exploit it4. Reduce serialization and cost of synchronization

Sequential Work

Max Work on any Processor


Identifying Concurrency Techniques seen for equation solver:

Loop structure, fundamental dependences, new algorithms Data Parallelism versus Function Parallelism Often see orthogonal levels of parallelism; e.g. VLSI routing

Wire W2 expands to segments

Segment S23 expands to routes

W1 W2 W3

S21 S22 S23 S24 S25 S26

(a)

(b)

(c)


Identifying Concurrency (Cont’d) Function parallelism:

entire large tasks (procedures) that can be done in parallel on same or different data e.g. different independent grid computations in Ocean pipelining, as in video encoding/decoding, or polygon

rendering degree usually modest and does not grow with input size difficult to load balance often used to reduce synch between data parallel phases

Most scalable programs data parallel (per this loose definition)

function parallelism reduces synch between data parallel phases


Deciding How to Manage Concurrency Static versus Dynamic techniques Static:

Algorithmic assignment based on input; won’t change Low runtime overhead Computation must be predictable Preferable when applicable (except in

multiprogrammed/heterogeneous environment) Dynamic:

Adapt at runtime to balance load Can increase communication and reduce locality Can increase task management overheads


Dynamic Assignment Profile-based (semi-static):

Profile work distribution at runtime, and repartition dynamically Applicable in many computations, e.g. Barnes-Hut, some

graphics

Dynamic Tasking: Deal with unpredictability in program or environment (e.g.

Raytrace) computation, communication, and memory system interactions multiprogramming and heterogeneity used by runtime systems and OS too

Pool of tasks; take and add tasks until done E.g. “self-scheduling” of loop iterations (shared loop counter)


Dynamic Tasking with Task Queues Centralized versus distributed queues Task stealing with distributed queues

Can compromise comm and locality, and increase synchronization Whom to steal from, how many tasks to steal, ... Termination detection Maximum imbalance related to size of task

QQ 0 Q2Q1 Q3

All remove tasks

P0 inserts P1 inserts P2 inserts P3 inserts

P0 removes P1 removes P2 removes P3 removes

(b) Distributed task queues (one per pr ocess)

Others maysteal

All processesinsert tasks

(a) Centralized task queue


Determining Task Granularity Task granularity: amount of work associated with

a task

General rule: Coarse-grained => often less load balance Fine-grained => more overhead; often more comm.,

contention

Comm., contention actually affected by assignment, not size

Overhead by size itself too, particularly with task queues


Reducing Serialization Careful about assignment and orchestration (including

scheduling) Event synchronization

Reduce use of conservative synchronization e.g. point-to-point instead of barriers, or granularity of pt-to-pt

But fine-grained synch more difficult to program, more synch ops.

Mutual exclusion Separate locks for separate data

e.g. locking records in a database: lock per process, record, or field lock per task in task queue, not per queue finer grain => less contention/serialization, more space, less reuse

Smaller, less frequent critical sections don’t do reading/testing in critical section, only modification e.g. searching for task to dequeue in task queue, building tree

Stagger critical sections in time


Reducing Inherent Communication Communication is expensive! Measure: communication to computation ratio Focus here on inherent communication

Determined by assignment of tasks to processes Later see that actual communication can be greater

Assign tasks that access same data to same process

Solving communication and load balance NP-hard in general case

But simple heuristic solutions work well in practice Applications have structure!


Domain Decomposition Works well for scientific, engineering, graphics, ... applications Exploits local-biased nature of physical problems

Information requirements often short-range Or long-range but fall off with distance

Simple example: nearest-neighbor grid computation Perimeter to Area comm-to-comp ratio (area to volume in 3-d)

Depends on n,p: decreases with n, increases with p

P0 P1 P2 P3

P4

P8

P12

P5 P6 P7

P9 P11

P13 P14

P10

n

n np

np

P15


Domain Decomposition (Cont’d)

Comm to comp: for block, for strip Retain block from here on

Application dependent: strip may be better in other cases E.g. particle flow in tunnel

4*√pn

2*pn

P0 P1 P2 P3

P4

P8

P12

P5 P6 P7

P9 P11

P13 P14 P15

P10

n

n

n

p------

n

p------

Best domain decomposition depends on information requirementsNearest neighbor example: block versus strip decomposition:


Finding a Domain Decomposition Static, by inspection

Must be predictable: grid example Static, but not by inspection

Input-dependent, require analyzing input structure E.g sparse matrix computations, data mining

(assigning itemsets) Semi-static (periodic repartitioning)

Characteristics change but slowly; e.g. Barnes-Hut Static or semi-static, with dynamic task stealing

Initial decomposition, but highly unpredictable; e.g ray tracing


Other Techniques

Preserve locality in task stealing• Steal large tasks for locality, steal from same queues, ...

12

3 4

12

3 4

12

3 4

12

3 4

12

3 4

12

3 4

12

3 4

12

3 4

12

3 4

12

3 4

12

3 4

12

3 4

12

3 4

12

3 4

12

3 4

12

3 4

12

43

Domain decomposition Scatter decomposition

Scatter Decomposition, e.g. initial partition in Raytrace


Implications of Comm-to-Comp Ratio Architects examine application needs to see where to spend

money If denominator is execution time, ratio gives average BW

needs If operation count, gives extremes in impact of latency and

bandwidth Latency: assume no latency hiding Bandwidth: assume all latency hidden Reality is somewhere in between

Actual impact of comm. depends on structure and cost as well

Need to keep communication balanced across processors as well

Sequential WorkMax (Work + Synch Wait Time + Comm Cost)

Speedup <


Reducing Extra Work Common sources of extra work:

Computing a good partition e.g. partitioning in Barnes-Hut or sparse matrix

Using redundant computation to avoid communication Task, data and process management overhead

applications, languages, runtime systems, OS Imposing structure on communication

coalescing messages, allowing effective naming Architectural Implications:

Reduce need by making communication and orchestration efficient

Sequential WorkMax (Work + Synch Wait Time + Comm Cost + Extra Work)

Speedup <


Memory-oriented View of Performance Multiprocessor as Extended Memory Hierarchy

as seen by a given processor Levels in extended hierarchy:

Registers, caches, local memory, remote memory (topology)

Glued together by communication architecture Levels communicate at a certain granularity of data

transfer Need to exploit spatial and temporal locality in

hierarchy Otherwise extra communication may also be caused Especially important since communication is expensive


Uniprocessor Optimization Performance depends heavily on memory

hierarchy Time spent by a program

Timeprog(1) = Busy(1) + Data Access(1)

Divide by cycles to get CPI equation

Data access time can be reduced by: Optimizing machine: bigger caches, lower latency... Optimizing program: temporal and spatial locality


Extended Hierarchy Idealized view: local cache hierarchy + single main memory But reality is more complex

Centralized Memory: caches of other processors Distributed Memory: some local, some remote; + network

topology Management of levels

caches managed by hardware main memory depends on programming model

SAS: data movement between local and remote transparent message passing: explicit

Levels closer to processor are lower latency and higher bandwidth

Improve performance through architecture or program locality Tradeoff with parallelism; need good node performance and

parallelism


Artifactual Comm. in Extended Hierarchy Accesses not satisfied in local portion cause

communication Inherent communication, implicit or explicit, causes transfers

determined by program Artifactual communication

determined by program implementation and arch. interactions poor allocation of data across distributed memories unnecessary data in a transfer unnecessary transfers due to system granularities redundant communication of data finite replication capacity (in cache or main memory)

Inherent communication assumes unlimited capacity, small transfers, perfect knowledge of what is needed.

More on artifactual later; first consider replication-induced further


Communication and Replication Comm induced by finite capacity is most fundamental artifact

Like cache size and miss rate or memory traffic in uniprocessors Extended memory hierarchy view useful for this relationship

View as three level hierarchy for simplicity Local cache, local memory, remote memory (ignore network

topology) Classify “misses” in “cache” at any level as for uniprocessors

compulsory or cold misses (no size effect) capacity misses (yes) conflict or collision misses (yes) communication or coherence misses (no)

Each may be helped/hurt by large transfer granularity (spatial locality)


Orchestration for Performance Reducing amount of communication:

Inherent: change logical data sharing patterns in algorithm

Artifactual: exploit spatial, temporal locality in extended hierarchy

Techniques often similar to those on uniprocessors

Structuring communication to reduce cost

Let’s examine techniques for both...


Reducing Artifactual Communication Message passing model

Communication and replication are both explicit Even artifactual communication is in explicit

messages Shared address space model

More interesting from an architectural perspective Occurs transparently due to interactions of

program and system sizes and granularities in extended memory hierarchy

Use shared address space to illustrate issues


Exploiting Temporal Locality Structure algorithm so working sets map well to hierarchy

often techniques to reduce inherent communication do well here schedule tasks for data reuse once assigned

Multiple data structures in same phase e.g. database records: local versus remote

Solver example: blocking

(a) Unblocked access pattern in a sweep (b) Blocked access pattern with B = 4

•More useful when O(nk+1) computation on O(nk) data

–many linear algebra computations (factorization, matrix multiply)


Exploiting Spatial Locality Besides capacity, granularities are important:

Granularity of allocation Granularity of communication or data transfer Granularity of coherence

Major spatial-related causes of artifactual communication: Conflict misses Data distribution/layout (allocation granularity) Fragmentation (communication granularity) False sharing of data (coherence granularity)

All depend on how spatial access patterns interact with data structures

Fix problems by modifying data structures, or layout/alignment Examine later in context of architectures

one simple example here: data distribution in SAS solver


Spatial Locality Example Repeated sweeps over 2-d grid, each time adding 1 to elements Natural 2-d versus higher-dimensional array representation

P6 P7P4

P8

P0 P3

P5 P6 P7P4

P8

P0 P1 P2 P3

P5

P2P1

Page straddlespartition boundaries:difficult to distribute memory well

Cache blockstraddles partitionboundary

(a) Two-dimensional array

Page doesnot straddlepartitionboundary

Cache block is within a partition

(b) Four-dimensional array

Contiguity in memory layout


Tradeoffs with Inherent Communication Partitioning grid solver: blocks versus rows

Blocks still have a spatial locality problem on remote data Rowwise can perform better despite worse inherent c-to-c ratio

Result depends on n and p

Good spacial locality onnonlocal accesses atrow-oriented boudary

Poor spacial locality onnonlocal accesses atcolumn-orientedboundary


Example Performance ImpactEquation solver on SGI Origin2000

Spee

dup

Number of processors

Spee

dup

Number of processors

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 310

5

10

15

20

25

30

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 310

5

10

15

20

25

30

35

40

45

504D4D-rr

2D-rr 2D Rows-rr

Rows 2D

4D

Rows


Structuring Communication Given amount of comm (inherent or artifactual), goal is to reduce

cost Cost of communication as seen by process:

C = f * ( o + l + + tc - overlap)

f = frequency of messages o = overhead per message (at both ends) l = network delay per message nc = total data sent m = number of messages B = bandwidth along path (determined by network, NI, assist) tc = cost induced by contention per message overlap = amount of latency hidden by overlap with comp. or comm.

Portion in parentheses is cost of a message (as seen by processor) That portion, ignoring overlap, is latency of a message Goal: reduce terms in latency and increase overlap

nc/mB


Reducing Overhead Can reduce no. of messages m or overhead per

message o o is usually determined by hardware or system

software Program should try to reduce m by coalescing messages More control when communication is explicit

Coalescing data into larger messages: Easy for regular, coarse-grained communication Can be difficult for irregular, naturally fine-grained

communication may require changes to algorithm and extra work

coalescing data and determining what and to whom to send


Reducing Contention All resources have nonzero occupancy

Memory, communication controller, network link, etc. Can only handle so many transactions per unit time

Effects of contention: Increased end-to-end cost for messages Reduced available bandwidth for individual messages Causes imbalances across processors

Particularly insidious performance problem Easy to ignore when programming Slow down messages that don’t even need that resource

by causing other dependent resources to also congest Effect can be devastating: Don’t flood a resource!


Overlapping Communication Cannot afford to stall for high latencies

even on uniprocessors! Overlap with computation or communication to

hide latency Requires extra concurrency (slackness), higher

bandwidth Techniques:

Prefetching Block data transfer Proceeding past communication Multithreading


Summary of Tradeoffs Different goals often have conflicting demands

Load Balance fine-grain tasks random or dynamic assignment

Communication usually coarse grain tasks decompose to obtain locality: not random/dynamic

Extra Work coarse grain tasks simple assignment

Communication Cost: big transfers: amortize overhead and latency small transfers: reduce contention


Relationship between Perspectives

Synch wait

Data-remote

Data-localOrchestration

Busy-overheadExtra work

Performance issueParallelization step(s) Processor time component

Decomposition/assignment/orchestration

Decomposition/assignment

Decomposition/assignment

Orchestration/mapping

Load imbalance and synchronization

Inherent communication volume

Artifactual communication and data locality

Communication structure


Summary

Goal is to reduce denominator components Both programmer and system have role to play Architecture cannot do much about load

imbalance or too much communication But it can:

reduce incentive for creating ill-behaved programs (efficient naming, communication and synchronization)

reduce artifactual communication provide efficient naming for flexible assignment allow effective overlapping of communication

Busy(1) + Data(1)Busyuseful(p)+Datalocal(p)+Synch(p)+Dateremote(p)+Busyoverhead(p)

Multi-Threading

Parallel Programming on Shared Memory Multiprocessors Using PThread

Chapter 2Shameem Akhtar and Jason Roberts, Multi-Core Programming, Intel Press,

2006


Outline of Multi-Threading Topics Threads

Terminology OS level view Hardware level threads

Threading as a parallel programming model Types of thread level parallel programs Implementation issues


Threads Definition

A discrete sequence of related instructions Executed independently of other such

sequences Every program has at least one thread

Initializes Executes instructions May create other threads

Each thread maintains its current state OS maps a thread to hardware resources


System View of Threads

Thread computational model layers: User level threads Kernel level threads Hardware threads


Flow of Threads in an Execution Environment

Defining and preparing stage Operating stage

Created and managed by the OS Execution stage


Threads Inside the OS


Processors, Processes, and Threads

A processor runs threads from one or more processes, each of which contains one or more threads


Mapping Models of Threads to Processors: 1:1 Mapping


Mapping Models of Threads to Processors: M:1 Mapping


Mapping Models of Threads to Processors: M:N Mapping


Threads Inside the Hardware


Thread Creation Multiple threads inside a process

Share same address space, FDs, etc. Operate independently Need their own stack space

Who handles thread creation details Not the programmer Typically handled at system level

OS support for threads Threading libraries

Same is true for thread management


Stack Layout for a Multi-Threaded Process


Thread State Diagram


Thread Implementation Often implemented as a thread package

Operations to create and destroy threads Synchronization mechanisms

Approaches to implement a thread package: Implement as a thread library to execute

entirely in user mode Have the kernel be aware of threads and

schedule them


Thread Implementation (2) Characteristics of a user level thread library

Cheap to create and destroy threads Switching thread context can be done in just a

few instructions Need to save and restore CPU registers only No need to change memory maps, flush TLB, CPU

accounting, etc. Drawback: a blocking system call will block all

threads in a process Solution to blocking: implement thread in OS

kernel


Kernel Implementations of Threads High price to solve blocking problem

Every thread operation will require a system call

Thread creation Thread deletion Thread synchronization

Thread switching will now become as expensive as process context switching


Kernel Implementations of Threads (2) Lightweight processes (LWP)

A hybrid form of user and kernel level threads An LWP runs in the context of a (heavy-weight)

process There can be several LWPs each with its own

scheduler and stack System also offers a user level thread package for

usual operations (creation, deletion, and synchronization)

Assignment of a user level thread to LWP is hidden from programmer

LWP handles the scheduling for multiple threads


LWP Implementation

Thread table is shared among LWPs Protected through mutexes no kernel intervention for LWP synch.

When an LWP finds a runnable thread switches context to that thread done entirely in user space

When a thread makes a blocking system call: OS might block one LWP May switch to another LWP will allow other threads to continue

Parallel Programming with Threads

Overview of POSIX threads, data races and types of synchronization


Shared Memory Programming Several Thread Libraries PTHREADS is the POSIX Standard

Solaris threads are very similar Relatively low level Portable but possibly slow

OpenMP is newer standard Support for scientific programming on shared

memory http://www.openMP.org

Multiple other efforts by specific vendors

http://www.openmp.org/


Overview of POSIX Threads POSIX: Portable Operating System Interface for

UNIX Interface to Operating System utilities

PThreads: The POSIX threading interface System calls to create and synchronize threads Should be relatively uniform across UNIX-like OS

platforms PThreads contain support for

Creating parallelism Synchronizing No explicit support for communication, because shared

memory is implicit; a pointer to shared data is passed to a thread


POSIX Thread Creation Signature: int pthread_create(pthread_t *, const pthread_attr_t *, void * (*)(void *), void *); Example call: pthread_create(&thread_id;

&thread_attribute &thread_fun; &fun_arg);


POSIX Thread Creation (2) thread_id is the thread id or handle (used to

halt, etc.) thread_attribute various attributes

standard default values obtained by passing a NULL pointer

thread_fun the function to be run (takes and returns void*)

fun_arg an argument can be passed to thread_fun when it starts

errorcode will be set nonzero if the create operation fails


Simple Threading Examplevoid* SayHello(void *foo) { printf( "Hello, world!\n" ); return NULL;}

int main() { pthread_t threads[16]; int tn; for(tn=0; tn<16; tn++) { pthread_create(&threads[tn], NULL, SayHello, NULL); } for(tn=0; tn<16 ; tn++) { pthread_join(threads[tn], NULL); } return 0;}

Compile using gcc –lpthread


Loop Level Parallelism Many scientific application have parallelism in

loops With threads: … my_stuff [n][n]; for (int i = 0; i < n; i++) for (int j = 0; j < n; j++) … pthread_create (update_cell, …, my_stuff[i][j]);

But overhead of thread creation is nontrivial update_cell should have a significant amount of work 1/pth if possible

Also need i & j


Shared Data and Threads Variables declared outside of main are

shared Object allocated on the heap may be

shared (if pointer is passed) Variables on the stack are private:

passing pointer to these around to other threads can cause problems


Shared Data and Threads (2) Often done by creating a large “thread

data” struct Passed into all threads as argument Simple example:

char *message = "Hello World!\n"; pthread_create( &thread1,

NULL, (void*)&print_fun, (void*) message);


Setting Attribute Values Once an initialized attribute object exists,

changes can be made. For example: To change the stack size for a thread to 8192

(before calling pthread_create), do this: pthread_attr_setstacksize(&my_attributes,

(size_t)8192); To get the stack size, do this:

size_t my_stack_size;pthread_attr_getstacksize(&my_attributes, &my_stack_size);

Slide Source: Theewara Vorakosit


Other AttributesOther attributes: Detached state – set if no other thread will use

pthread_join to wait for this thread (improves efficiency)

Scheduling parameter(s) – in particular, thread priority

Scheduling policy – FIFO or Round Robin Contention scope – with what threads does

this thread compete for a CPU Stack address – explicitly dictate where the

stack is located Lazy stack allocation – allocate on demand

(lazy) or all at once, “up front”


Data Race Example

Problem is a race condition on variable s in the program

A race condition or data race occurs when: two processors (or two threads) access the

same variable, and at least one does a write. The accesses are concurrent (not synchronized)

so they could happen simultaneously

Thread 1

for i = 0, n/2-1 s = s + f(A[i])

Thread 2

for i = n/2, n-1 s = s + f(A[i])

static int s = 0;


Basic Types of Synchronization: Barrier Barrier—global synchronization

Especially common when running multiple copies of the same function in parallel

SPMD “Single Program Multiple Data” simple use of barriers -- all threads hit the

same one work_on_my_subgrid(); barrier; read_neighboring_values(); barrier;


Barrier (2) More complicated—barriers on branches

(or loops) if (tid % 2 == 0) { work1(); barrier } else { barrier } Barriers are not provided in all thread

libraries


Creating and Initializing a Barrier To (dynamically) initialize a barrier, use

code similar to this (which sets the number of threads to 3):pthread_barrier_t b;pthread_barrier_init(&b,NULL,3);

The second argument specifies an object attribute; using NULL yields the default attributes.


Creating and Initializing a Barrier To wait at a barrier, a process executes:

pthread_barrier_wait(&b); This barrier could have been statically

initialized by assigning an initial value created using the macro PTHREAD_BARRIER_INITIALIZER(3)


Basic Types of Synchronization: Mutexes Mutexes—mutual exclusion aka locks

threads are working mostly independently need to access common data structure

lock *l = alloc_and_init(); /* shared */ acquire(l); access data release(l);


Mutexes (2) Java and other languages have lexically

scoped synchronization similar to cobegin/coend vs. fork and join

tradeoff Semaphores give guarantees on

“fairness” in getting the lock, but the same idea of mutual exclusion

Locks only affect processors using them: pair-wise synchronization


Mutexes in POSIX Threads To create a mutex: #include <pthread.h> pthread_mutex_t amutex =

PTHREAD_MUTEX_INITIALIZER; pthread_mutex_init(&amutex, NULL); To use it: int pthread_mutex_lock(amutex); int pthread_mutex_unlock(amutex);


Mutexes in POSIX Threads (2) To deallocate a mutex int

pthread_mutex_destroy(pthread_mutex_t *mutex);

Multiple mutexes may be held, but can lead to deadlock:

thread1 thread2 lock(a) lock(b) lock(b) lock(a)


Summary of Programming with Threads POSIX Threads are based on OS features

Can be used from multiple languages Familiar language for most of program Ability to shared data is convenient

Pitfalls Intermittent data race bugs are very nasty to find Deadlocks are usually easier, but can also be

intermittent OpenMP is commonly used today as an

alternative

Multi-Threaded Distributed Application

Examples

Distributed Operating SystemsBy Andrew S. Tanenbaum


Multithreaded Clients Distribution transparency

Needed when a DS operates in a wide-area network environment

Need some mechanism to hide communication latency Multithreading on client side is useful

One connection per thread If one thread is blocked, other can do useful work More responsive to the user

Example: a web browser One thread connected to a server can bring an HTML

document Another thread connected to the same server can bring

images while the first displays the text, scroll bars, etc.


Multithreaded Servers (1) A multithreaded server organized in a

dispatcher/worker model.


Multithreaded Servers (2) Three ways to construct a server.

Model Characteristics

Threads Parallelism, blocking system calls

Single-threaded process No parallelism, blocking system calls

Finite-state machine Parallelism, nonblocking system calls


Clients Anatomy of a client process:

User interface A major task for most clients is to interact with human users Provide a means to interact with a remote server An important class: Graphical User Interfaces (GUIs)

Client side software distribution transparency Example: X Windows system

Used to control bit-mapped devices Monitor, keyboard, keyboard, and a pointing device X kernel (X Server) contains hardware-specific details device

drivers X uses an event-driven approach

Captures events from devices Provides an interface in the form of Xlib for GUI/graphics applications Two types of applications: normal and window manager


The X-Window System The basic organization of the X Window

System


User Interface: Compound Documents Function of a user interface is more than interacting with

users! May allow multiple applications to share a single graphical window Use that window to exchange data through user actions

Typical examples: Drag and drop

Drag an icon representing a file on trash can icon Application associated with trash can will be activated to delete file

In-place editing Image within a text document in a word processor Pointing on the image can activate a drawing tool

Compound documents notion of user interface A collection of different documents (text, images, spreadsheets) Seamlessly integrated through user interface Different applications operate on different parts of the document


Client-Side Software for Distribution Transparency

A possible approach to transparent replication of a remote object using a client-side solution

Proxy replicates requests to all replicated servers Forms a single response for the client application replication transparency Failure transparency is also possible through client middleware


Servers Organization of a server process:

Design issues of a server Object servers

Alternatives for invoking objects Object adapter

General design of a server: Iterative server

Handles all requests itself If necessary, returns a response to the requesting user

Concurrent server Does not handle request itself Passes it to a separate thread or process and waits for the

next request


Servers: General Design Issues

Client-to-server binding using a daemon as in DCE

Client-to-server binding using a superserver as in UNIX

Other distinctions: Stateless server Stateful server


Key Takeaways of this Session A wealth of knowledge exists about

developing parallel applications On legacy parallel architectures For high performance computing (HPC)

applications Techniques are applicable to multi-core

Similar decomposition, assignment, orchestration, and mapping

Shared address space programming Wider range of applications topic for next

session