125
Programming Multi-Core Processors based Embedded Systems A Hands-On Experience on Cavium Octeon based Platforms Lecture 2 (Mapping Applications to Multi-core Arch)

Lecture 2 (Mapping Applications to Multi-core Arch)

  • Upload
    nicole

  • View
    51

  • Download
    0

Embed Size (px)

DESCRIPTION

Programming Multi-Core Processors based Embedded Systems A Hands-On Experience on Cavium Octeon based Platforms. Lecture 2 (Mapping Applications to Multi-core Arch). Course Outline. Introduction Multi-threading on multi-core processors Developing parallel applications - PowerPoint PPT Presentation

Citation preview

Page 1: Lecture 2 (Mapping Applications to Multi-core Arch)

Programming Multi-Core Processors based

Embedded Systems

A Hands-On Experience on Cavium Octeon based Platforms

Lecture 2 (Mapping Applications to Multi-core Arch)

Page 2: Lecture 2 (Mapping Applications to Multi-core Arch)

KICS, UETCavium Univ Program © 2010 2-87

Course Outline Introduction Multi-threading on multi-core processors

Developing parallel applications Introduction to POSIX based multi-threading Multi-threaded application examples

Applications for multi-core processors Application layer computing on multi-

core Performance measurement and tuning

Page 3: Lecture 2 (Mapping Applications to Multi-core Arch)

KICS, UETCavium Univ Program © 2010 2-87

Agenda for Today Mapping applications to multi-core

applications Parallel programming using threads

POSIX multi-threading Using multi-threading for parallel

programming

Page 4: Lecture 2 (Mapping Applications to Multi-core Arch)

Mapping Applications to Multi-Core Architectures

Chapter 2David E. Culler and Jaswinder Pal Singh,

Parallel Computer Architecture: A Hardware/Software Approach, Morgan

Kaufmann, 1998

Page 5: Lecture 2 (Mapping Applications to Multi-core Arch)

KICS, UETCavium Univ Program © 2010 2-87

Parallelization Assumption: Sequential algorithm is given

Sometimes need very different algorithm, but beyond scope

Pieces of the job: Identify work that can be done in parallel Partition work and perhaps data among processes Manage data access, communication and synchronization Note: work includes computation, data access and I/O

Main goal: Speedup (plus low prog. effort and resource needs)

Speedup (p) = Performance(p) / Performance(1) For a fixed problem:

Speedup (p) = Time(1) / Time(p)

Page 6: Lecture 2 (Mapping Applications to Multi-core Arch)

KICS, UETCavium Univ Program © 2010 2-87

Steps in Creating a Parallel Program

4 steps: Decomposition, Assignment, Orchestration, Mapping Done by programmer or system software (compiler, runtime, ...) Issues are the same, so assume programmer does it all explicitly

P0

Tasks Processes Processors

P1

P2 P3

p0 p1

p2 p3

p0 p1

p2 p3

Partitioning

Sequentialcomputation

Parallelprogram

Assignment

Decomposition

Mapping

Orchestration

Page 7: Lecture 2 (Mapping Applications to Multi-core Arch)

KICS, UETCavium Univ Program © 2010 2-87

Some Important Concepts Task:

Arbitrary piece of undecomposed work in parallel computation Executed sequentially; concurrency is only across tasks E.g. a particle/cell in Barnes-Hut, a ray or ray group in Raytrace Fine-grained versus coarse-grained tasks

Process (thread): Abstract entity that performs the tasks assigned to processes Processes communicate and synchronize to perform their tasks

Processor: Physical engine on which process executes Processes virtualize machine to programmer

first write program in terms of processes, then map to processors

Page 8: Lecture 2 (Mapping Applications to Multi-core Arch)

KICS, UETCavium Univ Program © 2010 2-87

Decomposition Break up computation into tasks to be

divided among processes Tasks may become available dynamically No. of available tasks may vary with time

i.e., identify concurrency and decide level at which to exploit it

Goal: Enough tasks to keep processes busy, but not too many No. of tasks available at a time is upper

bound on achievable speedup

Page 9: Lecture 2 (Mapping Applications to Multi-core Arch)

KICS, UETCavium Univ Program © 2010 2-87

Limited Concurrency: Amdahl’s Law

Most fundamental limitation on parallel speedup If fraction s of seq execution is inherently serial, speedup

<= 1/s Example: 2-phase calculation

sweep over n-by-n grid and do some independent computation sweep again and add each value to global sum

Time for first phase = n2/p Second phase serialized at global variable, so time = n2 Speedup <= or at most 2 Trick: divide second phase into two

accumulate into private sum during sweep add per-process private sum into global sum

Parallel time is n2/p + n2/p + p, and speedup at best

2n2

n2

p + n2

2n2

2n2 + p2

Page 10: Lecture 2 (Mapping Applications to Multi-core Arch)

KICS, UETCavium Univ Program © 2010 2-87

Pictorial Depiction

1

p

1

p

1

n2/p

n2

p

wor

k do

ne c

oncu

rren

tly

n2

n2

Timen2/p n2/p

(c)

(b)

(a)

Page 11: Lecture 2 (Mapping Applications to Multi-core Arch)

KICS, UETCavium Univ Program © 2010 2-87

Concurrency Profiles

Cannot usually divide into serial and parallel part Area under curve is total work done, or time with 1 processor Horizontal extent is lower bound on time (infinite processors)

Speedup is the ratio: , base case:

Amdahl’s law applies to any overhead, not just limited concurrency

fk k

fkkp

k=1

k=1

1

s + 1-sp

Con

curr

ency

150

219

247

286

313

343

380

415

444

483

504

526

564

589

633

662

702

733

0

200

400

600

800

1,000

1,200

1,400

Clock cycle number

Page 12: Lecture 2 (Mapping Applications to Multi-core Arch)

KICS, UETCavium Univ Program © 2010 2-87

Assignment Specifying mechanism to divide work up among processes

E.g. which process computes forces on which stars, or which rays Together with decomposition, also called partitioning Balance workload, reduce communication and management cost

Structured approaches usually work well Code inspection (parallel loops) or understanding of application Well-known heuristics Static versus dynamic assignment

As programmers, we worry about partitioning first Usually independent of architecture or prog model But cost and complexity of using primitives may affect decisions

As architects, we assume program does reasonable job of it

Page 13: Lecture 2 (Mapping Applications to Multi-core Arch)

KICS, UETCavium Univ Program © 2010 2-87

Orchestration Includes:

Naming data Structuring communication Synchronization Organizing data structures and scheduling tasks temporally

Goals Reduce cost of communication and synch. as seen by processors Reserve locality of data reference (incl. data structure organization) Schedule tasks to satisfy dependences early Reduce overhead of parallelism management

Closest to architecture (and programming model & language) Choices depend a lot on comm. abstraction, efficiency of primitives Architects should provide appropriate primitives efficiently

Page 14: Lecture 2 (Mapping Applications to Multi-core Arch)

KICS, UETCavium Univ Program © 2010 2-87

Mapping After orchestration, already have parallel program Two aspects of mapping:

Which processes will run on same processor, if necessary Which process runs on which particular processor

mapping to a network topology One extreme: space-sharing

Machine divided into subsets, only one app at a time in a subset Processes can be pinned to processors, or left to OS

Another extreme: complete resource management control to OS

OS uses the performance techniques we will discuss later Real world is between the two

User specifies desires in some aspects, system may ignore Usually adopt the view: process <-> processor

Page 15: Lecture 2 (Mapping Applications to Multi-core Arch)

KICS, UETCavium Univ Program © 2010 2-87

Parallelizing Computation vs. Data Above view is centered around computation

Computation is decomposed and assigned (partitioned) Partitioning Data is often a natural view too

Computation follows data: owner computes Grid example; data mining; High Performance Fortran

(HPF) But not general enough

Distinction between comp. and data stronger in many applications

Barnes-Hut, Raytrace (later) Retain computation-centric view Data access and communication is part of orchestration

Page 16: Lecture 2 (Mapping Applications to Multi-core Arch)

KICS, UETCavium Univ Program © 2010 2-87

High-level Goals

High performance (speedup over sequential program) But low resource usage and development effort Implications for algorithm designers and architects

Algorithm designers: high-perf., low resource needs Architects: high-perf., low cost, reduced programming effort

e.g. gradually improving perf. with programming effort may be preferable to sudden threshold after large programming effort

Table 2.1 Steps in the Parallelization Process and Their Goals

StepArchitecture-Dependent? Major Performance Goals

Decomposition Mostly no Expose enough concurrency but not too much

Assignment Mostly no Balance workloadReduce communication volume

Orchestration Yes Reduce noninherent communication via data locality

Reduce communication and synchronization cost as seen by the processor

Reduce serialization at shared resourcesSchedule tasks to satisfy dependences early

Mapping Yes Put related processes on the same processor if necessary

Exploit locality in network topology

Page 17: Lecture 2 (Mapping Applications to Multi-core Arch)

KICS, UETCavium Univ Program © 2010 2-87

Parallelization of An Example Program Motivating problems all lead to large, complex

programs Examine a simplified version of a piece of

Ocean simulation Iterative equation solver

Illustrate parallel program in low-level parallel language

C-like pseudocode with simple extensions for parallelism

Expose basic comm. and synch. primitives that must be supported

State of most real parallel programming today

Page 18: Lecture 2 (Mapping Applications to Multi-core Arch)

KICS, UETCavium Univ Program © 2010 2-87

Grid Solver Example

Simplified version of solver in Ocean simulation Gauss-Seidel (near-neighbor) sweeps to convergence

interior n-by-n points of (n+2)-by-(n+2) updated in each sweep updates done in-place in grid, and diff. from prev. value computed accumulate partial diffs into global diff at end of every sweep check if error has converged (to within a tolerance parameter) if so, exit solver; if not, do another sweep

A[i,j ] = 0.2 (A[i,j ] + A[i,j – 1] + A[i – 1, j] +A[i,j + 1] + A[i + 1, j ])

Expression for updating each interior point:

Page 19: Lecture 2 (Mapping Applications to Multi-core Arch)

KICS, UETCavium Univ Program © 2010 2-87

1. int n; /*size of matrix: (n + 2-by-n + 2) elements*/2. float **A, diff = 0;

3. main()4. begin5. read(n) ; /*read input parameter: matrix size*/6. A malloc (a 2-d array of size n + 2 by n + 2 doubles);7. initialize(A); /*initialize the matrix A somehow*/8. Solve (A); /*call the routine to solve equation*/9. end main

10. procedure Solve (A) /*solve the equation system*/11. float **A; /*A is an (n + 2)-by-(n + 2) array*/12. begin13. int i, j, done = 0;14. float diff = 0, temp;15. while (!done) do /*outermost loop over sweeps*/16. diff = 0; /*initialize maximum difference to 0*/17. for i 1 to n do /*sweep over nonborder points of grid*/18. for j 1 to n do19. temp = A[i,j]; /*save old value of element*/20. A[i,j] 0.2 * (A[i,j] + A[i,j-1] + A[i-1,j] +21. A[i,j+1] + A[i+1,j]); /*compute average*/22. diff += abs(A[i,j] - temp);23. end for24. end for25. if (diff/(n*n) < TOL) then done = 1;26. end while27. end procedure

Page 20: Lecture 2 (Mapping Applications to Multi-core Arch)

KICS, UETCavium Univ Program © 2010 2-87

Decomposition

Simple way to identify concurrency is to look at loop iterations dependence analysis; if not enough concurrency, then look further

Not much concurrency here at this level (all loops sequential) Examine fundamental dependences, ignoring loop structure

Concurrency O(n) along anti-diagonals, serialization O(n) along diag.

Retain loop structure, use pt-to-pt synch; Problem: too many synch ops.

Restructure loops, use global synch; imbalance and too much synch

Page 21: Lecture 2 (Mapping Applications to Multi-core Arch)

KICS, UETCavium Univ Program © 2010 2-87

Exploit Application Knowledge

Reorder grid traversal: red-black ordering

Different ordering of updates: may converge quicker or slower Red sweep and black sweep are each fully parallel: Global synch between them (conservative but convenient) Ocean uses red-black; we use simpler, asynchronous one to illustrate

no red-black, simply ignore dependences within sweep sequential order same as original, parallel program nondeterministic

Red point

Black point

Page 22: Lecture 2 (Mapping Applications to Multi-core Arch)

KICS, UETCavium Univ Program © 2010 2-87

Decomposition Only

Decomposition into elements: degree of concurrency n2

To decompose into rows, make line 18 loop sequential; degree n for_all leaves assignment left to system

but implicit global synch. at end of for_all loop

15. while (!done) do /*a sequential loop*/16. diff = 0; 17. for_all i 1 to n do /*a parallel loop nest*/18. for_all j 1 to n do19. temp = A[i,j];20. A[i,j] 0.2 * (A[i,j] + A[i,j-1] + A[i-1,j] +21. A[i,j+1] + A[i+1,j]);22. diff += abs(A[i,j] - temp);23. end for_all24. end for_all25. if (diff/(n*n) < TOL) then done = 1;26. end while

Page 23: Lecture 2 (Mapping Applications to Multi-core Arch)

KICS, UETCavium Univ Program © 2010 2-87

Assignment

Static assignments (given decomposition into rows) block assignment of rows: Row i is assigned to process cyclic assignment of rows: process i is assigned rows i, i+p, and so

on

Dynamic assignment get a row index, work on the row, get a new row, and so on

Static assignment into rows reduces concurrency (from n to p) block assign. reduces communication by keeping adjacent rows

together Let’s dig into orchestration under three programming models

P0

P1

P2

P4

ip

Page 24: Lecture 2 (Mapping Applications to Multi-core Arch)

KICS, UETCavium Univ Program © 2010 2-87

Data Parallel Solver1. int n, nprocs; /*grid size (n + 2-by-n + 2) and number of processes*/2. float **A, diff = 0;

3. main()4. begin5. read(n); read(nprocs); ; /*read input grid size and number of processes*/6. A G_MALLOC (a 2-d array of size n+2 by n+2 doubles);7. initialize(A); /*initialize the matrix A somehow*/8. Solve (A); /*call the routine to solve equation*/9. end main

10. procedure Solve(A) /*solve the equation system*/11. float **A; /*A is an (n + 2-by-n + 2) array*/12. begin13. int i, j, done = 0;14. float mydiff = 0, temp;14a. DECOMP A[BLOCK,*, nprocs];15. while (!done) do /*outermost loop over sweeps*/16. mydiff = 0; /*initialize maximum difference to 0*/17. for_all i 1 to n do /*sweep over non-border points of grid*/18. for_all j 1 to n do19. temp = A[i,j]; /*save old value of element*/20. A[i,j] 0.2 * (A[i,j] + A[i,j-1] + A[i-1,j] +21. A[i,j+1] + A[i+1,j]); /*compute average*/22. mydiff += abs(A[i,j] - temp);23. end for_all24. end for_all24a. REDUCE (mydiff, diff, ADD);25. if (diff/(n*n) < TOL) then done = 1;26. end while27. end procedure

Page 25: Lecture 2 (Mapping Applications to Multi-core Arch)

KICS, UETCavium Univ Program © 2010 2-87

Shared Address Space Solver

Assignment controlled by values of variables used as loop bounds

Sweep

Test Convergence

Processes

Solve Solve Solve Solve

Single Program Multiple Data (SPMD)

Page 26: Lecture 2 (Mapping Applications to Multi-core Arch)

KICS, UETCavium Univ Program © 2010 2-87

1. int n, nprocs; /*matrix dimension and number of processors to be used*/2a. float **A, diff; /*A is global (shared) array representing the grid*/

/*diff is global (shared) maximum difference in currentsweep*/

2b. LOCKDEC(diff_lock); /*declaration of lock to enforce mutual exclusion*/2c. BARDEC (bar1); /*barrier declaration for global synchronization between

sweeps*/

3. main()4. begin5. read(n); read(nprocs); /*read input matrix size and number of processes*/6. A G_MALLOC (a two-dimensional array of size n+2 by n+2 doubles);7. initialize(A); /*initialize A in an unspecified way*/8a. CREATE (nprocs–1, Solve, A);8. Solve(A); /*main process becomes a worker too*/8b. WAIT_FOR_END (nprocs–1); /*wait for all child processes created to terminate*/9. end main

10. procedure Solve(A)11. float **A; /*A is entire n+2-by-n+2 shared array,

as in the sequential program*/12. begin13. int i,j, pid, done = 0;14. float temp, mydiff = 0; /*private variables*/14a. int mymin = 1 + (pid * n/nprocs); /*assume that n is exactly divisible by*/14b. int mymax = mymin + n/nprocs - 1 /*nprocs for simplicity here*/

15. while (!done) do /*outer loop over all diagonal elements*/16. mydiff = diff = 0; /*set global diff to 0 (okay for all to do it)*/16a. BARRIER(bar1, nprocs); /*ensure all reach here before anyone modifies diff*/17. for i mymin to mymax do /*for each of my rows*/18. for j 1 to n do /*for all nonborder elements in that row*/19. temp = A[i,j];20. A[i,j] = 0.2 * (A[i,j] + A[i,j-1] + A[i-1,j] +21. A[i,j+1] + A[i+1,j]);22. mydiff += abs(A[i,j] - temp);23. endfor24. endfor25a. LOCK(diff_lock); /*update global diff if necessary*/25b. diff += mydiff;25c. UNLOCK(diff_lock);25d. BARRIER(bar1, nprocs); /*ensure all reach here before checking if done*/25e. if (diff/(n*n) < TOL) then done = 1; /*check convergence; all get

same answer*/25f. BARRIER(bar1, nprocs);26. endwhile27. end procedure

Page 27: Lecture 2 (Mapping Applications to Multi-core Arch)

KICS, UETCavium Univ Program © 2010 2-87

Notes on SAS Program SPMD: not all Code that does the update lockstep or even necessarily

same instructions Assignment controlled by values of variables used as

loop bounds unique pid per process, used to control assignment

Done condition evaluated redundantly by identical to sequential program

each process has private mydiff variable Most interesting special operations are for

synchronization accumulations into shared diff have to be mutually

exclusive why the need for all the barriers?

Page 28: Lecture 2 (Mapping Applications to Multi-core Arch)

KICS, UETCavium Univ Program © 2010 2-87

Need for Mutual Exclusion Code each process executes:load the value of diff into register r1add the register r2 to register r1store the value of register r1 into diff

A possible interleaving:P1 P2r1 diff {P1 gets 0 in its r1}r1 diff {P2 also gets 0}r1 r1+r2 {P1 sets its r1 to 1}r1 r1+r2 {P2 sets its r1 to 1}diff r1 {P1 sets cell_cost to 1}diff r1 {P2 also sets cell_cost to 1}

Need the sets of operations to be atomic (mutually exclusive)

Page 29: Lecture 2 (Mapping Applications to Multi-core Arch)

KICS, UETCavium Univ Program © 2010 2-87

Global Event Synchronization BARRIER(nprocs): wait here till nprocs processes get here

Built using lower level primitives Global sum example: wait for all to accumulate before using sum Often used to separate phases of computation

Process P_1 Process P_2 Process P_nprocsset up eqn system set up eqn system set up eqn systemBarrier (name, nprocs) Barrier (name, nprocs) Barrier (name,

nprocs)solve eqn system solve eqn system solve eqn systemBarrier (name, nprocs) Barrier (name, nprocs) Barrier (name,

nprocs)apply results apply results apply resultsBarrier (name, nprocs) Barrier (name, nprocs) Barrier (name,

nprocs)

Conservative form of preserving dependences, but easy to use

WAIT_FOR_END (nprocs-1)

Page 30: Lecture 2 (Mapping Applications to Multi-core Arch)

KICS, UETCavium Univ Program © 2010 2-87

Pt-to-pt Event Synch (Not Used Here) One process notifies another of an

event so it can proceed Common example: producer-consumer

(bounded buffer) Concurrent programming on uniprocessor:

semaphores Shared address space parallel programs:

semaphores, or use ordinary variables as flags

P1 P2

A = 1;

a: while (flag is 0) do nothing;b: flag = 1;

print A;

•Busy-waiting or spinning

Page 31: Lecture 2 (Mapping Applications to Multi-core Arch)

KICS, UETCavium Univ Program © 2010 2-87

Group Event Synchronization Subset of processes involved

Can use flags or barriers (involving only the subset)

Concept of producers and consumers

Major types: Single-producer, multiple-consumer Multiple-producer, single-consumer Multiple-producer, single-consumer

Page 32: Lecture 2 (Mapping Applications to Multi-core Arch)

KICS, UETCavium Univ Program © 2010 2-87

Message Passing Grid Solver Cannot declare A to be shared array any more Need to compose it logically from per-process

private arrays usually allocated in accordance with the assignment

of work process assigned a set of rows allocates them locally

Transfers of entire rows between traversals Structurally similar to SAS (e.g. SPMD), but

orchestration different data structures and data access/naming communication synchronization

Page 33: Lecture 2 (Mapping Applications to Multi-core Arch)

KICS, UETCavium Univ Program © 2010 2-87

1. int pid, n, b; /*process id, matrix dimension and number ofprocessors to be used*/

2. float **myA;3. main()4. begin5. read(n); read(nprocs); /*read input matrix size and number of processes*/8a. CREATE (nprocs-1, Solve);8b. Solve(); /*main process becomes a worker too*/8c. WAIT_FOR_END (nprocs–1); /*wait for all child processes created to terminate*/9. end main

10. procedure Solve()11. begin13. int i,j, pid, n’ = n/nprocs, done = 0;14. float temp, tempdiff, mydiff = 0; /*private variables*/6. myA malloc(a 2-d array of size [n/nprocs + 2] by n+2);

/*my assigned rows of A*/7. initialize(myA); /*initialize my rows of A, in an unspecified way*/

15. while (!done) do16. mydiff = 0; /*set local diff to 0*/16a. if (pid != 0) then SEND(&myA[1,0],n*sizeof(float),pid-1,ROW);16b. if (pid = nprocs-1) then

SEND(&myA[n’,0],n*sizeof(float),pid+1,ROW);16c. if (pid != 0) then RECEIVE(&myA[0,0],n*sizeof(float),pid-1,ROW);16d. if (pid != nprocs-1) then

RECEIVE(&myA[n’+1,0],n*sizeof(float), pid+1,ROW);/*border rows of neighbors have now been copiedinto myA[0,*] and myA[n’+1,*]*/

17. for i 1 to n’ do /*for each of my (nonghost) rows*/18. for j 1 to n do /*for all nonborder elements in that row*/19. temp = myA[i,j];20. myA[i,j] = 0.2 * (myA[i,j] + myA[i,j-1] + myA[i-1,j] +21. myA[i,j+1] + myA[i+1,j]);22. mydiff += abs(myA[i,j] - temp);23. endfor24. endfor

/*communicate local diff values and determine ifdone; can be replaced by reduction and broadcast*/

25a. if (pid != 0) then /*process 0 holds global total diff*/25b. SEND(mydiff,sizeof(float),0,DIFF);25c. RECEIVE(done,sizeof(int),0,DONE);25d. else /*pid 0 does this*/25e. for i 1 to nprocs-1 do /*for each other process*/25f. RECEIVE(tempdiff,sizeof(float),*,DIFF);25g. mydiff += tempdiff; /*accumulate into total*/25h. endfor25i if (mydiff/(n*n) < TOL) then done = 1;25j. for i 1 to nprocs-1 do /*for each other process*/25k. SEND(done,sizeof(int),i,DONE);25l. endfor25m. endif26. endwhile27. end procedure

Page 34: Lecture 2 (Mapping Applications to Multi-core Arch)

KICS, UETCavium Univ Program © 2010 2-87

Notes on Message Passing Program

Use of ghost rows Receive does not transfer data, send does

unlike SAS which is usually receiver-initiated (load fetches data) Communication done at beginning of iteration, so no asynchrony Communication in whole rows, not element at a time Core similar, but indices/bounds in local rather than global space Synchronization through sends and receives

Update of global diff and event synch for done condition Could implement locks and barriers with messages

Can use REDUCE and BROADCAST library calls to simplify code

/*communicate local diff values and determine if done, using reduction and broadcast*/25b. REDUCE(0,mydiff,sizeof(float),ADD);25c. if (pid == 0) then25i. if (mydiff/(n*n) < TOL) then done = 1;25k. endif25m. BROADCAST(0,done,sizeof(int),DONE);

Page 35: Lecture 2 (Mapping Applications to Multi-core Arch)

KICS, UETCavium Univ Program © 2010 2-87

Send and Receive Alternatives Can extend functionality: stride, scatter-gather, groups Semantic flavors: based on when control is returned

Affect when data structures or buffers can be reused at either end

Affect event synch (mutual excl. by fiat: only one process touches data)

Affect ease of programming and performance Synchronous messages provide built-in synch. through match

Separate event synchronization needed with asynch. messages With synch. messages, our code is deadlocked. Fix?

Send/Receive

Synchronous Asynchronous

Blocking asynch. Nonblocking asynch.

Page 36: Lecture 2 (Mapping Applications to Multi-core Arch)

KICS, UETCavium Univ Program © 2010 2-87

Orchestration: Summary Shared address space

Shared and private data explicitly separate Communication implicit in access patterns No correctness need for data distribution Synchronization via atomic operations on shared data Synchronization explicit and distinct from data

communication Message passing

Data distribution among local address spaces needed No explicit shared structures (implicit in comm. patterns) Communication is explicit Synchronization implicit in communication (at least in synch.

case) mutual exclusion by fiat

Page 37: Lecture 2 (Mapping Applications to Multi-core Arch)

KICS, UETCavium Univ Program © 2010 2-87

Correctness in Grid Solver Program Decomposition and Assignment similar in SAS

and message-passing Orchestration is different

Data structures, data access/naming, communication, synchronization

SAS Msg-Passing

Explicit global data structure? Yes No

Assignment indept of data layout? Yes No

Communication Implicit Explicit

Synchronization Explicit Implicit

Explicit replication of border rows? No Yes

Page 38: Lecture 2 (Mapping Applications to Multi-core Arch)

Programming for Performance

Chapter 3 David E. Culler and Jaswinder Pal Singh,

Parallel Computer Architecture: A Hardware/Software Approach, Morgan

Kaufmann, 1998

Page 39: Lecture 2 (Mapping Applications to Multi-core Arch)

KICS, UETCavium Univ Program © 2010 2-87

OutlineProgramming techniques for performance Partitioning for performance Relationship of communication, data locality and architecture Programming for performance For each issue:

Techniques to address it, and tradeoffs with previous issues Application to grid solver Some architectural implications

Components of execution time as seen by processor What workload looks like to architecture, and relate to software

issues

Implications for programming models

Page 40: Lecture 2 (Mapping Applications to Multi-core Arch)

KICS, UETCavium Univ Program © 2010 2-87

Partitioning for Performance Balancing the workload and reducing wait time at synch

points Reducing inherent communication Reducing extra work

Even these algorithmic issues trade off: Minimize comm. => run on 1 processor => extreme load

imbalance Maximize load balance => random assignment of tiny tasks

=> no control over communication Good partition may imply extra work to compute or manage it

Goal is to compromise Fortunately, often not difficult in practice

Page 41: Lecture 2 (Mapping Applications to Multi-core Arch)

KICS, UETCavium Univ Program © 2010 2-87

Load Balance and Synch Wait Time Limit on speedup: Speedupproblem(p) <

Work includes data access and other costs Not just equal work, but must be busy at same time

Four parts to load balance and reducing synch wait time:

1. Identify enough concurrency2. Decide how to manage it3. Determine the granularity at which to exploit it4. Reduce serialization and cost of synchronization

Sequential Work

Max Work on any Processor

Page 42: Lecture 2 (Mapping Applications to Multi-core Arch)

KICS, UETCavium Univ Program © 2010 2-87

Identifying Concurrency Techniques seen for equation solver:

Loop structure, fundamental dependences, new algorithms Data Parallelism versus Function Parallelism Often see orthogonal levels of parallelism; e.g. VLSI routing

Wire W2 expands to segments

Segment S23 expands to routes

W1 W2 W3

S21 S22 S23 S24 S25 S26

(a)

(b)

(c)

Page 43: Lecture 2 (Mapping Applications to Multi-core Arch)

KICS, UETCavium Univ Program © 2010 2-87

Identifying Concurrency (Cont’d) Function parallelism:

entire large tasks (procedures) that can be done in parallel on same or different data e.g. different independent grid computations in Ocean pipelining, as in video encoding/decoding, or polygon

rendering degree usually modest and does not grow with input size difficult to load balance often used to reduce synch between data parallel phases

Most scalable programs data parallel (per this loose definition)

function parallelism reduces synch between data parallel phases

Page 44: Lecture 2 (Mapping Applications to Multi-core Arch)

KICS, UETCavium Univ Program © 2010 2-87

Deciding How to Manage Concurrency Static versus Dynamic techniques Static:

Algorithmic assignment based on input; won’t change Low runtime overhead Computation must be predictable Preferable when applicable (except in

multiprogrammed/heterogeneous environment) Dynamic:

Adapt at runtime to balance load Can increase communication and reduce locality Can increase task management overheads

Page 45: Lecture 2 (Mapping Applications to Multi-core Arch)

KICS, UETCavium Univ Program © 2010 2-87

Dynamic Assignment Profile-based (semi-static):

Profile work distribution at runtime, and repartition dynamically Applicable in many computations, e.g. Barnes-Hut, some

graphics

Dynamic Tasking: Deal with unpredictability in program or environment (e.g.

Raytrace) computation, communication, and memory system interactions multiprogramming and heterogeneity used by runtime systems and OS too

Pool of tasks; take and add tasks until done E.g. “self-scheduling” of loop iterations (shared loop counter)

Page 46: Lecture 2 (Mapping Applications to Multi-core Arch)

KICS, UETCavium Univ Program © 2010 2-87

Dynamic Tasking with Task Queues Centralized versus distributed queues Task stealing with distributed queues

Can compromise comm and locality, and increase synchronization Whom to steal from, how many tasks to steal, ... Termination detection Maximum imbalance related to size of task

QQ 0 Q2Q1 Q3

All remove tasks

P0 inserts P1 inserts P2 inserts P3 inserts

P0 removes P1 removes P2 removes P3 removes

(b) Distributed task queues (one per pr ocess)

Others maysteal

All processesinsert tasks

(a) Centralized task queue

Page 47: Lecture 2 (Mapping Applications to Multi-core Arch)

KICS, UETCavium Univ Program © 2010 2-87

Determining Task Granularity Task granularity: amount of work associated with

a task

General rule: Coarse-grained => often less load balance Fine-grained => more overhead; often more comm.,

contention

Comm., contention actually affected by assignment, not size

Overhead by size itself too, particularly with task queues

Page 48: Lecture 2 (Mapping Applications to Multi-core Arch)

KICS, UETCavium Univ Program © 2010 2-87

Reducing Serialization Careful about assignment and orchestration (including

scheduling) Event synchronization

Reduce use of conservative synchronization e.g. point-to-point instead of barriers, or granularity of pt-to-pt

But fine-grained synch more difficult to program, more synch ops.

Mutual exclusion Separate locks for separate data

e.g. locking records in a database: lock per process, record, or field lock per task in task queue, not per queue finer grain => less contention/serialization, more space, less reuse

Smaller, less frequent critical sections don’t do reading/testing in critical section, only modification e.g. searching for task to dequeue in task queue, building tree

Stagger critical sections in time

Page 49: Lecture 2 (Mapping Applications to Multi-core Arch)

KICS, UETCavium Univ Program © 2010 2-87

Reducing Inherent Communication Communication is expensive! Measure: communication to computation ratio Focus here on inherent communication

Determined by assignment of tasks to processes Later see that actual communication can be greater

Assign tasks that access same data to same process

Solving communication and load balance NP-hard in general case

But simple heuristic solutions work well in practice Applications have structure!

Page 50: Lecture 2 (Mapping Applications to Multi-core Arch)

KICS, UETCavium Univ Program © 2010 2-87

Domain Decomposition Works well for scientific, engineering, graphics, ... applications Exploits local-biased nature of physical problems

Information requirements often short-range Or long-range but fall off with distance

Simple example: nearest-neighbor grid computation Perimeter to Area comm-to-comp ratio (area to volume in 3-d)

Depends on n,p: decreases with n, increases with p

P0 P1 P2 P3

P4

P8

P12

P5 P6 P7

P9 P11

P13 P14

P10

n

n np

np

P15

Page 51: Lecture 2 (Mapping Applications to Multi-core Arch)

KICS, UETCavium Univ Program © 2010 2-87

Domain Decomposition (Cont’d)

Comm to comp: for block, for strip Retain block from here on

Application dependent: strip may be better in other cases E.g. particle flow in tunnel

4*√pn

2*pn

P0 P1 P2 P3

P4

P8

P12

P5 P6 P7

P9 P11

P13 P14 P15

P10

n

n

n

p------

n

p------

Best domain decomposition depends on information requirementsNearest neighbor example: block versus strip decomposition:

Page 52: Lecture 2 (Mapping Applications to Multi-core Arch)

KICS, UETCavium Univ Program © 2010 2-87

Finding a Domain Decomposition Static, by inspection

Must be predictable: grid example Static, but not by inspection

Input-dependent, require analyzing input structure E.g sparse matrix computations, data mining

(assigning itemsets) Semi-static (periodic repartitioning)

Characteristics change but slowly; e.g. Barnes-Hut Static or semi-static, with dynamic task stealing

Initial decomposition, but highly unpredictable; e.g ray tracing

Page 53: Lecture 2 (Mapping Applications to Multi-core Arch)

KICS, UETCavium Univ Program © 2010 2-87

Other Techniques

Preserve locality in task stealing• Steal large tasks for locality, steal from same queues, ...

12

3 4

12

3 4

12

3 4

12

3 4

12

3 4

12

3 4

12

3 4

12

3 4

12

3 4

12

3 4

12

3 4

12

3 4

12

3 4

12

3 4

12

3 4

12

3 4

12

43

Domain decomposition Scatter decomposition

Scatter Decomposition, e.g. initial partition in Raytrace

Page 54: Lecture 2 (Mapping Applications to Multi-core Arch)

KICS, UETCavium Univ Program © 2010 2-87

Implications of Comm-to-Comp Ratio Architects examine application needs to see where to spend

money If denominator is execution time, ratio gives average BW

needs If operation count, gives extremes in impact of latency and

bandwidth Latency: assume no latency hiding Bandwidth: assume all latency hidden Reality is somewhere in between

Actual impact of comm. depends on structure and cost as well

Need to keep communication balanced across processors as well

Sequential WorkMax (Work + Synch Wait Time + Comm Cost)

Speedup <

Page 55: Lecture 2 (Mapping Applications to Multi-core Arch)

KICS, UETCavium Univ Program © 2010 2-87

Reducing Extra Work Common sources of extra work:

Computing a good partition e.g. partitioning in Barnes-Hut or sparse matrix

Using redundant computation to avoid communication Task, data and process management overhead

applications, languages, runtime systems, OS Imposing structure on communication

coalescing messages, allowing effective naming Architectural Implications:

Reduce need by making communication and orchestration efficient

Sequential WorkMax (Work + Synch Wait Time + Comm Cost + Extra Work)

Speedup <

Page 56: Lecture 2 (Mapping Applications to Multi-core Arch)

KICS, UETCavium Univ Program © 2010 2-87

Memory-oriented View of Performance Multiprocessor as Extended Memory Hierarchy

as seen by a given processor Levels in extended hierarchy:

Registers, caches, local memory, remote memory (topology)

Glued together by communication architecture Levels communicate at a certain granularity of data

transfer Need to exploit spatial and temporal locality in

hierarchy Otherwise extra communication may also be caused Especially important since communication is expensive

Page 57: Lecture 2 (Mapping Applications to Multi-core Arch)

KICS, UETCavium Univ Program © 2010 2-87

Uniprocessor Optimization Performance depends heavily on memory

hierarchy Time spent by a program

Timeprog(1) = Busy(1) + Data Access(1)

Divide by cycles to get CPI equation

Data access time can be reduced by: Optimizing machine: bigger caches, lower latency... Optimizing program: temporal and spatial locality

Page 58: Lecture 2 (Mapping Applications to Multi-core Arch)

KICS, UETCavium Univ Program © 2010 2-87

Extended Hierarchy Idealized view: local cache hierarchy + single main memory But reality is more complex

Centralized Memory: caches of other processors Distributed Memory: some local, some remote; + network

topology Management of levels

caches managed by hardware main memory depends on programming model

SAS: data movement between local and remote transparent message passing: explicit

Levels closer to processor are lower latency and higher bandwidth

Improve performance through architecture or program locality Tradeoff with parallelism; need good node performance and

parallelism

Page 59: Lecture 2 (Mapping Applications to Multi-core Arch)

KICS, UETCavium Univ Program © 2010 2-87

Artifactual Comm. in Extended Hierarchy Accesses not satisfied in local portion cause

communication Inherent communication, implicit or explicit, causes transfers

determined by program Artifactual communication

determined by program implementation and arch. interactions poor allocation of data across distributed memories unnecessary data in a transfer unnecessary transfers due to system granularities redundant communication of data finite replication capacity (in cache or main memory)

Inherent communication assumes unlimited capacity, small transfers, perfect knowledge of what is needed.

More on artifactual later; first consider replication-induced further

Page 60: Lecture 2 (Mapping Applications to Multi-core Arch)

KICS, UETCavium Univ Program © 2010 2-87

Communication and Replication Comm induced by finite capacity is most fundamental artifact

Like cache size and miss rate or memory traffic in uniprocessors Extended memory hierarchy view useful for this relationship

View as three level hierarchy for simplicity Local cache, local memory, remote memory (ignore network

topology) Classify “misses” in “cache” at any level as for uniprocessors

compulsory or cold misses (no size effect) capacity misses (yes) conflict or collision misses (yes) communication or coherence misses (no)

Each may be helped/hurt by large transfer granularity (spatial locality)

Page 61: Lecture 2 (Mapping Applications to Multi-core Arch)

KICS, UETCavium Univ Program © 2010 2-87

Orchestration for Performance Reducing amount of communication:

Inherent: change logical data sharing patterns in algorithm

Artifactual: exploit spatial, temporal locality in extended hierarchy

Techniques often similar to those on uniprocessors

Structuring communication to reduce cost

Let’s examine techniques for both...

Page 62: Lecture 2 (Mapping Applications to Multi-core Arch)

KICS, UETCavium Univ Program © 2010 2-87

Reducing Artifactual Communication Message passing model

Communication and replication are both explicit Even artifactual communication is in explicit

messages Shared address space model

More interesting from an architectural perspective Occurs transparently due to interactions of

program and system sizes and granularities in extended memory hierarchy

Use shared address space to illustrate issues

Page 63: Lecture 2 (Mapping Applications to Multi-core Arch)

KICS, UETCavium Univ Program © 2010 2-87

Exploiting Temporal Locality Structure algorithm so working sets map well to hierarchy

often techniques to reduce inherent communication do well here schedule tasks for data reuse once assigned

Multiple data structures in same phase e.g. database records: local versus remote

Solver example: blocking

(a) Unblocked access pattern in a sweep (b) Blocked access pattern with B = 4

•More useful when O(nk+1) computation on O(nk) data

–many linear algebra computations (factorization, matrix multiply)

Page 64: Lecture 2 (Mapping Applications to Multi-core Arch)

KICS, UETCavium Univ Program © 2010 2-87

Exploiting Spatial Locality Besides capacity, granularities are important:

Granularity of allocation Granularity of communication or data transfer Granularity of coherence

Major spatial-related causes of artifactual communication: Conflict misses Data distribution/layout (allocation granularity) Fragmentation (communication granularity) False sharing of data (coherence granularity)

All depend on how spatial access patterns interact with data structures

Fix problems by modifying data structures, or layout/alignment Examine later in context of architectures

one simple example here: data distribution in SAS solver

Page 65: Lecture 2 (Mapping Applications to Multi-core Arch)

KICS, UETCavium Univ Program © 2010 2-87

Spatial Locality Example Repeated sweeps over 2-d grid, each time adding 1 to elements Natural 2-d versus higher-dimensional array representation

P6 P7P4

P8

P0 P3

P5 P6 P7P4

P8

P0 P1 P2 P3

P5

P2P1

Page straddlespartition boundaries:difficult to distribute memory well

Cache blockstraddles partitionboundary

(a) Two-dimensional array

Page doesnot straddlepartitionboundary

Cache block is within a partition

(b) Four-dimensional array

Contiguity in memory layout

Page 66: Lecture 2 (Mapping Applications to Multi-core Arch)

KICS, UETCavium Univ Program © 2010 2-87

Tradeoffs with Inherent Communication Partitioning grid solver: blocks versus rows

Blocks still have a spatial locality problem on remote data Rowwise can perform better despite worse inherent c-to-c ratio

Result depends on n and p

Good spacial locality onnonlocal accesses atrow-oriented boudary

Poor spacial locality onnonlocal accesses atcolumn-orientedboundary

Page 67: Lecture 2 (Mapping Applications to Multi-core Arch)

KICS, UETCavium Univ Program © 2010 2-87

Example Performance ImpactEquation solver on SGI Origin2000

Spee

dup

Number of processors

Spee

dup

Number of processors

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 310

5

10

15

20

25

30

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 310

5

10

15

20

25

30

35

40

45

504D4D-rr

2D-rr 2D Rows-rr

Rows 2D

4D

Rows

Page 68: Lecture 2 (Mapping Applications to Multi-core Arch)

KICS, UETCavium Univ Program © 2010 2-87

Structuring Communication Given amount of comm (inherent or artifactual), goal is to reduce

cost Cost of communication as seen by process:

C = f * ( o + l + + tc - overlap)

f = frequency of messages o = overhead per message (at both ends) l = network delay per message nc = total data sent m = number of messages B = bandwidth along path (determined by network, NI, assist) tc = cost induced by contention per message overlap = amount of latency hidden by overlap with comp. or comm.

Portion in parentheses is cost of a message (as seen by processor) That portion, ignoring overlap, is latency of a message Goal: reduce terms in latency and increase overlap

nc/mB

Page 69: Lecture 2 (Mapping Applications to Multi-core Arch)

KICS, UETCavium Univ Program © 2010 2-87

Reducing Overhead Can reduce no. of messages m or overhead per

message o o is usually determined by hardware or system

software Program should try to reduce m by coalescing messages More control when communication is explicit

Coalescing data into larger messages: Easy for regular, coarse-grained communication Can be difficult for irregular, naturally fine-grained

communication may require changes to algorithm and extra work

coalescing data and determining what and to whom to send

Page 70: Lecture 2 (Mapping Applications to Multi-core Arch)

KICS, UETCavium Univ Program © 2010 2-87

Reducing Contention All resources have nonzero occupancy

Memory, communication controller, network link, etc. Can only handle so many transactions per unit time

Effects of contention: Increased end-to-end cost for messages Reduced available bandwidth for individual messages Causes imbalances across processors

Particularly insidious performance problem Easy to ignore when programming Slow down messages that don’t even need that resource

by causing other dependent resources to also congest Effect can be devastating: Don’t flood a resource!

Page 71: Lecture 2 (Mapping Applications to Multi-core Arch)

KICS, UETCavium Univ Program © 2010 2-87

Overlapping Communication Cannot afford to stall for high latencies

even on uniprocessors! Overlap with computation or communication to

hide latency Requires extra concurrency (slackness), higher

bandwidth Techniques:

Prefetching Block data transfer Proceeding past communication Multithreading

Page 72: Lecture 2 (Mapping Applications to Multi-core Arch)

KICS, UETCavium Univ Program © 2010 2-87

Summary of Tradeoffs Different goals often have conflicting demands

Load Balance fine-grain tasks random or dynamic assignment

Communication usually coarse grain tasks decompose to obtain locality: not random/dynamic

Extra Work coarse grain tasks simple assignment

Communication Cost: big transfers: amortize overhead and latency small transfers: reduce contention

Page 73: Lecture 2 (Mapping Applications to Multi-core Arch)

KICS, UETCavium Univ Program © 2010 2-87

Relationship between Perspectives

Synch wait

Data-remote

Data-localOrchestration

Busy-overheadExtra work

Performance issueParallelization step(s) Processor time component

Decomposition/assignment/orchestration

Decomposition/assignment

Decomposition/assignment

Orchestration/mapping

Load imbalance and synchronization

Inherent communication volume

Artifactual communication and data locality

Communication structure

Page 74: Lecture 2 (Mapping Applications to Multi-core Arch)

KICS, UETCavium Univ Program © 2010 2-87

Summary

Goal is to reduce denominator components Both programmer and system have role to play Architecture cannot do much about load

imbalance or too much communication But it can:

reduce incentive for creating ill-behaved programs (efficient naming, communication and synchronization)

reduce artifactual communication provide efficient naming for flexible assignment allow effective overlapping of communication

Busy(1) + Data(1)Busyuseful(p)+Datalocal(p)+Synch(p)+Dateremote(p)+Busyoverhead(p)

Page 75: Lecture 2 (Mapping Applications to Multi-core Arch)

Multi-Threading

Parallel Programming on Shared Memory Multiprocessors Using PThread

Chapter 2Shameem Akhtar and Jason Roberts, Multi-Core Programming, Intel Press,

2006

Page 76: Lecture 2 (Mapping Applications to Multi-core Arch)

KICS, UETCavium Univ Program © 2010 2-87

Outline of Multi-Threading Topics Threads

Terminology OS level view Hardware level threads

Threading as a parallel programming model Types of thread level parallel programs Implementation issues

Page 77: Lecture 2 (Mapping Applications to Multi-core Arch)

KICS, UETCavium Univ Program © 2010 2-87

Threads Definition

A discrete sequence of related instructions Executed independently of other such

sequences Every program has at least one thread

Initializes Executes instructions May create other threads

Each thread maintains its current state OS maps a thread to hardware resources

Page 78: Lecture 2 (Mapping Applications to Multi-core Arch)

KICS, UETCavium Univ Program © 2010 2-87

System View of Threads

Thread computational model layers: User level threads Kernel level threads Hardware threads

Page 79: Lecture 2 (Mapping Applications to Multi-core Arch)

KICS, UETCavium Univ Program © 2010 2-87

Flow of Threads in an Execution Environment

Defining and preparing stage Operating stage

Created and managed by the OS Execution stage

Page 80: Lecture 2 (Mapping Applications to Multi-core Arch)

KICS, UETCavium Univ Program © 2010 2-87

Threads Inside the OS

Page 81: Lecture 2 (Mapping Applications to Multi-core Arch)

KICS, UETCavium Univ Program © 2010 2-87

Processors, Processes, and Threads

A processor runs threads from one or more processes, each of which contains one or more threads

Page 82: Lecture 2 (Mapping Applications to Multi-core Arch)

KICS, UETCavium Univ Program © 2010 2-87

Mapping Models of Threads to Processors: 1:1 Mapping

Page 83: Lecture 2 (Mapping Applications to Multi-core Arch)

KICS, UETCavium Univ Program © 2010 2-87

Mapping Models of Threads to Processors: M:1 Mapping

Page 84: Lecture 2 (Mapping Applications to Multi-core Arch)

KICS, UETCavium Univ Program © 2010 2-87

Mapping Models of Threads to Processors: M:N Mapping

Page 85: Lecture 2 (Mapping Applications to Multi-core Arch)

KICS, UETCavium Univ Program © 2010 2-87

Threads Inside the Hardware

Page 86: Lecture 2 (Mapping Applications to Multi-core Arch)

KICS, UETCavium Univ Program © 2010 2-87

Thread Creation Multiple threads inside a process

Share same address space, FDs, etc. Operate independently Need their own stack space

Who handles thread creation details Not the programmer Typically handled at system level

OS support for threads Threading libraries

Same is true for thread management

Page 87: Lecture 2 (Mapping Applications to Multi-core Arch)

KICS, UETCavium Univ Program © 2010 2-87

Stack Layout for a Multi-Threaded Process

Page 88: Lecture 2 (Mapping Applications to Multi-core Arch)

KICS, UETCavium Univ Program © 2010 2-87

Thread State Diagram

Page 89: Lecture 2 (Mapping Applications to Multi-core Arch)

KICS, UETCavium Univ Program © 2010 2-87

Thread Implementation Often implemented as a thread package

Operations to create and destroy threads Synchronization mechanisms

Approaches to implement a thread package: Implement as a thread library to execute

entirely in user mode Have the kernel be aware of threads and

schedule them

Page 90: Lecture 2 (Mapping Applications to Multi-core Arch)

KICS, UETCavium Univ Program © 2010 2-87

Thread Implementation (2) Characteristics of a user level thread library

Cheap to create and destroy threads Switching thread context can be done in just a

few instructions Need to save and restore CPU registers only No need to change memory maps, flush TLB, CPU

accounting, etc. Drawback: a blocking system call will block all

threads in a process Solution to blocking: implement thread in OS

kernel

Page 91: Lecture 2 (Mapping Applications to Multi-core Arch)

KICS, UETCavium Univ Program © 2010 2-87

Kernel Implementations of Threads High price to solve blocking problem

Every thread operation will require a system call

Thread creation Thread deletion Thread synchronization

Thread switching will now become as expensive as process context switching

Page 92: Lecture 2 (Mapping Applications to Multi-core Arch)

KICS, UETCavium Univ Program © 2010 2-87

Kernel Implementations of Threads (2) Lightweight processes (LWP)

A hybrid form of user and kernel level threads An LWP runs in the context of a (heavy-weight)

process There can be several LWPs each with its own

scheduler and stack System also offers a user level thread package for

usual operations (creation, deletion, and synchronization)

Assignment of a user level thread to LWP is hidden from programmer

LWP handles the scheduling for multiple threads

Page 93: Lecture 2 (Mapping Applications to Multi-core Arch)

KICS, UETCavium Univ Program © 2010 2-87

LWP Implementation

Thread table is shared among LWPs Protected through mutexes no kernel intervention for LWP synch.

When an LWP finds a runnable thread switches context to that thread done entirely in user space

When a thread makes a blocking system call: OS might block one LWP May switch to another LWP will allow other threads to continue

Page 94: Lecture 2 (Mapping Applications to Multi-core Arch)

Parallel Programming with Threads

Overview of POSIX threads, data races and types of synchronization

Page 95: Lecture 2 (Mapping Applications to Multi-core Arch)

KICS, UETCavium Univ Program © 2010 2-87

Shared Memory Programming Several Thread Libraries PTHREADS is the POSIX Standard

Solaris threads are very similar Relatively low level Portable but possibly slow

OpenMP is newer standard Support for scientific programming on shared

memory http://www.openMP.org

Multiple other efforts by specific vendors

Page 96: Lecture 2 (Mapping Applications to Multi-core Arch)

KICS, UETCavium Univ Program © 2010 2-87

Overview of POSIX Threads POSIX: Portable Operating System Interface for

UNIX Interface to Operating System utilities

PThreads: The POSIX threading interface System calls to create and synchronize threads Should be relatively uniform across UNIX-like OS

platforms PThreads contain support for

Creating parallelism Synchronizing No explicit support for communication, because shared

memory is implicit; a pointer to shared data is passed to a thread

Page 97: Lecture 2 (Mapping Applications to Multi-core Arch)

KICS, UETCavium Univ Program © 2010 2-87

POSIX Thread Creation Signature: int pthread_create(pthread_t *, const pthread_attr_t *, void * (*)(void *), void *); Example call: pthread_create(&thread_id;

&thread_attribute &thread_fun; &fun_arg);

Page 98: Lecture 2 (Mapping Applications to Multi-core Arch)

KICS, UETCavium Univ Program © 2010 2-87

POSIX Thread Creation (2) thread_id is the thread id or handle (used to

halt, etc.) thread_attribute various attributes

standard default values obtained by passing a NULL pointer

thread_fun the function to be run (takes and returns void*)

fun_arg an argument can be passed to thread_fun when it starts

errorcode will be set nonzero if the create operation fails

Page 99: Lecture 2 (Mapping Applications to Multi-core Arch)

KICS, UETCavium Univ Program © 2010 2-87

Simple Threading Examplevoid* SayHello(void *foo) { printf( "Hello, world!\n" ); return NULL;}

int main() { pthread_t threads[16]; int tn; for(tn=0; tn<16; tn++) { pthread_create(&threads[tn], NULL, SayHello, NULL); } for(tn=0; tn<16 ; tn++) { pthread_join(threads[tn], NULL); } return 0;}

Compile using gcc –lpthread

Page 100: Lecture 2 (Mapping Applications to Multi-core Arch)

KICS, UETCavium Univ Program © 2010 2-87

Loop Level Parallelism Many scientific application have parallelism in

loops With threads: … my_stuff [n][n]; for (int i = 0; i < n; i++) for (int j = 0; j < n; j++) … pthread_create (update_cell, …, my_stuff[i][j]);

But overhead of thread creation is nontrivial update_cell should have a significant amount of work 1/pth if possible

Also need i & j

Page 101: Lecture 2 (Mapping Applications to Multi-core Arch)

KICS, UETCavium Univ Program © 2010 2-87

Shared Data and Threads Variables declared outside of main are

shared Object allocated on the heap may be

shared (if pointer is passed) Variables on the stack are private:

passing pointer to these around to other threads can cause problems

Page 102: Lecture 2 (Mapping Applications to Multi-core Arch)

KICS, UETCavium Univ Program © 2010 2-87

Shared Data and Threads (2) Often done by creating a large “thread

data” struct Passed into all threads as argument Simple example:

char *message = "Hello World!\n"; pthread_create( &thread1,

NULL, (void*)&print_fun, (void*) message);

Page 103: Lecture 2 (Mapping Applications to Multi-core Arch)

KICS, UETCavium Univ Program © 2010 2-87

Setting Attribute Values Once an initialized attribute object exists,

changes can be made. For example: To change the stack size for a thread to 8192

(before calling pthread_create), do this: pthread_attr_setstacksize(&my_attributes,

(size_t)8192); To get the stack size, do this:

size_t my_stack_size;pthread_attr_getstacksize(&my_attributes, &my_stack_size);

Slide Source: Theewara Vorakosit

Page 104: Lecture 2 (Mapping Applications to Multi-core Arch)

KICS, UETCavium Univ Program © 2010 2-87

Other AttributesOther attributes: Detached state – set if no other thread will use

pthread_join to wait for this thread (improves efficiency)

Scheduling parameter(s) – in particular, thread priority

Scheduling policy – FIFO or Round Robin Contention scope – with what threads does

this thread compete for a CPU Stack address – explicitly dictate where the

stack is located Lazy stack allocation – allocate on demand

(lazy) or all at once, “up front”

Page 105: Lecture 2 (Mapping Applications to Multi-core Arch)

KICS, UETCavium Univ Program © 2010 2-87

Data Race Example

Problem is a race condition on variable s in the program

A race condition or data race occurs when: two processors (or two threads) access the

same variable, and at least one does a write. The accesses are concurrent (not synchronized)

so they could happen simultaneously

Thread 1

for i = 0, n/2-1 s = s + f(A[i])

Thread 2

for i = n/2, n-1 s = s + f(A[i])

static int s = 0;

Page 106: Lecture 2 (Mapping Applications to Multi-core Arch)

KICS, UETCavium Univ Program © 2010 2-87

Basic Types of Synchronization: Barrier Barrier—global synchronization

Especially common when running multiple copies of the same function in parallel

SPMD “Single Program Multiple Data” simple use of barriers -- all threads hit the

same one work_on_my_subgrid(); barrier; read_neighboring_values(); barrier;

Page 107: Lecture 2 (Mapping Applications to Multi-core Arch)

KICS, UETCavium Univ Program © 2010 2-87

Barrier (2) More complicated—barriers on branches

(or loops) if (tid % 2 == 0) { work1(); barrier } else { barrier } Barriers are not provided in all thread

libraries

Page 108: Lecture 2 (Mapping Applications to Multi-core Arch)

KICS, UETCavium Univ Program © 2010 2-87

Creating and Initializing a Barrier To (dynamically) initialize a barrier, use

code similar to this (which sets the number of threads to 3):pthread_barrier_t b;pthread_barrier_init(&b,NULL,3);

The second argument specifies an object attribute; using NULL yields the default attributes.

Page 109: Lecture 2 (Mapping Applications to Multi-core Arch)

KICS, UETCavium Univ Program © 2010 2-87

Creating and Initializing a Barrier To wait at a barrier, a process executes:

pthread_barrier_wait(&b); This barrier could have been statically

initialized by assigning an initial value created using the macro PTHREAD_BARRIER_INITIALIZER(3)

Page 110: Lecture 2 (Mapping Applications to Multi-core Arch)

KICS, UETCavium Univ Program © 2010 2-87

Basic Types of Synchronization: Mutexes Mutexes—mutual exclusion aka locks

threads are working mostly independently need to access common data structure

lock *l = alloc_and_init(); /* shared */ acquire(l); access data release(l);

Page 111: Lecture 2 (Mapping Applications to Multi-core Arch)

KICS, UETCavium Univ Program © 2010 2-87

Mutexes (2) Java and other languages have lexically

scoped synchronization similar to cobegin/coend vs. fork and join

tradeoff Semaphores give guarantees on

“fairness” in getting the lock, but the same idea of mutual exclusion

Locks only affect processors using them: pair-wise synchronization

Page 112: Lecture 2 (Mapping Applications to Multi-core Arch)

KICS, UETCavium Univ Program © 2010 2-87

Mutexes in POSIX Threads To create a mutex: #include <pthread.h> pthread_mutex_t amutex =

PTHREAD_MUTEX_INITIALIZER; pthread_mutex_init(&amutex, NULL); To use it: int pthread_mutex_lock(amutex); int pthread_mutex_unlock(amutex);

Page 113: Lecture 2 (Mapping Applications to Multi-core Arch)

KICS, UETCavium Univ Program © 2010 2-87

Mutexes in POSIX Threads (2) To deallocate a mutex int

pthread_mutex_destroy(pthread_mutex_t *mutex);

Multiple mutexes may be held, but can lead to deadlock:

thread1 thread2 lock(a) lock(b) lock(b) lock(a)

Page 114: Lecture 2 (Mapping Applications to Multi-core Arch)

KICS, UETCavium Univ Program © 2010 2-87

Summary of Programming with Threads POSIX Threads are based on OS features

Can be used from multiple languages Familiar language for most of program Ability to shared data is convenient

Pitfalls Intermittent data race bugs are very nasty to find Deadlocks are usually easier, but can also be

intermittent OpenMP is commonly used today as an

alternative

Page 115: Lecture 2 (Mapping Applications to Multi-core Arch)

Multi-Threaded Distributed Application

Examples

Distributed Operating SystemsBy Andrew S. Tanenbaum

Page 116: Lecture 2 (Mapping Applications to Multi-core Arch)

KICS, UETCavium Univ Program © 2010 2-87

Multithreaded Clients Distribution transparency

Needed when a DS operates in a wide-area network environment

Need some mechanism to hide communication latency Multithreading on client side is useful

One connection per thread If one thread is blocked, other can do useful work More responsive to the user

Example: a web browser One thread connected to a server can bring an HTML

document Another thread connected to the same server can bring

images while the first displays the text, scroll bars, etc.

Page 117: Lecture 2 (Mapping Applications to Multi-core Arch)

KICS, UETCavium Univ Program © 2010 2-87

Multithreaded Servers (1) A multithreaded server organized in a

dispatcher/worker model.

Page 118: Lecture 2 (Mapping Applications to Multi-core Arch)

KICS, UETCavium Univ Program © 2010 2-87

Multithreaded Servers (2) Three ways to construct a server.

Model Characteristics

Threads Parallelism, blocking system calls

Single-threaded process No parallelism, blocking system calls

Finite-state machine Parallelism, nonblocking system calls

Page 119: Lecture 2 (Mapping Applications to Multi-core Arch)

KICS, UETCavium Univ Program © 2010 2-87

Clients Anatomy of a client process:

User interface A major task for most clients is to interact with human users Provide a means to interact with a remote server An important class: Graphical User Interfaces (GUIs)

Client side software distribution transparency Example: X Windows system

Used to control bit-mapped devices Monitor, keyboard, keyboard, and a pointing device X kernel (X Server) contains hardware-specific details device

drivers X uses an event-driven approach

Captures events from devices Provides an interface in the form of Xlib for GUI/graphics applications Two types of applications: normal and window manager

Page 120: Lecture 2 (Mapping Applications to Multi-core Arch)

KICS, UETCavium Univ Program © 2010 2-87

The X-Window System The basic organization of the X Window

System

Page 121: Lecture 2 (Mapping Applications to Multi-core Arch)

KICS, UETCavium Univ Program © 2010 2-87

User Interface: Compound Documents Function of a user interface is more than interacting with

users! May allow multiple applications to share a single graphical window Use that window to exchange data through user actions

Typical examples: Drag and drop

Drag an icon representing a file on trash can icon Application associated with trash can will be activated to delete file

In-place editing Image within a text document in a word processor Pointing on the image can activate a drawing tool

Compound documents notion of user interface A collection of different documents (text, images, spreadsheets) Seamlessly integrated through user interface Different applications operate on different parts of the document

Page 122: Lecture 2 (Mapping Applications to Multi-core Arch)

KICS, UETCavium Univ Program © 2010 2-87

Client-Side Software for Distribution Transparency

A possible approach to transparent replication of a remote object using a client-side solution

Proxy replicates requests to all replicated servers Forms a single response for the client application replication transparency Failure transparency is also possible through client middleware

Page 123: Lecture 2 (Mapping Applications to Multi-core Arch)

KICS, UETCavium Univ Program © 2010 2-87

Servers Organization of a server process:

Design issues of a server Object servers

Alternatives for invoking objects Object adapter

General design of a server: Iterative server

Handles all requests itself If necessary, returns a response to the requesting user

Concurrent server Does not handle request itself Passes it to a separate thread or process and waits for the

next request

Page 124: Lecture 2 (Mapping Applications to Multi-core Arch)

KICS, UETCavium Univ Program © 2010 2-87

Servers: General Design Issues

Client-to-server binding using a daemon as in DCE

Client-to-server binding using a superserver as in UNIX

Other distinctions: Stateless server Stateful server

Page 125: Lecture 2 (Mapping Applications to Multi-core Arch)

KICS, UETCavium Univ Program © 2010 2-87

Key Takeaways of this Session A wealth of knowledge exists about

developing parallel applications On legacy parallel architectures For high performance computing (HPC)

applications Techniques are applicable to multi-core

Similar decomposition, assignment, orchestration, and mapping

Shared address space programming Wider range of applications topic for next

session