45
CHAPTER 2 PARALLEL PROGRAMMING BACKGROUND “By the end of this chapter, you should have obtained a basic understanding of how modern processors execute parallel programs & understand some rules of thumb for scaling performance of parallel applications.” 1

“By the end of this chapter, you should have obtained a basic understanding of how modern processors execute parallel programs & understand some rules

Embed Size (px)

Citation preview

Page 1: “By the end of this chapter, you should have obtained a basic understanding of how modern processors execute parallel programs & understand some rules

1

CHAPTER 2 PARALLEL PROGRAMMING BACKGROUND“By the end of this chapter, you should have obtained a basic understanding of how modern processors execute parallel programs & understand some rules of thumb for scaling performance of parallel applications.”

Page 2: “By the end of this chapter, you should have obtained a basic understanding of how modern processors execute parallel programs & understand some rules

2

TRADITIONAL PARALLEL MODELSSerial ModelSISDParallel ModelsSIMDMIMDMISD*

S = SingleM = MultipleD = DataI = Instruction

Page 3: “By the end of this chapter, you should have obtained a basic understanding of how modern processors execute parallel programs & understand some rules

3

VOCABULARY & NOTATION (2.1)

Task vs. Data: tasks are instructions that operate on data; modify or create new

Parallel computation multiple tasksCoordinate, manage,

DependenciesData: task requires data from another taskControl: events/steps must be ordered (I/O)

Page 4: “By the end of this chapter, you should have obtained a basic understanding of how modern processors execute parallel programs & understand some rules

4

TASK MANAGEMENT – FORK-JOIN

Fork: split control flow, creating new control flow

Join: control flows are synchronized & merged

Page 5: “By the end of this chapter, you should have obtained a basic understanding of how modern processors execute parallel programs & understand some rules

5

GRAPHICAL NOTATION – FIG. 2.1

Task Data Fork Join Dependency

Page 6: “By the end of this chapter, you should have obtained a basic understanding of how modern processors execute parallel programs & understand some rules

6

STRATEGIES (2.2)Data Parallelism Best strategy for Scalable ParallelismP. that grows as data set/problem size grows

Split data set over set of processors with task processing each set

More Data More Tasks

Page 7: “By the end of this chapter, you should have obtained a basic understanding of how modern processors execute parallel programs & understand some rules

7

STRATEGIES

Control Parallelism orFunctional DecompositionDifferent program functions run in parallelNot scalable – best speedup is constant factor

As data grows, parallelism doesn’tMay be less/no overhead

Page 8: “By the end of this chapter, you should have obtained a basic understanding of how modern processors execute parallel programs & understand some rules

8

REGULAR VS. IRREGULAR PARALLELISMRegular: tasks are similar with predictable dependenciesMatrix multiplication

Irregular: tasks are different in ways that create unpredictable dependenciesChess program

Many problems contain combinations

Page 9: “By the end of this chapter, you should have obtained a basic understanding of how modern processors execute parallel programs & understand some rules

9

HARDWARE MECHANISMS (2.3)

Most important 2Thread Parallelism: implementation in HW using separate flow control for each worker – supports regular, irregular, functional decomposition

Vector Parallelism: implementation in HW with one flow control on multiple data elements – supports regular, some irregular parallelism

Page 10: “By the end of this chapter, you should have obtained a basic understanding of how modern processors execute parallel programs & understand some rules

10

BRANCH STATEMENTS Detrimental

to Parallelism• Locality• Pipelining• HOW?

Page 11: “By the end of this chapter, you should have obtained a basic understanding of how modern processors execute parallel programs & understand some rules

11

MASKING - ALL CONTROL PATHS ARE EXECUTED BUT RESULTS ARE MASKED OUT – NOT USED

if (a&1)a = 3*a + 1

elsea=a/2

if/else contains branch statementsMasking: Both parts are executed in parallel, keep only one result

p = (a&1)t = 3*A + 1if (p) a = tt = a/2if (!p) a = t

No branches – single control of flowMasking works as if it were coded this way

Page 12: “By the end of this chapter, you should have obtained a basic understanding of how modern processors execute parallel programs & understand some rules

12

MACHINE MODELS (2.4)

CoreFunctional UnitsRegistersCache memory – multiple levels

Page 13: “By the end of this chapter, you should have obtained a basic understanding of how modern processors execute parallel programs & understand some rules

13

Page 14: “By the end of this chapter, you should have obtained a basic understanding of how modern processors execute parallel programs & understand some rules

14

CACHE MEMORY

Blocks (cache lines) – amount fetchedBandwidth – amount transferred concurrently

Latency – time to complete transferCache Coherence – consistency among copies

Page 15: “By the end of this chapter, you should have obtained a basic understanding of how modern processors execute parallel programs & understand some rules

15

VIRTUAL MEMORYMemory system Disk storage + chip memory Allows programs larger than memory to run Allows multiprocessingSwaps PagesHW maps logical to physical addressData locality important to efficiencyPage Fault Thrashing

Page 16: “By the end of this chapter, you should have obtained a basic understanding of how modern processors execute parallel programs & understand some rules

16

PARALLEL MEMORY ACCESS

Cache (multiple)NUMA – Non-Uniform Memory AccessPRAM – Parallel Random Access Memory ModelTheoretical ModelAssumes - Uniform memory access times

Page 17: “By the end of this chapter, you should have obtained a basic understanding of how modern processors execute parallel programs & understand some rules

17

PERFORMANCE ISSUES (2.4.2)

Data LocalityChoose code segments that fit in cacheDesign to use data in close proximity Align data with cache lines (blocks)Dynamic Grain Size – good strategy

Page 18: “By the end of this chapter, you should have obtained a basic understanding of how modern processors execute parallel programs & understand some rules

18

PERFORMANCE ISSUES

Arithmetic IntensityLarge number of on-chip compute operations for every off-chip memory access

Otherwise, communication overhead is high

Related – Grain size

Page 19: “By the end of this chapter, you should have obtained a basic understanding of how modern processors execute parallel programs & understand some rules

19

FLYNN’S CATEGORIES

Serial ModelSISD

Parallel ModelsSIMD –

Array processorVector processor

MIMD Heterogeneous computer

ClustersMISD* - not useful

Page 20: “By the end of this chapter, you should have obtained a basic understanding of how modern processors execute parallel programs & understand some rules

20

CLASSIFICATION BASED ON MEMORY

Shared Memory – each processor accesses a common memoryAccess issuesNo message passingPC usually has small local memory

Distributed Memory – each processor has a local memory

Send explicit messages between processors

Page 21: “By the end of this chapter, you should have obtained a basic understanding of how modern processors execute parallel programs & understand some rules

21

EVOLUTION (2.4.4)

GPU – Graphics acceleratorsNow general purpose

Offload – running computations on accelerator, GPU’s or co-processor (not the regular CPU’s)

Heterogeneous – different (hardware working together)

Host Processor – for distribution, I/O, etc.

Page 22: “By the end of this chapter, you should have obtained a basic understanding of how modern processors execute parallel programs & understand some rules

22

PERFORMANCE (2.5)

Various interpretations of PerformanceReduce Total Time for computationLatency

Increasing Rate at which series of results are computedThroughput

Reduce Power Consumption*Performance Target

Page 23: “By the end of this chapter, you should have obtained a basic understanding of how modern processors execute parallel programs & understand some rules

23

LATENCY & THROUGHPUT (2.5.1)Latency: time to complete a taskThroughput: rate at which tasks are completeUnits per time (e.g. jobs per hour)

Page 24: “By the end of this chapter, you should have obtained a basic understanding of how modern processors execute parallel programs & understand some rules

24

OMIT SECTION 2.5.3 – POWER

Page 25: “By the end of this chapter, you should have obtained a basic understanding of how modern processors execute parallel programs & understand some rules

25

SPEEDUP & EFFICIENCY (2.5.2)

Sp = T1 / Tp T1: time to complete on 1 processor

Tp: time to complete on p processors

REMEMBER: “time” means number of instructions

E = Sp / P

= T1 / P*Tp

E = 1 is “perfect”

Linear Speedup – occurs when algorithm runs P-times faster on P processors

Page 26: “By the end of this chapter, you should have obtained a basic understanding of how modern processors execute parallel programs & understand some rules

26

SUPERLINEAR SPEEDUP (P.57)

Efficiency > 1Very RareOften due to HW variations (cache)Working in parallel may eliminate some work that is done when serial

Page 27: “By the end of this chapter, you should have obtained a basic understanding of how modern processors execute parallel programs & understand some rules

27

AMDAHL & GUSTAFSON-BARSIS (2.5.4, 2.5.5)Amdahl: speedup is limited by amount of serial work required

G-B: as problem size grows, parallel work grows faster than serial work, so speedup increases

See examples

Page 28: “By the end of this chapter, you should have obtained a basic understanding of how modern processors execute parallel programs & understand some rules

28

WORK

Total operations (time) for taskT1 = WorkP * Tp = Work T1 = P * Tp ?? Rare due to ???

Page 29: “By the end of this chapter, you should have obtained a basic understanding of how modern processors execute parallel programs & understand some rules

29

WORK-SPAN MODEL (2.5.6)Describes Dependencies among Tasks & allows for estimated timesRepresents Tasks: DAG (Figure 2.8) Critical Path – longest pathSpan - minimum time of Critical Path

Assumes Greedy Task Scheduling – no wasted resources, time

Parallel Slack – excess parallelism, more tasks than can be scheduled at once

Page 30: “By the end of this chapter, you should have obtained a basic understanding of how modern processors execute parallel programs & understand some rules

30

WORK-SPAN MODEL

Speedup <= Work/Span

Upper Bound: ??No more than…

Page 31: “By the end of this chapter, you should have obtained a basic understanding of how modern processors execute parallel programs & understand some rules

31

ASYMPTOTIC COMPLEXITY (2.5.7)

Comparing Algorithms!!Time Complexity: defines execution time growth in terms of input size

Space Complexity: defines growth of memory requirements in terms of input size

Ignores constantsMachine independent

Page 32: “By the end of this chapter, you should have obtained a basic understanding of how modern processors execute parallel programs & understand some rules

32

BIG OH NOTATION (P.66)

Big OH of F(n) – Upper BoundO(F(n)) = {G(n) |there exist positive constants c & No such that |G(n)| ≤ c F(n) for n ≥ No

*Memorize

Page 33: “By the end of this chapter, you should have obtained a basic understanding of how modern processors execute parallel programs & understand some rules

33

BIG OMEGA & BIG THETA

Big Omega – Functions that define Lower Bound

Big Theta – Functions that define a Tight Bound – Both Upper & Lower Bounds

Page 34: “By the end of this chapter, you should have obtained a basic understanding of how modern processors execute parallel programs & understand some rules

34

CONCURRENCY VS. PARALLEL

Parallel work actually occurring at same timeLimited by number of processors

Concurrent tasks in progress at same time but not necessarily executing“Unlimited”

Omit 2.58 & most of 2.59

Page 35: “By the end of this chapter, you should have obtained a basic understanding of how modern processors execute parallel programs & understand some rules

35

PITFALLS OF PARALLEL PROGRAMMING (2.6)

Pitfalls = Issues that can cause problemsSynchronization – often required Too little non-determinismToo much reduces scaling, increases time & may cause deadlock

Page 36: “By the end of this chapter, you should have obtained a basic understanding of how modern processors execute parallel programs & understand some rules

36

RACE CONDITIONS (2.6.1)

Situation in which final results depend upon order tasks complete work

Occurs when concurrent tasks share memory location & there is a write operation

Unpredictable – don’t always cause errorsInterleaving: instructions from 2 or more tasks are executed in an alternating manner

Page 37: “By the end of this chapter, you should have obtained a basic understanding of how modern processors execute parallel programs & understand some rules

37

RACE CONDITIONS ~ EXAMPLE 2.2Task AA = XA += 1X = A

Task BB = XB += 2X = B

Assume X is initially 0.

What are the possible results?

So, Tasks A & B are not REALLY independent!

Page 38: “By the end of this chapter, you should have obtained a basic understanding of how modern processors execute parallel programs & understand some rules

38

RACE CONDITIONS ~ EXAMPLE 2.3Task AX = 1A = Y

Task BY = 1B = X

Assume X & Y are initially 0.

What are the possible results?

Page 39: “By the end of this chapter, you should have obtained a basic understanding of how modern processors execute parallel programs & understand some rules

39

SOLUTIONS TO RACE CONDITIONS (2.6.2)

Mutual Exclusion, Locks, Semaphores, Atomic OperationsMechanisms to prevent access to a memory location(s) – allows one task to complete before allowing the other to start

Does not always solve the problem – may still depend upon which task executes first

Page 40: “By the end of this chapter, you should have obtained a basic understanding of how modern processors execute parallel programs & understand some rules

40

DEADLOCK (2.6.3)

Situation in which 2 or more processes cannot proceed due to waiting on each other – STOP

Recommendations for avoidanceAvoid mutual exclusionHold at most 1 lock at a timeAcquire locks in same order

Page 41: “By the end of this chapter, you should have obtained a basic understanding of how modern processors execute parallel programs & understand some rules

41

DEADLOCK – NECESSARY & SUFFICIENT CONDITIONS

1. Mutual Exclusion Condition: The resources involved are non-shareable.Explanation: At least one resource (thread) must be held in a non-shareable mode, that is, only one process at a time claims exclusive control of the resource. If another process requests that resource, the requesting process must be delayed until the resource has been released.

2. Hold and Wait Condition: Requesting process hold already, resources while waiting for requested resources.Explanation: There must exist a process that is holding a resource already allocated to it while waiting for additional resource that are currently being held by other processes.

3. No-Preemptive Condition: Resources already allocated to a process cannot be preempted.Explanation: Resources cannot be removed from the processes are used to completion or released voluntarily by the process holding it. 4. Circular Wait Condition

The processes in the system form a circular list or chain where each process in the list is waiting for a resource held by the next process in the list.

Page 42: “By the end of this chapter, you should have obtained a basic understanding of how modern processors execute parallel programs & understand some rules

42

STRANGLED SCALING (2.6.4)

Fine-Grain Locking – use of many locks on small sections, not 1 lock on large section

Notes1 large lock is faster but blocks other processes

Time consideration for set/release of many locks

Example: lock row of matrix, not entire matrix

Page 43: “By the end of this chapter, you should have obtained a basic understanding of how modern processors execute parallel programs & understand some rules

43

LACK OF LOCALITY (2.6.5)

Two Assumptions for good localityTemporal Locality – access same location soon

Spatial Locality – access nearby location soon

Reminder: Cache Line – block that is retrievedCurrently – Cache miss ~~ 100 cycles

Page 44: “By the end of this chapter, you should have obtained a basic understanding of how modern processors execute parallel programs & understand some rules

44

LOAD IMBALANCE (2.6.6)

Uneven distribution of work over processors

Related to decomposition of problemFew vs Many Tasks – what are implications?

Page 45: “By the end of this chapter, you should have obtained a basic understanding of how modern processors execute parallel programs & understand some rules

45

OVERHEAD (2.6.7)

Always in parallel processingLaunch, synchronize

Small vs larger processors ~ Implications???

~the end of chapter 2~