1
CHAPTER 2 PARALLEL PROGRAMMING BACKGROUND“By the end of this chapter, you should have obtained a basic understanding of how modern processors execute parallel programs & understand some rules of thumb for scaling performance of parallel applications.”
2
TRADITIONAL PARALLEL MODELSSerial ModelSISDParallel ModelsSIMDMIMDMISD*
S = SingleM = MultipleD = DataI = Instruction
3
VOCABULARY & NOTATION (2.1)
Task vs. Data: tasks are instructions that operate on data; modify or create new
Parallel computation multiple tasksCoordinate, manage,
DependenciesData: task requires data from another taskControl: events/steps must be ordered (I/O)
4
TASK MANAGEMENT – FORK-JOIN
Fork: split control flow, creating new control flow
Join: control flows are synchronized & merged
5
GRAPHICAL NOTATION – FIG. 2.1
Task Data Fork Join Dependency
6
STRATEGIES (2.2)Data Parallelism Best strategy for Scalable ParallelismP. that grows as data set/problem size grows
Split data set over set of processors with task processing each set
More Data More Tasks
7
STRATEGIES
Control Parallelism orFunctional DecompositionDifferent program functions run in parallelNot scalable – best speedup is constant factor
As data grows, parallelism doesn’tMay be less/no overhead
8
REGULAR VS. IRREGULAR PARALLELISMRegular: tasks are similar with predictable dependenciesMatrix multiplication
Irregular: tasks are different in ways that create unpredictable dependenciesChess program
Many problems contain combinations
9
HARDWARE MECHANISMS (2.3)
Most important 2Thread Parallelism: implementation in HW using separate flow control for each worker – supports regular, irregular, functional decomposition
Vector Parallelism: implementation in HW with one flow control on multiple data elements – supports regular, some irregular parallelism
10
BRANCH STATEMENTS Detrimental
to Parallelism• Locality• Pipelining• HOW?
11
MASKING - ALL CONTROL PATHS ARE EXECUTED BUT RESULTS ARE MASKED OUT – NOT USED
if (a&1)a = 3*a + 1
elsea=a/2
if/else contains branch statementsMasking: Both parts are executed in parallel, keep only one result
p = (a&1)t = 3*A + 1if (p) a = tt = a/2if (!p) a = t
No branches – single control of flowMasking works as if it were coded this way
12
MACHINE MODELS (2.4)
CoreFunctional UnitsRegistersCache memory – multiple levels
13
14
CACHE MEMORY
Blocks (cache lines) – amount fetchedBandwidth – amount transferred concurrently
Latency – time to complete transferCache Coherence – consistency among copies
15
VIRTUAL MEMORYMemory system Disk storage + chip memory Allows programs larger than memory to run Allows multiprocessingSwaps PagesHW maps logical to physical addressData locality important to efficiencyPage Fault Thrashing
16
PARALLEL MEMORY ACCESS
Cache (multiple)NUMA – Non-Uniform Memory AccessPRAM – Parallel Random Access Memory ModelTheoretical ModelAssumes - Uniform memory access times
17
PERFORMANCE ISSUES (2.4.2)
Data LocalityChoose code segments that fit in cacheDesign to use data in close proximity Align data with cache lines (blocks)Dynamic Grain Size – good strategy
18
PERFORMANCE ISSUES
Arithmetic IntensityLarge number of on-chip compute operations for every off-chip memory access
Otherwise, communication overhead is high
Related – Grain size
19
FLYNN’S CATEGORIES
Serial ModelSISD
Parallel ModelsSIMD –
Array processorVector processor
MIMD Heterogeneous computer
ClustersMISD* - not useful
20
CLASSIFICATION BASED ON MEMORY
Shared Memory – each processor accesses a common memoryAccess issuesNo message passingPC usually has small local memory
Distributed Memory – each processor has a local memory
Send explicit messages between processors
21
EVOLUTION (2.4.4)
GPU – Graphics acceleratorsNow general purpose
Offload – running computations on accelerator, GPU’s or co-processor (not the regular CPU’s)
Heterogeneous – different (hardware working together)
Host Processor – for distribution, I/O, etc.
22
PERFORMANCE (2.5)
Various interpretations of PerformanceReduce Total Time for computationLatency
Increasing Rate at which series of results are computedThroughput
Reduce Power Consumption*Performance Target
23
LATENCY & THROUGHPUT (2.5.1)Latency: time to complete a taskThroughput: rate at which tasks are completeUnits per time (e.g. jobs per hour)
24
OMIT SECTION 2.5.3 – POWER
25
SPEEDUP & EFFICIENCY (2.5.2)
Sp = T1 / Tp T1: time to complete on 1 processor
Tp: time to complete on p processors
REMEMBER: “time” means number of instructions
E = Sp / P
= T1 / P*Tp
E = 1 is “perfect”
Linear Speedup – occurs when algorithm runs P-times faster on P processors
26
SUPERLINEAR SPEEDUP (P.57)
Efficiency > 1Very RareOften due to HW variations (cache)Working in parallel may eliminate some work that is done when serial
27
AMDAHL & GUSTAFSON-BARSIS (2.5.4, 2.5.5)Amdahl: speedup is limited by amount of serial work required
G-B: as problem size grows, parallel work grows faster than serial work, so speedup increases
See examples
28
WORK
Total operations (time) for taskT1 = WorkP * Tp = Work T1 = P * Tp ?? Rare due to ???
29
WORK-SPAN MODEL (2.5.6)Describes Dependencies among Tasks & allows for estimated timesRepresents Tasks: DAG (Figure 2.8) Critical Path – longest pathSpan - minimum time of Critical Path
Assumes Greedy Task Scheduling – no wasted resources, time
Parallel Slack – excess parallelism, more tasks than can be scheduled at once
30
WORK-SPAN MODEL
Speedup <= Work/Span
Upper Bound: ??No more than…
31
ASYMPTOTIC COMPLEXITY (2.5.7)
Comparing Algorithms!!Time Complexity: defines execution time growth in terms of input size
Space Complexity: defines growth of memory requirements in terms of input size
Ignores constantsMachine independent
32
BIG OH NOTATION (P.66)
Big OH of F(n) – Upper BoundO(F(n)) = {G(n) |there exist positive constants c & No such that |G(n)| ≤ c F(n) for n ≥ No
*Memorize
33
BIG OMEGA & BIG THETA
Big Omega – Functions that define Lower Bound
Big Theta – Functions that define a Tight Bound – Both Upper & Lower Bounds
34
CONCURRENCY VS. PARALLEL
Parallel work actually occurring at same timeLimited by number of processors
Concurrent tasks in progress at same time but not necessarily executing“Unlimited”
Omit 2.58 & most of 2.59
35
PITFALLS OF PARALLEL PROGRAMMING (2.6)
Pitfalls = Issues that can cause problemsSynchronization – often required Too little non-determinismToo much reduces scaling, increases time & may cause deadlock
36
RACE CONDITIONS (2.6.1)
Situation in which final results depend upon order tasks complete work
Occurs when concurrent tasks share memory location & there is a write operation
Unpredictable – don’t always cause errorsInterleaving: instructions from 2 or more tasks are executed in an alternating manner
37
RACE CONDITIONS ~ EXAMPLE 2.2Task AA = XA += 1X = A
Task BB = XB += 2X = B
Assume X is initially 0.
What are the possible results?
So, Tasks A & B are not REALLY independent!
38
RACE CONDITIONS ~ EXAMPLE 2.3Task AX = 1A = Y
Task BY = 1B = X
Assume X & Y are initially 0.
What are the possible results?
39
SOLUTIONS TO RACE CONDITIONS (2.6.2)
Mutual Exclusion, Locks, Semaphores, Atomic OperationsMechanisms to prevent access to a memory location(s) – allows one task to complete before allowing the other to start
Does not always solve the problem – may still depend upon which task executes first
40
DEADLOCK (2.6.3)
Situation in which 2 or more processes cannot proceed due to waiting on each other – STOP
Recommendations for avoidanceAvoid mutual exclusionHold at most 1 lock at a timeAcquire locks in same order
41
DEADLOCK – NECESSARY & SUFFICIENT CONDITIONS
1. Mutual Exclusion Condition: The resources involved are non-shareable.Explanation: At least one resource (thread) must be held in a non-shareable mode, that is, only one process at a time claims exclusive control of the resource. If another process requests that resource, the requesting process must be delayed until the resource has been released.
2. Hold and Wait Condition: Requesting process hold already, resources while waiting for requested resources.Explanation: There must exist a process that is holding a resource already allocated to it while waiting for additional resource that are currently being held by other processes.
3. No-Preemptive Condition: Resources already allocated to a process cannot be preempted.Explanation: Resources cannot be removed from the processes are used to completion or released voluntarily by the process holding it. 4. Circular Wait Condition
The processes in the system form a circular list or chain where each process in the list is waiting for a resource held by the next process in the list.
42
STRANGLED SCALING (2.6.4)
Fine-Grain Locking – use of many locks on small sections, not 1 lock on large section
Notes1 large lock is faster but blocks other processes
Time consideration for set/release of many locks
Example: lock row of matrix, not entire matrix
43
LACK OF LOCALITY (2.6.5)
Two Assumptions for good localityTemporal Locality – access same location soon
Spatial Locality – access nearby location soon
Reminder: Cache Line – block that is retrievedCurrently – Cache miss ~~ 100 cycles
44
LOAD IMBALANCE (2.6.6)
Uneven distribution of work over processors
Related to decomposition of problemFew vs Many Tasks – what are implications?
45
OVERHEAD (2.6.7)
Always in parallel processingLaunch, synchronize
Small vs larger processors ~ Implications???
~the end of chapter 2~