Upload
amora
View
41
Download
0
Embed Size (px)
DESCRIPTION
Virtues of Good (Parallel) Software. Concurrency Able to exploit concurrencies in algorithm/problem/hardware Scalability Resilient to increasing processor count Locality More frequent access to local data than to remote data Modularity Employ abstraction and modular design. - PowerPoint PPT Presentation
Citation preview
1
Virtues of Good (Parallel) Software
ConcurrencyAble to exploit concurrencies in
algorithm/problem/hardwareScalability
Resilient to increasing processor countLocality
More frequent access to local data than to remote data
ModularityEmploy abstraction and modular design
2
Two Basic Requirements for Parallel Program
Safety: Produce correct resultsResult computed on P processors and on 1
processor must be IDENTICAL.
Livelihood: Able to proceed and finish; free of deadlock.
3
Sources of Overhead
Execution time The time that elapses from when the first processor starts executing on the
problem to when the last processor completes execution Execution time = computation time + communication time + idle time Communication / interprocess interaction: usually main source of overhead
T_comm = t_s + t_w*L Minimize the volume and frequency of communications; overlap
computation/communication Idling: lack of computation or lack of data
Load imbalance Synchronization Presence of serial components Wait on remote data
Replicated Computation Communicate or replicate
4
Speedup & Efficiency
Relative speed-up: the factor by which the execution time is reduced on multiple processors S(p) = T_1/T_p T_1 is the execution time on one processor T_p is execution time on p processors
Absolute speed-up: where is T_1 is the uniprocessor time for best-known (sequential) algorithm S(p) <= p Embarrassingly parallel (EP): no communication among cpus. Superlinear speedup: exists in reality
Efficiency: the fraction of time that processors spend doing useful work. E = S/p = T_1/(p*T_p)
Parallel cost: p*T_p Parallel overhead: T_o = p*T_p – T_1
5
Amdahl’s Law
This is for a fixed problem sizeT_p = alpha*T_1/p + (1-
alpha)*T_1S 1/(1-alpha) as
PinfinityAlpha = 90%, S10Alpha =99%, S100Alpha = 99.9%, S1000
P
S 1
1
Alpha – fraction of operations in serial code that can be parallelizedP – number of processors
“Mental block”
6
Gustafson’s Law
This is for a scaled problem size; or constant run time.T_1 = (1-alpha)*T_p + p*alpha*T_p
As problem size increases, fraction of parallel operations increases
PS )1(Alpha – fraction of time spent on parallel operations in the parallel program
7
Iso-Efficiency Function
For fixed problem size N, as P increases, increase in speedup S slows down or levels off, efficiency E decreases
For fixed P, as the N increases, S increases, efficiency E increases
As P increases, can increase the problem size N such that the efficiency is kept constantThis N(p) for fixed efficiency is called iso-efficiency
functionRate of increase in N(p), dN/dp, measures the
scalability of a parallel programSmaller rate of increase more scalable
8
Parallel Program Design
PCAM Model(I. Foster)
Concurrency, scalability
Locality, performance-related issues
9
PartitioningDecompose the computation to be performed
and the data operated on by this computation into small tasks
Purpose: expose opportunities of parallel executionIgnore practical issues such as number of processors
in target machine etcAvoid replicating computation and data
Focus: Define a large number of small tasks in order to yield a fine-grained decomposition of the problemFine grained decomposition provides the greatest
flexibility in terms of potential parallel algorithms
Maximize concurrency
10
Partitioning
Good partition: divides both the computation associated with a problem and the data this computation operates on
Domain/Data decomposition: first focus on dataPartition the data associated with the problemAssociate computations with partitioned data
Functional decomposition: first focus on computationDecompose computations to be performedDeal with data decomposed computations work on
11
Domain Decomposition Decompose the data first, and then associated
computations “owner computes” Outcome: tasks comprising some data and a set of operations
on that data Some operation may require data from several tasks
communication Data can be input data, output data, intermediate data,
or all of them. Rule of thumb: focus first on largest data structure or the data
structure accessed most frequently Mesh-based problems:
Structured mesh: 1D, 2D, 3D decompositions Unstructured mesh: graph partitioning tools such as METIS
Favor the most aggressive decomposition possible at this stage
12
Functional Decomposition
Focus first on computation to be performed; Divide computations into disjoint tasks
Then consider the data associated with each sub-taskData requirements may be disjoint doneData may overlap significantly, communications; May
just as well try domain decompositionProvide an alternative way of thinking about
problem; Hybrid decomposition maybe bestE.g. multi-physics simulations, overall functional
decomposition, each component domain decomposition
13
Partitioning: Questions to Ask Does your partition define more tasks (an order of
magnitude more?) than the number of processors of the target machine? No reduced flexibility in subsequent stages
Does your partition avoid redundant computation and storage requirements? No may not be scalable to large problems
Are tasks of comparable size? No hard to allocate to cpus with equal amount of work
load imbalance Does the number of tasks scale with problem size?
Ideal: increased problem size increase in number of tasks No may not be able to solve larger problems with more
processors Have you identified alternative partitions?
Maximize flexibility; try both domain and functional decompositions
14
CommunicationPurpose: Determine the interaction among tasks
Distribute communication operations among many tasks
Organize communication operations in a way that permits concurrent execution
4 categories of communications:Local/global communications:
Local: each task communicates with a small set of other tasks (neighbors)
Global: communicate with many or all other tasks
15
Communication Structured/un-structured communication
Structured: A task and neighbors form a regular structure, grid or tree
Un-structured: communication represented by arbitrary graphs Static/dynamic communication:
Static: identity of communication partners does not change over time
Dynamic: identity of partners determined by data computed at runtime and highly variable
Synchronous/asynchronous communication Synchronous: requires coordination between communication
partners Asynchronous: without cooperation
16
Task Dependency GraphTask dependencies: one task cannot start until
some other task(s) finishes.E.g. the output of one task is the input to another task
Represented by the task dependency graph:Directed acyclicNodes: tasks (task size as the weight of node)Directed edges: dependencies among tasks
17
Task Dependency Graph Degree of concurrency: number of tasks that can run
concurrently Maximum degree of concurrency: the maximum number of tasks
that can be executed simultaneously at any given time Average degree of concurrency: the average number of tasks
that can run concurrently over the duration of program Critical path: The longest vertex-weighted directed path
between any pair of start and finish nodes Critical path length: sum of vertex weights along the
critical path Average degree of concurrency = total amount of work /
critical path length
18
Task Interaction GraphEven independent tasks
may need to interact, e.g. sharing data
Interaction graph: captures interaction patterns among tasksNodes: tasksEdges: communications /
interactionsUsually contains task
dependency graph as sub-graph
Example interaction graph
19
Communication: Questions to Ask
Do all tasks perform the same number of communication operations? Unbalanced communication poor scalability Distribute communications equitably
Does each task communicate only with a small number of neighbors? May need to re-formulate global communication in terms of local
communication structures Can communications proceed concurrently? Can computations associated with different tasks
proceed concurrently? No may need to re-order computations / communications
20
Agglomeration
Improve performance: Combine tasks to reduce the task interaction strength, increase locality, increase the computation and communication granularity. Also determine if it is worthwhile to replicate data/computationDependent tasks will be combinedIndependent tasks may also be agglomerated to
increase granularity
Goals: reduce communication cost, retain flexibility w.r.t. scalability and mapping decisions
21
Increasing Granularity
Coarse-grain usually performs better: Send less data (reduce volume of communication) Use fewer messages when sending same amount of data
(reducing frequency of communications) Surface-to-volume effects:
Communication cost usually proportional to surface area of domain
Computation cost usually proportional to volume of domain As task size increases, amount of communication per unit
computation decreases High-D decomposition usually more efficient than low-D
decompositions, due to reduced surface area for a given volume. Replicate computation:
May trade off replicated computation for reduced communication or execution time.
22
Agglomeration: Questions to Ask
Has agglomeration reduced communication costs by increasing locality?
If computation is replicated, have you verified that the benefits of replication out-weigh its costs for a range of problem size and processor counts?
If data is replicated, have you verified that it does not comprise scalability
Do the tasks have similar computation and communication costs after agglomeration? Load balance
Does the number of tasks still scale with problem size?
23
Mapping
Map tasks to processors or processes.If the number of tasks is larger than the number
of processors, may need to place more than one task on a single processor
Goal: minimize total execution timePlace tasks that execute concurrently on different
processorsPlace tasks that communicate frequently on the same
processorIn general case, no computationally tractable
algorithm for the mapping problem, NP-complete.
If SPMD-style, one task per processor
24
Parallel Algorithm Models
Data parallel model: processors perform similar operations on different data
Work/task pool model (replicated workers): Pool of tasks, a number of processors A processor can remove a task from pool and work on it A processor may generate a new task during computation and
add it to the pool Master-slave/manager-worker model: master processors
generate work and allocate it to worker processors Pipeline/producer-consumer model: a stream of data
passes through a succession of processors, each perform some task on it.
Hybrid model: combination of two or more models