Upload
vuhanh
View
218
Download
4
Embed Size (px)
Citation preview
1
Parallel Real-Time Systems
Parallel Computing Overview
2
References(Will be expanded as needed)
• Website for Parallel & Distributed Computing:
www.cs.kent.edu/~jbaker/PDC-F08/
– Selected slides from “Introduction to Parallel
Computing”
• Michael Quinn, Parallel Programming in C with
MPI and Open MP, McGraw Hill, 2004.
– Chapter 1 is posted on website
• Selim Akl, “Parallel Computation: Models and
Methods”, Prentice Hall, 1997, Updated online
version available on website.
2
3
Outline
• Why use parallel computing
• Moore’s Law
• Modern parallel computers
• Flynn’s Taxonomy
• Seeking Concurrency
• Data clustering case study
• Programming parallel computers
4
Why Use Parallel Computers
• Solve compute-intensive problems faster
– Make infeasible problems feasible
– Reduce design time
• Solve larger problems in same amount of time
– Improve answer’s precision
– Reduce design time
• Increase memory size
– More data can be kept in memory
– Dramatically reduces slowdown due to accessing
external storage increases computation time
• Gain competitive advantage
3
5
1989 Grand Challenges to
Computational Science Categories
• Quantum chemistry, statistical mechanics, and
relativistic physics
• Cosmology and astrophysics
• Computational fluid dynamics and turbulence
• Materials design and superconductivity
• Biology, pharmacology, genome sequencing, genetic
engineering, protein folding, enzyme activity, and cell
modeling
• Medicine, and modeling of human organs and bones
• Global weather and environmental modeling
6
Weather Prediction• Atmosphere is divided into 3D cells
• Data includes temperature, pressure, humidity, wind speed and direction, etc
– Recorded at regular time intervals in each cell
• There are about 5×103 cells of 1 mile cubes.
• Calculations would take a modern computer over 100 days to perform calculations needed for a 10 day forecast
• Details in Ian Foster’s 1995 online textbook– Design & Building Parallel Programs
– Included in Parallel Reference List, which will be posted on website.
4
7
Moore’s Law
• In 1965, Gordon Moore [87] observed that the
density of chips doubled every year.
– That is, the chip size is being halved yearly.
– This is an exponential rate of increase.
• By the late 1980’s, the doubling period had
slowed to 18 months.
• Reduction of the silicon area causes speed of
the processors to increase.
• Moore’s law is sometimes stated: “The
processor speed doubles every 18 months”
8
Microprocessor Revolution
Micros
Minis
Mainframes
Speed (log scale)
Time
Supercomputers
5
9
Some Definitions
• Concurrent – Sequential events or processes which seem to occur or progress at the same time.
• Parallel –Events or processes which occur or progress at the same time
• Parallel computing provides simultaneous
execution of operations within a single parallel computer
• Distributed computing provides simultaneous execution of operations across a number of systems.
10
Flynn’s Taxonomy
• Best known classification scheme for parallel
computers.
• Depends on parallelism it exhibits with its
– Instruction stream
– Data stream
• A sequence of instructions (the instruction
stream) manipulates a sequence of operands
(the data stream)
• The instruction stream (I) and the data stream
(D) can be either single (S) or multiple (M)
• Four combinations: SISD, SIMD, MISD, MIMD
6
11
SISD
• Single Instruction, Single Data
• Usual sequential computer is primary example
– i.e., uniprocessors
– Note: co-processors don’t count as more processors
• Concurrent processing allowed
– Instruction prefetching
– Pipelined execution of instructions
– Independent concurrent tasks can execute different
sequences of operations.
12
SIMD
• Single instruction, multiple data
• One instruction stream is broadcast to all
processors
• Each processor, also called a processing
element (or PE), is very simplistic and is
essentially an ALU;
– PEs do not store a copy of the program
nor have a program control unit.
• Individual processors can be inhibited from
participating in an instruction (based on a
data test).
7
13
SIMD (cont.)
• All active processor executes the same
instruction synchronously, but on different
data
• On a memory access, all active
processors must access the same location
in their local memory.
• The data items form an array (or vector)
and an instruction can act on the complete
array in one cycle.
14
SIMD (cont.)
• Quinn calls this architecture a processor
array.
• Examples include
– The STARAN and MPP (Dr. Batcher architect)
– Connection Machine CM2, built by Thinking
Machines).
8
15
How to View a SIMD Machine
• Think of soldiers all in a unit.
• The commander selects certain soldiers
as active.
– For example, every even numbered row.
• The commander barks out an order to all
the active soldiers, who execute the order
synchronously.
16
MISD
• Multiple instruction streams, single data stream
• Primarily corresponds to multiple redundant
computation, say for reliability.
• Quinn argues that a systolic array is an example
of a MISD structure (pg 55-57)
• Some authors include pipelined architecture in
this category
• This category does not receive much attention
from most authors, so we won’t discuss it
further.
9
17
MIMD
• Multiple instruction, multiple data
• Processors are asynchronous and can
independently execute different programs
on different data sets.
• Communications are handled either
– through shared memory. (multiprocessors)
– by use of message passing (multicomputers)
• MIMD’s are considered by many
researchers to include the most powerful,
least restricted computers.
18
MIMD (cont. 2/4)• Have major communication costs
– When compared to SIMDs
– Internal ‘housekeeping activities’ are often overlooked
• Maintaining distributed memory & distributed databases
• Synchronization or scheduling of tasks
• Load balancing between processors
• The SPMD method of programming MIMDs
– All processors to execute the same program.
– SPMD stands for single program, multiple data.
– Easy method to program when number of processors
are large.
– While processors have same code, they can each
can be executing different parts at any point in time.
10
19
MIMD (cont 3/4)
• A more common technique for programming MIMDs is to use multi-tasking
– The problem solution is broken up into various tasks.
– Tasks are distributed among processors initially.
– If new tasks are produced during executions, these may handled by parent processor or distributed
– Each processor can execute its collection of tasks concurrently.
• If some of its tasks must wait for results from other tasks or new data , the processor will focus the remaining tasks.
– Larger programs usually require a load balancing algorithm to rebalance tasks between processors
– Dynamic scheduling algorithms may be needed to assign a higher execution priority to time-critical tasks
• E.g., on critical path, more important, earlier deadline, etc.
20
MIMD (cont 4/4)
• Recall, there are two principle types of
MIMD computers:
– Multiprocessors (with shared memory)
– Multicomputers (message passing)
• Both are important and will be covered in
greater detail next.
11
21
Multiprocessors(Shared Memory MIMDs)
• Consists of two types
– Centralized Multiprocessors
• Also called UMA (Uniform Memory Access)
• Symmetric Multiprocessor or SMP
– Distributed Multiprocessors
• Also called NUMA (Nonuniform Memory
Access)
22
Centralized Multiprocessors
(SMPs)
12
23
Centralized Multiprocessors
(SMPs)
• Consists of identical CPUs connected by a
bus and to common block of memory.
• Each processor requires the same amount
of time to access memory.
• Usually limited to a few dozen processors
due to memory bandwidth.
• SMPs and clusters of SMPs are currently
very popular
24
Distributed Multiprocessors
13
25
Distributed Multiprocessors(or NUMA)
• Has a distributed memory system
• Each memory location has the same address for
all processors.
– Access time to a given memory location varies
considerably for different CPUs.
• Normally, uses fast cache to reduce the
problem of different memory access time for
processors.
– Creates problem of ensuring all copies of the same
data in different memory locations are identical.
26
Multicomputers (Message-Passing MIMDs)
• Processors are connected by a network
– Usually an interconnection network
– Also, may be connected by Ethernet links or a bus.
• Each processor has a local memory and can only access its own local memory.
• Data is passed between processors using messages, when specified by the program.
14
27
Multicomputers (cont)
• Message passing between processors is
controlled by a message passing language
(e.g., MPI, PVM)
• The problem is divided into processes or
tasks that can be executed concurrently
on individual processors.
• Each processor is normally assigned
multiple tasks.
28
Multiprocessors vs Multicomputers
• Programming disadvantages of message-
passing
– Programmers must make explicit
message-passing calls in the code
– This is low-level programming and is error
prone.
– Data is not shared but copied, which
increases the total data size.
– Data Integrity: difficulty in maintaining
correctness of multiple copies of data item.
15
29
Multiprocessors vs Multicomputers
(cont)
• Programming advantages of message-passing
– No problem with simultaneous access to data.
– Allows different PCs to operate on the same
data independently.
– Allows PCs on a network to be easily
upgraded when faster processors become
available.
• Mixed “distributed shared memory” systems
exist
– An example is a cluster of SMPs.
30
Types of Parallel Execution
• Data parallelism
• Control/Job/Functional parallelism
• Pipelining
• Virtual parallelism
16
31
Data Parallelism
• All tasks (or processors) apply the same set of operations to different data.
• Example:
• Operations may be executed concurrently
• Accomplished on SIMDs by having all active processors execute the operations synchronously.
• Can be accomplished on MIMDs by assigning 100/p tasks to each processor and having each processor to calculated its share asynchronously.
for i 0 to 99 do
a[i] b[i] + c[i]
endfor
32
Supporting MIMD Data Parallelism
• SPMD (single program, multiple data)
programming is not really data parallel
execution, as processors typically execute
different sections of the program concurrently.
• Data parallel programming can be strictly
enforced when using SPMD as follows:
– Processors execute the same block of instructions
concurrently but asynchronously
– No communication or synchronization occurs within
these concurrent instruction blocks.
– Each instruction block is normally followed by a
synchronization and communication block of steps
17
33
MIMD Data Parallelism (cont.)
• Strict data parallel programming is unusual
for MIMDs, as the processors usually
execute independently, running their own
local program.
34
Data Parallelism Features
• Each processor performs the same data computation on different data sets
• Computations can be performed either synchronously or asynchronously
• Defn: Grain Size is the average number of computations performed between communication or synchronization steps – See Quinn textbook, page 411
• Data parallel programming usually results in smaller grain size computation– SIMD computation is considered to be fine-grain
– MIMD data parallelism is usually considered to be medium grain
18
35
Control/Job/Functional Parallelism
• Independent tasks apply different operations to
different data elements
• First and second statements may execute
concurrently
• Third and fourth statements may execute
concurrently
a 2
b 3
m (a + b) / 2
s (a2 + b2) / 2
v s - m2
36
Control Parallelism Features
• Problem is divided into different non-
identical tasks
• Tasks are divided between the processors
so that their workload is roughly balanced
• Parallelism at the task level is considered
to be coarse grained parallelism
19
37
Data Dependence Graph• Can be used to identify data
parallelism and job parallelism.
• See page 11.
• Most realistic jobs contain both parallelisms
– Can be viewed as branches in data parallel tasks
- If no path from vertex u to vertex v, then job parallelismcan be used to execute the tasks u and v concurrently.
- If larger tasks can be subdivided into smaller identical tasks, data parallelism can be used to execute these concurrently.
38
For example, “mow lawn” becomes • Mow N lawn
• Mow S lawn
• Mow E lawn
• Mow W lawn
• If 4 people are available
to mow, then data parallelism
can be used to do these
tasks simultaneously.
• Similarly, if several people
are available to “edge lawn”
and “weed garden”, then we
can use data parallelism to
provide more concurrency.
20
39
Pipelining
• Divide a process into stages
• Produce several items simultaneously
40
Compute Partial Sums
Consider the for loop:
p[0] a[0]
for i 1 to 3 do
p[i] p[i-1] + a[i]
endfor
• This computes the partial sums:
p[0] a[0]
p[1] a[0] + a[1]
p[2] a[0] + a[1] + a[2]
p[3] a[0] + a[1] + a[2] + a[3]
• The loop is not data parallel as there are dependencies.
• However, we can stage the calculations in order to achieve some parallelism.
21
41
Partial Sums Pipeline
a[0]
= + + +
a[1] a[2] a[3]
p[0] p[1] p[2]
p[0] p[1] p[2] p[3]
Virtual Parallelism
• In data parallel applications, it is often simpler to initially
design an algorithm or program assuming one data item
per processor.
– Particularly useful for SIMD programming
• If more processors are needed in actual program, each
processor is given a block of n/p or n/p data items
– Typically, requires a routine adjustment in program.
– Will result in a slowdown in running time of at least n/p.
• Called Virtual Parallelism since each processor plays the
role of several processors.
• A SIMD computer has been built that automatically
converts code to handle n/p items per processor.
– Wavetracer SIMD computer. 42
22
Slides from Parallel
Architecture Section
See www.cs.kent.edu/~jbaker/PDC-F08/
s
44
References
• Slides in this section are taken from the Parallel
Architecture Slides at site
www.cs.kent.edu/~jbaker/PDC-F08/
• Book reference is Chapter 2 of Quinn’s textbook.
23
Interconnection Networks
• Uses of interconnection networks
– Connect processors to shared memory
– Connect processors to each other
• Different interconnection networks define
different parallel machines.
• The interconnection network’s properties
influence the type of algorithm used for various
machines as it affects how data is routed.
Terminology for Evaluating
Switch Topologies
• We need to evaluate 4 characteristics of a network in order to help us understand their effectiveness
• These are– The diameter
– The bisection width
– The edges per node
– The constant edge length
• We’ll define these and see how they affect algorithm choice.
• Then we will introduce several different interconnection networks.
24
Terminology for Evaluating
Switch Topologies
• Diameter – Largest distance between two
switch nodes.
– A low diameter is desirable
– It puts a lower bound on the complexity of
parallel algorithms which requires
communication between arbitrary pairs of
nodes.
Terminology for Evaluating
Switch Topologies
• Bisection width – The minimum number of edges between switch nodes that must be removed in order to divide the network into two halves.– Or within 1 node of one-half if the number of
processors is odd.
• High bisection width is desirable.
• In algorithms requiring large amounts of data movement, the size of the data set divided by the bisection width puts a lower bound on the complexity of an algorithm.
25
Terminology for Evaluating
Switch Topologies
• Number of edges per node
– It is best if the maximum number of edges/node is a
constant independent of network size, as this allows
the processor organization to scale more easily to a
larger number of nodes.
– Degree is the maximum number of edges per node.
• Constant edge length? (yes/no)
– Again, for scalability, it is best if the nodes and edges
can be laid out in 3D space so that the maximum
edge length is a constant independent of network
size.
Three Important Interconnection
Networks• We will consider the following three well known
interconnection networks:– 2-D mesh
– linear network
– hypercube
• All three of these networks have been used to build commercial parallel computers.
26
2-D Meshes
Note: Circles represent switches and squares
represent processors in all these slides.
2-D Mesh Network
• Switches arranged into a 2-D lattice or grid
• Communication allowed only between
neighboring switches
• Torus: Variant that includes wraparound
connections between switches on edge of
mesh
27
Evaluating 2-D Meshes(Assumes mesh is a square)
n = number of processors
• Diameter:
– (n1/2)
– Places a lower bound on algorithms that require processing with arbitrary nodes sharing data.
• Bisection width:
– (n1/2)
– Places a lower bound on algorithms that require distribution of data to all nodes.
• Max number of edges per switch:
– 4 is the degree
• Constant edge length?
– Yes
• Does this scale well?
– Yes
Linear Network
• Switches arranged into a 1-D mesh
• Corresponds to a row or column of a 2-D mesh
• Ring : A variant that allows a wraparound connection between switches on the end.
• The linear and ring networks have many applications
• Essentially supports a pipeline in both directions
• Although these networks are very simple, they support many optimal algorithms.
28
Evaluating Linear and Ring Networks
• Diameter
– Linear : n-1 or Θ(n)
– Ring: n/2 or Θ(n)
• Bisection width:
– Linear: 1 or Θ(1)
– Ring: 2 or Θ(1)
• Degree for switches:
– 2
• Constant edge length?
– Yes
• Does this scale well?
– Yes
Hypercube(also called binary n-cube)
0010
0000
0100
0110 0111
1110
0001
0101
1000 1001
0011
1010
1111
1011
11011100
A hypercube with n = 2d processors & switches for d=4
29
Hypercube with
n = 2d Processors
0010
0000
0100
0110 0111
1110
0001
0101
1000 1001
0011
1010
1111
1011
11011100
•Number of nodes is a
power of 2
• Node addresses 0, 1,
…, n-1
• Node i is connected
to k nodes whose
addresses differ from i
in exactly one bit
position.
• Example: k = 0111 is
connected to 1111,
0011, 0101, and 0110
Growing a Hypercube
Note: For d = 4, it is called
a 4-dimensional cube.
30
Evaluating Hypercube Network
with n = 2d nodes
0010
0000
0100
0110 0111
1110
0001
0101
1000 1001
0011
1010
1111
1011
11011100
• Diameter:
• d = log n
•Bisection width:
• n / 2
•Edges per node:
• log n
•Constant edge
length?
• No.
The length of the
longest edge
increases as n
increases.
MIMD Message-Passing
Slides are still from Parallel
Architecture Unit at “Parallel &
Distributed Computing” website
31
61
Some Interconnection Network
Terminology (1/2)References: Wilkinson, et. al. & Grama, et. al.
Also, earlier slides on architecture & networks.
A link is the connection between two nodes.
• A switch that enables packets to be routed through the node to other nodes without disturbing the processor is assumed.
• The link between two nodes can be either bidirectional or use two directional links .
• Can assume either one wire that carries one bit or parallel wires (one wire for each bit in word).
• The above choices do not have a major impact on the concepts presented in this course.
62
Some Interconnection Network
Terminology (1/2)References: Wilkinson, et. al. & Grama, et. al.
Also, earlier slides on architecture & networks.
A link is the connection between two nodes.
• A switch that enables packets to be routed through the node to other nodes without disturbing the processor is assumed.
• The link between two nodes can be either bidirectional or use two directional links .
• Can assume either one wire that carries one bit or parallel wires (one wire for each bit in word).
• The above choices do not have a major impact on the concepts presented in this course.
32
63
Network Terminology (2/2)
• The bandwidth is the number of bits that can be transmitted in unit time (i.e., bits per second).
• The network latency is the time required to transfer a message through the network.
– The communication latency is the total time required to send a message, including software overhead and interface delay.
– The message latency or startup time is the time required to send a zero-length message.
• Includes software & hardware overhead, such as
– Choosing a route
– packing and unpacking the message
64
Store-and-forward Packet
Switching
• Message is divided into “packets” of information
• Each packet includes source and destination addresses.
• Packets can not exceed a fixed, maximum size (e.g., 1000 byte).
• A packet is stored in a node in a buffer until it can move to the next node.
• Different packets typically follow different routes but are re-assembled at the destination, as the packets arrive.
• Movements of packets is asynchronous.
33
65
Packet Switching (cont)
• At each node, the designation information is
looked at and used to select which node to
forward the packet to.
• Routing algorithms (often probabilistic) are used
to avoid hot spots and to minimize traffic jams.
• Significant latency is created by storing each
packet in each node it reaches.
• Latency increases linearly with the length of the
route.
Slides from Performance
Analysis
34
67
References on Performance
Evaluation• Slides are from www.cs.kent.edu/~jbaker/PDC-F08/ on
topic of Performance Evaluation.
• Selim Akl, “Parallel Computation: Models and Methods”, Prentice Hall, 1997, Updated online version available through website.
• Michael Quinn, Parallel Programming in C with MPI and Open MP, Ch. 7, McGraw Hill, 2004.
Outline
• Speedup
• Superlinearity Issues
• Speedup Analysis
• Cost
• Efficiency
• Amdahl’s Law
• Gustafson’s Law
35
Speedup
• Speedup measures increase in running time
due to parallelism. The number of PEs is given
by n.
• S(n) = ts/tp , where
– ts is the running time on a single processor, using
the fastest known sequential algorithm
– tp is the running time using a parallel processor.
• In simplest terms,
timerunning Parallel
timerunning Sequential Speedup
Linear Speedup Usually Optimal
• Speedup is linear if S(n) = (n)
• Claim: The maximum possible speedup for parallel computers with n PEs is n.
• Usual pseudo-proof: (Assume ideal conditions)
– Assume a computation is partitioned perfectly into nprocesses of equal duration.
– Assume no overhead is incurred as a result of this partitioning of the computation – (e.g., partitioning process, information passing, coordination of processes, etc),
– Under these ideal conditions, the parallel computation will execute n times faster than the sequential computation and
• the parallel running time will be ts /n.
– Then the parallel speedup in this “ideal situation” is
S(n) = ts /(ts /n) = n
36
Linear Speedup Usually Optimal (cont)
• This argument shows that for typical problems, linear speedup to be optimal
• This argument is valid for traditional problems, but is invalid for some types of nontraditional problems.
Speedup Usually Smaller Than Linear
• Unfortunately, the best speedup possible for most applications is considerably smaller than n
– The “ideal conditions” performance mentioned in earlier argument is usually unattainable.
– Normally, some parts of programs are sequential and allow only one PE to be active.
– Sometimes a significant number of processors are idle for certain portions of the program. For example
• Some PEs may be waiting to receive or to send data during parts of the program.
• Congestion may occur during message passing
37
Superlinear Speedup
• Superlinear speedup occurs when S(n) > n– Occasionally speedup that appears to be superlinear
may occur, but can be explained by other reasons such as
• the extra memory in parallel system.
• a sub-optimal sequential algorithm is compared to parallel algorithm.
• “Luck”, in case of algorithm that has a random aspect in its design (e.g., random selection)
Superlinearity (cont)
• Selim Akl has given a multitude of examples that establish that superlinear algorithms are required for many non-standard problems
• Examples include “nonstandard” problems involving
• Real-Time requirements where meeting deadlines is part of the problem requirements.
• Problems where all data is not initially available, but has to be processed after it arrives.
• Some problems are natural to solve using parallelism and sequential solutions are inefficient.
38
Execution time for parallel portion
Shows nontrivial parallel algorithm’s
computation component as a decreasing
function of the number of processors used.
processors
time
Time for MIMD communication
Shows a nontrivial parallel algorithm’s
communication component as an increasing
function of the number of processors.
processors
time
39
Combining Parallel & MIMD
Communicaton Times
Combining these, we see for a fixed problem
size, there is an optimum number of
processors that minimizes overall execution
time.
processors
time
MIMD Speedup Plot
“Speedup reaches max and then drops as processors increase”
processors
speedup
40
Cost
• The cost of a parallel algorithm (or program) is
Cost = Parallel running time #processors
• The cost of a parallel algorithm should be compared to the running time of a sequential algorithm.
– Cost removes the advantage of parallelism by charging for each additional processor.
– A parallel algorithm whose cost is O(running time) of an optimal sequential algorithm is called cost-optimal.
Efficiency
used Processors
Speedup Efficiency
timeexecution Parallel used Processors
timeexecution Sequential Efficiency
1 Efficiency 0 problems, ionalFor tradit
Cost
timerunning SequentialEfficiency
Processors
Speedup Efficiency
timerunning Parallel Processors
timerunning Sequential Efficiency
41
Amdahl’s Law
• Having a detailed understanding of
Amdahl’s law is not essential for this
course.
• However, having a brief, non-technical
introduction to this important law could be
useful.
81
Amdahl’s Law
Let f be the fraction of operations in a
computation that must be performed
sequentially, where 0 ≤ f ≤ 1. The
maximum speedup achievable by a
parallel computer with n processors is
• The word “law” is often used by computer scientists when it is
an observed phenomena (e.g, Moore’s Law) and not a theorem
that has been proven in a strict sense.
• It is easy to prove Amdahl’s law for “traditional problems”, but
it is not valid for “non-traditional problems”.
fnffnS
1
/)1(
1)(
42
Example 1
• Assume 95% of a program’s execution
time occurs inside a loop that can be
executed in parallel.
• Amdahl’s law shows the maximum
speedup we should expect from a parallel
version of the program executing on 8
CPUs is less than 6.
9.58/)05.01(05.0
1
Speedup 9.5
8/)05.01(05.0
1
Speedup
Example 2
• Assume 5% of a parallel program’s
execution time is spent within inherently
sequential code.
• Amdahl’s law shows that the maximum
speedup achievable by this program,
regardless of how many PEs are used, is
2005.0
1
/)05.01(05.0
1lim
pp
43
Amdahl’s Law
• The argument used in proof of Amdahl’s law
assumes that speedup can not be superliner, so
proof is invalid for “non-traditional” problems.
• Sometimes Amdahl’s law is just stated as
S(n) 1/f
• Note that S(n) never exceeds 1/f and
approaches 1/f as n increases.
Consequences of Amdahl’s Limitations
to Parallelism
• For a long time, Amdahl’s law was viewed as a fatal limit to the usefulness of parallelism.
• A key flaw in these early arguments is that they
were unaware of the impact of Gustafon’s Law:
• Gustafon’s Law: The proportion of the
computations that are sequential normally
decreases as the problem size increases.
• Note: Gustafon’s law is a “observed phenomena”
and not a theorem.
• The negative impact of Amdahl’s law disappears as the problem size increases.
44
Limitations of Amdahl’s Law
• It is now generally accepted by parallel
computing professionals that Amdahl’s law is not
a serious limit the benefit and future of parallel
computing
• Note that Amdahl’s law shows that efforts required to further reduce the fraction of the code that is sequential may pay off in huge performance gains.
88
Task/Channel Model
• Local accesses of private data are assumed to be easily distinguished from nonlocal data access done over channels.
• Local accesses should be considered much faster than nonlocal accesses.
• In this model:– The execution time of a parallel algorithm is the
period of time a task is active.
– The starting time of a parallel algorithm is when all tasks simultaneously begin executing.
– The finishing time of a parallel algorithm is when the last task has stopped executing.
45
Parallel MIMD Algorithm
Design
Reference: Chptr 3, Quinn
Textbook
90
References
• Slides at www.cs.kent.edu/~jbaker/PDC-
F08/ on Parallel Algorithm Design.
• Chapter 3 of Quinn’s Textbook
46
91
Task/Channel Model
• This model is intended for MIMDs (i.e., multiprocessors and multicomputers) and not for SIMDs.
• Parallel computation = set of tasks
• A task consists of a
– Program
– Local memory
– Collection of I/O ports
• Tasks interact by sending messages through channels
– A task can send local data values to other tasks via output ports
– A task can receive data values from other tasks via input ports.
• The local memory contains the program’s instructions and its private data
92
Task/Channel Model
• A channel is a message queue that connects one task’s
ouput port with another task’s input port.
• Data values appear in input port in the same order in which
they are placed in the channel’s output queue.
• A task is blocked if a task tries to receive a value at an input
port and the value isn’t available.
• The blocked task must wait until the value is received.
• A process sending a message is never blocked even if
previous messages it has sent on the channel have not
been received yet.
• Thus, receiving is a synchronous operation and sending is
an asynchronous operation.
47
93
Task/Channel Model
• Local accesses of private data are assumed to be easily distinguished from nonlocal data access done over channels.
• Local accesses should be considered much faster than nonlocal accesses.
• In this model:– The execution time of a parallel algorithm is the
period of time a task is active.
– The starting time of a parallel algorithm is when all tasks simultaneously begin executing.
– The finishing time of a parallel algorithm is when the last task has stopped executing.
94
Task/Channel Model
TaskChannel
A parallel computation can be viewed as a directed graph.
48
95
Foster’s Design Methodology
• Ian Foster has proposed a 4-step process for designing parallel algorithms for machines that fit the task/channel model.
– Foster’s online textbook is a useful resource here
• It encourages the development of scalable algorithms by delaying machine-dependent considerations until the later steps.
• The 4 design steps are called:
– Partitioning
– Communication
– Agglomeration
– Mapping
96
Foster’s Methodology
49
97
Partitioning
• Partitioning: Dividing the computation and data into pieces
• Domain decomposition – one approach– Divide data into pieces
– Determine how to associate computations with the data
– Focuses on the largest and most frequently accessed data structure
• Functional decomposition – another approach– Divide computation into pieces
– Determine how to associate data with the computations
– This often yields tasks that can be pipelined.
98
Example Domain Decompositions
Think of the primitive
tasks as processors.
In 1st, each 2D slice is
mapped onto one
processor of a system
using 3 processors.
In second, a 1D slice is
mapped onto a
processor.
In last, an element is
mapped onto a
processor
The last leaves more
primitive tasks and is
usually preferred.
50
99
Example Functional Decomposition
100
Partitioning Checklist for Evaluating
the Quality of a Partition
• At least 10x more primitive tasks than processors in target computer
• Minimize redundant computations and redundant data storage
• Primitive tasks are roughly the same size
• Number of tasks an increasing function of problem size
• Remember – we are talking about MIMDs here which typically have a lot less processors than SIMDs.
51
101
Foster’s Methodology
102
Communication
• Determine values passed among tasks
• There are two kinds of communication:
• Local communication– A task needs values from a small number of other
tasks
– Create channels illustrating data flow
• Global communication– A significant number of tasks contribute data to
perform a computation
– Don’t create channels for them early in design
52
103
Communication (cont.)
• Communications is part of the parallel computation overhead since it is something sequential algorithms do not have do.– Costs larger if some (MIMD) processors have to be
synchronized.
• SIMD algorithms have much smaller communication overhead because – Much of the SIMD data movement is between the
control unit and the PEs on broadcast/reduction circuits
• especially true for associative
– Parallel data movement along the interconnection network involves lockstep (i.e. synchronously) moves.
104
Communication Checklist for Judging
the Quality of Communications
• Communication operations should be balanced among tasks
• Each task communicates with only a small group of neighbors
• Tasks can perform communications concurrently
• Task can perform computations concurrently
53
105
Foster’s Methodology
106
What We Have Hopefully at This Point – and
What We Don’t Have
• The first two steps look for parallelism in the problem.
• However, the design obtained at this point probably doesn’t map well onto a real machine.
• If the number of tasks greatly exceed the number of processors, the overhead will be strongly affected by how the tasks are assigned to the processors.
• Now we have to decide what type of computer we are targeting
– Is it a centralized multiprocessor or a multicomputer?
– What communication paths are supported
– How must we combine tasks in order to map them effectively onto processors?
54
107
Agglomeration• Agglomeration: Grouping tasks into larger tasks
• Goals
– Improve performance
– Maintain scalability of program
– Simplify programming – i.e. reduce software engineering costs.
• In MPI programming, a goal is
– to lower communication overhead.
– often to create one agglomerated task per processor
• By agglomerating primitive tasks that communicate with each other, communication is eliminated as the needed data is local to a processor.
108
Agglomeration Can Improve
Performance
• It can eliminate communication between
primitive tasks agglomerated into consolidated
task
• It can combine groups of sending and receiving
tasks
55
109
Scalability
• Assume we are manipulating a 3D matrix of size 8 x 128 x 256 and– Our target machine is a centralized multiprocessor
with 4 CPUs.
• Suppose we agglomerate the 2nd and 3rd
dimensions. Can we run on our target machine?– Yes- because we can have tasks which are each
responsible for a 2 x 128 x 256 submatrix.
– Suppose we change to a target machine that is a centralized multiprocessor with 8 CPUs. Could our previous design basically work.
– Yes, because each task could handle a 1 x 128 x 256 matrix.
110
Scalability
– However, what if we go to more than 8 CPUs? Would
our design change if we had agglomerated the 2nd
and 3rd dimension for the 8 x 128 x 256 matrix?
– Yes.
• This says the decision to agglomerate the 2nd
and 3rd dimension in the long run has the
drawback that the code portability to more CPUs
is impaired.
56
111
Reducing Software Engineering Costs
• Software Engineering – the study of techniques
to bring very large projects in on time and on
budget.
• One purpose of agglomeration is to look for
places where existing sequential code for a task
might exist,
• Use of that code helps bring down the cost of
developing a parallel algorithm from scratch.
112
Agglomeration Checklist for Checking the
Quality of the Agglomeration
• Locality of parallel algorithm has increased
• Replicated computations take less time than communications they replace
• Data replication doesn’t affect scalability
• All agglomerated tasks have similar computational and communications costs
• Number of tasks increases with problem size
• Number of tasks suitable for likely target systems
• Tradeoff between agglomeration and code modifications costs is reasonable
57
113
Agglomeration Checklist for Checking the
Quality of the Agglomeration
• Locality of parallel algorithm has increased
• Replicated computations take less time than communications they replace
• Data replication doesn’t affect scalability
• All the agglomerated tasks have similar computational and communications costs
• Number of tasks increases with problem size
• Number of tasks suitable for likely target systems
• Tradeoff between agglomeration and code modifications costs is reasonable
114
Foster’s Methodology
58
115
Mapping
• Mapping: The process of assigning tasks to processors
• Centralized multiprocessor: Mapping done by operating system
• Distributed memory system: Mapping done by user
• Conflicting goals of mapping
– Maximize processor utilization – i.e. the average percentage of time the system’s processors are actively executing tasks necessary for solving the problem.
– Minimize interprocessor communication
116
Mapping Example
(a) is a task/channel graph showing the needed
communications over channels.
(b) shows a possible mapping of the tasks to 3 processors.
59
117
Mapping Example
If all tasks require the same amount of time and each CPU
has the same capability, this mapping would mean the
middle processor will take twice as long as the other
two..
118
Optimal Mapping
• Optimality is with respect to processor utilization and interprocessor communication.
• Finding an optimal mapping is NP-hard.
• Must rely on heuristics applied either manually or by the operating system.
• It is the interaction of the processor utilization and communication that is important.
• For example, with p processors and n tasks, putting all tasks on 1 processor makes interprocessor communication zero, but utilization is 1/p.
60
119
A Mapping Decision Tree (Quinn’s Suggestions – Details on pg 72)
• Static number of tasks
– Structured communication
• Constant computation time per task
– Agglomerate tasks to minimize communications
– Create one task per processor
• Variable computation time per task
– Cyclically map tasks to processors
– Unstructured communication
• Use a static load balancing algorithm
• Dynamic number of tasks
– Frequent communication between tasks
• Use a dynamic load balancing algorithm
– Many short-lived tasks. No internal communication
• Use a run-time task-scheduling algorithm
120
Mapping Checklist to Judge the Quality of a
Mapping
• Consider designs based on one task per
processor and multiple tasks per processor.
• Evaluate static and dynamic task allocation
• If dynamic task allocation chosen, the task
allocator (i.e., manager) is not a bottleneck to
performance
• If static task allocation chosen, ratio of tasks to
processors is at least 10:1
61
Boundary Value Problem
Example to illustrate use of
Foster’s design method
122
Boundary Value Problem
Ice waterRod Insulation
Problem:
The ends of a rod of length 1 are in contact with ice
water at 00 C. The initial temperature at distance x from the
end of the rod is 100sin(x). (These are the boundary values.)
The rod is surrounded by heavy insulation. So, the
temperature changes along the length of the rod are a result of
heat transfer at the ends of the rod and heat conduction along
the length of the rod.
We want to model the temperature at any point on the
rod as a function of time.
62
123
• Over time the rod gradually cools.
• A partial differential equation (PDE) models the
temperature at any point of the rod at any point
in time.
• PDEs can be hard to solve directly, but a
method called the finite difference method is one
way to approximate a good solution using a
computer.
• The derivative of f at a point s is defined by the
limit: lim f(x+h) – f(x)
h0 h
• If h is a fixed non-zero value (i.e. don’t take the
limit), then the above expression is called a finite
difference.
124
Finite differences approach differential quotients as h
goes to zero.
Thus, we can use finite differences to approximate
derivatives.
This is often used in numerical analysis, especially in
numerical ordinary differential equations and
numerical partial differential equations, which aim at
the numerical solution of ordinary and partial
differential equations respectively.
The resulting methods are called finite-difference
methods.
63
125
An Example of Using a Finite Difference Method
for an ODE (Ordinary Differential Equation)
Given f’(x) = 3f(x) + 2, the fact that
f(x+h) – f(x) approximates f’(x)
h
can be used to iteratively calculate an approximation to
f’(x).
In our case, a finite difference method finds the
temperature at a fixed number of points in the rod at
various time intervals.
The smaller the steps in space and time, the better the
approximation.
126
Rod Cools as Time Progresses
A finite difference method computes these
temperature approximations (vertical axis) at various
points along the rod (horizontal axis) for different
times between 0 and 3.
64
127
The Finite Difference Approximation
Requires the Following Data Structure
A matrix is used where
columns represent
positions and rows
represent time.
The element u(i,j) contains
the temperature at
position i on the rod at
time j. At each end of the rod the
temperature is always 0. At
time 0, the temperature at
point x is 100sin(x)
128
Finite Difference Method Actually Used• We have seen that for small h, we may
approximate f’(x) by
f’(x) ~ [f(x + h) – f(x)] / h
• It can be shown that in this case, for small h,
f’’(x) ~ [f(x + h) – 2f(x) + f(x-h)]
• Let u(i,j) represent the matrix element containing the temperature at position i on the rod at time j.
• Using above approximations, it is possible to determine a positive value r so that
u(i,j+1) ~ ru(i-1,j) + (1 – 2r)u(i,j) + ru(i+1,j)
• In the finite difference method, the algorithm computes the temperatures for the next time period using the above approximation.
65
129
Partitioning Step
• This one is fairly easy to identify initially.
• There is one data item (i.e. temperature) per grid
point in matrix.
• Let’s associate one primitive task with each grid
point.
• A primitive task would be the calculation of
u(i,j+1) as shown on the last slide.
• This gives us a two-dimensional domain
decomposition.
130
Communication Step
• Next, we identify the communication pattern between primitive tasks.
• Each interior primitive task needs three incoming and three outgoing channels because to calculate
u(i,j+1) = ru(i-1,j) + (1 – 2r)u(i,j) + ru(i+1,j)
the task needs u(i-1,j), u(i,j), and u(i+1,j).
– i.e. 3 incoming channels and
u(i,j+1) will be needed for 3 other tasks
- i.e. 3 outgoing channels.
• Tasks on the sides don’t need as many channels, but we really need to worry about the interior nodes.
66
131
Agglomeration Step
We now have a task/channel graph below:
It should be clear
this is not a good
situation even if
we had enough
processors.
The top row
depends on values
from bottom rows.
Be careful when designing a parallel algorithm
that you don’t think you have parallelism when
tasks are sequential.
132
Collapse the Columns in the 1st
Agglomeration Step
This task/channel graph
represents each task as
computing one
temperature for a given
position and time.
This task/channel graph
represents each task as
computing the
temperature at a
particular position for all
time steps.
67
133
Mapping Step
This graph shows only a few intervals. We are using one
processor per task.
For the sake of a good approximation, we may want
many more intervals than we have processors.
We go back to the decision tree on page 72 to see if we
can do better when we want more intervals than we
have available processors.
Note: On a large SIMD with an interconnection network
(which the ASC emulator doesn’t have), we might stop
here as we could possibly have enough processors.
134
Use Decision Tree (See earlier Slide on Decision Tree or Pg 72 of Quinn)
• The number of tasks is static once we decide on how
many intervals we want to use.
• The communication pattern among the tasks is regular –
i.e. structured.
• Each task performs the same computations.
• Therefore, the decision tree says to create one task per
processor by agglomerating primitive tasks so that
computation workloads are balanced and communication
is minimized.
• So, we will associate a contiguous piece of the rod with
each task by dividing the rod into n pieces of size h,
where n is the number of processors we have.
Comment: Can decide how to design algorithm without
use of the decision tree as well.
68
135
PictoriallyOur previous task/channel graph assumed 10 consolidated
tasks, one per interval:
If we now assume 3 processors, we would now
have:
Note this maintains the possibility of using some kind of
nearest neighbor interconnection network and eliminates
unnecessary communication.
What interconnection networks would work well?
136
Agglomeration and Mapping
Agglomeration
and Mapping
69
End of Unit
• This unit covered an overview of general
topics on parallel computing.
• Slides were taken from website for my
“Parallel and Distributed Computing”
course.
• This website is at
www.cs.kent.edu/~jbaker/PDC-F08/ and
can be used for reference, if desired.
137