Upload
others
View
6
Download
0
Embed Size (px)
Citation preview
1
2/11/2003 platforms 1
Parallel Programming Platforms
CS 442/EECE 432Brian T. Smith, UNM, CS Dept.
Spring, 2003
2/11/2003 platforms 2
The Standard Serial Model• Von Neumann model
• called by Flynn's taxonomy a single instruction single data stream (SISD)
– the programming model is:
• one sequential list of instructions of the formloop until finished
fetch instructiondecode instructionfetch operandsexecute instruction
end loop
2
2/11/2003 platforms 3
Typical Von Neumann Machines And Their Evolution
processor memoryA simple (early) sequential computer
A sequential computer with memory interleaving processor memory bank 2
memory bank 1
memory bank 3
A sequential computer with cacheand memory interleaving cache memory bank 2
memory bank 1
memory bank 3
processor
Replacing the CPU with a pipelined processor with d stages sd–1, sd–2,…,s0
… … …sd–2 s0sd–1 …
2/11/2003 platforms 4
Avoidance Of BottlenecksA Bottleneck• Many accesses to
memory• Memory latency
too long
• CPU too slow
The Solution• Multiple memory banks
• caches with blocks of size 1• caches with larger blocks
• pipelined execution units• multiple execution units
• fused add/multiply
• multiple processors sharing one memory
3
2/11/2003 platforms 5
The Result• All of these solutions represent implicit parallelism: that is,
• Parallelism implemented in/by the hardware
• The programmer has no way to specify/control its use -- available full-time
• The program, by the way the program is written or the particular algorithm used, benefits from parallelism or inhibits the parallel execution
• For the special case of the use of data parallel languages such as C*, Fortran D, HP Fortran (an extension of Fortran 90), the program constructs encourage the use of the hardware parallelism
• Still requires good optimizing/cognizant compilers
• Example (where A, B, and C are arrays), such constructs as:A = B + C
S = sum(B*C)
are explicit parallel constructs (the order of evaluation is unspecified)
• OpenMP, an extension to Fortran and C, also provides specific languagesupport for parallelism but the way parallelism is specified is explicit in the program
2/11/2003 platforms 6
Avoidance Continued
• Processing still too slow (costly)• Multiple processors each with their own memory (COTS)• Parallel distributed machines with an interconnection fabric
-- sometimes called multi-computers
– Programmed typically by a message passing software layer (usually a library of procedures)
• Parallelism is specified explicitly in the program• Standard examples are MPI, PVM, and lots of vendor
specific support libraries such as MPL (IBM), NX (Intel)
4
2/11/2003 platforms 7
Parallel Architectures--Control Mechanism
• Replace the CPU by many processors– Centralized control or synchronization mechanism -- the
concept sometimes referred to as an vector/array machine• Issues the same instruction to all processors
– Each processor has different data so that different computations are performed
– Single instruction multiple data machine (SIMD)– Uses the data parallelism idea (C*, Fortran D, HP Fortran)
– No centralized control mechanism -- the concept sometimes referred to as a multi-computer
– Multiple CPUs and memory» Completely separate instruction streams and no synchronization» Multiple instructions multiple data (MIMD)
– SPMD -- a programming paradigm to simplify use of an MIMD architecture
» Single instruction multiple data (SPMD) with NO synchrony
2/11/2003 platforms 8
SIMD vs MIMD Diagrams
Note:
• SIMD:
One global control unit
• MIMD:
Independent control units, one for each processor element (PE)
5
2/11/2003 platforms 9
SIMD Program With Synchrony
Suppose divide takes longer than assignment. Then
Time
if(B==0)C = A
elseC = A/B
D = …
5ABC
0
0
Processor 0
4ABC
2
0
Processor 1
5ABC
0
5
Processor 0
5ABC
0
5
Processor 0
1ABC
1
0
Processor 2
0ABC
0
0
Processor 2
Initially
After C = A
After C = A/B
4ABC
2
0
Processor 1
4ABC
2
2
Processor 1
1ABC
1
0
Processor 2
1ABC
1
1
Processor 2
0ABC
0
0
Processor 2
0ABC
0
0
Processor 2
XXX X
Proc #0
Proc #1
Proc #2
Proc #3
X -- Idle (Masked Store or Execution
A==B
A==B
A==B
A==B
C=A
C=A
C=A/B
C=A/B
D=…
D=…
D=…
D=…
T
I
M
E
↓
2/11/2003 platforms 10
SPMD -- No Synchronyif(B==0)
C = Aelse
C = A/BD = …
5ABC
0
0
Processor 0
4ABC
2
0
Processor 1
5ABC
0
5
Processor 0
D = …
Processor 0
1ABC
1
0
Processor 2
0ABC
0
0
Processor 2
Initially
After C = A and C = A/B
After D = …
4ABC
2
2
Processor 1
1ABC
1
1
Processor 2
0ABC
0
0
Processor 2
Time
Proc #0
Proc #1
Proc #2
Proc #3 A==B
A==B
A==B
A==B
C=A
C=A
C=A/B
C=A/B
D=…
D=…
D=…
D=…Suppose divide
takes longer than assignment. Then
D = …
Processor 1
D = …
Processor 3D = …
Processor 2
T
I
M
E
↓
No masking of processors
6
2/11/2003 platforms 11
Parallel ArchitecturesAddress-Space Organization
• Message-Passing Architecture• Each processor has its own private memory with its own
addresses for memory locations• The programming paradigm is to pass messages from task to
task to coordinate computational tasks
• Shared-Address-Space Architecture• All processors share the same address space
– Location n is the same place for all processors– Access memory via a switch, bus, and memory router
(depending on the manufacturer)• Software uses a shared memory programming paradigm
– Sometimes specialized languages such as C*, HPF are used» In such cases, the compiler generates code to synchronize
access to the shared address space
2/11/2003 platforms 12
Typical Message Passing Machine
Note:
• Memory is local, associated with each processor
• For a processor to read or write into another processor's
memory, a message consisting of the data is sent and received
7
2/11/2003 platforms 13
Typical Shared Address Space Machines
2/11/2003 platforms 14
Emulating Message Passing Using A Shared-Address machine
• Messages are passed notionally as follows:• Write the message to a shared space• Tell the receiver where the message is• Receiver reads the message and places in its own space
• Emulating shared memory in a distributed memory machine needs hardware support to be effective (Eg. SGI Origin)
• Thus, message passing is viewed as the more general programming paradigm
• Learn and implement messaging on either local memory or shared memory
• Implement your program in message passing paradigm and use/write a message passing library
8
2/11/2003 platforms 15
The Other Hardware Component -- The Interconnection Network
• How the components (processor/memory) are connected to each other
• Static» Fixed, never changes with time» Wires/fiber connecting the processor/memory pairs» Limitation -- physical space for the connections
• Dynamic» Can be changed with time and execution» Wires/fiber connected to a switch» The connections in the switch are changed by the program
to suit the algorithm being used» Limitation -- congestion in the paths of the switch,
depending on how carefully the user program is designed
2/11/2003 platforms 16
Flynn's Taxonomy
• General computational activities can be classified by the way the instruction streams and data streams are handled
– Aside: some modern processor chips support both streams with caches and different data paths -- both kept in RAM
– Single-Instruction, Single-Data (SISD)– Single-Instruction, Multiple-Data (SIMD)– Multiple-Instruction, Multiple-Data (MIMD)– Single-Program, Multiple-Data (SPMD)
9
2/11/2003 platforms 17
SPMD• Each processor has its own instructions and data
stream (as in MIMD machine)• BUT, each processor loads and runs the same
executable, but does not necessarily execute the same instructions at the same time
• Typical program (PC -- program counter):if( my_id == 0 ) ! I am the master
! Set up for myself and workers…
! Perform some work myself…
else ! I am a worker! Start doing useful work from master
…endif
PC for Proc #0
PC for Proc #1
PC for Proc #2
2/11/2003 platforms 18
Summary -- Parallel Issues
• Control: SIMD vs. MIMD• Coordination: synchronous vs. asynchronous• Memory Organization: private vs. shared• Address Space: local vs. global• Memory Access: uniform vs. non-uniform• Granularity: power of each processor• Scalability: efficiency with number of processors• Interconnection Network: topology, routed,
switched
10
2/11/2003 platforms 19
Categories Of Parallel Processors
• Vector or array processor• SMP: symmetric multiprocessor• MPP: massively parallel processor• DSM: distributed shared memory• Networked cluster of workstations or PCs• Hybrid combinations
• SMP or MPP with vector processors• Networked clusters of SMPs
2/11/2003 platforms 20
An Idealized Parallel Computer
• Called the parallel random access machine (PRAM)• p processors each with a global memory size• Memory access is uniform by all processors of any memory location
(all processor share the same address space)• The processors are synchronized by a single global clock
– Thus, a synchronous shared memory MIMD machine• But a PRAM model divides into different classes depending on how
multiple processors read and/or write into the same memory location
11
2/11/2003 platforms 21
PRAM Subclasses• Exclusive-read, exclusive-write (EREW)
• No concurrent reads or writes• This is the subclass with the least concurrency
• Concurrent-read, exclusive-write (CREW)• Multiple processor reads to the same address allowed• No concurrent writes -- multiple processor writes must be serialized --
one at a time
• Exclusive-read, concurrent-write (ERCW)• No concurrent reads allowed -- reads must be serialized• Concurrent writes allowed
• Concurrent-read, concurrent-write (CRCW)• Concurrent reads allowed• Concurrent writes allowed
2/11/2003 platforms 22
Concurrency Issue
• Concurrent reads are not a problem• But concurrent writes must be arbitrated
• Four useful arbitration protocols to be analyzed are:– Common -- write only if the current writes are for the same
value -- otherwise the writes fail– Arbitrary -- arbitrarily let one processor write and the other
processor writes fail– Priority -- the processors are orded in priority and the processor
with highest priority succeeds in writing and all the others fail– Sum -- the sum of all the processor write values is written into
the location and no processor fails
12
2/11/2003 platforms 23
Dynamic Interconnection Networks
• Consider an EREW PRAM with p processors and m memory locations
• Must connect every processor to every memory location• Therefore, we need O(mp) switch elements
– Say, a switch in the form of a crossbar» nonblocking but uses mp switch elements
– This is very costly to build, like prohibitive cost
– Compromise by placing the memory in banks– Say b banks with m/b memory locations per bank– It is considerably less costly than the above situation -- O(pb)
switch elements– Its disadvantage is that it blocks all processors but one
reading/writing any location in a bank if one is read/writing to a bank
– The Cray Y-MP was built with this kind of switch
2/11/2003 platforms 24
Crossbar Switch
• Non-blocking in the sense that a processor to a particular memory bank does NOT block another processor to a different memory bank
• Contention when more than one item from a given memory bank is required -- a bottleneck is created
13
2/11/2003 platforms 25
Crossbar Switch Properties
• The number of switch elements is O(pb)• It makes no sense to have b less than p
– Thus, the cost is Ω(p2)» This is notation for a lower bound -- that is, there
exists an N such that the cost ≥ c*p2 for some constant c and for all n ≥ N
• It is not cost effective to have b near m• A frequently configured system is with b as some
modest and constant multiple of p, often 1– For this case, the cost is O(p2)
2/11/2003 platforms 26
Bus-Based Properties
• A bus is a connection between all processors and all memory banks
• A request is made to the bus to fetch or send data– The bus arbitrates all requests and performs the data transfer
when not used (or overused) for other requests
– The bus is limited in the amount of data that can be sent at once
– As the number of processors and memory banks increases, processors increasingly wait for data to be transferred
– Results in what is called saturation of the bus
14
2/11/2003 platforms 27
Use Of Caches
• The saturation issue is alleviated by processor caches as follows
• Assuming the program needs data in address-local chunks, the architecture transmits a block of items in consecutive addressesinstead of just one item at a time
– The time to send a block is often nearly the same time as the time to get the first item (high latency, large bandwidth)
– By sending junks at a time, it reduces the number of requests tothe bus for data from memory and thus reduces the saturation of the bus
• Drawback: the caches are copies of memory and the copied data may be in different caches
– Consider writing into one of them» All caches must have the same value for the same memory
location (made coherent) to maintain the integrity of the computation
2/11/2003 platforms 28
Bus Based Interconnection
• Caches introduce the requirement for cache coherency
15
2/11/2003 platforms 29
The Compromise Connection
• Crossbars are scalable but become very expensive as they scale up
• Buses are inexpensive but are non-scalable because of saturation
• Is there anything that has moderate cost but is scalable?
• Yes: the multistage switch– A set of stages that does not grow quickly with increases in the
number of processors -- thus, not so costly» The set of stages grows with the number of processors in a
modest way to avoid saturation– The elements in the stages are simple so that they are
inexpensive yet can grow to avoid severe saturation
2/11/2003 platforms 30
Multi-Stage Network
• The stages are switches• Eg: an omega network is made up from a collection of 2x2
(sub)switches in the pass-through or crossover configuration
16
2/11/2003 platforms 31
Cost and Performance Vs Number of Processors
• Crossbars cost the most and buses the least as the number of processors increases
• BUT, crossbars perform the best and buses the least as the num. of procs increases
• And multistage switches are the compromise and are frequently used
2/11/2003 platforms 32
The Omega Network• Assume b = p• The number of stages is log(p)• The number of switches in a stage is p/2• Each switch is very simple
• p/2 crossover 2x2 switches -- each switch:– has 2 inputs and 2 outputs (hence p/2 switches per stage)– is either in pass-through or crossover configuration
» The configuration is dynamic and is determined by the p-th bit of source and destination addresses of the current message
– Each stage has p inputs and p outputs• Each set of connections from 0-th position and between all stages
is arranged in what is called a perfect shuffle -- see next slide
17
2/11/2003 platforms 33
Perfect Shuffle
• Position i is connected to position j if:
−≤≤−+−≤≤
=12/,1212/0,2
pippipii
j
2/11/2003 platforms 34
Configuration For 2x2 Switch
• Suppose a particular 2x2 switch is in the k-th stage encountering a message from processor numbered agoing to processor numbered b. Then, this switch is in:
– pass-through configuration when the k-th bits of a and b are the same– crossover configuration when the k-th bits of a and b are the different
18
2/11/2003 platforms 35
Complete Omega Network For 8x8 Switch
Each box is a 2x2 crossover switchTrace of the path for a message from processor 0 to bank 5 (eg: (000)→(101)) is:– Stage 0, switch 0 in crossover
configuration• message goes to stage 1, switch 1
– Stage 1, switch 1 in pass-through configuration
• message goes to stage 2, switch 2
– Stage 2, switch 2 in crossover configuration
• Message goes to output 101 (5)
Processors BanksSwitch Stages
2/11/2003 platforms 36
But An Omega Switch May Block
• Consider 2 messages sent at the same time• One from processor 2 (010) to back 7 (111)• The other from processor 6 (110) to bank 4 (010)
– They both route to switch 2, stage 0– One requires the switch to be in pass-through configuration– The other requires the switch to be in crossover configuration– Both messages end up on the second output port of this switch
– Either the switch blocks or the link from the second output of switch 2, stage 0 to switch 1, stage 1 overloads (blocks -- two messages simultaneously try to traverse the same wire or link)
19
2/11/2003 platforms 37
Blocking In An Omega Network
2/11/2003 platforms 38
Static Interconnection Networks
• No switches present (the examples here are connected processors, not connected processors and banks, but the situation is equivalent)
• The connections are fixed essentially by directly connected wires between processors
• In effect, the processor is the switch, selecting the wire the message is to traverse
• In contrast, in a switch, the message has a header which is readby the switch as the message is received and is routed by analyzing the sender and receiver addresses in the header
– There is potentially less overhead with static interconnection networks. On the other hand, the processor somehow has to know where to route the message.
20
2/11/2003 platforms 39
The Completely Connected Network
• Like a crossbar switch (see figure on slide 41)• Every disjoint pair of processors can communicate in
one hop without creating a block situation• In addition, the completely connected network can have
one processor communicate to all others simultaneously (as long as the processor supports this)
• But, it has physical scaling problems:» A chaotic collection of wires for even a modest number of
processors (consider say, 128 processors) where it is physically difficult to find space for the connections
2/11/2003 platforms 40
A Star Connection
• One processor can communicate to all others in one hop
• All other pairs of processors require two hops– The center processor thus becomes a bottleneck if the processor
communication is between the "starlets"– This network behaves very much like a bus
» congestion is possible and likely
• Many ethernet clusters in offices without switches behave like this
– With switches, the back planes have very large bandwidths to ameliorate the congestion (saturation) conditions
21
2/11/2003 platforms 41
A Complete Connection and A Star Connection Static Network
• Complete connection» Everyone to everyone connection -- no preferences
• Star connection» The boss/worker model -- the center is the boss
2/11/2003 platforms 42
A Linear And Ring Connected Static Network
• Linear– Have to pass the message in correct direction
» Message goes through intermediary processors (multiple hops)• Ring
– For the shortest path, you have to pass the message in the correct direction
» Message still goes through intermediary processors (multiple hop s)
22
2/11/2003 platforms 43
2-D Mesh, 2-D Wrapped Mesh, and 3-D Mesh Connected Static Networks
• (b) is also called a 2-d wraparound mesh or 2-d torus• If not square, called a 2-d rectangle mesh or torus
• These kinds of configurations have been built, including a 3-d torus
2/11/2003 platforms 44
Tree Networks With Message Routing
• Linear array and ring are special cases of trees• (a) is a complete binary tree with a static network
» The processors at the connections are also routers or switches
• (b) is a dynamic tree network» The processors are only at the leaves -- connections are only switches
23
2/11/2003 platforms 45
Message Routing In Trees
• Routing in a dynamic tree• The message starts up the tree until it finds the root of the
smallest tree that contains the sender and receiver• Then, the message goes down the tree to the receiver• For a static tree, the same algorithm works except that the
message may not go down the tree
• Note that there is a congestion in trees• The upper parts of the tree get lots of traffic, more traffic
than the lower parts• To ameliorate the congestion, fatten the links at the upper
parts of the tree» Called a fat tree interconnection
2/11/2003 platforms 46
Dynamic Fat Tree Network
• This tree doubles the number of paths as you go higher in the tree• The CM-5 used a dynamic fat tree with 100s of processors
• I believe it did not double in width at all levels
24
2/11/2003 platforms 47
Hypercube Connections• A cube of dimension d with 2d processors
– For d = 1, it is a linear configuration (linear array) with 2 (= 21) processors, one at each end
– For d = 2, it is the configuration of a square, with processors on the corners (4 = 22 of them)
– For d = 3, it is the configuration of a cube, with processors on the corners (8 = 23 of them)
– … (and so on -- can you guess what d = 4 is)• The very special properties of such configurations are:
– For any d-dimensional cube, every processor is connected to d other processors
– The number of steps (connection segments) between any pair of processors is at most d
– Actually, the shortest distance is the Hamming distance between the binary representation of the processor numbers, assuming the numbering is from 0 to 2d–1
• Note that a d-dimensional hypercube is made up from two (d–1) dimensional hypercubes , connected at their corresponding positions
2/11/2003 platforms 48
1-D,2-D, 3-D Hypercube Networks
25
2/11/2003 platforms 49
Distinct Partitions Of A 3-D Hypercube
• Always d distinct partitions of a d-dimensional hypercube into two subcubes of dimension d–1
2/11/2003 platforms 50
More Properties Of Hypercubes
• Two processors are neighbors (connected by a direct link) if the binary representations of their processor numbers differ in exactly on bit:
– Recall the Hamming distance is s ⊕ t (where ⊕ is exclusive or -- 1 only in positions where bits are different)
– Thus, s and t are neighbors if and only if s ⊕ t has exactly one 1-bit
• Consider the set of all processors with the same kbits (any subset of the d bits). Then:
• This set of processors forms a d–k dimensional hypercube– It is a sub-hypercube of the original hypercube
26
2/11/2003 platforms 51
Sub-cubes
2/11/2003 platforms 52
k-ary d-cube Networks
• Instead of 2 processors in each dimension of the cube (as with the hypercube), consider kprocessors connected in a ring:
• d-dimensional torus with p processors in each dimension is a p-ary d-cube
• k linearly connected processors is a k-ary 1-cube
27
2/11/2003 platforms 53
Evaluating Static Interconnection Networks
• Cost and performance measures of networks– Diameter:
• Maximum distance between any two processors– distance here is the shortest path between two processors
• Examples:» for a 2-d hypercube, it is 2» for a completely connected network, it is 1» for a star network, it is 2» for a linear array of size p, it is p–1» for a ring of size p, it is p/2» for a p × q mesh, it is p+q – 2» for a d-dimensional hypercube, it is d» for a hypercube of p processors, it is log2 p
2/11/2003 platforms 54
Evaluating Static Networks Continued
– Connectivity:• a measure of the number of paths between any two processors
– the higher the connectivity, the more choice in routing messages and less chance of contention
• arc connectivity:– the least number of arcs (connections) that must be removed to break the
network into disconnected networks– Examples:
» for a star, linear array, and tree network, it is 1» for a completely connected network of p processors, it is p–1» for a d-dimensional hypercube, it is d» for a ring, it is 2» for a wraparound mesh, it is 4
28
2/11/2003 platforms 55
Evaluating Static Networks Continued– Bisection width, channel width, channel rate, and
bisection bandwidth:• Bisection width is:
– the least number of arcs that must be removed to break the network into two partitions with an equal number of processors in each partition
• Channel width is:– the maximum number of bits that can be transmitted simultaneously over
a link connecting two processors» it is equal to the number of wires (arcs) connecting the two
processors• Channel rate is:
– is the rate at which bits can be transferred (bits/sec)• Channel bandwidth is:
– is the peak rate that a channel can operate; that is, the product channel rate * channel width
• Bisection bandwidth is:– is the peak rate along all the arcs that represent the bisection width; it is
the product bisection width * channel bandwidth
2/11/2003 platforms 56
Examples Of Bisection Widths
• Bisection widths:– of d-dimensional hypercube with p processors, it is 2d–1 or p/2– of a p×p mesh, it is p
– of a p×p wraparound mesh, it is 2p– of a tree, it is 1– of a star, it is 1 (by convention -- you cannot make two equal
partitions)– of a completely connected network of p processors, p2/4
29
2/11/2003 platforms 57
Cost Of Static Networks
• There are two common measures:• The number of arcs (links)
– for a d-dimensional hypercube with p=2d processors, it is (p log p)/2 -- the solution to a recurrence (cd = 2cd–1+2d–1, c0 = 0)
– for a linear array and trees of p processors, it is p–1
• Or the bisection bandwidth– It related to measures of minimal sizes (area or volume) of the
packaging to build the network– For example, if the bisection bandwidth is w of a:
» 2-dimensional package, a lower bound on the area is Θ(w2)» 3-dimensional package, a lower bound on the volume of the
network is Θ (w3/2)
2/11/2003 platforms 58
Summary: Characteristics Of Static Network Topologies
dp2d2kd–1d k/2Wrap k-ary d-cube
(p log p)/2log pp/2log pHypercube
2p42 √ p2√p/22-D no-wrap mesh
2(p – √p)2√ p2(√p) –1)2-D wrap mesh
p22p/2Ring
p–111p–1Linear
p–1112 log((p+1)/2)Compl Bin. Tree
p–1112Star
p(p–1)/2p–1p2/41Complete
Cost
(No. of Links)
Arc ConnectivityBisection
Width
DiameterNetwork
30
2/11/2003 platforms 59
Evaluating Dynamic Interconnection Networks• What is diameter for dynamic networks?
• Nodes are now defined as processors and switches– both have a delay connected with processing a message that passes
through them• Thus, it is defined as maximum number of nodes between any two
processors
• What is connectivity for dynamic networks?• Similarly, it is defined as the minimum number of connections that
must be removed to partition the network into two unreachable parts
• What is bisection width for dynamic networks?• Defined as the minimum number of edges (connections) that must be
removed to partition the network into two halves of an equal number of processors
• What is the cost of a dynamic network?• Depends on the link cost and the switch cost• The switch cost, in practice, dominants so that it becomes the number
of switches
2/11/2003 platforms 60
An Example Of Bisection Width
• Lines A, B, C are 3 cuts that cross the same number (4) of connections -- partitions the processors into two equal sizes
– This is the minimum number of connections for any cut– Thus, the bisection width is 4
P
PP
P
S
S
SS
a
a
a
aA
B
C
31
2/11/2003 platforms 61
Summary: Characteristics Of Dynamic Network Topologies
p–1212 log pDynamic Tree
p/22p/2log pOmega Network
p21p1Crossbar
Cost (No. of Links)
Arc ConnectivityBisectionWidth
DiameterNetwork
2/11/2003 platforms 62
Cache Coherence In Multiprocessor Systems• Complicated issue with uni-processors, particularly
when there are multiple levels of cache and caches are refreshed in blocks or lines
• For a multiprocessors, the problem is worse• As well as multiple copies, there are multiple processors trying to
perform reads and writes of the same memory locations
– There are two frequently used protocols to ensure the integrity of the computation
• Invalidate protocol:» Invalidate all locations that are copies of a location when one
of the copies is changed. Update the invalid locations only when needed -- this is the most frequently implemented currently
• Update protocol:» Update all locations that are copies of a location when one of
the copies is written
32
2/11/2003 platforms 63
Cache Coherence In Multiprocessor Systems
x = 1
P0
load x
x = 1
Memory
x = 1
P1
load x
x = 1
P0
load x
x = 1
Memory
x = 1
P1
load x
Invalid Protocol
Update Protocol
x = 3
P0
write #3, x
x = 1
Memory
x = 1
P1
Invalidate
x = 3
P0
write #3, x
x = 3
Memory
x = 3
P1
Update
2/11/2003 platforms 64
Property Of The Update Protocol
• Invalidate protocol:– Suppose a processor reads a location once (thus placing it
in its cache) and never accesses it again– Another processor reads and updates a location many
times (also in its cache)
» Behavior: the second processor causes the first processor's cache location to be marked invalid and is not necessary updated in the first processor's cache --to continually update the first processor's cache would be a waste of effort
33
2/11/2003 platforms 65
The False Sharing Issue– Caused by cache lines -- the other addresses in the cache
that are brought to the cache when any single value is the processor's cache
• The sharing of data among processors now occurs when a second processor access a data item in the same line but not the same value
– Thus a whole block (line) of data is shared and located in two or more caches
» Now consider that each processor repeatedly writes into a different item in the same cache line
» It looks like (thus false) the items are being shared but they are not
» Updates to one item in the line require the whole line to be invalidated or updated as the protocol requires when in fact there is no sharing of the same item
– It turns out the cost for update protocol is slightly less– But the tradeoff between communication overheads (updating) and
idling (stalling for invalidates) is better for the invalidate protocol
2/11/2003 platforms 66
Maintaining Coherence With The Invalidate Protocol
• For analysis and understanding, consider 3 states for a memory address:
• Shared state» Two or more processors have loaded the memory location
• Invalid state» At least one processor has updated the value and all other
processors mark their value with this state
• Dirty state :» The processor that modifies a value and there are other
copies of it in other processor is marked with this state» The processor with a value with this state is the source
processor for any updates to this value, when needed
34
2/11/2003 platforms 67
State Diagram For The 3-State Coherence Invalidate Protocol
Invalid Dirty
Shared
read write
C_write
C_write
read
C_read
write
flush
read/write
Legend Of State Changes
-- processor action
-- coherence action
flush action -- may occur when processor replaces a dirty item in a cache-replacement action
2/11/2003 platforms 68
Parallel Program Execution On A 3-State Coherence System Using The Invalidate Protocol
x = 5, Sy = 12, Sx = 5, Iy = 12, Iy = 13, Sx = 6, Sx = 6, Iy = 13, Ix = 6, Iy = 13, I
y = 12, S
y = 13, Dy = 13, Sx = 6, Sx = 6, Iy = 19, D
y = 20, D
x = 5, S
x = 6, D
y = 13, Sy = 6, Sx = 19, Dy = 13, Ix = 20, D
read y
y = y + 1
read x
y = x + y
y = y + 1
read x
x = x + 1
read y
x = x + y
x = x + 1
x = 5, Dy = 12, D
Variables and their states in global mem.
Variables and their states at processor 1
Variables and their states at processor 0
Instruction at processor 1
Instruction at processor 0
35
2/11/2003 platforms 69
Implementation Techniques For Cache Coherence
• Three frequently used methods are:• Snoopy systems• Directory-based (snoopy) systems• Distributed directory-based systems
2/11/2003 platforms 70
Snoopy Cache Systems
• On broadcast interconnection networks with a bus or a ring:
• Each processor snoops on the bus looking for transactions that effect its cache
• The cache has tag bits that specify the cache line state– When a processor with a dirty item sees a read request for
that item, it gets control of the request and sends the data out– When a processor sees a write to an item it owns, it
invalidates its copy
36
2/11/2003 platforms 71
Diagram Of A Snoopy Bus
Cache
Processor
TagsSnoop H/W
Cache
Processor
TagsSnoop H/W
Cache
Processor
TagsSnoop H/W
Dirty
Address/data
Memory
2/11/2003 platforms 72
Performance Of Snoopy Caches• Extensively studied and implemented
• Implementation is simple and straightforward» Can be easily added to existing bus-based systems
• Good performance properties in the sense that:» If different caches access different data (the expected case),
it performs well» Once designated dirty, the processor with the dirty tag can
continue use of the data without penalty» Also computations that only read shared data perform well
• Poor performance in the case that the processors are reading and updating the same data value
» Generates many coherence operations across the bus» Because it is a shared bus (between processors and with
data movement), the coherence operations saturate the bus
37
2/11/2003 platforms 73
Directory-Based Systems
• Directory-based systems provide a solution to this problem
» With snoopy buses, each processor has to continually monitor the bus for updates of interest
» The solution is to find a way to direct the updates only to the processors that need the updates
• This is done with a directory associated with each block of memory specifying who has each block and when the block is shared
2/11/2003 platforms 74
Centralized Directory-Based System
Processor
Cache
Processor
Cache
Processor
Cache
Interconnection Network
State Presence Bits
Data
Directory and Memory
S 1 10 0
Shared state for processors 0 and 3
38
2/11/2003 platforms 75
Typical Scenario For A Directory-Based Snoopy System
• Consider the scenario of slide 63• Both processors access x (with value 1)
» x is moved to each cache and is marked shared• Processor 0 executes store to x (with value 3)
» x in the directory is marked as dirty» Presence bit for all other processors is marked off» Processor 0 can access the changed x at will (memory not
changed)• Processor 1 accesses x
» x has a dirty tag and sees that processor 0 has x» Processor 0 updates the memory block and sends the
updated x to 1» The presence bits for processors 0 and 1 are set and x is
now marked as shared
S 1 1 0 0 1
D 1 0 0 0 1
S 1 1 0 0 3
2/11/2003 platforms 76
Performance Of Directory Based Systems– Implementation is more complex
• Good performance properties (as before) in the sense that» If different caches access different data (expected usually case), it
performs well» Once designated dirty, the processor with the dirty tag can continue
to use of the data without penalty» Also computations that only read shared data perform well
• Poor performance in the case that the processors are reading andupdating the same data value
– Encounters two kinds of overhead:» Propagation of state and generation of state» Results in respectively communication and contention problems» The contention is on the directory -- too many requests to the
directory for information» The communication is on the bus -- Solution: increase its bandwidth
– There is an issue with the directory -- it takes space» Reduce the size of the directory by increasing the block length of
the cache line -- BUT may increase false sharing overhead
39
2/11/2003 platforms 77
Distributed Directory Schemes• Contention on the directory of states can be
ameliorated by distributing the directory• This distribution is performed consistently with memory distribution
» The state and presence bits are represented near/in each piece of the distributed memory
» Each processor maintains the coherence of its own memory• Behavior:
» A first read is a request from the user processor to the owner processor for the block, and state information is set in the owner's directory
» A write by a user propagates an invalidate to the owner and the owner forwards that state to all other processors sharing the data
• The effect:» The directory is distributed and contention is only for the owner
processor -- not the previous situation where several processors are contending for the information in the directory
2/11/2003 platforms 78
Distributed Directory-Based System
Interconnection Network
Processor
Cache
Presence Bits/State
Memory
Processor
Cache
Presence Bits/State
Memory
40
2/11/2003 platforms 79
Performance Of Distributed Directory Schemes
• Better in performance– Because this design permits O(p) simultaneous coherence
operations
• Scales much better than the simple snoopy or centralized directory-based systems– Now latency and bandwidth of the network
become the bottleneck
2/11/2003 platforms 80
Message Passing Costs• Startup time ts:
• Time to process the message at the send and receive nodes» Time for adding header, trailer, error correction information, executing
the router algorithm, and interfacing between node and router
• Per-hop time th:• Time for the header of the message (over 1 link) to leave one node and
arrive at the next node» Also called node latency» It is dominated by the latency time to determine which output buffer or
channel to route the message to
• Per-word transfer time tw:• It is 1/r where r is the channel bandwidth measured in words per second
» This time includes network and buffering overheads
41
2/11/2003 platforms 81
Store-Forward Routing Scheme
• To send a message down a link,• The entire message is stored in the receiver before
further processing of the message can begin at the receiver or of other messages in the sender
• Total communication time:• For a message of size m traversing l links, it is:
tcomm = ts + (mtw + th) l
• For typical algorithms, th is small compared with mtw, even for small m and so th is ignored. That is,
tcomm = ts + mtw l
2/11/2003 platforms 82
Packet Routing
• Waiting for the entire message is inefficient• For LANs, WANs, and long-haul networks, the message is
broken down into packets– Intermediate nodes wait only on small pieces (packets), not the
whole message– Reduces the overhead for handling an error
» Only a packet gets resent, not the entire message– Allows packets to take different paths– Error correction is on a smaller pieces and so is more effective– BUT, there are overheads because
» the packet has to have routing information, error correction, and sequencing information stored in it
• The advantages outweigh the overhead incurred
42
2/11/2003 platforms 83
A Cost Model For Packets• The packet size is m = r + s where:
– r is the size of the original message– s is the size of additional information to deal with packets
• The time to packetize the message is proportional to the size of the message and is: mtw1
• Let the number of hops be l• Let the time to communicate one word over the
network every second be tw2 with latency th– Time to receive the first packet of a message: thl + tw2(r+s)– There are m/r – 1 remaining packets to send– Thus the time is (simplified):
tcomm = ts + thl + twm where tw = tw1 + tw2(1 + s/r)
2/11/2003 platforms 84
Cut-Though Routing• Packets but
• The processor-interconnect network is limited in generality, size, and amount of noise or interference
– Thus, the overheads can be reduced because:» No routing information is needed» No sequencing information needed as the packets are transmitted
and received in order» Errors can be associated with the whole message instead of
packets» Errors occur less frequently so simpler schemes can be used
• The term cut-through routing is applied to this simpler packetizing of the messages
– The packets are fixed size, are called flow control digits or flits» Smaller than long-haul network packets as no headers
43
2/11/2003 platforms 85
Cut-Though Routing Continued
• Tracer packets establish the route:– A tracer packet is sent ahead to initiate the route for the
message -- sets the path for the flits to follow– The flits pass down one after the other
– They are not buffered at each node but are passed on as soon as their arrival is complete without waiting for the next one -- reduces memory and memory bandwidth needs at the node
2/11/2003 platforms 86
Store-Forward And Cut-Through Routing
P4
44
2/11/2003 platforms 87
A Cost Model For C-T Routing• Assume the number of links is l• Assume the number of words in the message is m• Assume the startup time is ts
• There is clearly time to startup and shutdown the flit pipeline which is proportional to l and must be included in the per-hop time below
• There is also the time to send the tracer packet which is proportional to l and must be included in the per-hop time
• Assume the per-word transmission time is tw• Then the cost for m words is m tw
• Assume the per-hop time is lth• It clearly depends on the number of links in this way because each
processor is simultaneously handling a different flit
• Then, the communication cost is:tcomm = ts + lth + m tw
– Compare this with the store-forward time tcomm = ts + mtw l
2/11/2003 platforms 88
Performance Comparison And Issues• For C-T routing vs S-F routing:
• S-F has the compounding factor ml• C-T routing is linear in m and l
• Size of flits:• Too small implies processing of lots of flits and the processing
time must be very fast• Too large means more memory and memory bandwidth is
needed or latency is increased– Thus, there is a tradeoff in the design of the routers
» Flits are typically 4 bits to 32 bytes
• Congestion and contention• The C-T routing can deadlock, particularly when heavily
loaded– The solution is message buffering and/or careful routing
45
2/11/2003 platforms 89
Deadlock In Cut-Through RoutingDestinations of messages 0, 1,
2, and 3 are processors A, B, C, and D respectively
Flit from message 0 occupies path CB
It cannot progress because the flit from message 3 occupies path BA
And so on
2/11/2003 platforms 90
A Simplified Cost Model For C-T Routing• The cost model for C-T routing is:
tcomm = ts + l th + m tw
• To minimize this cost, we might design our software so that we:
– Communicate in bulk (reduce the effect of ts)» This reduces the number of messages so that the
number of times we pay for ts is reduced -- this is appropriate because ts is usually large relative to thand tw
– Minimize the data volume (reduce the term m tw)» Reduce the amount of communication
– Minimize the distance traveled by the messages (reduce l)» This reduces the term lth
46
2/11/2003 platforms 91
A Simplified Cost Model Continued
• On the other hand:• Message passing libraries such as MPI give the users very little
control of the mapping between the programmer's logical processors and the machine's physical processors
• Message passing implementations will use 2-hop schemes to reduce contention by picking the middle node at random
• Most switches nowadays have essentially equal link times between any two processors
• Thus, lth is small compared with the other terms
– For these reasons, we use the simplified cost model: tcomm = ts + m tw
– That is, drop the lth term
2/11/2003 platforms 92
Communication Costs For Shared-Address-Space
• The bottom line is that there is no uniform and simple model except for very special cases and architectures -- we give up for a general treatment
• The issues that cause this are:– Memory layout
» Determined by the system and compiler– Size of caches, particularly when there are small caches– Details of the invalidate and/or update protocols for the caches– Spatial locality of data
» Cache line sizes vary and the compiler locates the data– Pre-fetching by the compiler– False sharing (depends on threading, scheduling, compiler)– Contention for resources (cache update lines, directories, etc)
» Depends on the execution scheduler
47
2/11/2003 platforms 93
Routing Mechanisms For Interconnection Networks
• Routing mechanism• The process whereby the network determines what route or
routes to choose for a message and the way it routes the messages
» May use the state (how busy parts of the network are) of the network to determine the path
– Minimal routing• A path of least length is used -- usually computed directly from
the source and destination address– Minimal routing can lead to congestion so that non-minimal
routing is often used
– Non-minimal routing• The least length path is not chosen
– To avoid congestion» route is selected randomly or uses network state information
2/11/2003 platforms 94
Routing Mechanisms Continued
• Routing can be deterministic:• Always the same route, given the addresses of the source and
destination
• Routing can be adaptive:• Tries to avoid congestion and delay, depending on what is
currently being used (or may be a 2-hop scheme with the middle node selected at random)
• Dimension ordered routing• Uses the dimension properties of the connection network to
determine the path -- routes via minimal paths along dimensions» For a mesh, it is X-Y routing» For a hypercube, it is called E-cube routing
48
2/11/2003 platforms 95
X-Y Routing For A Mesh
• Send the message along the row of the source processor until it reaches the X value of the destination processor
• Then, send the message along the column to the destination processor– This is a minimal path
•
•
2/11/2003 platforms 96
E-Cube Routing
• Let Ps and Pd be the binary labels for nodes s and d– Then, the Hamming distance is Ps ⊕ Pd
• Its 1s indicate what dimensions to send the message along– Used to compute the dimension from the destination address, given
the address of the current node where the message is; that is,» Compute Ps ⊕ Pd and send the message from Ps along the
dimension corresponding to the least significant bit» The message is now at processor Pq and now form Pq ⊕ Pd
» Now send the message from Pq along the dimension of the least significant bit of Pq ⊕ Pd and now the message is at a new Pq
» Repeat until the message arrives at its destination» NOTE: the assumption is that the message always contains the
address of its destination
49
2/11/2003 platforms 97
E-Cube Routing For 3-D Cube
• Form the Hamming distance– Send the message from the source processor to the next destination
determined by changing the least significant bit in the Hamming distance of source processor and repeat for each new source until dest. is reached
2/11/2003 platforms 98
Embedding Other Networks Into A Hypercube Network
• Using all of the processors of a hypercube and an appropriate subset of the connections, can the processors be configured to look like other networks?
• Important process because an algorithm may naturally fit the first network configuration, and the actual machine uses a different network configuration
• The problem is a general graph problem:• Given two graphs: G(V,E) and G'(V',E'),
map each vertex in V onto one or more vertices in V'map each edge in E onto one or more edges in E'
• The vertices are processors and the edges are network links
• Yes, as shown in the following cases and slides
50
2/11/2003 platforms 99
Example Of A Bad Case
• Take a 4×4 mesh and map the nodes at random to another 4×4 mesh
• The original arrangement was designed to avoid congestion of communication
– Each link is used just once for all pairs of communication
• The particular random arrangement can congest up to 5 messages on one link
– k communicates with g, o, and l down the k-h link– j communicates with i along the k-h link– d communicates with h along the k-h link
» text suggests 6 links but I do not see that case
2/11/2003 platforms 100
Diagram Of Bad Case
• Up to 5 paths mapped to the same link assuming the original codeperformed only nearest neighbor communication
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
a b
e
c d
i
m
f
j
n o p
g h
k l
Only nearest neighbor communication on
dotted lines
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
k h
j
m i
d
c
p
e
l g f
o b
a n
Processors mapped at random to the same grid
k communicates with g, o, and l down the k-h link
d communicates with h along the k-h link
j communicates with i along the k-h link
51
2/11/2003 platforms 101
Properties Of Such Mappings
• The maximum number of edges in E mapped into a single edge in E'
• This is called the congestion• It measures the amount of traffic required on an edge in G'
• The maximum number of edges in E', joined together, to correspond to a single edge in E
• This is called the dilation• It measures the increased delay in G' caused by traversing multiple links
• The ratio of number of processors in V' to the number of processors in V
• This is called the expansion
• However, the cases described next all have expansion of 1
2/11/2003 platforms 102
Embedding A Linear Array Into A Hypercube• Consider a linear array (or ring) of 2d processors
• For convenience, linear array processors are labeled from 0 to 2d – 1
– Processor i in the linear array maps to processor numbered G(i,d) in the hypercube using the following mapping:
≥−−+
<=+
==
+ xxx
x
ixiG
ixiGxiG
GG
2),,12(2
2),,()1,(
1)1,1(0)1,0(
1
• G is called the binary reflected Gray code (RGC)• The d+1 Gray codes are derived from the d Gray codes as follows:
• For d+1, take two copies of the d codes• For one copy of the d codes, prefix a 0 bit• for the other copy, reflect the d codes and prefix a 1 bit
52
2/11/2003 platforms 103
A Ring Of 8 Processors Embedded Into a 3-d Hypercube
G(i,3)
i
Order in the Ring
This mapping has dilation of 1 and congestion of 1
Node 0 of the Ring
Last node (7) of the
ring
2/11/2003 platforms 104
A Ring Of 8 Processors Embedded Into a 3-d Hypercube
• Note:• Hypercube processors in consecutive rows of the
Gray table differ by one bit -- thus, they are adjacent in the cube
– Thus, each edge in the linear array maps to one and only one edge in the hypercube
– Thus, the dilation (maximum number of edges in E' an edge is E is mapped to) is 1
– Also, the congestion (max. number of edges in E mapped onto a single edge in E' ) is 1 by the requirements of G
53
2/11/2003 platforms 105
Meshes Embedded Into Hypercubes• We consider the mesh to be a wraparound
mesh of size 2r × 2s and the hypercube of dimension r+s
– We use the properties of the mapping of a ring to a hypercube to do this -- because each row and column is a ring -- as follows:
» For processor (i,j) in the mesh (the processor at the intersection of row i and column j), the binary hypercube processor number is G(i,r) || G(j,s) where || is the concatenation of two binary strings
2/11/2003 platforms 106
Meshes Embedded Into Cubes
54
2/11/2003 platforms 107
Embedding A Square p-Processor Mesh Into A p-Processor Ring
• The number of links in a p processor square wraparound mesh is: 2 √p √p = √p
– √p links on each row with √p rows– Considering the columns, there are just as many links again– Thus, there is a total of 2√p links
• The number of links in a p-processor ring is p• Thus, there must be congestion• The natural mappings from mesh to ring and
ring to mesh are shown on the next slide
2/11/2003 platforms 108
Mesh To Linear And Vice Versa
Legend:Bold lines are links in the linear arrayDotted lines are links in the mesh
Linear Array Onto MeshCongestion: 1
Dilation: 1
Mesh Onto Linear ArrayCongestion: 5 = √p + 1Dilation: 7 = 2√p – 1
Congestion -- 5 crossings
Dilation -- 7 bold edges
55
2/11/2003 platforms 109
Can We Do Better?
• What is a lower bound for the congestion?» Recall that congestion is the maximum number of edges in
E mapped into a single edge in E '• It turns out to be √p so it is possible to do better but not that
much better– Simple argument but overly conservative:
» The mesh has 2p links, the linear array has p links» Therefore, a congestion of 2 seems possible
• Proof (using bisection width, not links):– The bisection width of a 2-D mesh is √p– The bisection width of a linear array is 1– Thus, the congestion is at best √p
2/11/2003 platforms 110
Hypercubes Embedded Into 2-D Meshes
• Assume p nodes in the hypercube and assume p is an even power of 2
• Treat the hypercube as √p hypercubes each with √p nodes» Let d = log2 p -- d is even by assumption» Consider the hypercubes with the least significant d/2 bits
varying and the first d/2 bits fixed• Map each of these hypercubes to a row of √p × √p mesh, each
row having √p nodes and there are √p rows– Use the inverse of the mapping from a linear array to a
hypercube on slide 102– Connect the nodes column-wise so that the nodes with the same
d/2 least significant bits are in the same column» Notice that this is a similar arrangement as for the rows
56
2/11/2003 platforms 111
16-Node Hypercube To 16-Node 2-D Mesh
00 01
1110
00 01
1110
00 01
1110
00 01
1110
0100
1110All nodes numbered
10 are in the first column
Nodes numbered in sequence 00,01,11,
10 are in a row
See Textbook For a 32-Node Example, Figure 2.33, Page 72
2/11/2003 platforms 112
Congestion Is √p/2 And Is Best• The argument is as follows:
– The bisection width of this mapped 2-D mesh is √p/2• Proof:
– The bisection width of a √p hypercube is √p/2– The bisection width of a row (linear array) is 1– The congestion of a row is thus √p/2 (the ratio [√p/2]/1)– Similarly, for the columns, the congestion is √p/2– Because the row and column mapping affects disjoint sets of nodes,
the bisection width is the same, namely √p/2
– The lower bound for the bisection width is √p/2• Proof:
– (Links of a hypercube)/(links of mesh) = p/(2√p) = √p/2
– Thus, because the lower bound is equal to the actual, this is the best mapping for a hypercube to a 2-D mesh
57
2/11/2003 platforms 113
Processor-Processor Mapping And Fat Interconnection Networks
• The previous examples were mappings of dense networks to sparse networks
• The congestion was larger than one• If both networks have the same bandwidth on each link, this could be
disastrous• The denser network is usually higher dimensionality which is costly
» Complicated layouts, wire crossings, variable wire lengths• However, the sparser network is simpler and it is easier to make the
congested links fatter
– Example:– The congestion on the mapping from the hypercube to the 2-D mesh with
the same number of processors is √p/2– Widen the paths of the mesh by a factor of √p/2 – Then the two networks have the same bisection bandwidth– The disadvantage is that the diameter of the mesh is larger than the cube
2/11/2003 platforms 114
Cost Performance Tradeoffs -- The Mesh Is Better For The Same Cost
• Consider a fattened mesh (fattened so as to make the costs of the network the same) and hypercube with the same number of processors
• Assume the cost of the network is proportional to the number of wires
» Increase the links on the p-node wraparound mesh by a factor of (log p)/4 wires, which makes the p-node hypercube and p-node wraparound mesh have the same cost
• Let's compare the average communication costs• Let lav be the average distance between any two nodes
– For the 2-D mesh, this distance is √p/2 links– For the hypercube, it is (log p)/2 links
58
2/11/2003 platforms 115
Cost Performance Tradeoffs Continued• The average time for a message of length m on
average (over lav hops) is:– For the 2-D mesh with cut-through routing: ts + thlav + twm
» Because the channel width has been increased by a factor of (log p)/4, the term twm is decreased by the same factor and lav is replaced by √p/2
tav_ccomm_2D = ts + th √p/2 + 4twm/(log p)– For the hypercube (with lav = (log p)/2)
tav_comm_cube = ts + th (log p)/2 + twm
• Comparison: for large m– The average comm. time is smaller for the 2-D mesh for p > 16
» This is not true for store-forward protocol for the mesh» This comparison is for the case that the cost is determined
by the number of links
2/11/2003 platforms 116
Cost Performance Tradeoffs Continued
• Suppose the cost is determined by the bisection width
• This time, we increase the bandwidth on the mesh links by a factor of √p/4
» This is the ratio of the bisection bandwidth for a hypercube to a mesh with p processors each -- that is, (p/2)/(2√p) = √p/4 (see slide 58)
• Comparison:– For the 2-D mesh with cut-through routing (with lav = √p/2) :
tav_ccomm_2D = ts + th √p/2 + 4twm/√p– For the hypercube (with lav = (log p)/2):
tav_comm_cube = ts + th (log p)/2 + twm– Again, the average comm. time is smaller for 2-D mesh for p > 16