Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
ECE 5315
• PP used in class for assessing cache coherence protocols
Assessing Protocol Design • The benchmark programs are executed on a
multiprocessor simulator • The state transitions observed determine the
frequency of various events such as cache misses and bus transactions
• Evaluate the effect of protocols in terms of design parameters (e.g., bandwidth requirements, cache block size,…)
• The analysis is based on the frequency of various events, not on the absolute time (since it is simulation)
State Transitions
• 16 Processors, 1MB 4-way set associative cache, 64B block, MESI protocol
Bandwidth Requirements
• 200 MIPS/MFLOPS, 1MB cache • III – MESI protocol • 3St– MSI with BusUpgr • 3St-Rdex– MSI with BusRdX
T r a
f f i c
( M B
/ s )
T r a
f f i c
( M B
/ s )
x
d l
t
x
I l l
t
E
x
0
2 0
4 0
6 0
8 0
1 0 0
1 2 0
1 4 0
1 6 0
1 8 0
2 0 0
D a t a b u s
A d d r e s s b u s
E
E
0
1 0
2 0
3 0
4 0
5 0
6 0
7 0
8 0
D a t a b u s A d d r e s s b u s
Bar
nes/
III
Bar
nes/
3St
Bar
nes/
3St-R
dEx
LU/II
I
Rad
ix/3
St-R
dEx
LU/3
St
LU/3
St-R
dEx
Rad
ix/3
St
Oce
an/II
I O
cean
/3S
Rad
iosi
ty/3
St-R
dEx
Oce
an/3
St-R
dEx
Rad
ix/II
I
Rad
iosi
ty/II
I
Rad
iosi
ty/3
St
Ray
trace
/III
Ray
trace
/3S
t R
aytra
ce/3
St-R
dEx
App
l-Cod
e/III
App
l-Cod
e/3S
t
App
l-Cod
e/3S
t-RdE
x
App
l-Dat
a/III
A
ppl-D
ata/
3St
App
l-Dat
a/3S
t-RdE
x
OS
-Cod
e/III
O
S-C
ode/
3St
OS
-Dat
a/3S
t O
S-D
ata/
III
OS
-Cod
e/3S
t-RdE
x
OS
-Dat
a/3S
t-RdE
x
Multiprogram workload Parallel program workload
Cache-Miss Types • Cold miss --- occurs on the first reference to a memory
block by a processor. (compulsory miss) • Capacity miss --- occurs when all the blocks that are
referenced during the execution of a program do not fit in the cache.
• Collision miss --- occurs caches with less than full associativity, i.e., the referenced block does not fit in the set. (conflict miss)
• Coherence miss --- occurs when blocks of data are shared among multiple processors. – True sharing: a word in a cache block produced by one processor
is used by another processor. – False sharing: words accessed by different processors happen to be
placed in the same block
Sharing Misses: Illustration • True Sharing Miss
– One writes some words in a cache block – The same block in other processors are invalidated – The second processor reads one of the modified words
(read miss) • False Sharing Miss
– One writes some words in a cache block – The same block in other processors are invalidated – The second processor reads a different word in the
same cache block.
Sharing Misses • True Sharing Miss
– Reduced by increasing the cache block size and the spatial locality of the workload
• False Sharing Miss – Increases as the cache bloc size increases – Would not occur if the cache block size is one word – Current trend is enlarging the cache block size, which
potentially increases false sharing misses
Classification of Cache Misses Miss classi cation
Reasonfor miss
First reference tomemory block by processor
First accesssystemwide
Yes
No
Writtenbefore
Yes
No
Modi ed word(s) accessedduring lifetime
Yes
No
1. Cold
2. Cold
4. True-sharing-
3. False-sharing-
Reason forelimination of
last copy
Replacement
Invalidation
Old copywith state = invalid
still thereYesNo
8. Pure-7. Pure-
6. True-sharing-inval-cap
5. False-sharing- inval-cap
Modi ed word(s) accessedduring lifetime
Modi ed word(s) accessedduring lifetime
Yes
No YesNo
false-sharingtrue-sharing
Has blockbeen modi ed since
replacement
No Yes
10. True-sharing-9. Pure-
12. True-sharing-11. False-sharing-
Modi ed word(s) accessedduring lifetime Modi ed
word(s) accessedduring lifetime
YesNo
YesNo
capacity
Other
cold
cold
cap-inval cap-inval
capacity
Impact of block size on miss rates (1MB cache)
• 16 processors, 1MB cache, 4-way set associative • Cold, capacity, and true sharing misses tend to decrease with
increasing block size • False sharing misses tend to increase with block size
C o l d
C a p a c i t y
T r u e s h a r i n g
F a l s e s h a r i n g
U p g r a d e
8
0
0 . 1
0 . 2
0 . 3
0 . 4
0 . 5
0 . 6
C o l d
C a p a c i t y
T r u e s h a r i n g
F a l s e s h a r i n g
U p g r a d e
8 6
2
4 8
6 8
0
2
4
6
8
1 0
1 2
Mis
s ra
te (%
)
Bar
nes/
8
Bar
nes/
16
Bar
nes/
32
Bar
nes/
64
Bar
nes/
128
Bar
nes/
256
Lu/8
Lu/1
6 Lu
/32
Lu/6
4 Lu
/128
Lu/2
56
Rad
iosi
ty/8
Rad
iosi
ty/1
6 R
adio
sity
/32
Rad
iosi
ty/6
4
Rad
iosi
ty/1
28
Rad
iosi
ty/2
56
Mis
s ra
te (%
)
Oce
an/8
O
cean
/16
Oce
an/3
2 O
cean
/64
Oce
an/1
28
Oce
an/2
56
Rad
ix/8
Rad
ix/1
6
Rad
ix/3
2 R
adix
/64
Rad
ix/1
28
Rad
ix/2
56
Ray
trace
/8
Ray
trace
/16
Ray
trace
/32
Ray
trace
/64
Ray
trace
/128
R
aytra
ce/2
56
Block Size Block Size
Impact of block size on miss rates (64KB cache)
• Increases in overall miss rates • Capacity misses are a much larger portion of overall misses
Impact of Block Size on Bus Traffic (1MB Cache)
– Data traffic quickly increases with block size – Address bus traffic tends to decrease with block size – Address traffic overhead comprises a significant fraction for
small block sizes
Traffic affects performance indirectly through contention
Traf
fic (b
ytes
/inst
ruct
ion)
Traf
fic (b
ytes
/FLO
P)
Data busAddress bus
Data busAddress bus
Rad
ix/8
Rad
ix/1
6
Rad
ix/3
2
Rad
ix/6
4
Rad
ix/1
28
Rad
ix/2
56
0
1
2
3
4
5
6
7
8
9
10
LU/8
LU/1
6
LU/3
2
LU/6
4
LU/1
28
LU/2
56
Oce
an/8
Oce
an/1
6
Oce
an/3
2
Oce
an/6
4
Oce
an/1
28
Oce
an/2
56
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
4
2 8
0
0 . 0 2
0 . 0 4
0 . 0 6
0 . 0 8
0 . 1
0 . 1 2
0 . 1 4
0 . 1 6
0 . 1 8
D a t a b u s A d d r e s s b u s
Bar
nes/
16
Traf
fic (b
ytes
/inst
ruct
ions
)
Bar
nes/
8
Bar
nes/
32
Bar
nes/
64
Bar
nes/
128
Bar
nes/
256
Rad
iosi
ty/8
R
adio
sity
/16
Rad
iosi
ty/3
2
Rad
iosi
ty/6
4 R
adio
sity
/128
R
adio
sity
/256
Ray
trace
/8
Ray
trace
/16
Ray
trace
/32
Ray
trace
/64
Ray
trace
/128
R
aytra
ce/2
56
Impact of Block Size on Bus Traffic (64KB Cache)
– For Ocean, data traffic slowly increases with block size (cmp 1MB)
Drawbacks of Large Cache Blocks
• The trend toward larger cache block size is driven by availability of increasing density for processors and memory chips
• This trend bodes poorly for multiprocessor designs because of potential increase in false sharing misses
Countering the effects of large block size
• Organize data structures or work assignments so that data accessed by different processes is not interleaved finely in the shared address space (software approach)
• Use sub-blocks within a cache block. One sub-block may be valid while others are invalid
• Small cache blocks are used, but on a miss the system prefetches blocks beyond the accessed block
• Use adjustable block size (complex) • Delay propagating or applying invalidations from
a processor until it has issued multiple writes
Update-Based Vs. Invalidation-Based Protocols
• Update-based protocols perform better, if the processors that were using the data before it was updated are likely to use the new values in the future
• Invalidation-based protocols perform better, if the processors are never going to use the new values in the future (since traffic update is useless)
Hybrid of Update and Invalidation (Mixed)
• Start with an update protocol and set a counter to each block (k, called a threshold)
• Whenever a cache block is accessed by a local processor, the counter is reset to k
• Every time an update is received for a block, the counter is decremented
• If the counter goes to zero, the block is locally invalidated • Next time an update is generated, the block is switched to
the modified state and will stop generating updates • If some other processor now accesses the block, the block
again will switch to shared state and generate updates
Update vs Invalidate: Miss Rates
– K=4 for mixed – Lots of coherence misses: updates help – Lots of capacity misses: updates hurt (keep data in cache uselessly)
Mis
s ra
te (%
)
Mis
s ra
te (%
)
LU/in
v
LU/u
pd
Oce
an/in
v
Oce
an/m
ix
Oce
an/u
pd
Ray
trace
/inv
Ray
trace
/upd
0.00
0.10
0.20
0.30
0.40
0.50
0.60
Cold
Capacity
True sharing
False sharing
Rad
ix/in
v
Rad
ix/m
ix
Rad
ix/u
pd
0.00
0.50
1.00
1.50
2.00
2.50
Update Protocols • For applications with significant capacity miss
rates, the misses increase with an update protocol • False sharing decreases with an update protocol • The traffic associated with update is quite
substantial (many bus transactions vs one in invalidation)
• The increased traffic can cause contention and can greatly increase the cost of misses
• Update protocols have greater problems for scalable systems
• The trend is away from the update based protocols as default