PowerPoint Presentationtkwon/course/5315/ppp/ProtocolAssess.pdf · 1. Cold 2. Cold 4. True-sharing-3. False-sharing-Reason for elimination of last copy Replacement Invalidation Old

ECE 5315

• PP used in class for assessing cache coherence protocols

Assessing Protocol Design • The benchmark programs are executed on a

multiprocessor simulator • The state transitions observed determine the

frequency of various events such as cache misses and bus transactions

• Evaluate the effect of protocols in terms of design parameters (e.g., bandwidth requirements, cache block size,…)

• The analysis is based on the frequency of various events, not on the absolute time (since it is simulation)

State Transitions

• 16 Processors, 1MB 4-way set associative cache, 64B block, MESI protocol

Bandwidth Requirements

• 200 MIPS/MFLOPS, 1MB cache • III – MESI protocol • 3St– MSI with BusUpgr • 3St-Rdex– MSI with BusRdX

T r a

f f i c

( M B

/ s )

T r a

f f i c

( M B

/ s )

x

d l

t

x

I l l

t

E

x

0

2 0

4 0

6 0

8 0

1 0 0

1 2 0

1 4 0

1 6 0

1 8 0

2 0 0

D a t a b u s

A d d r e s s b u s

E

E

0

1 0

2 0

3 0

4 0

5 0

6 0

7 0

8 0

D a t a b u s A d d r e s s b u s

Bar

nes/

III

Bar

nes/

3St

Bar

nes/

3St-R

dEx

LU/II

I

Rad

ix/3

St-R

dEx

LU/3

St

LU/3

St-R

dEx

Rad

ix/3

St

Oce

an/II

I O

cean

/3S

Rad

iosi

ty/3

St-R

dEx

Oce

an/3

St-R

dEx

Rad

ix/II

I

Rad

iosi

ty/II

I

Rad

iosi

ty/3

St

Ray

trace

/III

Ray

trace

/3S

t R

aytra

ce/3

St-R

dEx

App

l-Cod

e/III

App

l-Cod

e/3S

t

App

l-Cod

e/3S

t-RdE

x

App

l-Dat

a/III

A

ppl-D

ata/

3St

App

l-Dat

a/3S

t-RdE

x

OS

-Cod

e/III

O

S-C

ode/

3St

OS

-Dat

a/3S

t O

S-D

ata/

III

OS

-Cod

e/3S

t-RdE

x

OS

-Dat

a/3S

t-RdE

x

Multiprogram workload Parallel program workload

Cache-Miss Types • Cold miss --- occurs on the first reference to a memory

block by a processor. (compulsory miss) • Capacity miss --- occurs when all the blocks that are

referenced during the execution of a program do not fit in the cache.

• Collision miss --- occurs caches with less than full associativity, i.e., the referenced block does not fit in the set. (conflict miss)

• Coherence miss --- occurs when blocks of data are shared among multiple processors. – True sharing: a word in a cache block produced by one processor

is used by another processor. – False sharing: words accessed by different processors happen to be

placed in the same block

Sharing Misses: Illustration • True Sharing Miss

– One writes some words in a cache block – The same block in other processors are invalidated – The second processor reads one of the modified words

(read miss) • False Sharing Miss

– One writes some words in a cache block – The same block in other processors are invalidated – The second processor reads a different word in the

same cache block.

Sharing Misses • True Sharing Miss

– Reduced by increasing the cache block size and the spatial locality of the workload

• False Sharing Miss – Increases as the cache bloc size increases – Would not occur if the cache block size is one word – Current trend is enlarging the cache block size, which

potentially increases false sharing misses

Classification of Cache Misses Miss classi cation

Reasonfor miss

First reference tomemory block by processor

First accesssystemwide

Yes

No

Writtenbefore

Yes

No

Modi ed word(s) accessedduring lifetime

Yes

No

1. Cold

2. Cold

4. True-sharing-

3. False-sharing-

Reason forelimination of

last copy

Replacement

Invalidation

Old copywith state = invalid

still thereYesNo

8. Pure-7. Pure-

6. True-sharing-inval-cap

5. False-sharing- inval-cap



Yes

No YesNo

false-sharingtrue-sharing

Has blockbeen modi ed since

replacement

No Yes

10. True-sharing-9. Pure-

12. True-sharing-11. False-sharing-

Modi ed word(s) accessedduring lifetime Modi ed

word(s) accessedduring lifetime

YesNo

YesNo

capacity

Other

cold

cold

cap-inval cap-inval

capacity

Impact of block size on miss rates (1MB cache)

• 16 processors, 1MB cache, 4-way set associative • Cold, capacity, and true sharing misses tend to decrease with

increasing block size • False sharing misses tend to increase with block size

C o l d

C a p a c i t y

T r u e s h a r i n g

F a l s e s h a r i n g

U p g r a d e

8

0

0 . 1

0 . 2

0 . 3

0 . 4

0 . 5

0 . 6

C o l d

C a p a c i t y

T r u e s h a r i n g

F a l s e s h a r i n g

U p g r a d e

8 6

2

4 8

6 8

0

2

4

6

8

1 0

1 2

Mis

s ra

te (%

)

Bar

nes/

8

Bar

nes/

16

Bar

nes/

32

Bar

nes/

64

Bar

nes/

128

Bar

nes/

256

Lu/8

Lu/1

6 Lu

/32

Lu/6

4 Lu

/128

Lu/2

56

Rad

iosi

ty/8

Rad

iosi

ty/1

6 R

adio

sity

/32

Rad

iosi

ty/6

4

Rad

iosi

ty/1

28

Rad

iosi

ty/2

56

Mis

s ra

te (%

)

Oce

an/8

O

cean

/16

Oce

an/3

2 O

cean

/64

Oce

an/1

28

Oce

an/2

56

Rad

ix/8

Rad

ix/1

6

Rad

ix/3

2 R

adix

/64

Rad

ix/1

28

Rad

ix/2

56

Ray

trace

/8

Ray

trace

/16

Ray

trace

/32

Ray

trace

/64

Ray

trace

/128

R

aytra

ce/2

56

Block Size Block Size

Impact of block size on miss rates (64KB cache)

• Increases in overall miss rates • Capacity misses are a much larger portion of overall misses

Impact of Block Size on Bus Traffic (1MB Cache)

– Data traffic quickly increases with block size – Address bus traffic tends to decrease with block size – Address traffic overhead comprises a significant fraction for

small block sizes

Traffic affects performance indirectly through contention

Traf

fic (b

ytes

/inst

ruct

ion)

Traf

fic (b

ytes

/FLO

P)

Data busAddress bus

Data busAddress bus

Rad

ix/8

Rad

ix/1

6

Rad

ix/3

2

Rad

ix/6

4

Rad

ix/1

28

Rad

ix/2

56

0

1

2

3

4

5

6

7

8

9

10

LU/8

LU/1

6

LU/3

2

LU/6

4

LU/1

28

LU/2

56

Oce

an/8

Oce

an/1

6

Oce

an/3

2

Oce

an/6

4

Oce

an/1

28

Oce

an/2

56

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

4

2 8

0

0 . 0 2

0 . 0 4

0 . 0 6

0 . 0 8

0 . 1

0 . 1 2

0 . 1 4

0 . 1 6

0 . 1 8

D a t a b u s A d d r e s s b u s

Bar

nes/

16

Traf

fic (b

ytes

/inst

ruct

ions

)

Bar

nes/

8

Bar

nes/

32

Bar

nes/

64

Bar

nes/

128

Bar

nes/

256

Rad

iosi

ty/8

R

adio

sity

/16

Rad

iosi

ty/3

2

Rad

iosi

ty/6

4 R

adio

sity

/128

R

adio

sity

/256

Ray

trace

/8

Ray

trace

/16

Ray

trace

/32

Ray

trace

/64

Ray

trace

/128

R

aytra

ce/2

56

Impact of Block Size on Bus Traffic (64KB Cache)

– For Ocean, data traffic slowly increases with block size (cmp 1MB)

Drawbacks of Large Cache Blocks

• The trend toward larger cache block size is driven by availability of increasing density for processors and memory chips

• This trend bodes poorly for multiprocessor designs because of potential increase in false sharing misses

Countering the effects of large block size

• Organize data structures or work assignments so that data accessed by different processes is not interleaved finely in the shared address space (software approach)

• Use sub-blocks within a cache block. One sub-block may be valid while others are invalid

• Small cache blocks are used, but on a miss the system prefetches blocks beyond the accessed block

• Use adjustable block size (complex) • Delay propagating or applying invalidations from

a processor until it has issued multiple writes

Update-Based Vs. Invalidation-Based Protocols

• Update-based protocols perform better, if the processors that were using the data before it was updated are likely to use the new values in the future

• Invalidation-based protocols perform better, if the processors are never going to use the new values in the future (since traffic update is useless)

Hybrid of Update and Invalidation (Mixed)

• Start with an update protocol and set a counter to each block (k, called a threshold)

• Whenever a cache block is accessed by a local processor, the counter is reset to k

• Every time an update is received for a block, the counter is decremented

• If the counter goes to zero, the block is locally invalidated • Next time an update is generated, the block is switched to

the modified state and will stop generating updates • If some other processor now accesses the block, the block

again will switch to shared state and generate updates

Update vs Invalidate: Miss Rates

– K=4 for mixed – Lots of coherence misses: updates help – Lots of capacity misses: updates hurt (keep data in cache uselessly)

Mis

s ra

te (%

)

Mis

s ra

te (%

)

LU/in

v

LU/u

pd

Oce

an/in

v

Oce

an/m

ix

Oce

an/u

pd

Ray

trace

/inv

Ray

trace

/upd

0.00

0.10

0.20

0.30

0.40

0.50

0.60

Cold

Capacity

True sharing

False sharing

Rad

ix/in

v

Rad

ix/m

ix

Rad

ix/u

pd

0.00

0.50

1.00

1.50

2.00

2.50

Update Protocols • For applications with significant capacity miss

rates, the misses increase with an update protocol • False sharing decreases with an update protocol • The traffic associated with update is quite

substantial (many bus transactions vs one in invalidation)

• The increased traffic can cause contention and can greatly increase the cost of misses

• Update protocols have greater problems for scalable systems

• The trend is away from the update based protocols as default

Documents

PowerPoint Presentationtkwon/course/5315/ppp/ProtocolAssess.pdf · 1. Cold 2. Cold 4. True-sharing-3. False-sharing-Reason for elimination of last copy Replacement Invalidation Old