CSE 661 PAPER PRESENTATION

CSE 661 PAPER CSE 661 PAPER PRESENTATIONPRESENTATION

PERFORMANCE AND ENERGY IMPLICATIONS OF MANY-CORE CACHES FOR

THROUGHPUT COMPUTINGBy C. J. Hughes et al

Presented BySALAMI, Hamza Onoruoiza

g201002240

OUTLINE OF OUTLINE OF PRESENTATIONPRESENTATIONThroughput ComputingBenchmarks UsedDegree of Sharing of L2 Caches in

BenchmarksCache Designs ConsideredExperimental SetupResults (Performance and Energy)Possible ImprovementsFinal ResultsConclusion, Comments and

Questions

2

THROUGHPUT THROUGHPUT COMPUTINGCOMPUTINGPerforming huge number of

computations with large amounts of parallelism

Also known as GPGPU

3

BENCHMARKS USEDBENCHMARKS USED

Working Set: 64KB – 2MB64 Threads with private 32KB Cache256KB L2 CacheL2 < 2MB may result in bad

performance

L1 Miss rate without prefetching

4

BENCHMARKS USED (2)BENCHMARKS USED (2)

L1 Miss rate without prefetching

5

DEGREE OF SHARING OF L2 DEGREE OF SHARING OF L2 CACHE IN BENCHMARKCACHE IN BENCHMARK

SHARING DEGREE◦ Spatial: Fraction of each cache line accessed. Most data

private except for svm◦ Temporal: Fraction of accesses to line. Shared data is

prevalente.g pcg. 0.1% of lines involved in global R/W giving 19.2% of L2 cache accesses

6

CACHE DESIGNS CACHE DESIGNS CONSIDEREDCONSIDEREDASSUMPTIONSTwo level caching (private L1, Varying L2); inclusive cacheDirectory Based CoherenceTiled Design (Tile = Core + Private Caches + Switch)

7

CACHE DESIGNS CONSIDERED CACHE DESIGNS CONSIDERED (2)(2)1) PRIVATE LLCLLC in tile’s coreMost flexible design (replicas of cache line can exist in all LLCs simultaneously)Fewer unique cache lines => more LLC misses Each tile contains tag directoryHash function (cache block address) = home tileHome tile provides info. On which LLC(s) hold required data.Cache-to-Cache transfer takes place

8

CACHE DESIGNS CONSIDERED CACHE DESIGNS CONSIDERED (3)(3)

2) UNCONTROLLED REPLICATIONSimilar to private LLCTries to increase no. of unique linesEviction of cache block with one sharer? Move block to its home tile. Already in home tile? Evict from chip.

9

CACHE DESIGNS CONSIDERED CACHE DESIGNS CONSIDERED (4)(4)

3) CONTROLLED REPLICATIONBuilds on Uncontrolled ReplicationTries to further increase no. of unique linesEach block has reference bit. Reference bit = 1 => likely part of working setDuplicate copies of cache blocks not in active use are favored for LRU eviction.

10

CACHE DESIGNS CONSIDERED CACHE DESIGNS CONSIDERED (5)(5)4) NO REPLICATIONLimited flexibilityCache lines reside in at most one LLC at a time .Shared lines held in lines’ home tile’s LLC (=> easy accessibility)Private lines held in user’s LLC (RDP points to line’s location). Eviction of private line or increased number of sharers returns block to its home LLC

11

CACHE DESIGNS CONSIDERED CACHE DESIGNS CONSIDERED (6)(6)5) SHAREDLeast flexibilityAll cache lines reside in their home LLC.Easy to find linesIncreased average access latency and on-die traffic for private lines

12

CACHE DESIGNS CACHE DESIGNS CONSIDERED(7)CONSIDERED(7)

CACHE DESIGNSPrivateUncontrolled ReplicationControlled ReplicationNo ReplicationShared

Flexibility

Reduction in On-Die bandwidth usage

Effective Cache Capacity(No. of Unique Blocks)Reducti

on in Off-Die bandwidth usage

13

EXPERIMENTAL SETUPEXPERIMENTAL SETUPSimulator is usedL1 has hardware

stride prefetcher

Energy Consumption•Storage energy: tag and cache line access to LLC, tag directory and RDP•On-die data messages•On-die coherence messages•Off-die accesses

14

RESULTS (PERFORMANCE)RESULTS (PERFORMANCE)

• Least flexible designs offer better performance!!!• Least flexible designs

• High throughput to heavily R/W lines (on a miss, home tile responds directly, no need for acknowledgement)

• Single write causes invalidation for readers (less impact for centralized data design, worse for flexible designs)

• Flexible designs• No centralized data storage• No overlapped cache-to-cache transfer; directory receives

acknowledgement from sending tile before processing another request.

15

RESULTS (ENERGY)RESULTS (ENERGY)

• Flexible designs consume significantly less energy than other designs!!!

• Flexible designs minimize on-die traffic because of replication.

• Increase in off-die traffic (fewer unique lines) but most lines have few shares. See Figure 1

• On-die traffic for No Replication better than Shared due to data migration

• Off-die traffic increases as we move from Private to Uncontrolled Replication to Controlled Replication

16

RESULTS SO FAR…RESULTS SO FAR…

• Flexible designs are more energy efficient

• Less flexible designs offer better performance

• Controlled Replication uses least energy.

• Can we improve its parallelism for handling multiple reads of the same cache line?

17

POSSIBLE IMPROVEMENTSPOSSIBLE IMPROVEMENTSTag Directory Buffer

◦ Small, fully associative buffer added to tag directory to hold clean lines having at least 3 shared readers (similar to Shared design)

Tag Directory Buffer All◦ Similar to Tag Directory Buffer◦ In this case, all read-shared lines are placed in tag

directory buffer

Four-entry buffer of 256 bytes is used

18

POSSIBLE IMPROVEMENTS POSSIBLE IMPROVEMENTS (2)(2)Sharing Migration

◦ Similar to Tag Directory Buffer◦ However, uses home tile’s LLC instead of a buffer

Tag Directory Buffer All◦ Similar to Tag Directory Buffer All◦ Uses home tile’s LLC instead of a buffer

Parallel Reads◦ Allows simultaneous (overlapped) cache-to-cache

transfers of the same cache lines for reads.

19

FINAL RESULTSFINAL RESULTSTag Directory Buffer provides highest

performance and close to the least energy consumption. See also figure 3

20

CONCLUDING REMARKS, CONCLUDING REMARKS, COMMENTS AND QUESTIONSCOMMENTS AND QUESTIONS

THANK YOU

21

Documents

CSE 661 PAPER PRESENTATION