21
CSE 661 PAPER CSE 661 PAPER PRESENTATION PRESENTATION PERFORMANCE AND ENERGY IMPLICATIONS OF MANY-CORE CACHES FOR THROUGHPUT COMPUTING By C. J. Hughes et al Presented By SALAMI, Hamza Onoruoiza g201002240

CSE 661 PAPER PRESENTATION

  • Upload
    noma

  • View
    43

  • Download
    1

Embed Size (px)

DESCRIPTION

CSE 661 PAPER PRESENTATION. PERFORMANCE AND ENERGY IMPLICATIONS OF MANY-CORE CACHES FOR THROUGHPUT COMPUTING By C. J. Hughes et al. Presented By SALAMI, Hamza Onoruoiza g201002240. OUTLINE OF PRESENTATION. Throughput Computing Benchmarks Used - PowerPoint PPT Presentation

Citation preview

Page 1: CSE  661 PAPER PRESENTATION

CSE 661 PAPER CSE 661 PAPER PRESENTATIONPRESENTATION

PERFORMANCE AND ENERGY IMPLICATIONS OF MANY-CORE CACHES FOR

THROUGHPUT COMPUTINGBy C. J. Hughes et al

Presented BySALAMI, Hamza Onoruoiza

g201002240

Page 2: CSE  661 PAPER PRESENTATION

OUTLINE OF OUTLINE OF PRESENTATIONPRESENTATIONThroughput ComputingBenchmarks UsedDegree of Sharing of L2 Caches in

BenchmarksCache Designs ConsideredExperimental SetupResults (Performance and Energy)Possible ImprovementsFinal ResultsConclusion, Comments and

Questions

2

Page 3: CSE  661 PAPER PRESENTATION

THROUGHPUT THROUGHPUT COMPUTINGCOMPUTINGPerforming huge number of

computations with large amounts of parallelism

Also known as GPGPU

3

Page 4: CSE  661 PAPER PRESENTATION

BENCHMARKS USEDBENCHMARKS USED

Working Set: 64KB – 2MB64 Threads with private 32KB Cache256KB L2 CacheL2 < 2MB may result in bad

performance

L1 Miss rate without prefetching

4

Page 5: CSE  661 PAPER PRESENTATION

BENCHMARKS USED (2)BENCHMARKS USED (2)

L1 Miss rate without prefetching

5

Page 6: CSE  661 PAPER PRESENTATION

DEGREE OF SHARING OF L2 DEGREE OF SHARING OF L2 CACHE IN BENCHMARKCACHE IN BENCHMARK

SHARING DEGREE◦ Spatial: Fraction of each cache line accessed. Most data

private except for svm◦ Temporal: Fraction of accesses to line. Shared data is

prevalente.g pcg. 0.1% of lines involved in global R/W giving 19.2% of L2 cache accesses

6

Page 7: CSE  661 PAPER PRESENTATION

CACHE DESIGNS CACHE DESIGNS CONSIDEREDCONSIDEREDASSUMPTIONSTwo level caching (private L1, Varying L2); inclusive cacheDirectory Based CoherenceTiled Design (Tile = Core + Private Caches + Switch)

7

Page 8: CSE  661 PAPER PRESENTATION

CACHE DESIGNS CONSIDERED CACHE DESIGNS CONSIDERED (2)(2)1) PRIVATE LLCLLC in tile’s coreMost flexible design (replicas of cache line can exist in all LLCs simultaneously)Fewer unique cache lines => more LLC misses Each tile contains tag directoryHash function (cache block address) = home tileHome tile provides info. On which LLC(s) hold required data.Cache-to-Cache transfer takes place

8

Page 9: CSE  661 PAPER PRESENTATION

CACHE DESIGNS CONSIDERED CACHE DESIGNS CONSIDERED (3)(3)

2) UNCONTROLLED REPLICATIONSimilar to private LLCTries to increase no. of unique linesEviction of cache block with one sharer? Move block to its home tile. Already in home tile? Evict from chip.

9

Page 10: CSE  661 PAPER PRESENTATION

CACHE DESIGNS CONSIDERED CACHE DESIGNS CONSIDERED (4)(4)

3) CONTROLLED REPLICATIONBuilds on Uncontrolled ReplicationTries to further increase no. of unique linesEach block has reference bit. Reference bit = 1 => likely part of working setDuplicate copies of cache blocks not in active use are favored for LRU eviction.

10

Page 11: CSE  661 PAPER PRESENTATION

CACHE DESIGNS CONSIDERED CACHE DESIGNS CONSIDERED (5)(5)4) NO REPLICATIONLimited flexibilityCache lines reside in at most one LLC at a time .Shared lines held in lines’ home tile’s LLC (=> easy accessibility)Private lines held in user’s LLC (RDP points to line’s location). Eviction of private line or increased number of sharers returns block to its home LLC

11

Page 12: CSE  661 PAPER PRESENTATION

CACHE DESIGNS CONSIDERED CACHE DESIGNS CONSIDERED (6)(6)5) SHAREDLeast flexibilityAll cache lines reside in their home LLC.Easy to find linesIncreased average access latency and on-die traffic for private lines

12

Page 13: CSE  661 PAPER PRESENTATION

CACHE DESIGNS CACHE DESIGNS CONSIDERED(7)CONSIDERED(7)

CACHE DESIGNSPrivateUncontrolled ReplicationControlled ReplicationNo ReplicationShared

Flexibility

Reduction in On-Die bandwidth usage

Effective Cache Capacity(No. of Unique Blocks)Reducti

on in Off-Die bandwidth usage

13

Page 14: CSE  661 PAPER PRESENTATION

EXPERIMENTAL SETUPEXPERIMENTAL SETUPSimulator is usedL1 has hardware

stride prefetcher

Energy Consumption•Storage energy: tag and cache line access to LLC, tag directory and RDP•On-die data messages•On-die coherence messages•Off-die accesses

14

Page 15: CSE  661 PAPER PRESENTATION

RESULTS (PERFORMANCE)RESULTS (PERFORMANCE)

• Least flexible designs offer better performance!!!• Least flexible designs

• High throughput to heavily R/W lines (on a miss, home tile responds directly, no need for acknowledgement)

• Single write causes invalidation for readers (less impact for centralized data design, worse for flexible designs)

• Flexible designs• No centralized data storage• No overlapped cache-to-cache transfer; directory receives

acknowledgement from sending tile before processing another request.

15

Page 16: CSE  661 PAPER PRESENTATION

RESULTS (ENERGY)RESULTS (ENERGY)

• Flexible designs consume significantly less energy than other designs!!!

• Flexible designs minimize on-die traffic because of replication.

• Increase in off-die traffic (fewer unique lines) but most lines have few shares. See Figure 1

• On-die traffic for No Replication better than Shared due to data migration

• Off-die traffic increases as we move from Private to Uncontrolled Replication to Controlled Replication

16

Page 17: CSE  661 PAPER PRESENTATION

RESULTS SO FAR…RESULTS SO FAR…

• Flexible designs are more energy efficient

• Less flexible designs offer better performance

• Controlled Replication uses least energy.

• Can we improve its parallelism for handling multiple reads of the same cache line?

17

Page 18: CSE  661 PAPER PRESENTATION

POSSIBLE IMPROVEMENTSPOSSIBLE IMPROVEMENTSTag Directory Buffer

◦ Small, fully associative buffer added to tag directory to hold clean lines having at least 3 shared readers (similar to Shared design)

Tag Directory Buffer All◦ Similar to Tag Directory Buffer◦ In this case, all read-shared lines are placed in tag

directory buffer

Four-entry buffer of 256 bytes is used

18

Page 19: CSE  661 PAPER PRESENTATION

POSSIBLE IMPROVEMENTS POSSIBLE IMPROVEMENTS (2)(2)Sharing Migration

◦ Similar to Tag Directory Buffer◦ However, uses home tile’s LLC instead of a buffer

Tag Directory Buffer All◦ Similar to Tag Directory Buffer All◦ Uses home tile’s LLC instead of a buffer

Parallel Reads◦ Allows simultaneous (overlapped) cache-to-cache

transfers of the same cache lines for reads.

19

Page 20: CSE  661 PAPER PRESENTATION

FINAL RESULTSFINAL RESULTSTag Directory Buffer provides highest

performance and close to the least energy consumption. See also figure 3

20

Page 21: CSE  661 PAPER PRESENTATION

CONCLUDING REMARKS, CONCLUDING REMARKS, COMMENTS AND QUESTIONSCOMMENTS AND QUESTIONS

THANK YOU

21