34
Optimizing Communication and Capacity in 3D Stacked Cache Hierarchies Aniruddha Udipi N. Madan, L. Zhao, N. Muralimanohar, A. Udipi, R. Balasubramonian, R. Iyer, S. Makineni and D. Newell University of Utah and Intel STL

Optimizing Communication and Capacity in 3D Stacked Cache Hierarchies Aniruddha Udipi N. Madan, L. Zhao, N. Muralimanohar, A. Udipi, R. Balasubramonian,

Embed Size (px)

Citation preview

Page 1: Optimizing Communication and Capacity in 3D Stacked Cache Hierarchies Aniruddha Udipi N. Madan, L. Zhao, N. Muralimanohar, A. Udipi, R. Balasubramonian,

Optimizing Communication and Capacity in 3D Stacked Cache Hierarchies

Aniruddha Udipi

N. Madan, L. Zhao, N. Muralimanohar, A. Udipi, R. Balasubramonian, R. Iyer, S.

Makineni and D. Newell

University of Utah and Intel STL

Page 2: Optimizing Communication and Capacity in 3D Stacked Cache Hierarchies Aniruddha Udipi N. Madan, L. Zhao, N. Muralimanohar, A. Udipi, R. Balasubramonian,

University of Utah 2

Motivation

• Many-core designs requires large cache capacity for performance

• SRAM has low latency and consumes less power

• DRAM has 8X density but poor latency/power characteristics

• Can we design a hybrid SRAM-DRAM cache to take advantage of both technologies?

• Can we build a customized on-chip network specifically targeted at such a design?

Page 3: Optimizing Communication and Capacity in 3D Stacked Cache Hierarchies Aniruddha Udipi N. Madan, L. Zhao, N. Muralimanohar, A. Udipi, R. Balasubramonian,

University of Utah 3

Proposal - 3D Stacked Hybrid Cache

SRAM DRAM

• Not an option in conventional 2D design

• 3D Mixed-process stacking enables a single vertical SRAM/DRAM bank

Page 4: Optimizing Communication and Capacity in 3D Stacked Cache Hierarchies Aniruddha Udipi N. Madan, L. Zhao, N. Muralimanohar, A. Udipi, R. Balasubramonian,

University of Utah 4

Executive Summary

• 3D stacked hybrid cache design

• Synergistic proposals to improve performance and power efficiency

– Optimizing Capacity• Reconfigurable cache hierarchy

– Optimizing Communication• Page coloring for effective data placement - reduced

communication• Tailor-made on-chip interconnection network - quicker

communication

• Up to 62% performance increase

Page 5: Optimizing Communication and Capacity in 3D Stacked Cache Hierarchies Aniruddha Udipi N. Madan, L. Zhao, N. Muralimanohar, A. Udipi, R. Balasubramonian,

University of Utah 5

Outline

• Overview of 3D Technology• Technique I - Reconfigurable Cache Hierarchy• Technique II - Page coloring• Technique III - On-chip Interconnection Network• Evaluation• Conclusions

Page 6: Optimizing Communication and Capacity in 3D Stacked Cache Hierarchies Aniruddha Udipi N. Madan, L. Zhao, N. Muralimanohar, A. Udipi, R. Balasubramonian,

University of Utah 6

3D Technology

+ Mixed process integration possible

+ High speed vertical interconnects

- Thermal Issues

Source: Black et al. MICRO’06

Through-Silicon Vias (TSVs)

Die-to-die vias

Heat sink

Die #2

Die #1

Bulk Si #1

Active Si #1Metal

MetalActive Si #2Bulk Si #2

I/O Bumps

Page 7: Optimizing Communication and Capacity in 3D Stacked Cache Hierarchies Aniruddha Udipi N. Madan, L. Zhao, N. Muralimanohar, A. Udipi, R. Balasubramonian,

University of Utah 7

Baseline Model

Lower die - 16 Processing cores

Upper die - 16 SRAM Banks, with grid based on-chip network

Page 8: Optimizing Communication and Capacity in 3D Stacked Cache Hierarchies Aniruddha Udipi N. Madan, L. Zhao, N. Muralimanohar, A. Udipi, R. Balasubramonian,

University of Utah 8

Outline

• Overview of 3D Technology• Technique I - Reconfigurable Cache Hierarchy• Technique II - On-chip Interconnection Network • Technique III - Page coloring • Evaluation• Conclusions

Page 9: Optimizing Communication and Capacity in 3D Stacked Cache Hierarchies Aniruddha Udipi N. Madan, L. Zhao, N. Muralimanohar, A. Udipi, R. Balasubramonian,

University of Utah 9

Technique I - Reconfigurable hierarchy

• Increase capacity by stacking a DRAM bank on each SRAM cache bank, reconfigure bank size based on demand

• More compelling with 3D and NUCA– Space capacity on die 3 does not intrude with layout of second

die or steal capacity from neighboring caches– Cache already partitioned into NUCA banks, additional banks

do not complicate logic too much– Access time grows less than linearly with capacity– Dramatic increase in capacity, no gradation, only two choices

• Turn-off DRAM for small working set size

Page 10: Optimizing Communication and Capacity in 3D Stacked Cache Hierarchies Aniruddha Udipi N. Madan, L. Zhao, N. Muralimanohar, A. Udipi, R. Balasubramonian,

University of Utah 10

Proposed Reconfigurable Cache Model

Die containing 16 cores

Die containing 16SRAM banks and tree interconnect

Die containing 16DRAM banks and no interconnect

Inter-die via pillarto send request from

core to L2 SRAM(not shown: one

pillar for each core)

Inter-die via pillarto access portionof L2 in DRAM

(not shown: onepillar per sector)

Page 11: Optimizing Communication and Capacity in 3D Stacked Cache Hierarchies Aniruddha Udipi N. Madan, L. Zhao, N. Muralimanohar, A. Udipi, R. Balasubramonian,

University of Utah 11

• Simple heuristic for enabling/disabling DRAM bank: Every Reconfiguration Interval,

– If usage is low and cache-bank miss-rate is low disable DRAM bank above

– If usage is high and cache-bank miss-rate is high enable DRAM bank above

• Reconfiguration interval is every 10 million cycles

• All cores are stalled for 100K cycles during reconfiguration

Proposed Reconfiguration Policy

Page 12: Optimizing Communication and Capacity in 3D Stacked Cache Hierarchies Aniruddha Udipi N. Madan, L. Zhao, N. Muralimanohar, A. Udipi, R. Balasubramonian,

Cache Organization

University of Utah 12

Tag Array

Data Array

DRAM

SRAM

Adaptive arrays become tag arrays for ways in DRAM

Total Capacity

1 MB9 MB

Low HighAccess Pressure

Ways

Ways

4 2

0

32

Page 13: Optimizing Communication and Capacity in 3D Stacked Cache Hierarchies Aniruddha Udipi N. Madan, L. Zhao, N. Muralimanohar, A. Udipi, R. Balasubramonian,

Cache Organization

• SRAM banks have three memory arrays – tag array, data array, adaptive array (can act as both tag & data)

• Whenever DRAM banks are switched on, tags implemented in part of the SRAM

– Quick lookup of tag

• Increased capacity manifests as additional ways– Cache lines in SRAM need not be flushed on

reconfiguration

– Two ways of data available with low latency, moving MRU data to these ways will further increase efficiency

University of Utah 13

Page 14: Optimizing Communication and Capacity in 3D Stacked Cache Hierarchies Aniruddha Udipi N. Madan, L. Zhao, N. Muralimanohar, A. Udipi, R. Balasubramonian,

Why is this better than a L2/L3 hierarchy?

• Additional access penalty on L2 miss before the L3 is accessed to service the request

– In our scheme, we look up all tags in parallel, in the SRAM

• An additional level implies additional coherence complexity

• Our experiments show non-trivial performance degradation on implementing SRAM/DRAM as L2/L3 compared to our scheme

University of Utah 14

Page 15: Optimizing Communication and Capacity in 3D Stacked Cache Hierarchies Aniruddha Udipi N. Madan, L. Zhao, N. Muralimanohar, A. Udipi, R. Balasubramonian,

University of Utah 15

Outline

• Overview of 3D Technology• Technique I - Reconfigurable Cache Hierarchy• Technique II - Page coloring• Technique III - On-chip Interconnection Network • Evaluation• Conclusions

Page 16: Optimizing Communication and Capacity in 3D Stacked Cache Hierarchies Aniruddha Udipi N. Madan, L. Zhao, N. Muralimanohar, A. Udipi, R. Balasubramonian,

University of Utah 16

Technique II - Page Coloring

• OS can control what Physical Page Number is assigned to each virtual page, thus controlling the index

• It can be manipulated to redirect cache line placements

Cache Tag OffsetIndex

Physical Page Number Offset

Page Color

CACHE VIEW

PHYSICAL ADDRESS

Page 17: Optimizing Communication and Capacity in 3D Stacked Cache Hierarchies Aniruddha Udipi N. Madan, L. Zhao, N. Muralimanohar, A. Udipi, R. Balasubramonian,

University of Utah 17

Page Coloring

• Page coloring employed to map data to banks based on proximity to cores.

• We assume an offline oracle page-coloring implementation

• Policies depend upon 2 criteria:– Knowledge of a page being private or shared– Knowledge of a page being data or code

• More capacity pressure on banks carrying shared data

Page 18: Optimizing Communication and Capacity in 3D Stacked Cache Hierarchies Aniruddha Udipi N. Madan, L. Zhao, N. Muralimanohar, A. Udipi, R. Balasubramonian,

University of Utah 18

Proposed Page Coloring Schemes

Share4:D+I Rp:I+Share4:D Share16:D+I

Shared Data + Code

Private Page

Shared Data

Private Code

Shared data & code mapped to central 4 banks

Shared data to central 4 banks; code replicated

Shared data + code distributed to all 16 banks

Page 19: Optimizing Communication and Capacity in 3D Stacked Cache Hierarchies Aniruddha Udipi N. Madan, L. Zhao, N. Muralimanohar, A. Udipi, R. Balasubramonian,

University of Utah 19

Outline

• Overview of 3D Technology• Technique I - Reconfigurable Cache Hierarchy• Technique II - Page coloring• Technique III - On-chip Interconnection Network• Evaluation• Conclusions

Page 20: Optimizing Communication and Capacity in 3D Stacked Cache Hierarchies Aniruddha Udipi N. Madan, L. Zhao, N. Muralimanohar, A. Udipi, R. Balasubramonian,

University of Utah 20

Technique III - Interconnection network

TREE

Links

Router

Routers

saved!

Page 21: Optimizing Communication and Capacity in 3D Stacked Cache Hierarchies Aniruddha Udipi N. Madan, L. Zhao, N. Muralimanohar, A. Udipi, R. Balasubramonian,

On chip tree network

• Predictable traffic pattern

• Data moves between shared central banks/private overhead banks and the core

• Decreased router overhead

• Saves energy and time

University of Utah 21

Page 22: Optimizing Communication and Capacity in 3D Stacked Cache Hierarchies Aniruddha Udipi N. Madan, L. Zhao, N. Muralimanohar, A. Udipi, R. Balasubramonian,

University of Utah 22

Synergy between proposals

Page coloring

Tree network Hybrid 3D cache

- No search (S-NUCA)

- Radiating traffic pattern

- No spills into neighboring banks

- Increased bank capacity with low latency

Page 23: Optimizing Communication and Capacity in 3D Stacked Cache Hierarchies Aniruddha Udipi N. Madan, L. Zhao, N. Muralimanohar, A. Udipi, R. Balasubramonian,

University of Utah 23

Outline

• Overview of 3D Technology• Technique I - Reconfigurable Cache Hierarchy• Technique II - Page coloring• Technique III - On-chip Interconnection Network • Evaluation• Conclusions

Page 24: Optimizing Communication and Capacity in 3D Stacked Cache Hierarchies Aniruddha Udipi N. Madan, L. Zhao, N. Muralimanohar, A. Udipi, R. Balasubramonian,

University of Utah 24

Methodology

• Intel ManySim trace-based simulator• CACTI cache model for area, power and access latencies• HotSpot 4.0 for thermal evaluation

• 16 cores, 32nm process, 4GHz clock • 4KB page granularity • 1MB SRAM bank and 8MB DRAM bank

• SAP, SPECjbb, TPC-C and TPC-E commercial multi-threaded workload traces

Page 25: Optimizing Communication and Capacity in 3D Stacked Cache Hierarchies Aniruddha Udipi N. Madan, L. Zhao, N. Muralimanohar, A. Udipi, R. Balasubramonian,

University of Utah 25

Workload Characterization

Sharing Characterization

0

10

20

30

40

50

60

70

80

90

100

SAP SPECJBB TPCC TPCE AVG

Workload

Percentage Sharing

CODE

DATA

• Working set size of code pages is 0.6% of data pages

• Average code page access count is 57%

Page 26: Optimizing Communication and Capacity in 3D Stacked Cache Hierarchies Aniruddha Udipi N. Madan, L. Zhao, N. Muralimanohar, A. Udipi, R. Balasubramonian,

University of Utah 26

00.20.40.60.8

11.21.41.61.8

2

Page Coloring Schemes

Page Coloring Evaluation

Capacity constraint favors distributing shared pages

Code Replication favorable when capacity is available

Page 27: Optimizing Communication and Capacity in 3D Stacked Cache Hierarchies Aniruddha Udipi N. Madan, L. Zhao, N. Muralimanohar, A. Udipi, R. Balasubramonian,

University of Utah 27

Interconnect Evaluation

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

BASE Share4:D+I Share4:D Share16:D+I

Percentage Bank Access

Local Sibling Distant

Network power savings up to 48%

Most accesses are local due to code

replication

Most accesses are random

Page 28: Optimizing Communication and Capacity in 3D Stacked Cache Hierarchies Aniruddha Udipi N. Madan, L. Zhao, N. Muralimanohar, A. Udipi, R. Balasubramonian,

University of Utah 28

Hybrid Cache Evaluation

00.20.40.60.8

11.21.41.61.8

Cores

SRAM

Cores

SRAM

SRAM

Cores

SRAM

DRAM

Base-No-PC Base-2x-No-PC Base-3- level

L2 L2

L2

L3

L2

Cores

SRAM

DRAM

Proposed Chip

L2

L2

Re-configurable Cache (with code replication) performs 55% better than Base-1

~ 5% IPC drop, to get power savings

Page 29: Optimizing Communication and Capacity in 3D Stacked Cache Hierarchies Aniruddha Udipi N. Madan, L. Zhao, N. Muralimanohar, A. Udipi, R. Balasubramonian,

University of Utah 29

SRAM-DRAM Hits without Reconfiguration

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Cache BankDRAM

SRAM

Most accesses are to SRAM ways except in shared banks (5,6,9,10)

Page 30: Optimizing Communication and Capacity in 3D Stacked Cache Hierarchies Aniruddha Udipi N. Madan, L. Zhao, N. Muralimanohar, A. Udipi, R. Balasubramonian,

SRAM-DRAM Hits with Reconfiguration

University of Utah 30

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Cache Bank

Page 31: Optimizing Communication and Capacity in 3D Stacked Cache Hierarchies Aniruddha Udipi N. Madan, L. Zhao, N. Muralimanohar, A. Udipi, R. Balasubramonian,

University of Utah 31

Reconfiguration Policy

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Bank Number

SAP

TPCC

TPCE

SPEC

Shared Banks have DRAM always enabled

SPECJbb – DRAM always enabled – majority pages are private

Page 32: Optimizing Communication and Capacity in 3D Stacked Cache Hierarchies Aniruddha Udipi N. Madan, L. Zhao, N. Muralimanohar, A. Udipi, R. Balasubramonian,

University of Utah 32

Related Work

• Reconfigurable Caches in 2D – Ranganathan et al. (ISCA ‘00), Balasubramonian et al. (MICRO ‘00),

Zhang et al. (ISCA ‘03)• 3D Cache hierarchy

– Lie et al. (IEEE D&T ‘05), Loi et al. (DAC ‘06),

Kgil et al. (ASPLOS ‘06), Loh (ISCA ‘08) • Page coloring for NUCA

– Cho et al. (MICRO ‘06), Awasthi et al. (HPCA’09), Chaudhuri (HPCA ‘09)

• 3D NUCA interconnect – Li et al. (ISCA ‘06)

• Our is the first paper to propose SRAM/DRAM, targeted tree network, and combining all these into a 3D hierarcy

Page 33: Optimizing Communication and Capacity in 3D Stacked Cache Hierarchies Aniruddha Udipi N. Madan, L. Zhao, N. Muralimanohar, A. Udipi, R. Balasubramonian,

University of Utah 33

Key Contributions

• A synergistic cache design

• Communication- and capacity-optimized 3D cache – Reconfigurable cache to improve performance while reducing

power– OS-based page coloring for reduced communication– Tailor-made on-chip network for quicker communication

• Significant increase in efficiency– Performance improvement of up to 62%– Network power savings of up to 48%

• Typical thermal effect +7 Celsius

Page 34: Optimizing Communication and Capacity in 3D Stacked Cache Hierarchies Aniruddha Udipi N. Madan, L. Zhao, N. Muralimanohar, A. Udipi, R. Balasubramonian,

University of Utah 34

Thank you..

• Questions?