Anoop Thomas

8/3/2019 Anoop Thomas

1/42

HIGH PERFORMANCE AND LOWHIGH PERFORMANCE AND LOWPOWER HYBRID CACHEPOWER HYBRID CACHE

ARCHITECTURES FOR CMPsARCHITECTURES FOR CMPs

Presented by,

Seminar Guide : Sunith C K ANOOP THOMASREG NO:98911037VLSI & EMBEDDED SYSTEMS

MTech VLSI & ES 1


2/42

IntroductionIntroduction

A multi-core processor is a singlecomponent with two or more independentrocessors cores .

Chip Multiprocessors or CMP:- When thecores are integrated onto a single

integrated circuit die.

MTech VLSI & ES 2


3/42

Need For Cache MemoryNeed For Cache Memory

Cache memories minimizes the

performance gap between high-speedprocessors and slow off-chip memory.

Cache subsystems, particularly on-chip,with multiple layers of large caches iscommon in CMPs.

MTech VLSI & ES 3


4/42

The processor-memory performance gap. [web]

MTech VLSI & ES 4


5/42

Current SchemesCurrent Schemes Performance can be improved through Non-uniform

Cache Architecture (NUCA)

Large cache is divided into multiple banks withdifferent access latencies determined by their physicallocations to the source of the re uest.

Static NUCA:- A cache line is statistically mapped intobanks, with low order bits of the index determiningthe bank.

Dynamic NUCA:- Any given line can be mapped toseveral banks based on the mapping policy.

MTech VLSI & ES 5


6/42

NUCA fails in case of large cacheNUCA fails in case of large cache

It only utilizes varied access latency of cachebanks, due to their physical location, to improveperformance.

Cache banks are of same size, process andcircuit technology.

Overall cache size available is fixed for samememory technology.

MTech VLSI & ES 6


7/42

Comparison of different memory technologies.[4]

MTech VLSI & ES 7

>No single memory technology by itself is efficient.


8/42

Memory Hierarchy generally used. [web]

MTech VLSI & ES 8


9/42

Hybrid Cache MemoryHybrid Cache MemoryArchitecturesArchitectures

Cache designed using differing memory

technologies performs better than singletechnology.

y r ac e rc ec ures a ows eve s ncache hierarchy to be constructed from differentmemory technologies.

MTech VLSI & ES 9


10/42

Inter cache HCA (LHCA): The levels in a cachehierarchy can be made of disparate memory

technologies.

Region Base Intra Cache HCA (RHCA): A singlelevel of cache can be partitioned into multiple

regions, each of different memory technology.

STT-RAM with SRAM can be used to form ahybrid cache architecture for chip

multiprocessors with low power consumption andhigh performance.

MTech VLSI & ES 10


11/42

MTech VLSI & ES 11

Overview of LHCA and RHCA[4]


12/42

STT RAM based HCASTT RAM based HCA

STTRAM is non-volatile.

Read speed is comparable to that ofSRAM (As per the design).

Higher density than SRAM.

Disadvantage is that write latency is long

and high dynamic power consumption.

MTech VLSI & ES 12


13/42

Background of STTBackground of STT--RAMRAM

Information carrier inside MRAM isMagnetic Tunnel Junction (MTJ).

MTech VLSI & ES 13

A conceptual view of MTJ. [4]


14/42

MTech VLSI & ES 14

An illustration of an MRAM cell. [4]


15/42

MTJ is the storage element. NMOS is used for access controller.

They are connected in series.

Write Operation >Positive voltage difference between bit lineand source line for writing 0

writing 1

Read operation NMOS is enabled and (Vbl-Vsl) voltage is

applied between BL and SL, is usually negativeand small.

MTech VLSI & ES 15


16/42

STT RAM alone is not used asSTT RAM alone is not used ascache memorycache memory

Large number of writes on the last levelcache (LLC) occurs for most of the CMPa lications.

Due to long write latency and very highdynamic power consumption, using STT

RAM is not advised.

MTech VLSI & ES 16


17/42

Hybrid Cache Architecture usingHybrid Cache Architecture usingSTTSTT--RAMRAM

STT-RAM and SRAM can be together used

to form an HCA.

STT-RAM has low leakage power and high.

With smart cache management policies,low power consumption and high

performance can be obtainedsimultaneously.

MTech VLSI & ES 17


18/42

Basic ArchitectureBasic Architecture

Each Core is configured with private L1instruction and data cache.

s e cons s ng o mu p e cac ebanks connected through interconnectionnetwork.

MTech VLSI & ES 18


19/42

Each bank is either an STT-RAM or an SRAMbank.

SRAM banks are shared by all cores

STT-RAM banks are logically divided into, .

Shared SRAM banks are organized intoDNUCA.

MTech VLSI & ES 19


20/42

Hybrid Cache Architecture. [2]

MTech VLSI & ES 20


21/42

24 STT-RAM groups are logically divided into8 groups.

Each of these groups consists of 3 STT-RAMbanks.

Each core is privately allocated one logicalSTT-RAM group and most of the cores

.

SRAM is included to make write operationsmore efficient.

SRAM banks are shared by all on-chip cores.

MTech VLSI & ES 21


22/42

MTech VLSI & ES 22

Each hybrid LLC is implemented with 4 sub-banks.

Each STT-RAM sub-bank is configured with a

sub-bank write buffer to speed up long latencywrite operations.

Cache bank structure of hybrid cache. [2]


23/42

Micro Architectural MechanismMicro Architectural Mechanism

Private STT-RAM was used to reduce the

high power-consuming remote blockaccesses.

or t e core runn ng memory- ntens veworkloads, private STT-RAM group maynot accommodate the large working set.

MTech VLSI & ES 23


24/42

Neighborhood Group CachingNeighborhood Group Caching

Neighboring cores share their private STT-

RAM groups with each other based on theHCA.

E : Core 1 can share its STT-RAM banks

with its one-hop neighbor core 0, 2 and 5. Neighborhood sharing can obtain more

balanced capacity and access latency

between private and shared schemes.

MTech VLSI & ES 24


25/42

NGC is scalable for future CMPs by carefullydefining the neighborhood.

The energy aware read and write policieshelps HCA to optimize the powerconsum tion without sacrificin erformance.

Flow graph for the whole micro architecturalmechanism is shown in the next slide.

MTech VLSI & ES 25


26/42

Flow graph of proposed micro-architecture mechanisms. [2]

MTech VLSI & ES 26


27/42

EnergyEnergy--Aware WriteAware Write

Write miss occurs then target block will be

loaded from low level memory and putinto SRAM bank.

Write hits to SRAM is directly served bycorresponding SRAM bank.

Write hits to STT-RAM banks are servedby Block Swapping Mechanism.

MTech VLSI & ES 27


28/42

EnergyEnergy--Aware ReadAware Read

Read miss occurs then the target block is

fetched from low level memory and putinto local STT-RAM group.

-by local group or from the neighboringgroups.

Read hits on SRAM bank, Active BlockMigration is used to serve the request.

MTech VLSI & ES 28


29/42

Block SwappingBlock Swapping Cache lines with intensive write operations are

migrated from STT-RAM to SRAM.

Migration causes an original line in SRAM to bereplaced.

If re laced SRAM line is valid two lines in SRAM and

STT-RAM are swapped.

Future accesses to this line will hit STT-RAM whichreduces long latency accesses to low level memory.

Invalid line is directly written back to memory.

MTech VLSI & ES 29


30/42

Swapping is activated upon a block in STT-RAMaccessed by two consecutive writes or accumulativelyaccessed by three writes.

Each cache line extended with 2 bit swapping counterand 1 bit cross access counter to control dataswapping between STT-RAM and SRAM.

MTech VLSI & ES 30

a e rans ons o oc swapp ng


31/42

Once a block is loaded into STT-RAM bothcounters will be set to zero.

Block swap occurs when cross access counteris 0 and swapping counter is 10 or whencross access counter is 1 and swappingcounter is 11.

When a read occurs when swapping counteris 01, cross access counter will be set to 1 toindicate that this block is accessed by readand write operations.

MTech VLSI & ES 31


32/42

Active Block MigrationActive Block Migration Upon read hits on SRAM migration of cache line

from SRAM to STT-RAM occurs.

Blocks in SRAM are divided into two types

>Blocks swapped from STT-RAM banks.

Cross access counter is used to differentiatethese block.

>low level memory is set to 0>swapped from STT-RAM is set to 1

MTech VLSI & ES 32


33/42

State transitions of Active Block Migration [4]

: oc etc e rom ow eve memory

will be migrated into STT-RAM when a read request hits onthis block.

LAZY ACTIVE MIGRATION: The swapped blocks from STT-

RAM are migrated back into STT-RAM when it isaccumulatively read by twice more than the write on thisblock.

MTech VLSI & ES 33


34/42

Results and AnalysisResults and Analysis

MTech VLSI & ES 34

Main simulation parameters considered. [4]


35/42

POWER ANALYSIS

The main power component in STT-RAM isdynamic power and the leakage power ofperipheral circuits.

Using STT-RAM as well as the low-power

MTech VLSI & ES 35

,

scheme consumes less power thanconventional SNUCA and DNUCA.


36/42

MTech VLSI & ES 36

Power Comparison normalized by SNUCA.[2]


37/42

PERFORMANCE ANALYSIS

The performance of the hybrid scheme is betterthan conventional SNUCA and DNUCA.

Block replication causes large numbers of low-latency local hits in private STT-RAM groups andhence IPC is improved .

Due the large density of STT-RAM and thecapacity efficiency of NGC scheme, the hybridscheme reduces massive long latency on-chipremote accesses and off-chip accesses during the

execution.

MTech VLSI & ES 37


38/42

MTech VLSI & ES 38

Average IPC comparison normalized by SNUCA. [2]


39/42

ConclusionConclusion

HCA greatly reduces the power and increasesthe performance when compared to using theconventional SRAM on-chip cache technology.

By the combination of various memory

cache system with better performance.

With the help of proposed micro-architecturalmechanisms, the hybrid scheme is adaptive

to variations of workloads.

MTech VLSI & ES 39


40/42

ReferencesReferences [1] Fran Vahid, Tony D. Givargis Embedded System Design: A Unified

Hardware/Software Introduction

[2] Jianhua Li, Xue, C.J, Yinlong Xu, STT-RAM based energy-efficiencyhybrid cache for CMPs In VLSI and System-on-Chip, 2011 IEEE/IFIP 19thInternational Conference, pages 31-36, 2011

[3] M. Hosomi, H. Yamagishi, T. Yamamoto, K. Bessho, Y. Higo, K. Yamane,

. , . , . , . , . , . ,novel nonvolatile memory with spin torque transfer magnetization switching:spin-ram. In IEEE Electronic Device Conference, pages 459462, 2005.

[4] Xiaoxia Wu, Jian Li, Lixin Zhang, Evan Speight, Ramakrishnan Rajamony,Yuan Xie, Hybrid Cache Architecture with disparate memory technologies.Paper published in- SCA 2009 Proceedings of the 36th annual internationalsymposium on Computer architecture

Online Available: isca09.cs.columbia.edu/pres/04.pdf

MTech VLSI & ES 40


41/42

ReferecesRefereces ccntdntd..

[5] G. Sun, X. Dong, Y. Xie, J. Li, and Y. Chen, A novel architecture of the 3d stacked

mram l2 cache for cmps. In The 16th IEEE International Symposium on High-Performance Computer Architecture, pages 239249, 2009.

[6] Video lecture on Digital Computer Organization - Lec-18 Cache MemoryArchitecture

On ine Avai a e: ttp: npte .iitm.ac.in vi eo.p p?su jectI =117105078

MTech VLSI & ES 41


42/42

THANK YOUTHANK YOU

MTech VLSI & ES 42

Documents

Anoop Thomas