Upload
scribduserfornow
View
222
Download
0
Embed Size (px)
Citation preview
8/3/2019 Anoop Thomas
1/42
HIGH PERFORMANCE AND LOWHIGH PERFORMANCE AND LOWPOWER HYBRID CACHEPOWER HYBRID CACHE
ARCHITECTURES FOR CMPsARCHITECTURES FOR CMPs
Presented by,
Seminar Guide : Sunith C K ANOOP THOMASREG NO:98911037VLSI & EMBEDDED SYSTEMS
MTech VLSI & ES 1
8/3/2019 Anoop Thomas
2/42
IntroductionIntroduction
A multi-core processor is a singlecomponent with two or more independentrocessors cores .
Chip Multiprocessors or CMP:- When thecores are integrated onto a single
integrated circuit die.
MTech VLSI & ES 2
8/3/2019 Anoop Thomas
3/42
Need For Cache MemoryNeed For Cache Memory
Cache memories minimizes the
performance gap between high-speedprocessors and slow off-chip memory.
Cache subsystems, particularly on-chip,with multiple layers of large caches iscommon in CMPs.
MTech VLSI & ES 3
8/3/2019 Anoop Thomas
4/42
The processor-memory performance gap. [web]
MTech VLSI & ES 4
8/3/2019 Anoop Thomas
5/42
Current SchemesCurrent Schemes Performance can be improved through Non-uniform
Cache Architecture (NUCA)
Large cache is divided into multiple banks withdifferent access latencies determined by their physicallocations to the source of the re uest.
Static NUCA:- A cache line is statistically mapped intobanks, with low order bits of the index determiningthe bank.
Dynamic NUCA:- Any given line can be mapped toseveral banks based on the mapping policy.
MTech VLSI & ES 5
8/3/2019 Anoop Thomas
6/42
NUCA fails in case of large cacheNUCA fails in case of large cache
It only utilizes varied access latency of cachebanks, due to their physical location, to improveperformance.
Cache banks are of same size, process andcircuit technology.
Overall cache size available is fixed for samememory technology.
MTech VLSI & ES 6
8/3/2019 Anoop Thomas
7/42
Comparison of different memory technologies.[4]
MTech VLSI & ES 7
>No single memory technology by itself is efficient.
8/3/2019 Anoop Thomas
8/42
Memory Hierarchy generally used. [web]
MTech VLSI & ES 8
8/3/2019 Anoop Thomas
9/42
Hybrid Cache MemoryHybrid Cache MemoryArchitecturesArchitectures
Cache designed using differing memory
technologies performs better than singletechnology.
y r ac e rc ec ures a ows eve s ncache hierarchy to be constructed from differentmemory technologies.
MTech VLSI & ES 9
8/3/2019 Anoop Thomas
10/42
Inter cache HCA (LHCA): The levels in a cachehierarchy can be made of disparate memory
technologies.
Region Base Intra Cache HCA (RHCA): A singlelevel of cache can be partitioned into multiple
regions, each of different memory technology.
STT-RAM with SRAM can be used to form ahybrid cache architecture for chip
multiprocessors with low power consumption andhigh performance.
MTech VLSI & ES 10
8/3/2019 Anoop Thomas
11/42
MTech VLSI & ES 11
Overview of LHCA and RHCA[4]
8/3/2019 Anoop Thomas
12/42
STT RAM based HCASTT RAM based HCA
STTRAM is non-volatile.
Read speed is comparable to that ofSRAM (As per the design).
Higher density than SRAM.
Disadvantage is that write latency is long
and high dynamic power consumption.
MTech VLSI & ES 12
8/3/2019 Anoop Thomas
13/42
Background of STTBackground of STT--RAMRAM
Information carrier inside MRAM isMagnetic Tunnel Junction (MTJ).
MTech VLSI & ES 13
A conceptual view of MTJ. [4]
8/3/2019 Anoop Thomas
14/42
MTech VLSI & ES 14
An illustration of an MRAM cell. [4]
8/3/2019 Anoop Thomas
15/42
MTJ is the storage element. NMOS is used for access controller.
They are connected in series.
Write Operation >Positive voltage difference between bit lineand source line for writing 0
writing 1
Read operation NMOS is enabled and (Vbl-Vsl) voltage is
applied between BL and SL, is usually negativeand small.
MTech VLSI & ES 15
8/3/2019 Anoop Thomas
16/42
STT RAM alone is not used asSTT RAM alone is not used ascache memorycache memory
Large number of writes on the last levelcache (LLC) occurs for most of the CMPa lications.
Due to long write latency and very highdynamic power consumption, using STT
RAM is not advised.
MTech VLSI & ES 16
8/3/2019 Anoop Thomas
17/42
Hybrid Cache Architecture usingHybrid Cache Architecture usingSTTSTT--RAMRAM
STT-RAM and SRAM can be together used
to form an HCA.
STT-RAM has low leakage power and high.
With smart cache management policies,low power consumption and high
performance can be obtainedsimultaneously.
MTech VLSI & ES 17
8/3/2019 Anoop Thomas
18/42
Basic ArchitectureBasic Architecture
Each Core is configured with private L1instruction and data cache.
s e cons s ng o mu p e cac ebanks connected through interconnectionnetwork.
MTech VLSI & ES 18
8/3/2019 Anoop Thomas
19/42
Each bank is either an STT-RAM or an SRAMbank.
SRAM banks are shared by all cores
STT-RAM banks are logically divided into, .
Shared SRAM banks are organized intoDNUCA.
MTech VLSI & ES 19
8/3/2019 Anoop Thomas
20/42
Hybrid Cache Architecture. [2]
MTech VLSI & ES 20
8/3/2019 Anoop Thomas
21/42
24 STT-RAM groups are logically divided into8 groups.
Each of these groups consists of 3 STT-RAMbanks.
Each core is privately allocated one logicalSTT-RAM group and most of the cores
.
SRAM is included to make write operationsmore efficient.
SRAM banks are shared by all on-chip cores.
MTech VLSI & ES 21
8/3/2019 Anoop Thomas
22/42
MTech VLSI & ES 22
Each hybrid LLC is implemented with 4 sub-banks.
Each STT-RAM sub-bank is configured with a
sub-bank write buffer to speed up long latencywrite operations.
Cache bank structure of hybrid cache. [2]
8/3/2019 Anoop Thomas
23/42
Micro Architectural MechanismMicro Architectural Mechanism
Private STT-RAM was used to reduce the
high power-consuming remote blockaccesses.
or t e core runn ng memory- ntens veworkloads, private STT-RAM group maynot accommodate the large working set.
MTech VLSI & ES 23
8/3/2019 Anoop Thomas
24/42
Neighborhood Group CachingNeighborhood Group Caching
Neighboring cores share their private STT-
RAM groups with each other based on theHCA.
E : Core 1 can share its STT-RAM banks
with its one-hop neighbor core 0, 2 and 5. Neighborhood sharing can obtain more
balanced capacity and access latency
between private and shared schemes.
MTech VLSI & ES 24
8/3/2019 Anoop Thomas
25/42
NGC is scalable for future CMPs by carefullydefining the neighborhood.
The energy aware read and write policieshelps HCA to optimize the powerconsum tion without sacrificin erformance.
Flow graph for the whole micro architecturalmechanism is shown in the next slide.
MTech VLSI & ES 25
8/3/2019 Anoop Thomas
26/42
Flow graph of proposed micro-architecture mechanisms. [2]
MTech VLSI & ES 26
8/3/2019 Anoop Thomas
27/42
EnergyEnergy--Aware WriteAware Write
Write miss occurs then target block will be
loaded from low level memory and putinto SRAM bank.
Write hits to SRAM is directly served bycorresponding SRAM bank.
Write hits to STT-RAM banks are servedby Block Swapping Mechanism.
MTech VLSI & ES 27
8/3/2019 Anoop Thomas
28/42
EnergyEnergy--Aware ReadAware Read
Read miss occurs then the target block is
fetched from low level memory and putinto local STT-RAM group.
-by local group or from the neighboringgroups.
Read hits on SRAM bank, Active BlockMigration is used to serve the request.
MTech VLSI & ES 28
8/3/2019 Anoop Thomas
29/42
Block SwappingBlock Swapping Cache lines with intensive write operations are
migrated from STT-RAM to SRAM.
Migration causes an original line in SRAM to bereplaced.
If re laced SRAM line is valid two lines in SRAM and
STT-RAM are swapped.
Future accesses to this line will hit STT-RAM whichreduces long latency accesses to low level memory.
Invalid line is directly written back to memory.
MTech VLSI & ES 29
8/3/2019 Anoop Thomas
30/42
Swapping is activated upon a block in STT-RAMaccessed by two consecutive writes or accumulativelyaccessed by three writes.
Each cache line extended with 2 bit swapping counterand 1 bit cross access counter to control dataswapping between STT-RAM and SRAM.
MTech VLSI & ES 30
a e rans ons o oc swapp ng
8/3/2019 Anoop Thomas
31/42
Once a block is loaded into STT-RAM bothcounters will be set to zero.
Block swap occurs when cross access counteris 0 and swapping counter is 10 or whencross access counter is 1 and swappingcounter is 11.
When a read occurs when swapping counteris 01, cross access counter will be set to 1 toindicate that this block is accessed by readand write operations.
MTech VLSI & ES 31
8/3/2019 Anoop Thomas
32/42
Active Block MigrationActive Block Migration Upon read hits on SRAM migration of cache line
from SRAM to STT-RAM occurs.
Blocks in SRAM are divided into two types
>Blocks swapped from STT-RAM banks.
Cross access counter is used to differentiatethese block.
>low level memory is set to 0>swapped from STT-RAM is set to 1
MTech VLSI & ES 32
8/3/2019 Anoop Thomas
33/42
State transitions of Active Block Migration [4]
: oc etc e rom ow eve memory
will be migrated into STT-RAM when a read request hits onthis block.
LAZY ACTIVE MIGRATION: The swapped blocks from STT-
RAM are migrated back into STT-RAM when it isaccumulatively read by twice more than the write on thisblock.
MTech VLSI & ES 33
8/3/2019 Anoop Thomas
34/42
Results and AnalysisResults and Analysis
MTech VLSI & ES 34
Main simulation parameters considered. [4]
8/3/2019 Anoop Thomas
35/42
POWER ANALYSIS
The main power component in STT-RAM isdynamic power and the leakage power ofperipheral circuits.
Using STT-RAM as well as the low-power
MTech VLSI & ES 35
,
scheme consumes less power thanconventional SNUCA and DNUCA.
8/3/2019 Anoop Thomas
36/42
MTech VLSI & ES 36
Power Comparison normalized by SNUCA.[2]
8/3/2019 Anoop Thomas
37/42
PERFORMANCE ANALYSIS
The performance of the hybrid scheme is betterthan conventional SNUCA and DNUCA.
Block replication causes large numbers of low-latency local hits in private STT-RAM groups andhence IPC is improved .
Due the large density of STT-RAM and thecapacity efficiency of NGC scheme, the hybridscheme reduces massive long latency on-chipremote accesses and off-chip accesses during the
execution.
MTech VLSI & ES 37
8/3/2019 Anoop Thomas
38/42
MTech VLSI & ES 38
Average IPC comparison normalized by SNUCA. [2]
8/3/2019 Anoop Thomas
39/42
ConclusionConclusion
HCA greatly reduces the power and increasesthe performance when compared to using theconventional SRAM on-chip cache technology.
By the combination of various memory
cache system with better performance.
With the help of proposed micro-architecturalmechanisms, the hybrid scheme is adaptive
to variations of workloads.
MTech VLSI & ES 39
8/3/2019 Anoop Thomas
40/42
ReferencesReferences [1] Fran Vahid, Tony D. Givargis Embedded System Design: A Unified
Hardware/Software Introduction
[2] Jianhua Li, Xue, C.J, Yinlong Xu, STT-RAM based energy-efficiencyhybrid cache for CMPs In VLSI and System-on-Chip, 2011 IEEE/IFIP 19thInternational Conference, pages 31-36, 2011
[3] M. Hosomi, H. Yamagishi, T. Yamamoto, K. Bessho, Y. Higo, K. Yamane,
. , . , . , . , . , . ,novel nonvolatile memory with spin torque transfer magnetization switching:spin-ram. In IEEE Electronic Device Conference, pages 459462, 2005.
[4] Xiaoxia Wu, Jian Li, Lixin Zhang, Evan Speight, Ramakrishnan Rajamony,Yuan Xie, Hybrid Cache Architecture with disparate memory technologies.Paper published in- SCA 2009 Proceedings of the 36th annual internationalsymposium on Computer architecture
Online Available: isca09.cs.columbia.edu/pres/04.pdf
MTech VLSI & ES 40
8/3/2019 Anoop Thomas
41/42
ReferecesRefereces ccntdntd..
[5] G. Sun, X. Dong, Y. Xie, J. Li, and Y. Chen, A novel architecture of the 3d stacked
mram l2 cache for cmps. In The 16th IEEE International Symposium on High-Performance Computer Architecture, pages 239249, 2009.
[6] Video lecture on Digital Computer Organization - Lec-18 Cache MemoryArchitecture
On ine Avai a e: ttp: npte .iitm.ac.in vi eo.p p?su jectI =117105078
MTech VLSI & ES 41
8/3/2019 Anoop Thomas
42/42
THANK YOUTHANK YOU
MTech VLSI & ES 42