Upload
madhu-yalaka
View
224
Download
0
Embed Size (px)
Citation preview
7/31/2019 14 Memory Interconnect
1/25
Memory-Interconnect Design
Bharadwaj Amrutur
7/31/2019 14 Memory Interconnect
2/25
AMD x86-64
32nm Process with High-K Metal Gate35million xtors, 3GHz, 2W to 25W8T memory cell (as opposed to 6T cell)Read followed by Write in same cycle for L1 D$
Shallow bitlines: 8 cells/line
[Jotwani et.al., ISSCC 2010]
7/31/2019 14 Memory Interconnect
3/25
Sun 16-Core SPARC
TSMC 40 nm, 11 Cu levels, 1B xtors8 threads x 16 cores. 4-way gluelessconnection of 4 chipsUnified 6MB L2 each L2 is 386KBCrossbar: 461GB/s
[Shin et.al., ISSCC 2010]
7/31/2019 14 Memory Interconnect
4/25
IBM Eight-core POWER7
45nm CMOS SOI Process, 1.2B xtors11 layer cu with low-k32MB Embedded DRAM for L3$L2: 8-way 256KB L2 per coreL1: 32KB, 2 cycle access time, each I$ and D$GPR: 4R,4W 112 Entry 64+8 register file
8MB eDRAM per L2 in each core, 8-waySmall SRAM directory (probably to selectthe way)25 cycle load to use latency16B/cycle to/fro to L2 bandwidth
[Wendel et.al., ISSCC 2010]
7/31/2019 14 Memory Interconnect
5/25
IBM wirespeed 16-core processor
16 cores, 4 threads per core45nm SOIH/W acceleratorsShared bus also used for power managementEDRAM L2 cache: 4 x 292Kb x 1200 blocks x 16
3x SRAM density, 1/5th SRAM power
Dynamic voltage scaling0.7V and higher
Can hookup 4 of thesechips to scale up to 64 cores65W at 2.0GHz
[Johnson et.al., ISSCC 2010]
7/31/2019 14 Memory Interconnect
6/25
Intel 48-core in 45nm
45nm CMOS 1.3B Xtors48 IA-32 cores, 256KB L2, 2 per tile6x4 mesh with router in each tile.L1: 16KB for I$ and D$ resp.L2: Unified 4-way associative, write back10cycle hit, SECDED,
64-entry TLB + 256 entry LUT extension16KB message passing buffer to supportMPI and OpenMP5-port virtual cut-through router16B Flits,
[Howard et.al., ISSCC 2010]
Vi f h
7/31/2019 14 Memory Interconnect
7/25
View from the processorClk
MemOp
Address
ReadData
Processor Memory
Memory Operations (MemOp)(DLX)
Load,Store
(Other RISC Processors)Prefetch, Load/Store coprocessorCache Flush,Synchronization
WriteData
Address is 32bits or 64bits (modern processors)
Data bus width is 64 (accesses can be inbytes, 32bits, 64bits)
Th G
7/31/2019 14 Memory Interconnect
8/25
The Gap
P r o c
6 0 % / y r
D R A M
7 % / y r .1
1 0
1 0 0
1 0 0 0
1
98
0
1
98
1
1
98
3
1
98
4
1
98
5
1
98
6
1
98
7
1
98
8
1
98
9
1
99
0
1
99
1
1
99
2
1
99
3
1
99
4
1
99
5
1
99
6
1
99
7
1
99
8
1
99
9
2
00
0
D R A M
C P U
1
98
2
P r o c e s s o r- M e m o r
P e r f o r m a n c( g r o w s 5 0 %
P
e
rfo
rm
a
n
c
e M o o r e s L a w
L e s s L a w ?
P r o c
6 0 % / y r
D R A M
7 % / y r .1
1 0
1 0 0
1 0 0 0
1
98
0
1
98
1
1
98
3
1
98
4
1
98
5
1
98
6
1
98
7
1
98
8
1
98
9
1
99
0
1
99
1
1
99
2
1
99
3
1
99
4
1
99
5
1
99
6
1
99
7
1
99
8
1
99
9
2
00
0
D R A M
C P U
1
98
2
P r o c e s s o r- M e m o r
P e r f o r m a n c( g r o w s 5 0 %
P
e
rfo
rm
a
n
c
e M o o r e s L a w
L e s s L a w ?
From Kubiatowicz/UCB
Cl i th
7/31/2019 14 Memory Interconnect
9/25
Closing the gap
Use fast high speed RAMS close to the processor
Caches
Takes up ~ 90% transistors in the processor chip!
Disk
Main Memory (DRAM)
L2 $
L1 $
Processor Registers
Bigger Faster
Proc
M Hi h Ch t i ti
7/31/2019 14 Memory Interconnect
10/25
Memory Hierarchy Characteristics
Disk
Main Memory (DRAM)
L2 $
L1 $
RegsProc
16-128 64-bit
4KB-32KB
1MB - 8MB
1GB - 64GB
80GB-few TB
cycle latency, ~ 1000 Gb/s
1 cycle latency, ~ 400 Gb/s
5-10 cycles latency, ~200 Gb/s
40-100 cycles latency, ~50Gb/s
1000s of cycles, ~1Gb/s
Chip
Chip/PackagePCBBox
Integration
7/31/2019 14 Memory Interconnect
11/25
Memory Hierarchy
Exercise Find Power/Mbps/bit for each layer of the memory
hierarchy
Plot Power/Mbps versus Bit as well as Bit0.5
Which is better?
Register File
7/31/2019 14 Memory Interconnect
12/25
Register File
ReadDecoder WriteDecoder
R0 R63
W0 W63
RWL0
RWL31
WWL0
WWL31
ReadAddressReadAddress Write
Address
Register File
7/31/2019 14 Memory Interconnect
13/25
Register File
Can add more ports
One switch, bitline per cell, and one decoder
Wire dominated
Register file cell can be 10x bigger than SRAM cell (used in L1/L2
cache)
Hence small in size Register files are explicitly visible to the processor
Unlike caches
Access latency can be clock cycle to allow for reading and
execution (or execution and writeback) in same cycle. Easy to scale up word width (64/128/256/512)
Power cost
Cache concept
7/31/2019 14 Memory Interconnect
14/25
Cache concept Small, fast storage to exploit
Spatial and Temporal Locality
Found in other places: File caches, Name Caches etc.
H/w managed: Programming is easy
Consider the memory as a sequence of lines
Also known as blocks
The line can contain multiple bytes.
Cache allows storage of a subset of the lines from main memory
Cache is first searched to satisfy the memory access request.
A hit will return fast. A miss will incur a penalty.
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15Main Memory
Cache
0 1 2 3
Main memory lines are temporarilystored in the Cache
Average Memory Access Time
7/31/2019 14 Memory Interconnect
15/25
Average Memory Access Time
CPUtime=IC
ALUops
Instr CPIAluops
MemAccess
Inst AMATCycletime
Program Execution Time is given as:
Average Memory Access Time (AMAT) is given as:
AMAT=HitTimeMissRateMissPenalty
HitTimeand MissPenaltyare in number of clock cycles
ICis Instruction Count in the program
To reduce AMAT, reduce HitTime, MissRateand MissPenalty
HitTimeis usually the lowest possible of 1 cycle
MissPenaltyis a function of the upper levels of the memory hierarchy
MissRate is a function of Cache Size & Associativity
which also impacts Cycletime: Hence an optimization problem
Exercise
7/31/2019 14 Memory Interconnect
16/25
Exercise
Write the corresponding equation for the energy
consumed by a program
Cache issues
7/31/2019 14 Memory Interconnect
17/25
Cache issues
Where should a line be placed in the cache?
How is a line searched for in the cache? Which line should be replaced on a cache
miss?
What to do on a write?
Block Placement: Direct Map
7/31/2019 14 Memory Interconnect
18/25
Block Placement: Direct Map
Direct Mapped
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15Main Memory
Cache
0 1 2 3
0
4
8
12
1
5
9
13
2
6
10
14
3
7
11
15
The main memorylines which mapto specific cache lines
are:
The formula is:
Direct Mapped: Placement
7/31/2019 14 Memory Interconnect
19/25
Direct Mapped: Placement
Direct Mapped
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15Main Memory
Cache
0 1 2 3
0
4
8
12
1
5
9
13
2
6
10
14
3
7
11
15
The main memoryblocks which mapto specific cache blocks
are:
The formula is:
index = lineAddress modcacheSize
(cacheSize is in lines)
index
Direct Mapped: Search
7/31/2019 14 Memory Interconnect
20/25
Direct Mapped: Search
Direct Mapped
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15Main Memory
Cache Data
0 1 2 3
0
4
8
12
1
5
9
13
2
6
10
14
3
7
11
15
The main memorylines which mapto specific cache lines
are:
Cache Tag ByteSelCacheIndexTag031
Direct Mapped: Search
7/31/2019 14 Memory Interconnect
21/25
Direct Mapped: Search
ByteSelCacheIndexTag
031
=
Hit/Miss
Tag DataDecoder
What is missing?
Direct Mapped: Search
7/31/2019 14 Memory Interconnect
22/25
Direct Mapped: Search
ByteSelcacheIndexTag
031
=Hit/Miss
Tag DataDecoderValid
Block Placement: 2-way Associative
7/31/2019 14 Memory Interconnect
23/25
Block Placement: 2 way Associative
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15Main Memory
Cache
0 1
04
8
12
26
10
14
15
9
13
37
11
15
The main memorylines which mapto specific cache linesare:
The formula is:
setIndex =
Set 0 Set 1
Within each set, the blocks
can be in either of the locationssetIndex
Block Placement: 2-way Associative
7/31/2019 14 Memory Interconnect
24/25
Block Placement: 2 way Associative
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15Main Memory
Cache
04
8
12
26
10
14
15
9
13
37
11
15
The main memorylines which mapto specific cache blocksare:
The formula is:
setIndex = lineAddress modcacheSize/Associativity
Set 0 Set 1
Within each set, the lines
can be in either of the locations0 1setIndex
Note: Other formulae for mapping into cache/set index are possible
2-Way Associative: Search
7/31/2019 14 Memory Interconnect
25/25
2 Way Associative: Search
ByteSelCacheIndexTag
031
=
Hit/Miss_Set0
Tag DataDecoderValid
=
Tag DataDecoderValid
Exercises:a) Complete the wiringb) How do you generate the final Hit/Miss signalc) Extend the design to a Fully Associative Cached) What happens to MissRate with associativitye) What happens to MissRate with sizef) What happens to cycle time with Associativity and Size?
Hit/Miss_Set1
Tristate Driver Tristate Driver