Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
Respin: Rethinking Near-Threshold Multiprocessor Design
with Non-Volatile Memory
XiangPan,AnysBacha,andRaduTeodorescuComputerArchitectureResearchLab
h"p://arch.cse.ohio-state.edu
Universal Demand for Low Power
2
• Mobility• Ba"erylife
• Performance• Powerconstraints
• Energycost• Environment
Respin: Rethinking Near-Threshold Multiprocessor Design with Non-Volatile Memory Xiang Pan, Anys Bacha, and Radu Teodorescu
Near-Threshold Operation
3
Voltage(V)
NTVdd
NominalVdd
Vth
Frequency
10.80.40.30.20
~1/10~1/100
Respin: Rethinking Near-Threshold Multiprocessor Design with Non-Volatile Memory Xiang Pan, Anys Bacha, and Radu Teodorescu
• PerformancedegradaNon
Challenges in Near-Threshold
4
0
0.002
0.004
0.006
0.008
0.01
0.012
0 0.5 1 1.5 2
Prob
abili
ty d
istri
butio
n
Relative frequency
900mV, Vth σ/µ= 9%900mV, Vth σ/µ=12%400mV, Vth σ/µ= 9%400mV, Vth σ/µ=12%
00.10.20.30.40.50.60.70.80.91
NominalVdd Near-ThresholdVdd
Frac=on
ofTotalChipPo
wer
DynamicandLeakagePowerBreakdownfora64-CoreCMP
dynamic-core
dynamic-cache
leakage-core
leakage-cache
• AmplifiedprocessvariaNon
• Leakagepowerdominates
• TheiniNalideaofRespin–BuildcachesinNT-CMP
with“leakage-free”non-volaNlememoriesto
reducepowerconsumpNon
~39%
~73%
~35%
Respin: Rethinking Near-Threshold Multiprocessor Design with Non-Volatile Memory Xiang Pan, Anys Bacha, and Radu Teodorescu
TotalLeakage
TotalLeakage
CacheLeakage
Outline
• BasicsofNVM• RespinArchitecture• MethodologyandEvaluaNon• Conclusion
5Respin: Rethinking Near-Threshold Multiprocessor Design with Non-Volatile Memory Xiang Pan, Anys Bacha, and Radu Teodorescu
Non-Volatile Memory Basics • Non-VolaNlity–ResistanceasdatarepresentaNon(e.g.PCRAM,STT-RAM,ReRAM,etc.)
• Near-ZeroLeakagePower–Goodfitforfuturepower-constrainedcompuNng
• HighDensity–Greatdesigncandidateinthebigdataera
• GoodPerformance–Feasibleforon-chipstoragereplacement
6Respin: Rethinking Near-Threshold Multiprocessor Design with Non-Volatile Memory Xiang Pan, Anys Bacha, and Radu Teodorescu
STT-RAM • UniquefeaturesofSTT-RAM:fastreadspeed,lowreadenergy,unlimitedwriteendurance,andgoodcompaNbilitywithCMOStechnology
• Shortcomings:longwritelatencyandhighwriteenergy
7
AnN-parallelstate:high-resistance,represenNnga“1”
Parallelstate:low-resistance,represenNnga“0”
MTJ(MagneNcTunnelJuncNon)
STT-RAMCell(1T-1MTJ)
Respin: Rethinking Near-Threshold Multiprocessor Design with Non-Volatile Memory Xiang Pan, Anys Bacha, and Radu Teodorescu
STT-RAM is Good Fit for NT-CMP
8
NT-CMP STT-RAM
Highleakagepower Near-zeroleakagepower
LowoperaNngfrequency
Longlatencywrites
Largenumbersofcoresrequiringhigh
cachecapacity
Highdensity(~4xdenserthanSRAM)
UnreliablefuncNonalunits
Robustfromsoh-errors
Respin: Rethinking Near-Threshold Multiprocessor Design with Non-Volatile Memory Xiang Pan, Anys Bacha, and Radu Teodorescu
Outline
• BasicsofNVM• RespinArchitecture• MethodologyandEvaluaNon• Conclusion
9Respin: Rethinking Near-Threshold Multiprocessor Design with Non-Volatile Memory Xiang Pan, Anys Bacha, and Radu Teodorescu
Respin Architecture
10
Cluster 0
Core 0 Core 1 Core 2 Core 7Core 6Core 5Core 4
Core 8 Core 9 Core 10 Core11 Core 15Core 14Core 13Core 12
Core 3
Cluster 2
Core 0 Core 1 Core 2 Core 7Core 6Core 5Core 4
Core 8 Core 9 Core 10 Core11 Core 15Core 14Core 13Core 12
Core 3
Cluster 3
Core 0 Core 1 Core 2 Core 7Core 6Core 5Core 4
Core 8 Core 9 Core 10 Core11 Core 15Core 14Core 13Core 12
Core 3
Cluster 1
Core 0 Core 1 Core 2 Core 7Core 6Core 5Core 4
Core 8 Core 9 Core 10 Core11 Core 15Core 14Core 13Core 12
Core 3
L1D L1IL2 Cache L2 Cache
Core 0 Core 1 Core 2 Core 7Core 6Core 5Core 4
Core 8 Core 9 Core 10 Core11 Core 15Core 14Core 13Core 12
Core 3
NT Vdd High Vdd
(a)
L3 Cache
(b)
NT Vdd High Vdd
• CoresoperateatNT-Vddrailwithlowfrequencies• CachesarebuiltwithSTT-RAMandoperateathigh-Vddrailmakingreadspeed
extremelyfast
• Clustered-CMPwithfastSTT-RAMreadenableswithin-clustersharedL1cache
design,removingcoherencecostsRespin: Rethinking Near-Threshold Multiprocessor Design with Non-Volatile Memory
Xiang Pan, Anys Bacha, and Radu Teodorescu
Shared Cache Controller
11
Cache CLK (0.4ns)
Core0 (1.6ns) Core1 (2.4ns) Core2 (1.6ns) Core3 (1.6ns) Core4 (2.0ns)
Core0 (1.6ns) Core1 (2.4ns) Core2 (1.6ns) Core3 (1.6ns) Core4 (2.0ns)
cycle 0 cycle 1 cycle 2 cycle 3 cycle 4service C0
(a)
(b)
service C2 service C4service C3half-miss C3
cannot service C3 in time
service C1
Cache CLK (0.4ns)
cycle 0 cycle 1 cycle 2 cycle 3 cycle 4
half-miss C3
cycle 5 cycle 6
Level-shifting overhead
Level-shifting overhead
Request serviced
cycle 5 cycle 6
00011 01111
00011 00111
00001 00001
00001 00001 half-miss
00111 00011 00001
00011
00011
service C0 service C2 service C4service C3 service C1
Respin: Rethinking Near-Threshold Multiprocessor Design with Non-Volatile Memory Xiang Pan, Anys Bacha, and Radu Teodorescu
Dynamic Core Consolidation
12
• HighprocessvariaNonandleakageinNT-CMPleadtofastcoresmore
energy-efficientthanslowones
• Dynamicallyconsolidatethreads
ontomoreefficientcoreswith
greedysearchatrunNmecanfurthersaveenergy
• Mechanismimplementedin
systemfirmware,energy-per-
instrucNonusedasgreedyselecNonmetric,andinstrucNon
countusedasevaluaNoninterval
System Firmware (ACPI)
Core Remapper
Filter Power Control
VC 0VC 2VC 3VC 4VC 13VC 14VC 15 . . .
Virtual Cores
VC 1
Retired Inst.Energy
01…
1514…
WeightPhy. ID
01…
32…
Phy. IDVirt. IDEnergy Table ID Map
OFFOFFVC 3VC 15VC 13VC 14 . . .
Physical Cores
C 0C 1C 2C 3C 4C 13C 14C 15 . . .
VC 1VC 2
VC 0VC 4
OFFOFF…
Status
Virtual Core Monitor
OS
Respin: Rethinking Near-Threshold Multiprocessor Design with Non-Volatile Memory Xiang Pan, Anys Bacha, and Radu Teodorescu
Greedy Selection
13
• AgreedyapproachisusedtosearchforopNmalsystemconfiguraNonsat
applicaNonrunNme
• Energy-Per-InstrucNon(EPI)asevaluaNonmetric
• InstrucNoncountasevaluaNoninterval
Respin: Rethinking Near-Threshold Multiprocessor Design with Non-Volatile Memory Xiang Pan, Anys Bacha, and Radu Teodorescu
Outline
• BasicsofNVM• RespinArchitecture• MethodologyandEvalua=on• Conclusion
14Respin: Rethinking Near-Threshold Multiprocessor Design with Non-Volatile Memory Xiang Pan, Anys Bacha, and Radu Teodorescu
Methodology
15
Level Size(mm²) BlockSize Associa=vity Read/WritePorts
L1I(Private/SharedwithinCluster) 16KB(Private)/256KB(Sharedwithin
Cluster)32B
2-way
1/1
L1D(Private/SharedwithinCluster)
4-way
L2(SharedwithinCluster)
8MB(Small)/16MB(Medium)/32MB(Large)
64B 8-way
L3(SharedwithinChip)
24MB(Small)/48MB(Medium)/96MB(Large)
128B 16-way
Table1.SummaryofCacheParameters.
VddRail Area(mm²)
Read/WriteLatency(ns)
Read/WriteEnergy(pJ)
LeakagePower(mW)
SRAM(16KB×16) Low(0.65V)0.9176
1.337 2.578 573
SRAM(256KB)High(1.0V)
0.5336 42.41 881
STT-RAM(256KB) 0.2451 0.3774/5.208 29.32/209.3 114
Table2.ComparisonofSRAMvs.STT-RAMTechnologyParameters.Respin: Rethinking Near-Threshold Multiprocessor Design with Non-Volatile Memory
Xiang Pan, Anys Bacha, and Radu Teodorescu
Methodology
16
CMPArchitecture
Cores 64out-of-order
Fetch/Issue/CommitWidth 2/2/2
RegisterFileSize 76int,56fp
InstrucNonWindowSize 56int,24fp
ReorderBufferSize 80entries
Load/StoreQueueSize 38entries
NoCInterconnect 2DTorus
CoherenceProtocol MESI
ConsistencyModel ReleaseConsistency
Technology 22nm
NT-Vdd 0.4V(Core),0.65V(Cache)
Nominal-Vdd 1.0V
CoreFrequencyRange 375MHz–725MHz
MedianCoreFrequency 500MHz
VariaNonParameters
Vthstd.dev./mean(σ/μ) 12%(Chip),10%(Cluster)
Table3.SummaryofExperimentalParameters.
• SimulaNonFramework:
• SESCforarchitecturalsimulaNon
• CACTI,McPAT,andNVSimfor
latency,power,energy,andareasimulaNon
• Benchmarks:
• SPLASH2andPARSEC
• MainEvaluatedConfiguraNon:
• 64-coreCMPwithfour16-coreclusters
• MediumsizeL2andL3caches
• 0.4nssharedL1cachereadlatency
Respin: Rethinking Near-Threshold Multiprocessor Design with Non-Volatile Memory Xiang Pan, Anys Bacha, and Radu Teodorescu
Power and Performance
17
SmallCache MediumCacheLargeCache
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1.1
1.2
1.3
Normalized
TotalChipPo
wer
PowerReduc=onofProposedDesignwithThreeL2/L3CacheSizes
leakage dynamic
~3%~14%
~23%
• Respinachieved14%powerreducNonand11%performanceimprovementwithmediumsizedcache
Respin: Rethinking Near-Threshold Multiprocessor Design with Non-Volatile Memory Xiang Pan, Anys Bacha, and Radu Teodorescu
1.0
0.22
0.90 0.89
0
0.2
0.4
0.6
0.8
1
GEOMEAN
Normalized
Execu=on
Tim
e
Rela=veExecu=onTimeforMediumCacheSize
PR-SRAM-NT HP-SRAM-CMP SH-SRAM-Nom SH-STT
Energy Consumption
18
• Formediumsizedcache,Respinachieved22%energysavingswiththebasicshared
STT-RAMcachedesignplusaddiNonal10%withcoreconsolidaNonenabled
0.87
0.78
0.69
0.4
0.5
0.6
0.7
0.8
0.9
1
1.1
1.2
SmallCache MediumCache LargeCache
Normalized
Ene
rgyCo
nsum
p=on
EnergyConsump=onforThreeL2/L3CacheSizes
PR-SRAM-NT SH-SRAM-Nom SH-STT
Respin: Rethinking Near-Threshold Multiprocessor Design with Non-Volatile Memory Xiang Pan, Anys Bacha, and Radu Teodorescu
1.0
1.39
1.12
0.78
0.680.65
0.76
0.99
0.5
0.6
0.7
0.8
0.9
1
1.1
1.2
1.3
1.4
GEOMEAN
Normalized
Ene
rgyCo
nsum
p=on
EnergyConsump=onforDifferentCoreConsolida=onConfigura=ons
PR-SRAM-NT HP-SRAM-CMP SH-SRAM-Nom SH-STTSH-STT-CC SH-STT-CC-Oracle PR-STT-CC SH-STT-CC-OS
Core Consolidation Analysis
19
• Inmostcasesourgreedyalgorithmmatcheswellwiththeoraclewhileinveryfewcasessub-
opNmalselecNonbecomesthebarriertoslowdownthepaceofourgreedymechanism
Respin: Rethinking Near-Threshold Multiprocessor Design with Non-Volatile Memory Xiang Pan, Anys Bacha, and Radu Teodorescu
lu(SPLASH2):greedyachieved29%energysavingswhileoracle
achieved38%
0
2
4
6
8
10
12
14
16
0 50 100 150 200 250 300 350 400 450 500
Num
ber
of
OF
F C
ore
s
Number of Million Instructions
SH-STT-CC SH-STT-CC-Oracle
radix(SPLASH2):greedyachieved48%energysavingswhileoracleachieved50%
0
2
4
6
8
10
12
14
16
0 20 40 60 80 100 120 140 160
Num
ber
of
OF
F C
ore
s
Number of Million Instructions
SH-STT-CC SH-STT-CC-Oracle
Sub-op=mal
Shared Cache Access Load
20
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 1 2 3 >=4
Percen
tageofTotalCache
Cycles
NumofRequestsArrivingatDL1Cache
SharedDL1CacheU=liza=onRateinOneNT-CMPCluster
blackscholes
cholesky
radix
raytrace
streamcluster
AVERAGE0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 2 >=3
Percen
tageofD
L1ReadHitReq
uests
NumofCoreCyclesServiced
Frac=onofReadHitRequestsServicedbytheSharedDL1in1,2,ormorecycles
blackscholes
cholesky
radix
raytrace
streamcluster
AVERAGE
Respin: Rethinking Near-Threshold Multiprocessor Design with Non-Volatile Memory Xiang Pan, Anys Bacha, and Radu Teodorescu
• Morethan95%oftheincomingrequestscanbeservicedinone
processorcorecycle
Sensitivity Study on Cluster Size
21
ClusterSize(#cores) SharedL1(I/D)Size(KB) PerformanceGain(%)
4 64 4.82
8 128 6.29
16 256 10.81
32 512 2.50
Respin: Rethinking Near-Threshold Multiprocessor Design with Non-Volatile Memory Xiang Pan, Anys Bacha, and Radu Teodorescu
• 16-coreclusterprovidesthebesttrade-offbetweendatasharingamong
coresandsharedcacheaccessload,giving~11%performancegain
Number of Active Cores in Dynamic Core Consolidation
22Respin: Rethinking Near-Threshold Multiprocessor Design with Non-Volatile Memory Xiang Pan, Anys Bacha, and Radu Teodorescu
0 1 2 3 4 5 6 7 8 9
10 11 12 13 14 15 16
barnescholesky
fft lu oceanradiosity
radixraytrace
water-nsquared
blackscholes
bodytrackstreamcluster
swaptions
AVG
Aver
age
Num
ber
of
Act
ive
Core
s
SH-STT-CC SH-STT-CC-Oracle
• StrongphasedbehaviorcanbeobservedacrossallapplicaNons,making
ourdynamiccoreconsolidaNonmechanismveryuseful
Outline
• BasicsofNVM• RespinArchitecture• MethodologyandEvaluaNon• Conclusion
23Respin: Rethinking Near-Threshold Multiprocessor Design with Non-Volatile Memory Xiang Pan, Anys Bacha, and Radu Teodorescu
Conclusion • Thefirstworktoexploretheuseofnon-volaNlecachesinnear-thresholdchipmulN-processors
• AnovelarchitecturedesignedtoenhanceNT-CMPperformanceandreduce
energyconsumpNonbysharingL1cachesandimplemenNngdynamiccore
consolidaNonmechanism
• AchievedenergyreducNonby33%andimprovedperformanceby11%
24Respin: Rethinking Near-Threshold Multiprocessor Design with Non-Volatile Memory Xiang Pan, Anys Bacha, and Radu Teodorescu
Questions?
Thankyou!
25Respin: Rethinking Near-Threshold Multiprocessor Design with Non-Volatile Memory Xiang Pan, Anys Bacha, and Radu Teodorescu