Upload
beck-huff
View
36
Download
0
Tags:
Embed Size (px)
DESCRIPTION
From Crash-and-Recover to Sense-and-Adapt: Our Evolving Models of Computing Machines. Rajesh K. Gupta UC San Diego. To a software designer, all chips look alike. To a hardware engineer, a chip is delivered as per contract in a data-sheet. Reality is. - PowerPoint PPT Presentation
Citation preview
From Crash-and-Recover to Sense-and-Adapt: Our Evolving Models of
Computing Machines
Rajesh K. GuptaUC San Diego.
To a software designer, all chips look alike
To a hardware engineer, a chip is delivered as per contract in a data-sheet.
2
3
COMPUTERS ARE BUILT ON STUFF THAT IS IMPERFECT AND…
Reality is
Changing From Chiseled Objects to Molecular Assemblies
4Courtesy: P. Gupta, UCLA
45nm Implementation of Leon3 Processor Core
Engineers Know How to “Sandbag”
5
• PVTA margins add to guardbands– Static Process variation: effective transistor channel
length and threshold voltage– Dynamic variations: Temperature fluctuations,
supply Voltage droops, and device Aging (NBTI, HCI)
Temperature
Clock
actual circuit delay guardband
Aging VCC Droop Across-wafer Frequency
Uncertainty Means Unpredictability
• VLSI Designer: Eliminate It– Capture physics into models– Statistical or plain-old Monte Carlo– Manufacturing, temperature effects
• Architect: Average it out– Workload (Dynamic) Variations
• Software, OS: Deny It– Simplify, re-organize OS/tasks breaking these into
parts that are precise (W.C.) and imprecise (Ave.)
6Each doing their own thing, massive overdesign…
Simulate ‘degraded’ netlist with model input changes (DVth)
Deterministic simulations capture known physical processes (e.g., aging)
Multiple (Monte-Carlo) simulations wrapped around a nominal model
Let us step back a bit: HW-SW Stack
Hardware Abstraction Layer (HAL)Hardware Abstraction Layer (HAL)
Operating SystemOperating System
ApplicationApplication ApplicationApplication
7
Let us step back a bit: HW-SW Stack
Hardware Abstraction Layer (HAL)Hardware Abstraction Layer (HAL)
Operating SystemOperating System
ApplicationApplication ApplicationApplication
8
Time or part
Let us step back a bit: HW-SW Stack
Hardware Abstraction Layer (HAL)Hardware Abstraction Layer (HAL)
Operating SystemOperating System
ApplicationApplication ApplicationApplication
9
Time or part
}overdesignedhardware
20x in sleep power50% in performance
40% larger chip35% more active power60% more sleep power
What if?
Hardware Abstraction Layer (HAL)Hardware Abstraction Layer (HAL)
Operating SystemOperating System
ApplicationApplication ApplicationApplication
10
Time or part
}underdesignedhardware
New Hardware-Software Interface..
Time or part
ApplicationApplication
Hardware Abstraction Layer (HAL)Hardware Abstraction Layer (HAL)
Operating SystemOperating System
ApplicationApplication
minimal variability handling in hardware
UnderdesignedHardware
OpportunisticSoftware
TraditionalFault-tolerance
11
UNO Computing Machines Seek Opportunities based on Sensing Results
Variability manifestations- faulty cache bits- delay variation- power variation
Variability manifestations- faulty cache bits- delay variation- power variation
Variability signatures:- cache bit map- cpu speed-power map- memory access time- ALU error rates
Variability signatures:- cache bit map- cpu speed-power map- memory access time- ALU error rates
Do Nothing
(Elastic User, Robust App)
Change Algorithm Parameters
(Codec Setting, Duty Cycle Ratio)
Change Algorithm Implementation
(Alternate code path, Dynamic recompilation)
Change Hardware Operating Point
(Disabling parts of the cache, Changing V-f)
12
Sensors
Models
Metadata Mechanisms: Reflection, Introspection
UnO Computing Machines: Taxonomy of Underdesign
13
Nominal Design
Hardware Characterization
Tests
Die Specific
Adaptation
D
SignatureBurn In
Manufacturing
Manufactured Die
Performance Constraints
DDD
DDD
DDD
DDD
DDD
DDD
D
Manufactured Die With Stored Signatures
Puneet Gupta/UCLA
Hardware
Software
Several Fundamental Questions
• How do we distinguish between codes that need to be accurate versus that can be not so? – How fine grain are these (or have to be)?
• How do we communicate this information across the stack in a manner that is robust and portable? – And error controllable (=safe).
• What is the model of error that should be used in designing UNO machines?
14
Building Machines that leverage move from Crash & Recover to Sense & Adapt
15
Expedition Grand Challenge & Questions“Can microelectronic variability be controlled and utilized in building better computer systems?”
16
Three Goals:a. Address fundamental technical
challenges (understand the problem)
b. Create experimental systems (proof of concept prototypes)
c. Educational and broader impact opportunities to make an impact (ensure training for future talent).
What are most effective ways to detect variability?
What are software-visible manifestations?
What are software mechanisms to exploit variability?
How can designers and tools leverage adaptation?
How do we verify and test hw-sw interfaces?
Thrusts traverse institutions on testbed vehicles seeding various projects
17
Group A: Signature Detection and
Generation
Characterizing variability in power consumption for modern computing
platforms, and implications
Runtime support and software adaptation for variable hardware
Probabilistic analysis of faulty hardware
Understanding and exploiting variability in flash memory devices
FPGA-based variability simulator
Group B: Variability Mitigation Measures
Mitigating variability in solid-state storage devices
Hardware solutions to better understand and exploit variability
VarEmu emulation-based testbed for variability-aware software
Variability-aware opportunistic system software stack
Application robustification for stochastic processors
Group C: Opportunistic Software and Abstractions
Effective error resilience
Negative bias temperature instability and electromigration
Memory-variability aware runtime systems
Design-dependent ring oscillator and software testbed
Executing programs under relaxed semantics
Instruction-level Vulnerability (ILV)
Sequence-level Vulnerability (SLV)
Procedure-level Vulnerability (PLV)
Task-level Vulnerability (TLV)
Monitor manifestations from instructions levels to task levels.
Observe and Control Variability Across Stack
By the time, we get to TLV, we are into a parallel software context: instruct OpenMP scheduler, even create an abstraction for programmers to express irregular and unstructured parallelism (code refactoring).
The steps to build variability abstractions
up to the SW layer
[ILV,SLV,PLV,TLV] Rahimi et al, DATE’12, ISLPED’12, TC’13, DATE’13
Closer to HW: Uncertainty Manifestations
• The most immediate manifestations of variability are in path delay and power variations.– Path delay variations has been addressed extensively in
delay fault detection by test community.
• With Variability, it is possible to do better by focusing on the actual mechanisms– For instance, major source of timing variation is voltage
droops, and errors matter when these end up in a state change.
19
Combine these two observations and you get a rich literature in recent years for handling variability induced errors: Razor, EDA, TRC,
…
Detecting and Correcting Timing Errors• Detect error, tune supply voltage to reach an error rate, borrow
time, stretch clock– Exploit detection circuits (e.g., voltage droops), double sampling with shadow
latches, Exploit data dependence on circuit delays– Enable reduction in voltage margin– Manage timing guardbands and voltage margins– Tunable Replica allow non-intrusive operation.
20
Voltage droop
Voltage droop
Sensing: Razor, RazorII, EDS, Bubble RazorDouble Sampling (Razor I) [Ernest’03]
Transition Detector with Time Borrowing [Bowman’09]
Razor II [Das’09]
Double Sampling with Time Borrowing [Bowman’09]
EDS [Bowman ‘11]
22
Task Ingredients: Model, Sense, Predict, Adapt
I. Sense & AdaptObservation using in situ monitors (Razor, EDS) with cycle-by-cycle corrections (leveraging CMOS knobs or replay)
1ns
4ns3ns
5ns
Sense (detect)
Adapt (correct)
SensorsModel
1ns
4ns3ns
5ns
Prevent
II. Predict & PreventRelying on external or replica monitors Model-based rule derive
adaptive guardband to prevent error
CHARACTERIZE, MODEL, PREDICTDon’t Fear Errors: Bits Flip, Instructions Don’t Always Execute Correctly
23
Bit Error Rate, Timing Error Rate, Instruction Error Rate, ….
Characterize Instructions and Instruction Sequences for Vulnerability to timing errors
Characterize LEON3 in 65nm TSMC across
full range of operating conditions:
(-40°C−125°C, 0.72V−1.1V)
24
Critical path (ns)
Dynamic variations cause the critical path delay to increase by a factor of 6.1×.
Leon-3
ModelSimVsim
Verilog
Timing constraints
Verilognetlist
Netlist+ P
arasitics
Adaptive CLK
TSMC 45nm Libraries
Netlist+ S
DF
(0.81V,125˚C)
Design CompilerPLUT
Module
VHDLIC
Compiler
(0.81V,-40˚C) (0.72V,125˚C) (0.99V,-40˚C)
SLV Flow
Intolerant Apps
gccPin
High-frequent Sequences
Operand Distribution
Sequence Generator
Testbench
ILV Flow
Design Time
Adaptive GuardbandingSLV metadata
PrimeTime
ModelSimVsim
Operand Distribution
Instruction Generator
Testbench
Desire (V,T)
Netlist+ S
DF
ILV metadata
Generate ILV, SLV “Metadata”• The ILV (SLV) for each instructioni (sequencei) at every
operating condition is quantified:
– where Ni (Mi) is the total number of clock cycles in Monte Carlo simulation of instructioni (sequencei) with random operands.
– Violationj indicates whether there is a violated stage at clock cyclej or not.
• ILVi (SLVi) defined as the total number of violated cycles over the total simulated cycles for the instructioni (sequencei).
M1
( , , , _ ) ViolationM
1
If any stage violates at cycleViolation
otherwise
i
SLV i V T cycle time jij
1 jj
0
N
1( , , , _ ) Violation
N1
If any stage violates at cycleViolation
otherwise
i
ILV i V T cycle time jij
1 jj
0
Now, I am going to make a jump over characterization data…
0
2
4
6
8
10
12
14
16
Fetch Decode Reg. acc. Execute Memory Write back
Num
ber
of
faile
d p
aths
x 10000 0.72V 0.88V 1.10V
0
1
2
3
4
5
6
7
8
Fetch Decode Reg. acc. Execute Memory Write back
Num
ber
of
faile
d p
aths
x 10000 -40 � C 0� C 125 � C
Observe:The execute and memory parts are sensitive to V/T variations, and also exhibit a large number of critical paths in comparison to the rest of processor.
Hypothesis: We anticipate that the instructions that significantly exercise the execute and memory stages are likely to be more vulnerable to V/T variations Instruction-level Vulnerability (ILV)
VDD= 1.1V
T= 125°C
Connect the dots from paths to Instructions
For SPARC V8 instructions (V, T, F) are varied and ILVi is evaluated for every instructioni with random operands; SLVi is evaluated for a high-frequent sequencei of instructions.
• For every operating conditions:
ILV (3rd Class) ≥ ILV (2nd Class) ≥ ILV (1st Class)
SLV (Class II) ≥ SLV (Class I)
ClassII ClassIIIHWMUL/DIV
MEMCONTROL
ClassII
ClassI
ALU
ClassI
SPARC Instructions Classes of ILV Classes of SLV
ILV and SLV classification for integer SPARC V8 ISA.
ILV AND SLV: Partition them into groups according to their vulnerability to timing errors
ILV: 1st class= Logical and arithmetic; 2nd class= Memory; 3rd class= Multiply and divide.SLV: Class II= mixtures of memory, logic, control; Class I= logical and arithmetic. For top 20 high-frequency sequence from 80 billion dynamic instructions of 32 benchmarks
APPLY: STATICALLY TO ACHIEVE HIGHER INSTRUCTION THROUGHPUT, LOWER POWER
Use Instruction Vulnerabilities to Generate Better Code, Call/Returns
32
Now Use ILV, SLV to Dynamically Adapt Guardbands
VA Compiler
Application Code
SLV
ILV
App.type
I. Error-tolerant Applications Duplication of critical instructions Satisfying the fidelity metric
II. Error-intolerant Application Increasing the percentage of the sequences of ClassI,
i.e., increasing the number arithmetic instructions with regard to the memory and control flow instructions, e.g., through loop unrolling technique
Adaptive Guardbanding
IF ID W B
CPM
I$
D$
PL
UT
RA
EX
ME
Ad
apti
ve C
lock
ing
(V,T)
Seqi
via memory-mapped I/O
LEON3 core clock
Co
mp
ile
tim
eR
un
tim
e
• Adaptive clock scaling for each class of sequences mitigates the conservative inter- and intra-corner guardbanding.
• At the runtime, in every cycle, the PLUT module sends the desired frequency to the adaptive clocking circuit utilizing the characterized SLV metadata of the current sequence and the operating condition monitored by CPM.
Utilization SLV at Compile Time
• Applying the loop unrolling produces a longer chain of ALU instructions, and as a result the percentage of sequences of ClassI is increased up to 41% and on average 31%.
• Hence, the adaptive guardbanding benefits from this compiler transformation technique to further reduce the guardband for sequences of ClassI.
0
5
10
15
20
25
30
35
40
45
50
Per
cen
tag
e o
f se
qu
ence
s (%
) a: Without loop unrolling
ClassI (length=7)
ClassI (length=6)
ClassI (length=5)
ClassI (length=4)
ClassI (length=3)
ClassI (length=2)
0
5
10
15
20
25
30
35
40
45
50
Per
cen
tag
e o
f se
qu
ence
s (%
) b: With loop unrolling
ClassI (length=7)
ClassI (length=6)
ClassI (length=5)
ClassI (length=4)
ClassI (length=3)
ClassI (length=2)
Effectiveness of Adaptive Guardbanding
• Using online SLV coupled with offline compiler techniques enables the processor to achieve 1.6× average speedup for intolerant applications
• Compared to recent work [Hoang’11], by adapting the cycle time for dynamic variations (inter-corner) and different instruction sequences (intra-corner).
1.2
1.3
1.4
1.5
1.6
1.7
1.8
1.9
No
rmali
zed
Th
rou
gh
pu
t
(0.72V,125°) (0.81V,0°) (0.81V,125°)
0.80
1.00
1.20
1.40
1.60
1.80
2.00
No
rmal
ized
Th
rou
gh
pu
t
(0.72V,125°) (0.81V,0°) (0.81V,125°)
• Adaptive guardbanding achieves up to 1.9× performance improvement for error-tolerant (probabilistic) applications in comparison to the traditional worst-case design.
Example: Procedure Hopping in Clustered CPU, Each core with its voltage domain
• Statically characterize procedure for PLV• A core increases voltage if monitored
delay is high• A procedure hops from one core to
another if its voltage variation is high• Less 1% cycle overhead in EEMBC.
36
... I$Bi-1I$B0
Log. Interc.
Core15
VA
-VD
D-h
oppi
ng
... TCDMBj-1TCDMB0
Log. Interc.
Low VDD
Typical VDD
High VDD
DF
S...
f+18
0°f+
180°
f
CPM
Level ShiftersLevel Shifters
Level ShiftersLevel Shifters
SHM
PSS
Core0
VA
-VD
D-h
oppi
ng
CPM
PSS
VDD = 0.81V VDD = 0.99V VA-VDD-Hopping=( 0.81V 0.99V, )
f0
862
f1
909
f2
870
f3
847
f4
826
f5
855
f6
877
f7
893
f8
820
f9
826
f10
909
f11
847
f12
901
f13
917
f14
847
f15
901
f0
862
f1
909
f2
870
f3
847
f4
1370
f5
855
f6
877
f7
893
f8
1370
f9
1370
f10
909
f11
847
f12
901
f13
917
f14
847
f15
901
f0
1408 f1
1389 f2
1408 f3
1370 f4
1370 f5
1408 f6
1408 f7
1408 f8
1370 f9
1370 f10
1389 f11
1370 f12
1408 f13
1408 f14
1389 f15
1389
HW/SW Collaborative Architecture to Support Intra-cluster Procedure Hopping
37
…ProcX@Callee:if (calculate_PLV ≤ PLV_threshold)
set_statusX_PHIT = runningload_contex¶m_from_SSPX
set_all_param&pointerscall ProcX
store_contex_to_SSPX
set_statusX_PHIT = donesend_broadcast_ack
else resume_normal_execution
…Broadcast_req_ISR:ProcX@Callee = search_in_PHITcall ProcX@Callee
…call ProcX //conventional compile Call ProcX@Caller //VA-compile…ProcX@Caller:If (calculate_PLV ≤ PLV_threshold)
call ProcX
else create_shared_stack_layoutset_PHIT_for_ProcX
send_broadcast_reqset_timerwait_on_ack_or_timer
…Broadcast_ack_ISR:if (statusX_PHIT == done)
load_context&return_from_SSPX
Shared Local
Heap
Shared Stack
ProcXProcX
@Callee
PHIT
Op
era
ting
Co
n. M
on
it.Interrup
t Co
nt.O
pe
ratin
g C
on
. Mo
nit.
Inte
rrup
t C
ont
.
TCDM
Sh
aredL
1 -I$
Callee Corek Caller Corei
ProcX
@Caller……
…
Stacks
• The code is easily accessible via the shared-L1 I$.• The data and parameters are passed through the shared stack in
TCDM (Tightly Coupled Data Memory)• A procedure hopping information table (PHIT) keeps the status
for a migrated procedure.
APPLY: MODEL, SENSE, AND ADAPT DYNAMICALLY
Combine Characterization with Online Recognition
38
Consider a Full Permutation of PVTA Parameters
39
• 10 32-bit integer, 15 single precision FP Functional Units (FUs)
• For each FUi working with tclk and a given PVTA variations, we defined Timing Error Rate (TER):
Design Compiler
IC Compiler
DesignWareLibs
FUs VERILOG
45nm Corners
Libs
PrimeTime
45nm Process VA Libs
Variable Parameters
Netlist&SPEF
VoltageTemp.
Processtclk
Vth
Timing Error Rate Analysis
SSTA &STA
MATLAB Linear
Classifier
Parametric ModelStart Point
End Point Step # of
PointsVoltage 0.88V 1.10V 0.01V 23
Temperature 0°C 120°C 10°C 13Process (σWID) 0% 9.6% 3.2% 4Aging (∆Vth) 0mV 100mV 25mV 5
tclk 0.2ns 5.0ns 0.2ns 25
i clki clk
i
CriticalPaths (FU ,t ,V,T,P,A) 100TER (FU ,t ,V,T,P,A)
Paths (FU )
Parametric Model Fitting
40
• We used Supervised learning (linear discriminant analysis) to generate a parametric model at the level of FU that relates PVTA parameters variation and tclk to classes of TER.
• On average, for all FUs the resubstitution error is 0.036, meaning the models classify nearly all data correctly.
• For extra characterization points, the model makes correct estimates for 97% of out-of-sample data. The remaining 3% is misclassified to the high-error rate class, CH, thus will have safe guardband.
HFG ASIC Analysis Flow for TER
PVTA tclk
TER=0% 33%>= TER >0% 66%>= TER >33% 100%>= TER >66%Class0 (C0) ClassLow (CL) ClassMedium (CM) ClassHigh (CH)
Classes of TER
TER TER Class
K
1,...,K 1
ˆ ( | ) ( | )arg miny k
y P k x C y k
1
σ0.5
σ
1 1 ( )( | ) exp( M ( ))2(2 M )
TxP x k xk k
( | ) ( )( | )
( )
P k x P kP k x
P x
K
1
cost( ) ( | ) ( | )i
k P i x C k i
Linear discriminant analysis
Parametric Model
Delay Variation and TER Characterization
41• During design time the delay of the FP adder has a large uncertainty of
[0.73ns,1.32ns], since the actual values of PVTA parameters are unknown.
(P,?,?,?) (P,A,?,?) (P,A,T,?)(P,A,T,V)
P(σWID)=0%
A(∆Vth)=0mV
T=0°CV=1.10V
…V=0.88V
… …
T=120°CV=1.10V
…V=0.88V
… … …
A(∆Vth)=100mV
T=0°CV=1.10V
…V=0.88V
… …
T=120°CV=1.10V
…V=0.88V
… … … …
P(σWID)=9.6%
A(∆Vth)=0mV
T=0°CV=1.10V
…V=0.88V
…
T=120°CV=1.10V
…V=0.88V
… … …
A(∆Vth)=100mV
T=0°CV=1.10V
…V=0.88V
… …
T=120°CV=1.10V
…V=0.88V
Delay(ns)
0
50
100
0.90.95
11.05
1.1
0
20
40
60
80
100
Temperature (°C)VDD (V)
Tim
ing
Err
or
Ra
te (
%)
20
40
60
80
0
50
100
0.90.95
11.05
1.1
0
20
40
60
80
100
Temperature (°C)
VDD (V)
Tim
ing
Err
or
Ra
te (
%)
0
20
40
60
80
(P(σWID) = 0%, A(∆Vth)=100mV)(P(σWID) = 0%, A(∆Vth)=0mV)
0
50
100
0.9
0.95
1
1.05
1.1
0
20
40
60
80
100
Temperature (°C)VDD (V)
Tim
ing
Err
or
Ra
te (
%)
0
20
40
60
80
(P(σWID) = 9.6%, A(∆Vth)=0mV)
020
4060
80100
120
0.90.95
11.05
1.1
0
20
40
60
80
100
Temperature (°C)VDD (V)
Tim
ing
Err
or
Ra
te (
%)
20
40
60
80
(P(σWID) = 9.6%, A(∆Vth)=100mV)
Hierarchical Sensors Observability
42
• The question is what mix of monitors that would be useful?
• The more sensors we provide for a FU, the better conservative guardband reduction for that FU.
0.6
0.8
1.0
1.2
1.4
1.6
1.8
2.0
2.2
2.4
2.6t c
lk(n
s)
FP_expFP_add
INT_mac
P_sensor PA_sensors PAT_sensors PATV_sensors• The guardband of FP adder can be reduced up to • 8% (P_sensor), • 24% (PA_sensors), • 28% (PAT_sensors), • 44% (PATV_sensors)
In-situ PVT sensors impose 1−3% area overhead [Bowman’09]Five replica PVT sensors increase area of by 0.2% [Lefurgy’11]The banks of 96 NBTI aging sensors occupy less than 0.01% of the core's area [Singh’11]
Online Utilization of Guardbanding
43
The control system tunes the clock frequency through an online model-based rule.
FU
k
FU
j
TER raw data Classifier
Parametric Model
PATV_configtarget_TER
P A T V tclk
… … … … …
PA
TV
S
enso
r
CLKcontrol
FU
i
max
P (2-bit)
A (3-bit)
T (3-bit)
V (3-bit)
instruction
tclk(5-bit)
LU
TsGPU
SIMD IF
offlin
eo
nlin
e
1. Fine-grained granularity of instruction-by-instruction monitoring and adaptation that uses signals of PATV sensors from individual FUs
2. Coarse-grained granularity of kernel-level monitoring uses a representative PATV sensors for the entire execution stage of pipeline
Throughput benefit of HFG
44
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
Th
rou
gh
pu
t (G
IPS
)
P_sensor PA_sensors PAT_sensors PATV_sensors
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
Th
rou
gh
pu
t (G
IPS
)
P_sensor PA_sensors PAT_sensors PATV_sensors
Kernel-level monitoring improves throughput by 70% from P to PATV sensors. Target TER=0
Instruction-level monitoring improves throughput by 1.8-2.1X.
PUTTING IT TOGETHER: COORDINATED ADAPTATION TO PROPAGATE ERRORS TOWARDS APPLICATION
Consider shared 8-FPU 16-core architectures
45
Accurate, Approximate Operating Modes
• Accurate mode: every pipeline uses (with 3.8% area overhead)– EDS circuit sensors to detect any timing errors, ECU to correct errors using
multiple-issue operation replay mechanism (without changing frequency) 46
opmodeOpnd1&2resdone
opmodeOpnd1&2resdone
EDS +ECU
S1 S2
EDS +ECU
S1 S2
EDS +ECU
S1 S2 S18…
SLAVE PORT
ADD/ SUB pipe
MUL pipe
DIV pipe
opmodeOpnd1&2resdone
oprt
FLV
FLV
FLV
Modeled after STM P2012 16-core machine
Accuracy-Configurable Architecture
• In the approximate mode – Pipeline disables the EDS sensors on the less significant N bits of the
fraction where N is reprogrammable through a memory-mapped register.
– The sign and the exponent bits are always protected by EDS.
– Thus pipeline ignores any timing error below the less significant N bits of the fraction and save on the recovery cost.
• Switching between modes disables/enables the error detection circuits partially on N bits of the fraction FP pipeline can efficiently execute subsequent interleaved accurate or approximate software blocks.
47
Fine-grain Interleaving Possible Through Coordination and Controlled Approximation
Architecture: accuracy-reconfigurable FPUs that are shared among tightly-coupled processors and support online FPV characterizationCompiler: OpenMP pragmas for approximate FP computations; profiling technique to identify tolerable error significance and error rateRuntime: Scheduler utilizes FPV metadata and promotes FPUs to accurate mode, or demotes them to approximate mode depending upon the code region requirements.
48
Either ignore the timing errors (in approximate regions) or reduce frequency of errors by assigning computations to correctible
hardware resources for a cost.Ensure safety of error ignorance through a set of rules.
FP Vulnerability Dynamically Monitored and Controlled by ECU
• % of cycles with timing errors as reported by EDS sensors captured as FPV metadata
• Metadata is visible to the software through memory-mapped registers.
• Enables runtime scheduler to perform on-line selection of best FP pipeline candidates– Low FPV units for accurate blocks, or steer error
without correction to application.
49
OpenMP Compiler Extension
50
#pragma omp accurate structured-block #pragma omp approximate [clause] structured-block
error_significance_threshold (<value N>)
#pragma omp parallel{ #pragma omp accurate #pragma omp for for (i=K/2; i <(IMG_M-K/2); ++i) { // iterate over image for (j=K/2; j <(IMG_N-K/2); ++j) { float sum = 0; int ii, jj; for (ii =-K/2; ii<=K/2; ++ii) { // iterate over kernel for (jj = -K/2; jj <= K/2; ++jj) { float data = in[i+ii][j+jj]; float coef = coeffs[ii+K/2][jj+K/2]; float result; #pragma omp approximate error_significance_threshold(20) { result = data * coef; sum += result;
} } } out[i][j]=sum/scale; } } }
Code snippet for Gaussian filter
utilizing OpenMP variability-aware
directives
int ID = GOMP_resolve_FP (GOMP_APPROX, GOMP_MUL, 20);
GOMP_FP (ID, data, coeff, &result);
int ID = GOMP_resolve_FP (GOMP_APPROX, GOMP_ADD, 20);
GOMP_FP (ID, sum, result, &sum);
Invokes the runtime FPU scheduler
programs the FPU
FPV Metadata can even drive synthesis!
51
Source code
Annotated source code
OpenMP approximate
directives
ProfilingInput data
Controlled approximation
analysiserror rate
error sig.
Fidelity (PSNR)
Approximate-aware timing constraint generation
error sig. threshold (N) error ratethreshold
Design-time hardware FPU synthesis & optimization
clock
Nrelaxed timing
tight timing
Runtimelibrary
scheduler
utilizing fast leaky standard cells (low-VTH) for these paths
utilizing the regular and slow standard cells (regular-VTH and high-VTH) for the rest of paths since errors can be ignored!
Save Recovery Time, Energy using FPV Monitoring (TSMC 45nm)
• Error-tolerant applications: Gaussian, Sobel filters– PSNR results show error significance threshold at N=20
while maintaining >30 dB• 36% more energy efficient FPUs, recovery cycles reduced by 46%
• 5 kernel codes as error-intolerant applications– 22% average energy savings. 52
ARM v6 core 16 TCDM banks 16I$ size(per core) 16KB TCDM latency 2 cyclesI$ line 4 words TCDM size 256 KBLatency hit 1 cycle L3 latency ≥ 60 cyclesLatency miss ≥ 59 cycles L3 size 256MBShared-FPUs 8 FP ADD latency 2FP MUL latency 2 FP DIV latency 18
Expedition Experimental Platforms & Artifacts
• Interesting and unique challenges in building research testbeds that drive our explorations– Mocks up don’t go far since variability is at the
heart of microelectronic scaling. Need platforms that capture scaling and integration aspects.
• Testbeds to observe (Molecule, GreenLight, Ming), control (Oven, ERSA)
53
Ming the Merciless
ERSA@BEE3
Molecule
Red Cooper
Red Cooper Testbed
• Customized chip with processor + speed/leakage sensors• Testbed board to finish the sensor feedback loop on board• Used in building a duty-cycled OS based on variability sensors
54
CPU Mem Storage Accelerators
Energy Source Network(Batteries)
Runtime
Microarchitecture and Compilers
Applications
Vend
or
Proc
ess
Ambi
ent
Agin
g
Power
Performance
Errors
Ferrari Chip: Closing Loop On-Chip
55
ARM Cortex-M3
JTAGAMBA
Bus
GPIO
Timers
PLL RO CLK
Config
64 kB IMEM
176 kB DMEM
ECC
Counters
8 banks of sensors(N/P Leak, Temp, Oxide)
19 DDROs
GPIO
SensOut
• On-Chip Sensors – Memory mapped i/o and control– Leakage sensors, DDROs,
temperature sensors, reliability sensors
• Better support for OS and software
ARM Cortex
-M3DMEM
IMEM
DMEM
PLL
DMEM
55Energy Source Network(Batteries)
CPU Mem Storage Accelerators
Runtime
Microarchitecture and Compilers
Applications
Vend
or
Proc
ess
Ambi
ent
Agin
g Power
Performance
Errors
DUT Device
Ref. Device
Available since April 2013
Sense-and-Adapt Fundamentally Alters The Stack
• Machines that consist of parts with variations in performance, power and reliability
• Machines that incorporate sensing circuits• Machines w/ interfaces to change ongoing
computation & structures • New machine models: QOS or Relaxed
Reliability parts.
56
57
Thank You!
http://variability.org
The Variability Expedition
A NSF Expeditions in Computing Project
Rajesh K. GuptaNikil Dutt, UCIPunit Gupta, UCLAMani Srivastava, UCLASteve Swanson, UCSDLara Dolecek, UCLASubhashish Mitra, Stanford
YY Zhou, UCSDTajana Rosing, UCSDAlex Nicolau, UCIRanjit Jhala, UCSDSorin Lerner, UCSDRakesh Kumar, UIUCDennis Sylvester, UMich
Yuvraj Agrawal, CMULucas Wanner, UCLA