From Crash-and-Recover to Sense-and-Adapt: Our Evolving Models of Computing Machines

From Crash-and-Recover to Sense-and-Adapt: Our Evolving Models of

Computing Machines

Rajesh K. GuptaUC San Diego.

To a software designer, all chips look alike

To a hardware engineer, a chip is delivered as per contract in a data-sheet.

2

3

COMPUTERS ARE BUILT ON STUFF THAT IS IMPERFECT AND…

Reality is

Changing From Chiseled Objects to Molecular Assemblies

4Courtesy: P. Gupta, UCLA

45nm Implementation of Leon3 Processor Core

Engineers Know How to “Sandbag”

5

• PVTA margins add to guardbands– Static Process variation: effective transistor channel

length and threshold voltage– Dynamic variations: Temperature fluctuations,

supply Voltage droops, and device Aging (NBTI, HCI)

Temperature

Clock

actual circuit delay guardband

Aging VCC Droop Across-wafer Frequency

Uncertainty Means Unpredictability

• VLSI Designer: Eliminate It– Capture physics into models– Statistical or plain-old Monte Carlo– Manufacturing, temperature effects

• Architect: Average it out– Workload (Dynamic) Variations

• Software, OS: Deny It– Simplify, re-organize OS/tasks breaking these into

parts that are precise (W.C.) and imprecise (Ave.)

6Each doing their own thing, massive overdesign…

Simulate ‘degraded’ netlist with model input changes (DVth)

Deterministic simulations capture known physical processes (e.g., aging)

Multiple (Monte-Carlo) simulations wrapped around a nominal model

Let us step back a bit: HW-SW Stack

Hardware Abstraction Layer (HAL)Hardware Abstraction Layer (HAL)

Operating SystemOperating System

ApplicationApplication ApplicationApplication

7





8

Time or part





9

Time or part

}overdesignedhardware

20x in sleep power50% in performance

40% larger chip35% more active power60% more sleep power

What if?




10

Time or part

}underdesignedhardware

New Hardware-Software Interface..

Time or part

ApplicationApplication



ApplicationApplication

minimal variability handling in hardware

UnderdesignedHardware

OpportunisticSoftware

TraditionalFault-tolerance

11

UNO Computing Machines Seek Opportunities based on Sensing Results

Variability manifestations- faulty cache bits- delay variation- power variation

Variability manifestations- faulty cache bits- delay variation- power variation

Variability signatures:- cache bit map- cpu speed-power map- memory access time- ALU error rates

Variability signatures:- cache bit map- cpu speed-power map- memory access time- ALU error rates

Do Nothing

(Elastic User, Robust App)

Change Algorithm Parameters

(Codec Setting, Duty Cycle Ratio)

Change Algorithm Implementation

(Alternate code path, Dynamic recompilation)

Change Hardware Operating Point

(Disabling parts of the cache, Changing V-f)

12

Sensors

Models

Metadata Mechanisms: Reflection, Introspection

UnO Computing Machines: Taxonomy of Underdesign

13

Nominal Design

Hardware Characterization

Tests

Die Specific

Adaptation

D

SignatureBurn In

Manufacturing

Manufactured Die

Performance Constraints

DDD

DDD

DDD

DDD

DDD

DDD

D

Manufactured Die With Stored Signatures

Puneet Gupta/UCLA

Hardware

Software

Several Fundamental Questions

• How do we distinguish between codes that need to be accurate versus that can be not so? – How fine grain are these (or have to be)?

• How do we communicate this information across the stack in a manner that is robust and portable? – And error controllable (=safe).

• What is the model of error that should be used in designing UNO machines?

14

Building Machines that leverage move from Crash & Recover to Sense & Adapt

15

Expedition Grand Challenge & Questions“Can microelectronic variability be controlled and utilized in building better computer systems?”

16

Three Goals:a. Address fundamental technical

challenges (understand the problem)

b. Create experimental systems (proof of concept prototypes)

c. Educational and broader impact opportunities to make an impact (ensure training for future talent).

What are most effective ways to detect variability?

What are software-visible manifestations?

What are software mechanisms to exploit variability?

How can designers and tools leverage adaptation?

How do we verify and test hw-sw interfaces?

Thrusts traverse institutions on testbed vehicles seeding various projects

17

Group A: Signature Detection and

Generation

Characterizing variability in power consumption for modern computing

platforms, and implications

Runtime support and software adaptation for variable hardware

Probabilistic analysis of faulty hardware

Understanding and exploiting variability in flash memory devices

FPGA-based variability simulator

Group B: Variability Mitigation Measures

Mitigating variability in solid-state storage devices

Hardware solutions to better understand and exploit variability

VarEmu emulation-based testbed for variability-aware software

Variability-aware opportunistic system software stack

Application robustification for stochastic processors

Group C: Opportunistic Software and Abstractions

Effective error resilience

Negative bias temperature instability and electromigration

Memory-variability aware runtime systems

Design-dependent ring oscillator and software testbed

Executing programs under relaxed semantics

Instruction-level Vulnerability (ILV)

Sequence-level Vulnerability (SLV)

Procedure-level Vulnerability (PLV)

Task-level Vulnerability (TLV)

Monitor manifestations from instructions levels to task levels.

Observe and Control Variability Across Stack

By the time, we get to TLV, we are into a parallel software context: instruct OpenMP scheduler, even create an abstraction for programmers to express irregular and unstructured parallelism (code refactoring).

The steps to build variability abstractions

up to the SW layer

[ILV,SLV,PLV,TLV] Rahimi et al, DATE’12, ISLPED’12, TC’13, DATE’13

Closer to HW: Uncertainty Manifestations

• The most immediate manifestations of variability are in path delay and power variations.– Path delay variations has been addressed extensively in

delay fault detection by test community.

• With Variability, it is possible to do better by focusing on the actual mechanisms– For instance, major source of timing variation is voltage

droops, and errors matter when these end up in a state change.

19

Combine these two observations and you get a rich literature in recent years for handling variability induced errors: Razor, EDA, TRC,

…

Detecting and Correcting Timing Errors• Detect error, tune supply voltage to reach an error rate, borrow

time, stretch clock– Exploit detection circuits (e.g., voltage droops), double sampling with shadow

latches, Exploit data dependence on circuit delays– Enable reduction in voltage margin– Manage timing guardbands and voltage margins– Tunable Replica allow non-intrusive operation.

20

Voltage droop

Voltage droop

Sensing: Razor, RazorII, EDS, Bubble RazorDouble Sampling (Razor I) [Ernest’03]

Transition Detector with Time Borrowing [Bowman’09]

Razor II [Das’09]

Double Sampling with Time Borrowing [Bowman’09]

EDS [Bowman ‘11]

22

Task Ingredients: Model, Sense, Predict, Adapt

I. Sense & AdaptObservation using in situ monitors (Razor, EDS) with cycle-by-cycle corrections (leveraging CMOS knobs or replay)

1ns

4ns3ns

5ns

Sense (detect)

Adapt (correct)

SensorsModel

1ns

4ns3ns

5ns

Prevent

II. Predict & PreventRelying on external or replica monitors Model-based rule derive

adaptive guardband to prevent error

CHARACTERIZE, MODEL, PREDICTDon’t Fear Errors: Bits Flip, Instructions Don’t Always Execute Correctly

23

Bit Error Rate, Timing Error Rate, Instruction Error Rate, ….

Characterize Instructions and Instruction Sequences for Vulnerability to timing errors

Characterize LEON3 in 65nm TSMC across

full range of operating conditions:

(-40°C−125°C, 0.72V−1.1V)

24

Critical path (ns)

Dynamic variations cause the critical path delay to increase by a factor of 6.1×.

Leon-3

ModelSimVsim

Verilog

Timing constraints

Verilognetlist

Netlist+ P

arasitics

Adaptive CLK

TSMC 45nm Libraries

Netlist+ S

DF

(0.81V,125˚C)

Design CompilerPLUT

Module

VHDLIC

Compiler

(0.81V,-40˚C) (0.72V,125˚C) (0.99V,-40˚C)

SLV Flow

Intolerant Apps

gccPin

High-frequent Sequences

Operand Distribution

Sequence Generator

Testbench

ILV Flow

Design Time

Adaptive GuardbandingSLV metadata

PrimeTime

ModelSimVsim

Operand Distribution

Instruction Generator

Testbench

Desire (V,T)

Netlist+ S

DF

ILV metadata

Generate ILV, SLV “Metadata”• The ILV (SLV) for each instructioni (sequencei) at every

operating condition is quantified:

– where Ni (Mi) is the total number of clock cycles in Monte Carlo simulation of instructioni (sequencei) with random operands.

– Violationj indicates whether there is a violated stage at clock cyclej or not.

• ILVi (SLVi) defined as the total number of violated cycles over the total simulated cycles for the instructioni (sequencei).

M1

( , , , _ ) ViolationM

1

If any stage violates at cycleViolation

otherwise

i

SLV i V T cycle time jij

1 jj

0

N

1( , , , _ ) Violation

N1

If any stage violates at cycleViolation

otherwise

i

ILV i V T cycle time jij

1 jj

0

Now, I am going to make a jump over characterization data…

0

2

4

6

8

10

12

14

16

Fetch Decode Reg. acc. Execute Memory Write back

Num

ber

of

faile

d p

aths

x 10000 0.72V 0.88V 1.10V

0

1

2

3

4

5

6

7

8

Fetch Decode Reg. acc. Execute Memory Write back

Num

ber

of

faile

d p

aths

x 10000 -40 � C 0� C 125 � C

Observe:The execute and memory parts are sensitive to V/T variations, and also exhibit a large number of critical paths in comparison to the rest of processor.

Hypothesis: We anticipate that the instructions that significantly exercise the execute and memory stages are likely to be more vulnerable to V/T variations Instruction-level Vulnerability (ILV)

VDD= 1.1V

T= 125°C

Connect the dots from paths to Instructions

For SPARC V8 instructions (V, T, F) are varied and ILVi is evaluated for every instructioni with random operands; SLVi is evaluated for a high-frequent sequencei of instructions.

• For every operating conditions:

ILV (3rd Class) ≥ ILV (2nd Class) ≥ ILV (1st Class)

SLV (Class II) ≥ SLV (Class I)

ClassII ClassIIIHWMUL/DIV

MEMCONTROL

ClassII

ClassI

ALU

ClassI

SPARC Instructions Classes of ILV Classes of SLV

ILV and SLV classification for integer SPARC V8 ISA.

ILV AND SLV: Partition them into groups according to their vulnerability to timing errors

ILV: 1st class= Logical and arithmetic; 2nd class= Memory; 3rd class= Multiply and divide.SLV: Class II= mixtures of memory, logic, control; Class I= logical and arithmetic. For top 20 high-frequency sequence from 80 billion dynamic instructions of 32 benchmarks

APPLY: STATICALLY TO ACHIEVE HIGHER INSTRUCTION THROUGHPUT, LOWER POWER

Use Instruction Vulnerabilities to Generate Better Code, Call/Returns

32

Now Use ILV, SLV to Dynamically Adapt Guardbands

VA Compiler

Application Code

SLV

ILV

App.type

I. Error-tolerant Applications Duplication of critical instructions Satisfying the fidelity metric

II. Error-intolerant Application Increasing the percentage of the sequences of ClassI,

i.e., increasing the number arithmetic instructions with regard to the memory and control flow instructions, e.g., through loop unrolling technique

Adaptive Guardbanding

IF ID W B

CPM

I$

D$

PL

UT

RA

EX

ME

Ad

apti

ve C

lock

ing

(V,T)

Seqi

via memory-mapped I/O

LEON3 core clock

Co

mp

ile

tim

eR

un

tim

e

• Adaptive clock scaling for each class of sequences mitigates the conservative inter- and intra-corner guardbanding.

• At the runtime, in every cycle, the PLUT module sends the desired frequency to the adaptive clocking circuit utilizing the characterized SLV metadata of the current sequence and the operating condition monitored by CPM.

Utilization SLV at Compile Time

• Applying the loop unrolling produces a longer chain of ALU instructions, and as a result the percentage of sequences of ClassI is increased up to 41% and on average 31%.

• Hence, the adaptive guardbanding benefits from this compiler transformation technique to further reduce the guardband for sequences of ClassI.

0

5

10

15

20

25

30

35

40

45

50

Per

cen

tag

e o

f se

qu

ence

s (%

) a: Without loop unrolling

ClassI (length=7)

ClassI (length=6)

ClassI (length=5)

ClassI (length=4)

ClassI (length=3)

ClassI (length=2)

0

5

10

15

20

25

30

35

40

45

50

Per

cen

tag

e o

f se

qu

ence

s (%

) b: With loop unrolling

ClassI (length=7)

ClassI (length=6)

ClassI (length=5)

ClassI (length=4)

ClassI (length=3)

ClassI (length=2)

Effectiveness of Adaptive Guardbanding

• Using online SLV coupled with offline compiler techniques enables the processor to achieve 1.6× average speedup for intolerant applications

• Compared to recent work [Hoang’11], by adapting the cycle time for dynamic variations (inter-corner) and different instruction sequences (intra-corner).

1.2

1.3

1.4

1.5

1.6

1.7

1.8

1.9

No

rmali

zed

Th

rou

gh

pu

t

(0.72V,125°) (0.81V,0°) (0.81V,125°)

0.80

1.00

1.20

1.40

1.60

1.80

2.00

No

rmal

ized

Th

rou

gh

pu

t

(0.72V,125°) (0.81V,0°) (0.81V,125°)

• Adaptive guardbanding achieves up to 1.9× performance improvement for error-tolerant (probabilistic) applications in comparison to the traditional worst-case design.

Example: Procedure Hopping in Clustered CPU, Each core with its voltage domain

• Statically characterize procedure for PLV• A core increases voltage if monitored

delay is high• A procedure hops from one core to

another if its voltage variation is high• Less 1% cycle overhead in EEMBC.

36

... I$Bi-1I$B0

Log. Interc.

Core15

VA

-VD

D-h

oppi

ng

... TCDMBj-1TCDMB0

Log. Interc.

Low VDD

Typical VDD

High VDD

DF

S...

f+18

0°f+

180°

f

CPM

Level ShiftersLevel Shifters

Level ShiftersLevel Shifters

SHM

PSS

Core0

VA

-VD

D-h

oppi

ng

CPM

PSS

VDD = 0.81V VDD = 0.99V VA-VDD-Hopping=( 0.81V 0.99V, )

f0

862

f1

909

f2

870

f3

847

f4

826

f5

855

f6

877

f7

893

f8

820

f9

826

f10

909

f11

847

f12

901

f13

917

f14

847

f15

901

f0

862

f1

909

f2

870

f3

847

f4

1370

f5

855

f6

877

f7

893

f8

1370

f9

1370

f10

909

f11

847

f12

901

f13

917

f14

847

f15

901

f0

1408 f1

1389 f2

1408 f3

1370 f4

1370 f5

1408 f6

1408 f7

1408 f8

1370 f9

1370 f10

1389 f11

1370 f12

1408 f13

1408 f14

1389 f15

1389

HW/SW Collaborative Architecture to Support Intra-cluster Procedure Hopping

37

…ProcX@Callee:if (calculate_PLV ≤ PLV_threshold)

set_statusX_PHIT = runningload_contex&param_from_SSPX

set_all_param&pointerscall ProcX

store_contex_to_SSPX

set_statusX_PHIT = donesend_broadcast_ack

else resume_normal_execution

…Broadcast_req_ISR:ProcX@Callee = search_in_PHITcall ProcX@Callee

…call ProcX //conventional compile Call ProcX@Caller //VA-compile…ProcX@Caller:If (calculate_PLV ≤ PLV_threshold)

call ProcX

else create_shared_stack_layoutset_PHIT_for_ProcX

send_broadcast_reqset_timerwait_on_ack_or_timer

…Broadcast_ack_ISR:if (statusX_PHIT == done)

load_context&return_from_SSPX

Shared Local

Heap

Shared Stack

ProcXProcX

@Callee

PHIT

Op

era

ting

Co

n. M

on

it.Interrup

t Co

nt.O

pe

ratin

g C

on

. Mo

nit.

Inte

rrup

t C

ont

.

TCDM

Sh

aredL

1 -I$

Callee Corek Caller Corei

ProcX

@Caller……

…

Stacks

• The code is easily accessible via the shared-L1 I$.• The data and parameters are passed through the shared stack in

TCDM (Tightly Coupled Data Memory)• A procedure hopping information table (PHIT) keeps the status

for a migrated procedure.

APPLY: MODEL, SENSE, AND ADAPT DYNAMICALLY

Combine Characterization with Online Recognition

38

Consider a Full Permutation of PVTA Parameters

39

• 10 32-bit integer, 15 single precision FP Functional Units (FUs)

• For each FUi working with tclk and a given PVTA variations, we defined Timing Error Rate (TER):

Design Compiler

IC Compiler

DesignWareLibs

FUs VERILOG

45nm Corners

Libs

PrimeTime

45nm Process VA Libs

Variable Parameters

Netlist&SPEF

VoltageTemp.

Processtclk

Vth

Timing Error Rate Analysis

SSTA &STA

MATLAB Linear

Classifier

Parametric ModelStart Point

End Point Step # of

PointsVoltage 0.88V 1.10V 0.01V 23

Temperature 0°C 120°C 10°C 13Process (σWID) 0% 9.6% 3.2% 4Aging (∆Vth) 0mV 100mV 25mV 5

tclk 0.2ns 5.0ns 0.2ns 25

i clki clk

i

CriticalPaths (FU ,t ,V,T,P,A) 100TER (FU ,t ,V,T,P,A)

Paths (FU )

Parametric Model Fitting

40

• We used Supervised learning (linear discriminant analysis) to generate a parametric model at the level of FU that relates PVTA parameters variation and tclk to classes of TER.

• On average, for all FUs the resubstitution error is 0.036, meaning the models classify nearly all data correctly.

• For extra characterization points, the model makes correct estimates for 97% of out-of-sample data. The remaining 3% is misclassified to the high-error rate class, CH, thus will have safe guardband.

HFG ASIC Analysis Flow for TER

PVTA tclk

TER=0% 33%>= TER >0% 66%>= TER >33% 100%>= TER >66%Class0 (C0) ClassLow (CL) ClassMedium (CM) ClassHigh (CH)

Classes of TER

TER TER Class

K

1,...,K 1

ˆ ( | ) ( | )arg miny k

y P k x C y k

1

σ0.5

σ

1 1 ( )( | ) exp( M ( ))2(2 M )

TxP x k xk k

( | ) ( )( | )

( )

P k x P kP k x

P x

K

1

cost( ) ( | ) ( | )i

k P i x C k i

Linear discriminant analysis

Parametric Model

Delay Variation and TER Characterization

41• During design time the delay of the FP adder has a large uncertainty of

[0.73ns,1.32ns], since the actual values of PVTA parameters are unknown.

(P,?,?,?) (P,A,?,?) (P,A,T,?)(P,A,T,V)

P(σWID)=0%

A(∆Vth)=0mV

T=0°CV=1.10V

…V=0.88V

… …

T=120°CV=1.10V

…V=0.88V

… … …

A(∆Vth)=100mV

T=0°CV=1.10V

…V=0.88V

… …

T=120°CV=1.10V

…V=0.88V

… … … …

P(σWID)=9.6%

A(∆Vth)=0mV

T=0°CV=1.10V

…V=0.88V

…

T=120°CV=1.10V

…V=0.88V

… … …

A(∆Vth)=100mV

T=0°CV=1.10V

…V=0.88V

… …

T=120°CV=1.10V

…V=0.88V

Delay(ns)

0

50

100

0.90.95

11.05

1.1

0

20

40

60

80

100

Temperature (°C)VDD (V)

Tim

ing

Err

or

Ra

te (

%)

20

40

60

80

0

50

100

0.90.95

11.05

1.1

0

20

40

60

80

100

Temperature (°C)

VDD (V)

Tim

ing

Err

or

Ra

te (

%)

0

20

40

60

80

(P(σWID) = 0%, A(∆Vth)=100mV)(P(σWID) = 0%, A(∆Vth)=0mV)

0

50

100

0.9

0.95

1

1.05

1.1

0

20

40

60

80

100


Tim

ing

Err

or

Ra

te (

%)

0

20

40

60

80

(P(σWID) = 9.6%, A(∆Vth)=0mV)

020

4060

80100

120

0.90.95

11.05

1.1

0

20

40

60

80

100


Tim

ing

Err

or

Ra

te (

%)

20

40

60

80

(P(σWID) = 9.6%, A(∆Vth)=100mV)

Hierarchical Sensors Observability

42

• The question is what mix of monitors that would be useful?

• The more sensors we provide for a FU, the better conservative guardband reduction for that FU.

0.6

0.8

1.0

1.2

1.4

1.6

1.8

2.0

2.2

2.4

2.6t c

lk(n

s)

FP_expFP_add

INT_mac

P_sensor PA_sensors PAT_sensors PATV_sensors• The guardband of FP adder can be reduced up to • 8% (P_sensor), • 24% (PA_sensors), • 28% (PAT_sensors), • 44% (PATV_sensors)

In-situ PVT sensors impose 1−3% area overhead [Bowman’09]Five replica PVT sensors increase area of by 0.2% [Lefurgy’11]The banks of 96 NBTI aging sensors occupy less than 0.01% of the core's area [Singh’11]

Online Utilization of Guardbanding

43

The control system tunes the clock frequency through an online model-based rule.

FU

k

FU

j

TER raw data Classifier

Parametric Model

PATV_configtarget_TER

P A T V tclk

… … … … …

PA

TV

S

enso

r

CLKcontrol

FU

i

max

P (2-bit)

A (3-bit)

T (3-bit)

V (3-bit)

instruction

tclk(5-bit)

LU

TsGPU

SIMD IF

offlin

eo

nlin

e

1. Fine-grained granularity of instruction-by-instruction monitoring and adaptation that uses signals of PATV sensors from individual FUs

2. Coarse-grained granularity of kernel-level monitoring uses a representative PATV sensors for the entire execution stage of pipeline

Throughput benefit of HFG

44

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

Th

rou

gh

pu

t (G

IPS

)

P_sensor PA_sensors PAT_sensors PATV_sensors

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

Th

rou

gh

pu

t (G

IPS

)

P_sensor PA_sensors PAT_sensors PATV_sensors

Kernel-level monitoring improves throughput by 70% from P to PATV sensors. Target TER=0

Instruction-level monitoring improves throughput by 1.8-2.1X.

PUTTING IT TOGETHER: COORDINATED ADAPTATION TO PROPAGATE ERRORS TOWARDS APPLICATION

Consider shared 8-FPU 16-core architectures

45

Accurate, Approximate Operating Modes

• Accurate mode: every pipeline uses (with 3.8% area overhead)– EDS circuit sensors to detect any timing errors, ECU to correct errors using

multiple-issue operation replay mechanism (without changing frequency) 46

opmodeOpnd1&2resdone


EDS +ECU

S1 S2

EDS +ECU

S1 S2

EDS +ECU

S1 S2 S18…

SLAVE PORT

ADD/ SUB pipe

MUL pipe

DIV pipe


oprt

FLV

FLV

FLV

Modeled after STM P2012 16-core machine

Accuracy-Configurable Architecture

• In the approximate mode – Pipeline disables the EDS sensors on the less significant N bits of the

fraction where N is reprogrammable through a memory-mapped register.

– The sign and the exponent bits are always protected by EDS.

– Thus pipeline ignores any timing error below the less significant N bits of the fraction and save on the recovery cost.

• Switching between modes disables/enables the error detection circuits partially on N bits of the fraction FP pipeline can efficiently execute subsequent interleaved accurate or approximate software blocks.

47

Fine-grain Interleaving Possible Through Coordination and Controlled Approximation

Architecture: accuracy-reconfigurable FPUs that are shared among tightly-coupled processors and support online FPV characterizationCompiler: OpenMP pragmas for approximate FP computations; profiling technique to identify tolerable error significance and error rateRuntime: Scheduler utilizes FPV metadata and promotes FPUs to accurate mode, or demotes them to approximate mode depending upon the code region requirements.

48

Either ignore the timing errors (in approximate regions) or reduce frequency of errors by assigning computations to correctible

hardware resources for a cost.Ensure safety of error ignorance through a set of rules.

FP Vulnerability Dynamically Monitored and Controlled by ECU

• % of cycles with timing errors as reported by EDS sensors captured as FPV metadata

• Metadata is visible to the software through memory-mapped registers.

• Enables runtime scheduler to perform on-line selection of best FP pipeline candidates– Low FPV units for accurate blocks, or steer error

without correction to application.

49

OpenMP Compiler Extension

50

#pragma omp accurate structured-block #pragma omp approximate [clause] structured-block

error_significance_threshold (<value N>)

#pragma omp parallel{ #pragma omp accurate #pragma omp for for (i=K/2; i <(IMG_M-K/2); ++i) { // iterate over image for (j=K/2; j <(IMG_N-K/2); ++j) { float sum = 0; int ii, jj; for (ii =-K/2; ii<=K/2; ++ii) { // iterate over kernel for (jj = -K/2; jj <= K/2; ++jj) { float data = in[i+ii][j+jj]; float coef = coeffs[ii+K/2][jj+K/2]; float result; #pragma omp approximate error_significance_threshold(20) { result = data * coef; sum += result;

} } } out[i][j]=sum/scale; } } }

Code snippet for Gaussian filter

utilizing OpenMP variability-aware

directives

int ID = GOMP_resolve_FP (GOMP_APPROX, GOMP_MUL, 20);

GOMP_FP (ID, data, coeff, &result);

int ID = GOMP_resolve_FP (GOMP_APPROX, GOMP_ADD, 20);

GOMP_FP (ID, sum, result, &sum);

Invokes the runtime FPU scheduler

programs the FPU

FPV Metadata can even drive synthesis!

51

Source code

Annotated source code

OpenMP approximate

directives

ProfilingInput data

Controlled approximation

analysiserror rate

error sig.

Fidelity (PSNR)

Approximate-aware timing constraint generation

error sig. threshold (N) error ratethreshold

Design-time hardware FPU synthesis & optimization

clock

Nrelaxed timing

tight timing

Runtimelibrary

scheduler

utilizing fast leaky standard cells (low-VTH) for these paths

utilizing the regular and slow standard cells (regular-VTH and high-VTH) for the rest of paths since errors can be ignored!

Save Recovery Time, Energy using FPV Monitoring (TSMC 45nm)

• Error-tolerant applications: Gaussian, Sobel filters– PSNR results show error significance threshold at N=20

while maintaining >30 dB• 36% more energy efficient FPUs, recovery cycles reduced by 46%

• 5 kernel codes as error-intolerant applications– 22% average energy savings. 52

ARM v6 core 16 TCDM banks 16I$ size(per core) 16KB TCDM latency 2 cyclesI$ line 4 words TCDM size 256 KBLatency hit 1 cycle L3 latency ≥ 60 cyclesLatency miss ≥ 59 cycles L3 size 256MBShared-FPUs 8 FP ADD latency 2FP MUL latency 2 FP DIV latency 18

Expedition Experimental Platforms & Artifacts

• Interesting and unique challenges in building research testbeds that drive our explorations– Mocks up don’t go far since variability is at the

heart of microelectronic scaling. Need platforms that capture scaling and integration aspects.

• Testbeds to observe (Molecule, GreenLight, Ming), control (Oven, ERSA)

53

Ming the Merciless

ERSA@BEE3

Molecule

Red Cooper

Red Cooper Testbed

• Customized chip with processor + speed/leakage sensors• Testbed board to finish the sensor feedback loop on board• Used in building a duty-cycled OS based on variability sensors

54

CPU Mem Storage Accelerators

Energy Source Network(Batteries)

Runtime

Microarchitecture and Compilers

Applications

Vend

or

Proc

ess

Ambi

ent

Agin

g

Power

Performance

Errors

Ferrari Chip: Closing Loop On-Chip

55

ARM Cortex-M3

JTAGAMBA

Bus

GPIO

Timers

PLL RO CLK

Config

64 kB IMEM

176 kB DMEM

ECC

Counters

8 banks of sensors(N/P Leak, Temp, Oxide)

19 DDROs

GPIO

SensOut

• On-Chip Sensors – Memory mapped i/o and control– Leakage sensors, DDROs,

temperature sensors, reliability sensors

• Better support for OS and software

ARM Cortex

-M3DMEM

IMEM

DMEM

PLL

DMEM

55Energy Source Network(Batteries)

CPU Mem Storage Accelerators

Runtime

Microarchitecture and Compilers

Applications

Vend

or

Proc

ess

Ambi

ent

Agin

g Power

Performance

Errors

DUT Device

Ref. Device

Available since April 2013

Sense-and-Adapt Fundamentally Alters The Stack

• Machines that consist of parts with variations in performance, power and reliability

• Machines that incorporate sensing circuits• Machines w/ interfaces to change ongoing

computation & structures • New machine models: QOS or Relaxed

Reliability parts.

56

57

Thank You!

http://variability.org

The Variability Expedition

A NSF Expeditions in Computing Project

Rajesh K. GuptaNikil Dutt, UCIPunit Gupta, UCLAMani Srivastava, UCLASteve Swanson, UCSDLara Dolecek, UCLASubhashish Mitra, Stanford

YY Zhou, UCSDTajana Rosing, UCSDAlex Nicolau, UCIRanjit Jhala, UCSDSorin Lerner, UCSDRakesh Kumar, UIUCDennis Sylvester, UMich

Yuvraj Agrawal, CMULucas Wanner, UCLA

Documents

From Crash-and-Recover to Sense-and-Adapt: Our Evolving Models of Computing Machines