100cm by 100cm Poster Template - University of Virginiapeople.virginia.edu/~xg2dt/papers/CLASH_2016...

Preview:

Citation preview

Task 2410.001 CLASH - Cross-Layer Accelerated Self-Healing:

Circadian Rhythms for Resilient Electronic SystemsXinfei Guo, Alec Roelke, Mircea R. Stan

ECE Department, University of Virginia, Charlottesville, VA

Previous Solutions Tolerate - Design for the worst case

Compensate - Dynamically adapt to wearout

Slow - Reduce the stress during operation

Passive Recovery

Wearout Issues BTI, HCI, TDDB, EM, etc.

Increase design margin and worsen metrics

Cross-layer Issues

Both Reversible and Irreversible Part

Overview

This Project Repair wearout completely

Accelerated & Active Recovery

Circadian Rhythms for FULL recovery

Cross-Layer Implementations

Wearout

Accelerated Self-Healing [DAC ’14]

Interface Board ChipData Sampling

16-b

frefclk

in

Cout

16

En En

75 LUTs

Circuit Under Test (CUT)

rst

(a) (b)

Thermal Chamber

Counter

Main Idea Sleep → Proactive Recovery

Some of the effects of wearout (e.g. BTI and EM) can be reversed

by applying several techniques (high temperature, negative

voltage, UV light, reverse current, etc.), thus leading to effective

accelerated self-healing fundamentally.

Test Setup Commercial 40nm FPGA chips

Accelerated Testing Methodology

Knobs: V, T, AC/DC, Sleep/Active

Results

“Sleep When Getting Tired” [ASPDAC ’16]

Main Idea The boundary between reversible & irreversible is “soft”

Irreversible wearout can be recovered through acceleration

Frequency dependency of accelerated wearout & recovery

“Sleep when getting tired” to FULLY avoid the irreversible wearout

Negative “turbo” boost at the system level

Results

Recovery under Different “circadian rhythms”

Reduction of Design Margin >60X

Average Performance Improvement –

Close to the fresh state through the whole lifetime

Circuit-level implementationsNegative Voltages A Charge-pump Neg. voltage generator

Embeddable Wearout Sensors [SELSE ’15]

Vdd

clk1

clk2

clk1C1 C2

charging charge redistribution

Vout

+ +

High TemperaturesWearout-aware Power Gating

Core 6

Core 1 Core 2 Core 3 Core 4

Core 5 Core 7

Shared L3 Cache

Core 8

Zzzzzz...

Zzzzzz...Heat Heat

Hea

t

Heat

Heat Heat

Metastable element based

Track both wearout and (accelerated) recovery

Track Multiple Paths simultaneously

Be aware of wearout induced path reranking

Used together with multiple dynamic management

track poll1

poll2

poll3

track

discharg discharg

ref

10%

5%

2%path0track poll1

poll2

poll3

discharg discharg

ref

5%

path1

path2

path3

5%

5%

track

track

poll1

poll2

poll3

outa outb

A CLASH System [JVLSI]

Fresh “Turbo-boost”

Fre

qu

en

cy

Time

Desig

n M

arg

in

Average

Frequency

Negative

Days

Hours

End of life

Years

Wearout

No recoveryActive

Sleep

Utilize Core Redundancy in dark silicon era

Optimal Scheduling/Load balancing

Introduce Accelerated & Active Recovery as a

new design knob for cross-layer resilience

Tradeoff between lifetime, power, performance

FFD

Scan_in

SE

clk

Q

Q

rst

Old Scan Cell

MUX

MUX

FFD

Scan_in

SE

clk

Q

Q

rst

SENSOR

sel

Core

New Scan Cell

Close-loop Open-loop

To user

Accelerated Active Adaptive solutions or

Recovery

MU

X

Embedded in a ASIC design flow

Sleep

VoltageNegative

Logic Blocks

vddvdd_high

Sleep

Boosted Vdd

GeneratorNegative Voltage

vdd

SleepEn

Power Gating Block

2016 SLD Review

2X

1X

Accelerated Recovery

Heating Element

Core

Logic

output

LogicCoreCritical

path

-300.6mV

638ns

-300.6mV

4.36mV

4.33mV

Ripple: 1.45%

Clock frequency = 66.7MHz

Accelerated & Active

Recovery for 12 hours24.5

25

25.5

26

Fre

qu

ency

(MH

z)

Accelerated Stress for 48 hours

72.4%

An example where about 72.4% of wearout is

recovered by accelerated self-healing techniques

in only ¼ of stress time (measured).

Accelerated self-healing space exploration (model)

Core-level

implementations

Active blocks healing inactive elements

(e.g. Dark Silicon & Redundant Resources)

Core

sensors

Sensor

Sensor

Proactive

Accelerated

From

To cores

Apply to

Sleep Cores Active Cores

Accelerated

rou

ter

Scheduler

& Active Recovery

outputs

sleep cores

outputs

Applications

Scheduler

& Active Recovery

Blocks

Core

Allocation

LoadBalancer

Heat for accelerated

recovery

- VActive Recovery

EN

CircuitWearout Sensors

Architecture

Dark SiliconProgramCounters

+1

SystemVirtualSensorsLoad Balancer

Heating Elements

AcceleratedRecovery

EN

Negative Voltage Generator

Proactive Scheduler

RedundantResources

23.2min

57.8min

344.3min

104.5min

Cross-layer Accelerated Self-Healing

Illustration of “Negative Turbo Boost”A potential implementation of Cross-layer Accelerated

Self-Healing in a NoC system

Recovery time after 12-hour constant stress under

regular operation condition (no accelerated stress)

Recommended