1
Task 2410.001 CLASH - Cross-Layer Accelerated Self-Healing: Circadian Rhythms for Resilient Electronic Systems Xinfei Guo, Alec Roelke, Mircea R. Stan ECE Department, University of Virginia, Charlottesville, VA Previous Solutions Tolerate - Design for the worst case Compensate - Dynamically adapt to wearout Slow - Reduce the stress during operation Passive Recovery Wearout Issues BTI, HCI, TDDB, EM, etc. Increase design margin and worsen metrics Cross-layer Issues Both Reversible and Irreversible Part Overview This Project Repair wearout completely Accelerated & Active Recovery Circadian Rhythms for FULL recovery Cross-Layer Implementations Wearout Accelerated Self-Healing [DAC ’14] Interface Board Chip Data Sampling 16-b fref clk in Cout 16 En En 75 LUTs Circuit Under Test (CUT) rst Thermal Chamber Counter Main Idea Sleep Proactive Recovery Some of the effects of wearout (e.g. BTI and EM) can be reversed by applying several techniques (high temperature, negative voltage, UV light, reverse current, etc.), thus leading to effective accelerated self-healing fundamentally. Test Setup Commercial 40nm FPGA chips Accelerated Testing Methodology Knobs: V, T, AC/DC, Sleep/Active Results “Sleep When Getting Tired” [ASPDAC ’16] Main Idea The boundary between reversible & irreversible is “soft” Irreversible wearout can be recovered through acceleration Frequency dependency of accelerated wearout & recovery “Sleep when getting tired” to FULLY avoid the irreversible wearout Negative “turbo” boost at the system level Results Recovery under Different “circadian rhythms” Reduction of Design Margin >60X Average Performance Improvement Close to the fresh state through the whole lifetime Circuit-level implementations Negative Voltages A Charge-pump Neg. voltage generator Embeddable Wearout Sensors [SELSE ’15] Vdd clk1 clk2 clk1 C1 C2 charging charge redistribution Vout + + High Temperatures Wearout-aware Power Gating Core 6 Core 1 Core 2 Core 3 Core 4 Core 5 Core 7 Shared L3 Cache Core 8 Zzzzzz... Zzzzzz... Heat Heat Heat Heat Heat Heat Metastable element based Track both wearout and (accelerated) recovery Track Multiple Paths simultaneously Be aware of wearout induced path reranking Used together with multiple dynamic management track poll1 poll2 poll3 track discharg discharg ref 10% 5% 2% path0 track poll1 poll2 poll3 discharg discharg ref 5% path1 path2 path3 5% 5% track track poll1 poll2 poll3 outa outb A CLASH System [JVLSI] Fresh “Turbo-boost” Frequency Time Design Margin Average Frequency Negative Days Hours End of life Years Wearout No recovery Active Sleep Utilize Core Redundancy in dark silicon era Optimal Scheduling/Load balancing Introduce Accelerated & Active Recovery as a new design knob for cross-layer resilience Tradeoff between lifetime, power, performance FF D Scan_in SE clk Q Q rst Old Scan Cell M U X M U X FF D Scan_in SE clk Q Q rst SENSOR sel Core New Scan Cell Close-loop Open-loop To user Accelerated Active Adaptive solutions or Recovery M U X Embedded in a ASIC design flow Sleep Voltage Negative Logic Blocks vdd vdd_high Sleep Boosted Vdd Generator Negative Voltage vdd Sleep En Power Gating Block 2016 SLD Review 2X 1X Accelerated Recovery Heating Element Core Logic output Logic Core Critical path -300.6mV 638ns -300.6mV 4.36mV 4.33mV Ripple: 1.45% Clock frequency = 66.7MHz Accelerated & Active Recovery for 12 hours 24.5 25 25.5 26 Frequency(MHz) Accelerated Stress for 48 hours 72.4% An example where about 72.4% of wearout is recovered by accelerated self-healing techniques in only ¼ of stress time (measured). Accelerated self-healing space exploration (model) Core-level implementations Active blocks healing inactive elements (e.g. Dark Silicon & Redundant Resources) Core sensors Sensor Sensor Proactive Accelerated From To cores Apply to Sleep Cores Active Cores Accelerated router Scheduler & Active Recovery outputs sleep cores outputs Applications Scheduler & Active Recovery Blocks Core Allocation Load Balancer Heat for accelerated recovery -V Active Recovery EN Circuit Wearout Sensors Architecture Dark Silicon Program Counters +1 System Virtual Sensors Load Balancer Heating Elements Accelerated Recovery EN Negative Voltage Generator Proactive Scheduler Redundant Resources 23.2min 57.8min 344. 3min 104. 5min Cross-layer Accelerated Self-Healing Illustration of “Negative Turbo Boost” A potential implementation of Cross-layer Accelerated Self-Healing in a NoC system Recovery time after 12-hour constant stress under regular operation condition (no accelerated stress)

100cm by 100cm Poster Template - University of Virginiapeople.virginia.edu/~xg2dt/papers/CLASH_2016 SRC... · 2016 SLD Review 2X 1X Accelerated Recovery Heating Element Core Logic

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: 100cm by 100cm Poster Template - University of Virginiapeople.virginia.edu/~xg2dt/papers/CLASH_2016 SRC... · 2016 SLD Review 2X 1X Accelerated Recovery Heating Element Core Logic

Task 2410.001 CLASH - Cross-Layer Accelerated Self-Healing:

Circadian Rhythms for Resilient Electronic SystemsXinfei Guo, Alec Roelke, Mircea R. Stan

ECE Department, University of Virginia, Charlottesville, VA

Previous Solutions Tolerate - Design for the worst case

Compensate - Dynamically adapt to wearout

Slow - Reduce the stress during operation

Passive Recovery

Wearout Issues BTI, HCI, TDDB, EM, etc.

Increase design margin and worsen metrics

Cross-layer Issues

Both Reversible and Irreversible Part

Overview

This Project Repair wearout completely

Accelerated & Active Recovery

Circadian Rhythms for FULL recovery

Cross-Layer Implementations

Wearout

Accelerated Self-Healing [DAC ’14]

Interface Board ChipData Sampling

16-b

frefclk

in

Cout

16

En En

75 LUTs

Circuit Under Test (CUT)

rst

(a) (b)

Thermal Chamber

Counter

Main Idea Sleep → Proactive Recovery

Some of the effects of wearout (e.g. BTI and EM) can be reversed

by applying several techniques (high temperature, negative

voltage, UV light, reverse current, etc.), thus leading to effective

accelerated self-healing fundamentally.

Test Setup Commercial 40nm FPGA chips

Accelerated Testing Methodology

Knobs: V, T, AC/DC, Sleep/Active

Results

“Sleep When Getting Tired” [ASPDAC ’16]

Main Idea The boundary between reversible & irreversible is “soft”

Irreversible wearout can be recovered through acceleration

Frequency dependency of accelerated wearout & recovery

“Sleep when getting tired” to FULLY avoid the irreversible wearout

Negative “turbo” boost at the system level

Results

Recovery under Different “circadian rhythms”

Reduction of Design Margin >60X

Average Performance Improvement –

Close to the fresh state through the whole lifetime

Circuit-level implementationsNegative Voltages A Charge-pump Neg. voltage generator

Embeddable Wearout Sensors [SELSE ’15]

Vdd

clk1

clk2

clk1C1 C2

charging charge redistribution

Vout

+ +

High TemperaturesWearout-aware Power Gating

Core 6

Core 1 Core 2 Core 3 Core 4

Core 5 Core 7

Shared L3 Cache

Core 8

Zzzzzz...

Zzzzzz...Heat Heat

Hea

t

Heat

Heat Heat

Metastable element based

Track both wearout and (accelerated) recovery

Track Multiple Paths simultaneously

Be aware of wearout induced path reranking

Used together with multiple dynamic management

track poll1

poll2

poll3

track

discharg discharg

ref

10%

5%

2%path0track poll1

poll2

poll3

discharg discharg

ref

5%

path1

path2

path3

5%

5%

track

track

poll1

poll2

poll3

outa outb

A CLASH System [JVLSI]

Fresh “Turbo-boost”

Fre

qu

en

cy

Time

Desig

n M

arg

in

Average

Frequency

Negative

Days

Hours

End of life

Years

Wearout

No recoveryActive

Sleep

Utilize Core Redundancy in dark silicon era

Optimal Scheduling/Load balancing

Introduce Accelerated & Active Recovery as a

new design knob for cross-layer resilience

Tradeoff between lifetime, power, performance

FFD

Scan_in

SE

clk

Q

Q

rst

Old Scan Cell

MUX

MUX

FFD

Scan_in

SE

clk

Q

Q

rst

SENSOR

sel

Core

New Scan Cell

Close-loop Open-loop

To user

Accelerated Active Adaptive solutions or

Recovery

MU

X

Embedded in a ASIC design flow

Sleep

VoltageNegative

Logic Blocks

vddvdd_high

Sleep

Boosted Vdd

GeneratorNegative Voltage

vdd

SleepEn

Power Gating Block

2016 SLD Review

2X

1X

Accelerated Recovery

Heating Element

Core

Logic

output

LogicCoreCritical

path

-300.6mV

638ns

-300.6mV

4.36mV

4.33mV

Ripple: 1.45%

Clock frequency = 66.7MHz

Accelerated & Active

Recovery for 12 hours24.5

25

25.5

26

Fre

qu

ency

(MH

z)

Accelerated Stress for 48 hours

72.4%

An example where about 72.4% of wearout is

recovered by accelerated self-healing techniques

in only ¼ of stress time (measured).

Accelerated self-healing space exploration (model)

Core-level

implementations

Active blocks healing inactive elements

(e.g. Dark Silicon & Redundant Resources)

Core

sensors

Sensor

Sensor

Proactive

Accelerated

From

To cores

Apply to

Sleep Cores Active Cores

Accelerated

rou

ter

Scheduler

& Active Recovery

outputs

sleep cores

outputs

Applications

Scheduler

& Active Recovery

Blocks

Core

Allocation

LoadBalancer

Heat for accelerated

recovery

- VActive Recovery

EN

CircuitWearout Sensors

Architecture

Dark SiliconProgramCounters

+1

SystemVirtualSensorsLoad Balancer

Heating Elements

AcceleratedRecovery

EN

Negative Voltage Generator

Proactive Scheduler

RedundantResources

23.2min

57.8min

344.3min

104.5min

Cross-layer Accelerated Self-Healing

Illustration of “Negative Turbo Boost”A potential implementation of Cross-layer Accelerated

Self-Healing in a NoC system

Recovery time after 12-hour constant stress under

regular operation condition (no accelerated stress)