1
ID. 20 Towards Wearout-aware and Accelerated Self-healing Digital Systems Student: Xinfei Guo; Advisor: Mircea R. Stan ECE Department, University of Virginia, USA PHD FORUM Motivation Wearout Issues BTI, HCI, TDDB, EM, etc. More significant with extremely scaling technology Increase design margin and worsen metrics Cross-layer Issues Both Reversible and Irreversible Part Previous Solutions Design for the worst case (Guard band) Hard to predict wearout; The worst case becomes even worse; Power, performance and area (PPA) overhead. Track and monitor them, dynamically adapt to wearout Through the whole life time; The average case is skewed; Power, performance and area (PPA) overhead. Reduce the stress during operation Not applicable for high performance systems. Lack irreversible wearout solution The boundary between reversible and irreversible is unclear Let the system sleep when it gets tired Completely Avoid Irreversible Wearout [2] Reversible vs. Irreversible Wearout Permanent part of wearout exits even for BTI Majority of the trapped electrons are at low energy Trap energy is much higher (~2eV) Fast traps: Reversible; Slow traps: Irreversible Natural Recovery vs. Accelerated Recovery Natural Recovery: 0V, Room Temperature Accelerated Recovery: High Temperature, Negative Voltage, Other energy sources (e.g. UV) The irreversible part can be recovered! The boundary is “soft” and controllable! Circadian Rhythms: Completely Avoid Irreversible Wearout! The irreversible wearout is totally gone under an optimized sleep/active ratio Reduced Design Margin (O(ln(days)) vs. O(ln(years))) Improved average performance increase with time! & 0 0 0 Q Q SET CL R S R Slow! Timing Error! Failure! High Power! Vth Transistor State ON OFF Time Vth Net Increase Stress Recovery Stress Recovery Stress Recovery Gate Oxide Trapping De-Trapping Traps Charge Carrier Body Source Drain Accelerated Self-Healing [1] Main Idea Sleep should be used as an active recovery period for future electronics. Electronic systems will benefit from such sleep periods with active rejuvenation during which some of the effects of wearout (e.g. BTI) can be reversed by several techniques (high temperature, negative voltage, UV light, reverse current, etc.), thus leading to effective self-healing. Test Setup Commercial 40nm FPGA chips Accelerated Testing Methodology Knobs: V, T, AC/DC, Sleep/Active Measurement Results and On-chip solutions Recovery rate increased significantly Utilize “Dark Silicon On-Chip Negative Voltage On-chip reconfigurable elements Reversible Weaout Irreversible Weaout Boundary? Cross-layer Optimization Infrastructure Modeling Circuit-level: Transient Simulation, be compatible with circuit simulators (e.g. SPICE, Spectre) [4]; Architecture-level: physically aware parameterized high-level modes that are integrated with simulators like gem5; System-level: Optimized scheduling algorithms that trade off between lifetime and other metrics, like energy efficiency. [1] X. Guo, W. Burleson, M. Stan, “Modeling and Experimental Demonstration of Accelerated Self- Healing Techniques,Proc. of ACM/IEEE Design Automation Conference (DAC), June, 2014. [2] X. Guo, M. Stan, “Let the system sleep before getting tired Avoid irreversible wearout by periodic accelerated rejuvenation, Submitted. [3] X. Guo, M. Stan, “MCPENS: Multiple-Critical-Path Embeddable NBTI Sensors for Dynamic Wearout Management,” IEEE Workshop on Silicon Errors in LogicSystem Effects (SELSE-11), April, 2015. [4] A. Roelke, X. Guo, M. Stan, “A SPICE-Compatible BTI Transient Model Considering Accelerated Self-healing,” Ongoing. [5] M. Stan, X. Guo, A. Roelke, “Modeling and Experimental Demonstration of Accelerated Self- Healing Techniques in CMOS Circuits,” Proc. of GOMAC Tech, March, 2015. Case Measurement(%) Model(%) 20°C and 0V 0.66% 1% 20°C and -0.3V 16.7% 14.4% 110°C and 0V 28.7% 29.6% 110°C and -0.3V 72.4% 70% 16-b Counter fref clk in Cout 16 En En 75 LUTs Circuit Under Test (CUT) rst ref out osc d f C f T 4 1 2 1 Test configuration Measurement Setup Thermal Chamber (Chip Inside) Motherboard Data Sampling Chip FPGA Board and Interface Board FPGA Chip To FPGA Programmer To Mother Board Programmer To PC Core 6 Core 1 Core 2 Core 3 Core 4 Core 5 Core 7 Shared L3 Cache Core 8 Zzzzzz... Zzzzzz... Heat Heat Heat Heat Heat Heat On-chip Solution 3: Utilize Dark Silicon in Multicore Logic vdd vdd_high On-chip Solution 1: Negative voltage to recover header On-chip Solution 2: Reconfigurable self-healing tree Logic Block Logic Block Logic Block Logic Block Logic Block Logic Block Self-healing Tree Reconfigurable Self-heating Blocks Probability 19.072 19.082 19.092 19.102 19.112 19.122 19.132 19.142 0 100 200 300 400 Frequency(MHz) Recovery Time (min) 110 C and -0.3V 0V and Room Temperature(20C) Fresh Irreversible Part from Accelerated Recovery Irreversible Part from Regular Recovery Irreversible Part under different recovery conditions 18.95 19 19.05 19.1 0 200 400 600 800 1000 1200 1400 Frequency(MHz) Time (min) 1 hr vs. 1 hr 4 hrs. vs. 4 hrs 2hrs vs. 2hrs 6 hrs vs. 6 hrs 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 1 hr vs. 1 hr Case 2 hrs vs. 2 hrs. Case 4 hrs vs. 4 hrs Case 6 hrs vs. 6 hrs Case Permanent Part (MHz) Electron Energy Distribution at Room Temperature 19.03 19.04 19.05 19.06 19.07 19.08 19.09 19.1 0 100 200 300 400 500 600 700 800 Frequency(MHz) Time (min) Stress for 6 hours Reversible Part Recovered Part Accelerated Recovery for 6 hours Sequentiality of reversible and irreversible wearout Irreversible wearout for different cases 19.02 19.03 19.04 19.05 19.06 19.07 19.08 6 hr vs. 6 hr 4 hrs. vs. 4 hrs 2 hrs vs. 2 hrs 1 hr vs. 1 hr Average Frequency(MHZ) Average Frequency for 1 Day 6 hr vs. 6 hr 4 hrs. vs. 4 hrs 2 hrs vs. 2 hrs 1 hr vs. 1 hr Average Frequency for 2 Days Device Level Circuit Level Architecture Level System Level Accelerated Self-healing Embeddable Wearout Sensors [3] Track both wearout and recovery Small, fast and accurate Wearout-induced Path Re-ranking Adaptive Solutions DVFS Body Bias Circadian Rhythms Core 2 Core 1 Accelerated Self-healing MCPENS Path<N:0> MCPENS Path<N:0> Core 3 MCPENS Path<N:0> Core 5 Core 4 MCPENS Path<N:0> MCPENS Path<N:0> Core 6 MCPENS Path<N:0> A Wearout-aware and Accelerated Self-healing Robust System 24.5 24.7 24.9 25.1 25.3 25.5 25.7 25.9 26.1 Frequency(MHz) Wearout for 48 hours Accelerated Recovery for 12 hours Illustration of wearout vs. accelerated self-healing 72.4% ~ kT (0.026eV) at room temperature Energy that causes detrapping

100cm by 100cm Poster Template - people.Virginia.EDUpeople.virginia.edu/~xg2dt/papers/DAC PhD Forum Poster_Xinfei_Mircea.pdf · [2] X. Guo, M. Stan, “Letthe system sleep before

Embed Size (px)

Citation preview

Page 1: 100cm by 100cm Poster Template - people.Virginia.EDUpeople.virginia.edu/~xg2dt/papers/DAC PhD Forum Poster_Xinfei_Mircea.pdf · [2] X. Guo, M. Stan, “Letthe system sleep before

ID. 20 Towards Wearout-aware and Accelerated

Self-healing Digital SystemsStudent: Xinfei Guo; Advisor: Mircea R. Stan

ECE Department, University of Virginia, USA

PHD FORUM

MotivationWearout Issues BTI, HCI, TDDB, EM, etc.

More significant with extremely scaling technology

Increase design margin and worsen metrics

Cross-layer Issues

Both Reversible and Irreversible Part

Previous Solutions Design for the worst case (Guard band)

Hard to predict wearout;

The worst case becomes even worse;

Power, performance and area (PPA) overhead.

Track and monitor them, dynamically adapt to wearout

Through the whole life time;

The average case is skewed;

Power, performance and area (PPA) overhead.

Reduce the stress during operation

Not applicable for high performance systems.

Lack irreversible wearout solution

The boundary between reversible and irreversible is unclear

Let the system sleep when it gets tired –

Completely Avoid Irreversible Wearout [2]Reversible vs. Irreversible Wearout Permanent part of wearout exits even for BTI

Majority of the trapped electrons are at low energy

Trap energy is much higher (~2eV)

Fast traps: Reversible; Slow traps: Irreversible

Natural Recovery vs. Accelerated Recovery Natural Recovery: 0V, Room Temperature

Accelerated Recovery: High Temperature,

Negative Voltage, Other energy sources (e.g. UV)

The irreversible part can be recovered!

The boundary is “soft” and controllable!

Circadian Rhythms: Completely Avoid Irreversible Wearout! The irreversible wearout is totally gone under an optimized sleep/active ratio

Reduced Design Margin (O(ln(days)) vs. O(ln(years)))

Improved average performance increase with time!

&0

0

0

Q

QSET

CLR

S

R

Slow!

Timing Error!

Failure!High Power!

∆V

th

Tra

nsis

tor

Sta

te

ON

OFF

Time

Vth Net Increase

Stress Recovery Stress Recovery Stress Recovery

Gate

Oxide

Trapping De-Trapping

Traps

Charge

Carrier

Body

Source Drain

Accelerated Self-Healing [1]Main Idea

Sleep should be used as an active recovery period for future electronics.

Electronic systems will benefit from such sleep periods with active

rejuvenation during which some of the effects of wearout (e.g. BTI) can be

reversed by several techniques (high temperature, negative voltage, UV light,

reverse current, etc.), thus leading to effective self-healing.

Test Setup

Commercial 40nm FPGA chips

Accelerated Testing Methodology

Knobs: V, T, AC/DC, Sleep/Active

Measurement Results and On-chip solutions

Recovery rate increased significantly

Utilize “Dark Silicon”

On-Chip Negative Voltage

On-chip reconfigurable elements

Reversible Weaout Irreversible Weaout

Boundary?

Cross-layer Optimization InfrastructureModelingCircuit-level: Transient Simulation, be compatible

with circuit simulators (e.g. SPICE, Spectre) [4];

Architecture-level: physically aware

parameterized high-level modes that

are integrated with simulators like gem5;

System-level: Optimized scheduling

algorithms that trade off between lifetime

and other metrics, like energy efficiency.

[1] X. Guo, W. Burleson, M. Stan, “Modeling and Experimental Demonstration of Accelerated Self-

Healing Techniques,” Proc. of ACM/IEEE Design Automation Conference (DAC), June, 2014.

[2] X. Guo, M. Stan, “Let the system sleep before getting tired – Avoid irreversible wearout by

periodic accelerated rejuvenation, ” Submitted.

[3] X. Guo, M. Stan, “MCPENS: Multiple-Critical-Path Embeddable NBTI Sensors for

Dynamic Wearout Management,” IEEE Workshop on Silicon Errors in Logic–System

Effects (SELSE-11), April, 2015.

[4] A. Roelke, X. Guo, M. Stan, “A SPICE-Compatible BTI Transient Model Considering

Accelerated Self-healing,” Ongoing.

[5] M. Stan, X. Guo, A. Roelke, “Modeling and Experimental Demonstration of Accelerated Self-

Healing Techniques in CMOS Circuits,” Proc. of GOMAC Tech, March, 2015.

Case Measurement(%) Model(%)

20°C and 0V 0.66% 1%

20°C and -0.3V 16.7% 14.4%

110°C and 0V 28.7% 29.6%

110°C and -0.3V 72.4% 70%

16-b

Counter

fref clk

in

Cout16

EnEn

75 LUTs

Circuit Under Test (CUT)rst

refoutosc

dfCf

T4

1

2

1

Test configuration

Measurement Setup

Thermal Chamber

(Chip Inside)

Motherboard

Data Sampling

Chip

FPGA Board and Interface Board

FPGA Chip

To FPGA

Programmer

To Mother Board

ProgrammerTo PC

Core 6

Core 1 Core 2 Core 3 Core 4

Core 5 Core 7

Shared L3 Cache

Core 8

Zzzzzz...

Zzzzzz...

Heat Heat

Heat

Heat

Heat Heat

On-chip Solution 3: Utilize Dark Silicon in Multicore

Logic

vdd

vdd_high

On-chip Solution 1: Negative

voltage to recover headerOn-chip Solution 2:

Reconfigurable self-healing tree

Logic

Block

Logic

Block

Logic

Block

Logic

Block

Logic

Block

Logic

BlockSelf-healing Tree

Reconfigurable

Self-heating Blocks

Pro

bab

ilit

y

19.072

19.082

19.092

19.102

19.112

19.122

19.132

19.142

0 100 200 300 400

Fre

qu

ency

(MH

z)

Recovery Time (min)

110 C and -0.3V 0V and Room Temperature(20C) Fresh

Irreversible Part from

Accelerated Recovery

Irreversible Part from

Regular Recovery

Irreversible Part under different recovery conditions

18.95

19

19.05

19.1

0 200 400 600 800 1000 1200 1400

Fre

qu

ency

(MH

z)

Time (min)

1 hr vs. 1 hr 4 hrs. vs. 4 hrs 2hrs vs. 2hrs 6 hrs vs. 6 hrs

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

1 hr vs. 1 hr

Case

2 hrs vs. 2 hrs.

Case

4 hrs vs. 4 hrs

Case

6 hrs vs. 6 hrs

Case

Per

man

ent

Par

t

(MH

z)

Electron Energy Distribution at

Room Temperature

19.03

19.04

19.05

19.06

19.07

19.08

19.09

19.1

0 100 200 300 400 500 600 700 800

Fre

qu

ency

(MH

z)

Time (min)

Stress for 6 hours

Reversible

Part Recovered

Part

Accelerated Recovery for 6 hours

Sequentiality of reversible and irreversible wearout

Irreversible wearout for different cases19.02

19.03

19.04

19.05

19.06

19.07

19.08

6 hr vs. 6

hr

4 hrs. vs.

4 hrs

2 hrs vs.

2 hrs

1 hr vs. 1

hr

Av

erag

e F

req

uen

cy(M

HZ

) Average Frequency for 1 Day

6 hr vs. 6

hr

4 hrs. vs.

4 hrs

2 hrs vs.

2 hrs

1 hr vs. 1

hr

Average Frequency for 2 Days

Device Level

Circuit Level

Architecture Level

System Level

Accelerated

Self-healing

Embeddable Wearout Sensors [3]• Track both wearout and recovery

• Small, fast and accurate

• Wearout-induced Path Re-rankingAdaptive Solutions

DVFS

Body BiasCircadian

Rhythms

Core 2Core 1

Accelerated

Self-healing

MCPENS

Path<N:0>

MCPENS

Path<N:0>

Core 3

MCPENS

Path<N:0>

Core 5Core 4

MCPENS

Path<N:0>

MCPENS

Path<N:0>

Core 6

MCPENS

Path<N:0>

A Wearout-aware and

Accelerated Self-healing Robust System

24.5

24.7

24.9

25.1

25.3

25.5

25.7

25.9

26.1

Fre

qu

ency

(MH

z)

Wearout for 48 hours

Accelerated

Recovery for

12 hours

Illustration of wearout vs. accelerated self-healing

72.4%

~ kT (0.026eV) at

room temperature

Energy that causes

detrapping