Transcript
Page 1: 100cm by 100cm Poster Template - people.Virginia.EDUpeople.virginia.edu/~xg2dt/papers/DAC PhD Forum Poster_Xinfei_Mircea.pdf · [2] X. Guo, M. Stan, “Letthe system sleep before

ID. 20 Towards Wearout-aware and Accelerated

Self-healing Digital SystemsStudent: Xinfei Guo; Advisor: Mircea R. Stan

ECE Department, University of Virginia, USA

PHD FORUM

MotivationWearout Issues BTI, HCI, TDDB, EM, etc.

More significant with extremely scaling technology

Increase design margin and worsen metrics

Cross-layer Issues

Both Reversible and Irreversible Part

Previous Solutions Design for the worst case (Guard band)

Hard to predict wearout;

The worst case becomes even worse;

Power, performance and area (PPA) overhead.

Track and monitor them, dynamically adapt to wearout

Through the whole life time;

The average case is skewed;

Power, performance and area (PPA) overhead.

Reduce the stress during operation

Not applicable for high performance systems.

Lack irreversible wearout solution

The boundary between reversible and irreversible is unclear

Let the system sleep when it gets tired –

Completely Avoid Irreversible Wearout [2]Reversible vs. Irreversible Wearout Permanent part of wearout exits even for BTI

Majority of the trapped electrons are at low energy

Trap energy is much higher (~2eV)

Fast traps: Reversible; Slow traps: Irreversible

Natural Recovery vs. Accelerated Recovery Natural Recovery: 0V, Room Temperature

Accelerated Recovery: High Temperature,

Negative Voltage, Other energy sources (e.g. UV)

The irreversible part can be recovered!

The boundary is “soft” and controllable!

Circadian Rhythms: Completely Avoid Irreversible Wearout! The irreversible wearout is totally gone under an optimized sleep/active ratio

Reduced Design Margin (O(ln(days)) vs. O(ln(years)))

Improved average performance increase with time!

&0

0

0

Q

QSET

CLR

S

R

Slow!

Timing Error!

Failure!High Power!

∆V

th

Tra

nsis

tor

Sta

te

ON

OFF

Time

Vth Net Increase

Stress Recovery Stress Recovery Stress Recovery

Gate

Oxide

Trapping De-Trapping

Traps

Charge

Carrier

Body

Source Drain

Accelerated Self-Healing [1]Main Idea

Sleep should be used as an active recovery period for future electronics.

Electronic systems will benefit from such sleep periods with active

rejuvenation during which some of the effects of wearout (e.g. BTI) can be

reversed by several techniques (high temperature, negative voltage, UV light,

reverse current, etc.), thus leading to effective self-healing.

Test Setup

Commercial 40nm FPGA chips

Accelerated Testing Methodology

Knobs: V, T, AC/DC, Sleep/Active

Measurement Results and On-chip solutions

Recovery rate increased significantly

Utilize “Dark Silicon”

On-Chip Negative Voltage

On-chip reconfigurable elements

Reversible Weaout Irreversible Weaout

Boundary?

Cross-layer Optimization InfrastructureModelingCircuit-level: Transient Simulation, be compatible

with circuit simulators (e.g. SPICE, Spectre) [4];

Architecture-level: physically aware

parameterized high-level modes that

are integrated with simulators like gem5;

System-level: Optimized scheduling

algorithms that trade off between lifetime

and other metrics, like energy efficiency.

[1] X. Guo, W. Burleson, M. Stan, “Modeling and Experimental Demonstration of Accelerated Self-

Healing Techniques,” Proc. of ACM/IEEE Design Automation Conference (DAC), June, 2014.

[2] X. Guo, M. Stan, “Let the system sleep before getting tired – Avoid irreversible wearout by

periodic accelerated rejuvenation, ” Submitted.

[3] X. Guo, M. Stan, “MCPENS: Multiple-Critical-Path Embeddable NBTI Sensors for

Dynamic Wearout Management,” IEEE Workshop on Silicon Errors in Logic–System

Effects (SELSE-11), April, 2015.

[4] A. Roelke, X. Guo, M. Stan, “A SPICE-Compatible BTI Transient Model Considering

Accelerated Self-healing,” Ongoing.

[5] M. Stan, X. Guo, A. Roelke, “Modeling and Experimental Demonstration of Accelerated Self-

Healing Techniques in CMOS Circuits,” Proc. of GOMAC Tech, March, 2015.

Case Measurement(%) Model(%)

20°C and 0V 0.66% 1%

20°C and -0.3V 16.7% 14.4%

110°C and 0V 28.7% 29.6%

110°C and -0.3V 72.4% 70%

16-b

Counter

fref clk

in

Cout16

EnEn

75 LUTs

Circuit Under Test (CUT)rst

refoutosc

dfCf

T4

1

2

1

Test configuration

Measurement Setup

Thermal Chamber

(Chip Inside)

Motherboard

Data Sampling

Chip

FPGA Board and Interface Board

FPGA Chip

To FPGA

Programmer

To Mother Board

ProgrammerTo PC

Core 6

Core 1 Core 2 Core 3 Core 4

Core 5 Core 7

Shared L3 Cache

Core 8

Zzzzzz...

Zzzzzz...

Heat Heat

Heat

Heat

Heat Heat

On-chip Solution 3: Utilize Dark Silicon in Multicore

Logic

vdd

vdd_high

On-chip Solution 1: Negative

voltage to recover headerOn-chip Solution 2:

Reconfigurable self-healing tree

Logic

Block

Logic

Block

Logic

Block

Logic

Block

Logic

Block

Logic

BlockSelf-healing Tree

Reconfigurable

Self-heating Blocks

Pro

bab

ilit

y

19.072

19.082

19.092

19.102

19.112

19.122

19.132

19.142

0 100 200 300 400

Fre

qu

ency

(MH

z)

Recovery Time (min)

110 C and -0.3V 0V and Room Temperature(20C) Fresh

Irreversible Part from

Accelerated Recovery

Irreversible Part from

Regular Recovery

Irreversible Part under different recovery conditions

18.95

19

19.05

19.1

0 200 400 600 800 1000 1200 1400

Fre

qu

ency

(MH

z)

Time (min)

1 hr vs. 1 hr 4 hrs. vs. 4 hrs 2hrs vs. 2hrs 6 hrs vs. 6 hrs

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

1 hr vs. 1 hr

Case

2 hrs vs. 2 hrs.

Case

4 hrs vs. 4 hrs

Case

6 hrs vs. 6 hrs

Case

Per

man

ent

Par

t

(MH

z)

Electron Energy Distribution at

Room Temperature

19.03

19.04

19.05

19.06

19.07

19.08

19.09

19.1

0 100 200 300 400 500 600 700 800

Fre

qu

ency

(MH

z)

Time (min)

Stress for 6 hours

Reversible

Part Recovered

Part

Accelerated Recovery for 6 hours

Sequentiality of reversible and irreversible wearout

Irreversible wearout for different cases19.02

19.03

19.04

19.05

19.06

19.07

19.08

6 hr vs. 6

hr

4 hrs. vs.

4 hrs

2 hrs vs.

2 hrs

1 hr vs. 1

hr

Av

erag

e F

req

uen

cy(M

HZ

) Average Frequency for 1 Day

6 hr vs. 6

hr

4 hrs. vs.

4 hrs

2 hrs vs.

2 hrs

1 hr vs. 1

hr

Average Frequency for 2 Days

Device Level

Circuit Level

Architecture Level

System Level

Accelerated

Self-healing

Embeddable Wearout Sensors [3]• Track both wearout and recovery

• Small, fast and accurate

• Wearout-induced Path Re-rankingAdaptive Solutions

DVFS

Body BiasCircadian

Rhythms

Core 2Core 1

Accelerated

Self-healing

MCPENS

Path<N:0>

MCPENS

Path<N:0>

Core 3

MCPENS

Path<N:0>

Core 5Core 4

MCPENS

Path<N:0>

MCPENS

Path<N:0>

Core 6

MCPENS

Path<N:0>

A Wearout-aware and

Accelerated Self-healing Robust System

24.5

24.7

24.9

25.1

25.3

25.5

25.7

25.9

26.1

Fre

qu

ency

(MH

z)

Wearout for 48 hours

Accelerated

Recovery for

12 hours

Illustration of wearout vs. accelerated self-healing

72.4%

~ kT (0.026eV) at

room temperature

Energy that causes

detrapping

Recommended