25
Oct. 2006 Defect Avoidance and Circumvention Slide 1 Fault-Tolerant Computing Dealing with Low- Level Impairment s

Oct. 2006 Defect Avoidance and Circumvention Slide 1 Fault-Tolerant Computing Dealing with Low-Level Impairments

  • View
    220

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Oct. 2006 Defect Avoidance and Circumvention Slide 1 Fault-Tolerant Computing Dealing with Low-Level Impairments

Oct. 2006 Defect Avoidance and Circumvention Slide 1

Fault-Tolerant Computing

Dealing with Low-Level Impairments

Page 2: Oct. 2006 Defect Avoidance and Circumvention Slide 1 Fault-Tolerant Computing Dealing with Low-Level Impairments

Oct. 2006 Defect Avoidance and Circumvention Slide 2

About This Presentation

Edition Released Revised Revised

First Oct. 2006

This presentation has been prepared for the graduate course ECE 257A (Fault-Tolerant Computing) by Behrooz Parhami, Professor of Electrical and Computer Engineering at University of California, Santa Barbara. The material contained herein can be used freely in classroom teaching or any other educational setting. Unauthorized uses are prohibited. © Behrooz Parhami

Page 3: Oct. 2006 Defect Avoidance and Circumvention Slide 1 Fault-Tolerant Computing Dealing with Low-Level Impairments

Oct. 2006 Defect Avoidance and Circumvention Slide 3

Defect Avoidance and Circumvention

Page 4: Oct. 2006 Defect Avoidance and Circumvention Slide 1 Fault-Tolerant Computing Dealing with Low-Level Impairments

Oct. 2006 Defect Avoidance and Circumvention Slide 4

Page 5: Oct. 2006 Defect Avoidance and Circumvention Slide 1 Fault-Tolerant Computing Dealing with Low-Level Impairments

Oct. 2006 Defect Avoidance and Circumvention Slide 5 

Multilevel Model

Component

Logic

Service

Result

Information

System

Low-Level Impaired

Mid-Level Impaired

High-Level Impaired

Initial Entry

Deviation

Remedy

Legned:

Ideal

Defective

Faulty

Erroneous

Malfunctioning

Degraded

Failed

Legend:

Tolerance

Entry

Page 6: Oct. 2006 Defect Avoidance and Circumvention Slide 1 Fault-Tolerant Computing Dealing with Low-Level Impairments

Oct. 2006 Defect Avoidance and Circumvention Slide 6 

The Manufacturing Process for an IC Part

15-30 cm

30-60 cm

Silicon crystal ingot

Slicer Processing: 20-30 steps

Blank wafer with defects

x x x x x x x

x x x x

0.2 cm

Patterned wafer

(100s of simple or scores of complex processors)

Dicer Die

~1 cm

Good die

~1 cm

Die tester

Microchip or other part

Mounting Part

tester Usable

part to ship

Page 7: Oct. 2006 Defect Avoidance and Circumvention Slide 1 Fault-Tolerant Computing Dealing with Low-Level Impairments

Oct. 2006 Defect Avoidance and Circumvention Slide 7

The dramatic decrease in yield with larger dies

 

Effect of Die Size on Yield

120 dies, 109 good 26 dies, 15 good

Die yield =def (Number of good dies) / (Total number of dies)

Die yield = Wafer yield [1 + (Defect density Die area) / a]–a

Die cost = (Cost of wafer) / (Total number of dies Die yield) = (Cost of wafer) (Die area / Wafer area) / (Die yield)

The parameter a ranges from 3 to 4 for modern CMOS processes

Shown are some random defects; there are also bulk or clustered defects that affect a large region

Page 8: Oct. 2006 Defect Avoidance and Circumvention Slide 1 Fault-Tolerant Computing Dealing with Low-Level Impairments

Oct. 2006 Defect Avoidance and Circumvention Slide 8 

Effects of Yield on Testing and Part Reliability

Die yield =assume 50%

Out of 2,000,000 dies manufactured, 1,000,000 are defective

To achieve the goal of 100 defects per million (DPM) in parts shipped, we must catch 999,900 of the 1,000,000 defective parts

Therefore, we need a test coverage of 99.99%

Page 9: Oct. 2006 Defect Avoidance and Circumvention Slide 1 Fault-Tolerant Computing Dealing with Low-Level Impairments

Oct. 2006 Defect Avoidance and Circumvention Slide 9 

Examples of Random Defects in ICs

Resistive open due to unfilled via Resistive open due to unfilled via [R. Madge et al.,[R. Madge et al., IEEE D&T, IEEE D&T, 2003]2003]

Particle embedded Particle embedded between layersbetween layers

Page 10: Oct. 2006 Defect Avoidance and Circumvention Slide 1 Fault-Tolerant Computing Dealing with Low-Level Impairments

Oct. 2006 Defect Avoidance and Circumvention Slide 10 

Defect Modeling

Extra-material defects are Extra-material defects are modeled as circular areasmodeled as circular areas

Pinhole defects are tiny Pinhole defects are tiny breaches in the dielectricbreaches in the dielectric

From: http://www.see.ed.ac.uk/research/IMNS/papers/IEE_SMT95_Yield/IEEAbstract.html

Page 11: Oct. 2006 Defect Avoidance and Circumvention Slide 1 Fault-Tolerant Computing Dealing with Low-Level Impairments

Oct. 2006 Defect Avoidance and Circumvention Slide 11 

Sensitivity of Layouts to Defects

Extra materialExtra material

VLSI layout must be done with defect VLSI layout must be done with defect patterns and their impacts in mindpatterns and their impacts in mind

A balance must be struck with regard A balance must be struck with regard to sensitivity to different defect typesto sensitivity to different defect types

Missing materialMissing material

Actual photo of a Actual photo of a missing-material defectmissing-material defect

http://www.midasvision.com/v3.htm

Killer Killer defectdefect

Latent Latent defectdefect

Page 12: Oct. 2006 Defect Avoidance and Circumvention Slide 1 Fault-Tolerant Computing Dealing with Low-Level Impairments

Oct. 2006 Defect Avoidance and Circumvention Slide 12 

The Bathtub CurveMany components fail early on because of residual or latent defectsComponents may also wear out due to aging (less so for electronics)In between the two high-mortality regions lies the useful life period

Time

Failure rate Infant

mortalityEnd-of-life wearout

Useful life (low, constant failure rate)

Mechanical

Electronic

Primarily due to latent defects

Page 13: Oct. 2006 Defect Avoidance and Circumvention Slide 1 Fault-Tolerant Computing Dealing with Low-Level Impairments

Oct. 2006 Defect Avoidance and Circumvention Slide 13 

Survival Probability of Electronic Components

From: http://www.weibull.com/hotwire/issue21/hottopics21.htm

Infant mortality

Time in years

Per

cent

of p

arts

stil

l wor

king

No significantwear-out

Page 14: Oct. 2006 Defect Avoidance and Circumvention Slide 1 Fault-Tolerant Computing Dealing with Low-Level Impairments

Oct. 2006 Defect Avoidance and Circumvention Slide 14 

Burn-In and Stress Testing

From: http://www.weibull.com/hotwire/issue21/hottopics21.htm Time in years

Per

cent

of p

arts

stil

l wor

king

Burn-in and stress tests are done in accelerated form

Difficult to perform on complex and delicate ICs without damaging good parts

Expensive “ovens” are required

Page 15: Oct. 2006 Defect Avoidance and Circumvention Slide 1 Fault-Tolerant Computing Dealing with Low-Level Impairments

Oct. 2006 Defect Avoidance and Circumvention Slide 15 

Defect Avoidance vs. Circumvention

Defect AvoidanceDefect awareness in design, particularly layout and routingExtensive quality control during the manufacturing processComprehensive screening, including burn-in and stress tests

Defect Circumvention (Removal)Built-in dynamic redundancy on the die or waferIdentification of defective parts (visual inspection, testing, association)Bypassing or reconfiguration via embedded switches

Defect Circumvention (Tolerance)Built-in static redundancy on the die or waferIdentification of defective parts (external test or self-test)Adjustment or tuning of redundant structures

Page 16: Oct. 2006 Defect Avoidance and Circumvention Slide 1 Fault-Tolerant Computing Dealing with Low-Level Impairments

Oct. 2006 Defect Avoidance and Circumvention Slide 16 

Defect Bypassing via Reconfiguration

Works best when the system on die has regular, repetitive structure: Memory FPGA Multicore chip CMP (chip multiprocessor)

Irregular (random) logic implies greater redundancy due to replication: Replicated structures must not be close to each other They should not be very far either (wiring/switching overhead)

Page 17: Oct. 2006 Defect Avoidance and Circumvention Slide 1 Fault-Tolerant Computing Dealing with Low-Level Impairments

Oct. 2006 Defect Avoidance and Circumvention Slide 17

Peripheral reconfiguration elements

 

Defects in Memory ArraysDefect circumvention (removal)Provide several extra (spare) rows and/or columnsRoute external connections to defect-free rows and columns

Spare rows Spare rows

Memoryarray

Memoryarray

Defective rowDefectivecolumn

Defect circumvention (tolerance)Error-correcting code

With m rows and s spares, can model as m-out-of-(m + s)

Somewhat more complex with both spare rows and columns(still combinational, though)

Modeling with coded scheme to be discussed at the info level

Methods in use since the 1970s;e.g., IBM’s defect-tolerant chip

Spa

re c

olum

ns

Spa

re c

olum

ns

Page 18: Oct. 2006 Defect Avoidance and Circumvention Slide 1 Fault-Tolerant Computing Dealing with Low-Level Impairments

Oct. 2006 Defect Avoidance and Circumvention Slide 18 

Yield Improvement in Memory ArraysExample of IBM’s experimental 16 Mb memory chipCombines the use of spare rows/columns in memory arrays with ECC

Four quadrants, each with 16 spare rows & 24 spare columns

ECC corrects any single error via 9 check bits (137 data bits)

Bits assigned to the same word are separated by 8 bit positions Avg. number of failing cells per chip

40003000200010000

100

80

60

40

20

0

Yield

ECConly

Sparesonly

ECC and spares

Page 19: Oct. 2006 Defect Avoidance and Circumvention Slide 1 Fault-Tolerant Computing Dealing with Low-Level Impairments

Oct. 2006 Defect Avoidance and Circumvention Slide 19

(a) Portion of PAL with storable output (b) Generic structure of an FPGA

8-input ANDs

D

C Q

Q

FF

Mux

Mux

0 1

0 1

I/O blocks

Configurable logic block

Programmable connections

CLB

CLB

CLB

CLB

 

Defects in FPGAsDefect circumvention (removal)Provide several extra (spare) CLBs, I/O blocks, and connectionsRoute external connections to available blocks

Defect circumvention (tolerance)Not applicable

Page 20: Oct. 2006 Defect Avoidance and Circumvention Slide 1 Fault-Tolerant Computing Dealing with Low-Level Impairments

Oct. 2006 Defect Avoidance and Circumvention Slide 20 

Defects in Multicore Chips or CMPsDefect circumvention (removal)Similar to FPGAs, except that processors are the replacement entities

Interprocessor interconnection network is the main challenge

Will discuss the switching and reconfiguration aspects in more detail when we get to the malfunction level in our multilevel model

Page 21: Oct. 2006 Defect Avoidance and Circumvention Slide 1 Fault-Tolerant Computing Dealing with Low-Level Impairments

Oct. 2006 Defect Avoidance and Circumvention Slide 21

Circumventing Defects in Processor Arrays

Page 22: Oct. 2006 Defect Avoidance and Circumvention Slide 1 Fault-Tolerant Computing Dealing with Low-Level Impairments

Oct. 2006 Defect Avoidance and Circumvention Slide 22

Defect Tolerance Schemes for Linear Arrays

A linear array with a spare processor and embedded switching

Spare or Defective

P 0 P 1 P 2 P 3

Bypassed

I/O

Test

I/O

Test

Spare or Defective

MuxP 0 P 1 P 2 P 3

A linear array with a spare processor and reconfiguration switches

Page 23: Oct. 2006 Defect Avoidance and Circumvention Slide 1 Fault-Tolerant Computing Dealing with Low-Level Impairments

Oct. 2006 Defect Avoidance and Circumvention Slide 23

Defect Tolerance in 2D Arrays

Two types of reconfiguration switching for 2D arrays

Pa Pb

Pc Pd

Pa Pb

Pc Pd

Mux

Assumption: A defective unit can be bypassed in its row/column by means of a separate switching mechanism (not shown)

Page 24: Oct. 2006 Defect Avoidance and Circumvention Slide 1 Fault-Tolerant Computing Dealing with Low-Level Impairments

Oct. 2006 Defect Avoidance and Circumvention Slide 24

A Reconfiguration Scheme for 2D Arrays

Spare Row

Spare Column

A 5 5 working array salvaged from a 6 6 redundant mesh through reconfiguration switching

Seven defective processors in a 5 5 array and their associated compensation paths

Page 25: Oct. 2006 Defect Avoidance and Circumvention Slide 1 Fault-Tolerant Computing Dealing with Low-Level Impairments

Oct. 2006 Defect Avoidance and Circumvention Slide 25

Limits of Reconfigurability

A set of three defective nodes, one of which cannot be accommodated by the compensation-path method.

No compensation path exists for this faulty node

Extension: We can go beyond the 3-defect limit by providing spare rows on top and bottom and spare columns on either side