18
Copyright 2015 QuEST Forum. All Rights Reserved. 1 The action against Soft-errors to prevent service outages NTT Network Service Systems Laboratories Hidenori Iwashita 2015 APAC QuEST Forum APAC Best Practices Conference April 2015

The Action Against Soft-Errors to Prevent Service Outage

Embed Size (px)

Citation preview

Page 1: The Action Against Soft-Errors to Prevent Service Outage

Copyright 2015 QuEST Forum. All Rights Reserved.

1

The action against Soft-errors

to prevent service outages

NTT Network Service Systems Laboratories

Hidenori Iwashita

2015 APAC QuEST Forum APAC Best Practices Conference

April 2015

Page 2: The Action Against Soft-Errors to Prevent Service Outage

Agenda

2

1. Soft error problemsLaboratory non-reproducible errors

Silent errors

2. Soft error mechanismsSoft errors are caused by cosmic rays

3. The increase of soft errorsWith miniaturization of LSI design rules, soft errors are

increasing rapidly

4. PracticesSoft error test using a compact accelerator neutron source

5. Results

6. Conclusion

NTT can reduce service outages and failure recovery costs due

to soft errors.

Page 3: The Action Against Soft-Errors to Prevent Service Outage

1. Soft error problems

Laboratory non-reproducible errors

3Network System

Network operations center

① Error

② Alarm

Manufacturer factory

③ Return

④ Tests

⑤ Test OK

Page 4: The Action Against Soft-Errors to Prevent Service Outage

1. Soft error problems

Silent errors

4Network System

Network operations center① User complaint

I can’t connect! • Not alarmed

• Fault node

unknown

Prolonged

Significant failure Press release

(Newspaper, TV)

Page 5: The Action Against Soft-Errors to Prevent Service Outage

5

SunSupernova explosion

Earth

Cosmic rays

(High energy particles)

Neutron

Nuclei (O or N)陽子

High energy particlesDestruction

Nuclear reactions in the atmosphere

Proton

Muon

π-meson

2. Soft error mechanisms

Neutrons generated by cosmic rays

Page 6: The Action Against Soft-Errors to Prevent Service Outage

6

2. Soft error mechanisms

Nuclear reactions in the device

Soft error

(Bit error)

Secondary ions

Silicon nuclei陽子

Destruction

NeutronNetwork System

Neutrons

Page 7: The Action Against Soft-Errors to Prevent Service Outage

3. The increase of soft errors

7

Miniaturization of LSI design rule

(Highly integrated)

Soft errors increase

Current,

At ground level

Past,

Only in space or the sky

Page 8: The Action Against Soft-Errors to Prevent Service Outage

3. The increase of soft errors

How often do soft errors occur ?

8

FPGA

SRAM

The FPGA contains large capacity SRAM.

Without soft error mitigation you got more than

10000 FIT.

E.g.

Since SRAMs have less critical charge (are more

sensitive), soft errors occur more frequently.

SRAM

×1000 units in networkFPGA×6

About 1.5 devices per day fail

Page 9: The Action Against Soft-Errors to Prevent Service Outage

4. Practices

9

Developing and applying soft error countermeasures

Page 10: The Action Against Soft-Errors to Prevent Service Outage

4. Practices

Step 1. Specifying requirements

10

Planned network scale

E.g.

1000 units on the network

Specify requirements

E.g.

1 failure per month

on the network

⇒ about 1300FIT / unit

Page 11: The Action Against Soft-Errors to Prevent Service Outage

4. Practices

Step 2. Simulating soft errors

11

Device Design

rule

[nm]

Size

[Mb]

Soft error

rate

[FIT]

CPU SRAM 65 2 200

FPGA SRAM 28 100 10000

ASIC SRAM 90 2 150

DRAM① 40 500 10

DRAM ② 40 500 10

DRAM ③ 40 500 10

DRAM ④ 40 500 10

SRAM ① 65 10 1000

SRAM ② 65 1 100

SRAM ③ 65 10 1000

SRAM ④ 65 2 200

SRAM ⑤ 65 10 1000

Flash Mem 90 50 50

Substrate

FPGA ASIC

CPUSRAM

SRAMSRAMSRAMSRAMSRAM

DRAM

DRAM

DRAM

DRAM

Flash

Memory

SRAMSRAM

E.g.

We simulate high soft error rates in devices.

High

High

High

High

Page 12: The Action Against Soft-Errors to Prevent Service Outage

4. PracticesStep 3. Apply soft error countermeasures

12

(1) Reducing

soft errors

(2) Protection from

soft errors

(3) Recovery from

soft errors

Devices with low soft

error rates

Using memory devices

with error correction

functions such as ECC*.*Error Correction Code

Systems automatically

restart or overwrite if a

soft error occurs.

Selecting the appropriate soft error countermeasures to suit

functions

MRAM

Special

device

Low

spec

High

cost

1 bit correction

2 bit detection

2 bit correction

3 bit detection

Low

cost

High

cost

Firmware Low cost

ASIC Long-term

development

Page 13: The Action Against Soft-Errors to Prevent Service Outage

4. PracticesStep 4. Soft error tests with real products

13

We developed soft error testing technology using Hokkaido

University’s compact accelerator-driven neutron source.

Hokkaido University’s compact

accelerator-driven neutron source

Page 14: The Action Against Soft-Errors to Prevent Service Outage

14

4. PracticesStep 4. Soft error tests with real products

Page 15: The Action Against Soft-Errors to Prevent Service Outage

5. Results

15

0 1 2 3 4 5 6 7 8 9 10

0 1 2 3 4 5 6 7 8 9 10

0 1 2 3 4 5 6 7 8 9 10

Comparison of neutron soft error rates

FPGA based device

ASIC based device

w/o ECC function

w/ ECC function

w/o auto recovery function

w/ auto recovery function

We measured the device to confirm the soft error rate reduction using

the accelerator neutron source.

On the real network, the number of soft errors largely decreased.

80% reduction

90% reduction

80% reduction

Page 16: The Action Against Soft-Errors to Prevent Service Outage

6. Conclusion

16

We successfully reproduced soft errors using a compact

accelerator-driven neutron source.

We were able to investigate soft error tolerance, and check

the fault detection process and the process of switching to a

backup network system.

We conclude that NTT can reduce service outages and

failure recovery costs due to soft errors.

Page 17: The Action Against Soft-Errors to Prevent Service Outage

Message

17

Have you ever experience troubles with unknown

causes on your network ?

It might be caused with soft errors !

Soft errors is able to deal with !

We hope all of the carriers and manufacturers of

the world to be freed from this problems !

Page 18: The Action Against Soft-Errors to Prevent Service Outage

Special thanks:

18

Fujitsu, Ltd.

Hitachi, Ltd.

NEC corp.