22
BinFI: An Efficient Fault Injector for Safety-Critical Machine Learning Systems Zitao Chen , Guanpeng Li, Karthik Pattabiraman, Nathan DeBardeleben 1

BinFI: An Efficient Fault Injector for Safety-Critical Machine ...blogs.ubc.ca/karthik/files/2019/11/SC19-zitao.pdfMotivation •Machine learning taking computing by storm •HPC-ML:

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: BinFI: An Efficient Fault Injector for Safety-Critical Machine ...blogs.ubc.ca/karthik/files/2019/11/SC19-zitao.pdfMotivation •Machine learning taking computing by storm •HPC-ML:

BinFI: An Efficient Fault Injector for Safety-Critical Machine Learning

SystemsZitao Chen, Guanpeng Li, Karthik Pattabiraman, Nathan DeBardeleben

1

Page 2: BinFI: An Efficient Fault Injector for Safety-Critical Machine ...blogs.ubc.ca/karthik/files/2019/11/SC19-zitao.pdfMotivation •Machine learning taking computing by storm •HPC-ML:

Motivation•Machine learning taking computing by storm

• HPC-ML: precision medicine; earthquake simulation;

• Growing safety-critical applications.

• Reliability of ML becomes important.

2Image source: https://www.nvidia.com/en-us/self-driving-cars/drive-platform/hardware/

Page 3: BinFI: An Efficient Fault Injector for Safety-Critical Machine ...blogs.ubc.ca/karthik/files/2019/11/SC19-zitao.pdfMotivation •Machine learning taking computing by storm •HPC-ML:

Soft Error • Transient hardware fault

• Growing in frequency: • Occur every 53 mins in 1m nodes [1]

• Manifested as a single bit-flip

• Silent data corruptions (SDCs)• Erroneous ML output.

• Safety standard for road vehicles:• ISO26262: 10 FIT

3[1] J. Dongarra, T. Herault, and Y. Robert, “Fault tolerance techniques for high-performance computing,” in Fault-Tolerance Techniques for HighPerformance Computing, 2015, pp. 3–85.

Page 4: BinFI: An Efficient Fault Injector for Safety-Critical Machine ...blogs.ubc.ca/karthik/files/2019/11/SC19-zitao.pdfMotivation •Machine learning taking computing by storm •HPC-ML:

Existing solutions• Application-agnostic:

• Triple modular duplication (TMR) for execution units.

• Expensive: Hardware cost, performance (e.g., delay-based accidents).

• Application-specific:

• Random fault injection to guide protection (e.g., instruction duplication).

• Coarse-grained

4

Page 5: BinFI: An Efficient Fault Injector for Safety-Critical Machine ...blogs.ubc.ca/karthik/files/2019/11/SC19-zitao.pdfMotivation •Machine learning taking computing by storm •HPC-ML:

Is Random FI Good Enough?

1 0 0 1 0 1 1 1

SDC Non-SDC

0 0 1 1 1 1 0 1

1. Where are the critical faults in the entire system?2. Are the critical faults uniformly distributed (or not)?

5

1 0 0 1 0 1 1 1

0 0 1 1 1 1 0 1

25% SDC

Randomly simulate bit-flip, and then obtain statistical error resilience

Page 6: BinFI: An Efficient Fault Injector for Safety-Critical Machine ...blogs.ubc.ca/karthik/files/2019/11/SC19-zitao.pdfMotivation •Machine learning taking computing by storm •HPC-ML:

Our goal

An efficient approach to obtain fine-grained understanding of the error resilience of ML systems

6

Overhead

Erro

r cov

erag

eExhaustive FI

Random FI

Our goal

Page 7: BinFI: An Efficient Fault Injector for Safety-Critical Machine ...blogs.ubc.ca/karthik/files/2019/11/SC19-zitao.pdfMotivation •Machine learning taking computing by storm •HPC-ML:

Contributions:

• Identify the property of ML computations which constrain the

fault propagation behaviors.

• Characterize the pattern of critical faults.

• Propose a Binary-FI approach to identify critical bits.

7

Page 8: BinFI: An Efficient Fault Injector for Safety-Critical Machine ...blogs.ubc.ca/karthik/files/2019/11/SC19-zitao.pdfMotivation •Machine learning taking computing by storm •HPC-ML:

ML framework - TensorFlow

• TensorFlow: framework for executing dataflow graphs

• ML algorithms expressed as dataflow graphs

• Others:

8Image source: https://www.easy-tensorflow.com/tf-tutorials/basics/graph-and-session

Page 9: BinFI: An Efficient Fault Injector for Safety-Critical Machine ...blogs.ubc.ca/karthik/files/2019/11/SC19-zitao.pdfMotivation •Machine learning taking computing by storm •HPC-ML:

Fault model• Focus on inference phase• Faults at processor’s datapath (e.g. ALUs)• Interface-level fault injection (i.e. TensorFlow Operators) [2]

9

[2] Li et al. TensorFI: A Configurable Fault Injector for TensorFlow Applications. ISSREW’18).Image source: https://medium.com/@d3lm/understand-tensorflow-by-mimicking-its-api-from-scratch-faa55787170d

Bit-flip

Page 10: BinFI: An Efficient Fault Injector for Safety-Critical Machine ...blogs.ubc.ca/karthik/files/2019/11/SC19-zitao.pdfMotivation •Machine learning taking computing by storm •HPC-ML:

How to cause an SDC in ML• In ML, fault usually results in numerical change in the data.• Output by ML is usually determined by numerical magnitude.

• To cause SDC: large deviation at the output

10

Large positive deviation

Large negative deviation

Page 11: BinFI: An Efficient Fault Injector for Safety-Critical Machine ...blogs.ubc.ca/karthik/files/2019/11/SC19-zitao.pdfMotivation •Machine learning taking computing by storm •HPC-ML:

Error propagation Co

nv

Pool FCConv

Conv

Pool

Conv

Conv

Pool

Conv

Conv

Conv

Pool

Conv

Conv

Conv

Pool

Conv

Conv FC FC

OutputInput

11

Individual EP function

• Error propagation (EP): from the fault occurrence to the output.

• Each EP function: large response to Large input, i.e., monotone

• Convolution function: ! ∗# = ∑&'('• Larger Input deviation: A > B

• Larger Output deviation: )(' ≥ +('

1 0 0 1 0 1 1 1

Bit-flip A Bit-flip B

VGG16

Page 12: BinFI: An Efficient Fault Injector for Safety-Critical Machine ...blogs.ubc.ca/karthik/files/2019/11/SC19-zitao.pdfMotivation •Machine learning taking computing by storm •HPC-ML:

Individual EP function

• Common computations in DNNs, satisfy monotone property

• Why: ML tends generate large responses to ``target’’ class/feature

12

Page 13: BinFI: An Efficient Fault Injector for Safety-Critical Machine ...blogs.ubc.ca/karthik/files/2019/11/SC19-zitao.pdfMotivation •Machine learning taking computing by storm •HPC-ML:

Error propagation example• One fault propagates into multiple faults.

13Image source: http://worldcomp-proceedings.com/proc/p2014/ABD3492.pdf

Single fault

Multiple faults

Page 14: BinFI: An Efficient Fault Injector for Safety-Critical Machine ...blogs.ubc.ca/karthik/files/2019/11/SC19-zitao.pdfMotivation •Machine learning taking computing by storm •HPC-ML:

All EP functions • Composite EP function:

• Break of monotonicity:

• We call it approximate monotone.

EP function 1 EP function N…

Composite function

14

OutputInput

Page 15: BinFI: An Efficient Fault Injector for Safety-Critical Machine ...blogs.ubc.ca/karthik/files/2019/11/SC19-zitao.pdfMotivation •Machine learning taking computing by storm •HPC-ML:

All EP functions (cont.)• Approximate the EP behavior as an approximate monotonic function.

Conv

Pool FCConv

Conv

Pool

Conv

Conv

Pool

Conv

Conv

Conv

Pool

Conv

Conv

Conv

Pool

Conv

Conv FC FC

OutputInput

Approximate monotonic function

Conv

Pool

Conv

Conv

Pool

Conv

Conv

Conv

ConvInput Output

Input (with faults) and output deviation are constrained by the approximate monotonicity

15

Page 16: BinFI: An Efficient Fault Injector for Safety-Critical Machine ...blogs.ubc.ca/karthik/files/2019/11/SC19-zitao.pdfMotivation •Machine learning taking computing by storm •HPC-ML:

Characteristic of critical faults

If a fault at high-order bits does not lead to SDC (by FI), faults at lower-order bits would not lead to SDC (without FI), i.e, SDC

boundary

16

To cause an SDC Large output dev. Large Input dev. Fault at high-order bit

Approximate monotone

Page 17: BinFI: An Efficient Fault Injector for Safety-Critical Machine ...blogs.ubc.ca/karthik/files/2019/11/SC19-zitao.pdfMotivation •Machine learning taking computing by storm •HPC-ML:

Binary fault injection (BinFI)

17

• Consider the effects of different faults as a sorted array.

0 0 0 0 0 0 0 64 32 16 8 4 2 1

Input Output deviation+

Search within a sorted array

64 32 16 8 4 2 1

Not result in SDCMove to higher-order bit

1Result in SDCMove to lower-order bit

2 3

Page 18: BinFI: An Efficient Fault Injector for Safety-Critical Machine ...blogs.ubc.ca/karthik/files/2019/11/SC19-zitao.pdfMotivation •Machine learning taking computing by storm •HPC-ML:

Experimental setup

• SDC: Image misclassification; degree

of deviation.

• Fault injection tool: TensorFI

• 3 FI approaches: BinFI vs Random FI

vs Exhaustive FI

18FI tool: https://github.com/DependableSystemsLab/TensorFI

Dataset Dataset Description ML model

MNist Hand-written digits

Neural NetworkLeNet

Survive Prediction of patient surivval kNN

Cifar-10 General images AlexNet

ImageNet General images VGG16

Traffic sign Real-world traffic signs VGG11

Driving Real-world driving frames

Nvidia DaveComma.ai

Page 19: BinFI: An Efficient Fault Injector for Safety-Critical Machine ...blogs.ubc.ca/karthik/files/2019/11/SC19-zitao.pdfMotivation •Machine learning taking computing by storm •HPC-ML:

Effects of SDCs

19

Single bit-flip can

cause undesirable

output in ML

Page 20: BinFI: An Efficient Fault Injector for Safety-Critical Machine ...blogs.ubc.ca/karthik/files/2019/11/SC19-zitao.pdfMotivation •Machine learning taking computing by storm •HPC-ML:

Identification of critical bits• Recall: 99.56% (average)• Precision: 99.63% (average)

1. BinFI can identify most of

the critical bits.

2. Random FI is not desirable

for identifying critical bits.

20

5x more overhead than BinFI

Same overhead as BinFI

Page 21: BinFI: An Efficient Fault Injector for Safety-Critical Machine ...blogs.ubc.ca/karthik/files/2019/11/SC19-zitao.pdfMotivation •Machine learning taking computing by storm •HPC-ML:

21

Overhead of BinFI is ~20% of that by exhaustive FI

Overhead

• Data-width: 16, 32, 64 bits.• 99.5+% recall and precision

1. Overhead of BinFI grows linearly with the data width.

2. BinFI is agnostic to data width in identifying critical bits.

Page 22: BinFI: An Efficient Fault Injector for Safety-Critical Machine ...blogs.ubc.ca/karthik/files/2019/11/SC19-zitao.pdfMotivation •Machine learning taking computing by storm •HPC-ML:

Summary• Common ML functions exhibit monotonicity, which constrains the

fault propagation behaviors.

• Critical faults in ML programs tend to be clustered: If a fault at high-

order bit does not lead to SDC (by FI), faults at lower-order bits would

not lead to SDC (without FI)

Zitao Chen, Graduate student at University of British [email protected]

Artifact: https://github.com/DependableSystemsLab/TensorFI-BinaryFI22