Machine Learning in Test A Journey To AI · 10/15/2018 2 What AI? ITC 2018 Tutorial ‐ML in Test ‐Wang 3 What Kind of “AI”? ¾1950 “Computing Machinery and Intelligence”

10/15/2018

1

Machine Learning in Test‐ A Journey To AI

ITC 2018 Tutorial ‐ML in Test ‐Wang 1

Li‐C. WangUniversity of California, Santa Barbara

Let’s Start With Four Questions

What kind of “AI?”

Why “A Journey To AI?”

How “Machine Learning” is applied in Test?

What “Machine Learning” means anyway?


10/15/2018

2

What AI?


What Kind of “AI”?

1950 “Computing Machinery and Intelligence” – The Turing TestOther AI: Thinking humanly, Thinking rationally, Acting rationally“Artificial Intelligence ‐ A Modern Approach” 3rd Edition, PRENTICE HALL SERIES


(Acting humanly – The Turing Test Approach)

Natural Language Processing

Knowledge Representation

Automated ReasoningComputerVision

Robotics

MachineLearning

AI

10/15/2018

3

What Kind of “AI”? (Outline Of The Book)

Problem Solving– Classical search, adversarial search, CSPKnowledge Representation, Reasoning, Planning– Prop. Logic, 1st‐order Logic, Planning and ActingUncertain Knowledge and Reasoning– Probabilistic reasoning, reasoning over time, Decision making

Learning– Learning algorithms, Explanation‐based learning, reinforcement learning

Communication, Perceiving, Acting– NLP, Perception, Robotics


What “AI” Remind US?

Problem Solving– Classical search, adversarial search, CSPKnowledge Representation, Reasoning, Planning– Prop. Logic, 1st‐order Logic, Planning and ActingUncertain Knowledge and Reasoning– Probabilistic reasoning, reasoning over time, Decision making

Learning– Learning algorithms, Explanation‐based learning, reinforcement learning

Communication, Perceiving, Acting– NLP, Perception, Robotics


Self‐Driving Car

10/15/2018

4

Intelligence Vehicle Development

1959 – GM Firebird III– In vision of autonomous driving1990‐95 – CMU Navlab landmark vehicle1994 – Diamler VITA II (Vision Technology Application)– Automatic collision avoidance– Autonomous highway driving and lane change1997 – Netherland PATH program– Demonstrated platoon driving1998‐01 – NIST Demo III Experimental Unmanned Vehicle

DARPA Challenges on Autonomous Driving– Mar 13,04 – no one completed 5% of the 142‐mile off‐road course– Oct 8, 05 – 5 completed with Stanford’s “Stanley” winning the race– Nov 07 – Urban Challenge (CA) won by CMU’s “Boss”

7ITC 2018 Tutorial ‐ML in Test ‐Wang

DARPA Urban Challenge

GPS+DMI provide real‐time positioningLMS + Riegl provide information of 3D road structure and road surface (e.g. lane)Velodyne + LDLRS + Radar provide moving vehicle information

8

CMU’s Boss

Stanford’s Junior

BOSCH Radar

IBEO Laser

Riegl LaserVelodyne Laser

SICK LMS Laser

(Distance) DMISICK LDLRS Laser

ITC 2018 Tutorial ‐ML in Test ‐Wang

10/15/2018

5

Lesson from Urban Challenge

Urban Challenge represented significantly more difficulties than previous challenges

Advancement in sensor technology (off‐the‐shelf sensor availability) was a key reason for the success

However, senor technology was not yet sufficient to cope with noises and cost for practical production of autonomous vehicles

Since then, technologies got much improved– Better sensors– Advances in pattern recognition– More powerful hardware


Today’s Example ‐ Google’s Waymo

By 2012, with more than 300K test driveIn 2015, launch fully self‐driving car Firefly

In 2017, introduce fully self‐driving Chrysler Hybrid minivans


10/15/2018

6

Future Projection

11

Source: Victoria Transportation Policy Institute, May 1st 2017

Large‐scaletesting

Taxi,Carshareservices

A‐VLane

RestrictingHuman‐driving

InfrastructureRe‐design


The Three Components

Sensing: SR/LR Radars, LiDAR, Vision, Stereovision, U‐Sonic, etc. Perception: Lane detection, object recognitionReasoning & Control: Free space calculation, path planning, speed/brake/rotation controls


Self‐Driving Car

Sensing

Perception

Reasoning& Control

10/15/2018

7

What The 3 Components Do

Sensing: Collect all relevant dataPerception: Recognize what data meanReason & Control: What to do next


Sensing

Perception

Reasoning& Control

SR‐RadarLR‐Radar

LiDAR

VisionStereovision

U‐Sonic …Perception Component

Flowchart

Optimizationengine

Decision‐TreeRules

…

Operate In A “WORLD VIEW”

The system maintains a “WORLD VIEW” on the environment, and react to it

14

Source: Google self‐driving car pictures from the public domain


Sensing

Perception

Reasoning& Control

10/15/2018

8

World View – Google Self‐Driving Car


(Google More Examples Online)

Why “Journey To AI”?


10/15/2018

9

Applying ML in Design/Test (2003‐2013)


Pre‐silicon(Design Automation)e.g. Simulation data

Post‐silicon(Manufacturing Automation)

e.g. Test data

In‐field(System/in‐field test)e.g. Customer returns

Classification Regression Transformation Clustering Outlier Rule Learning

Supervised learning Unsupervised learning

Apply

Test cost reduction

Functionalverification

Layoutverification

Design‐silicon timing verification

Po‐Si ValidationYield Zero DPPM

FmaxSpeed test





e.g. Test data




Apply

Test cost reduction


Layoutverification



FmaxSpeed test

IEEE Trans. On CAD Paper (Oct 2016):

“Experience of Data Analytics in EDA and Test – Principles, Promises, and Challenges”

10/15/2018

10

Journey To AI

This is especially the case when the ML solution is deployed in design/test processes


Applying ML in a D&T application

In order to deploy a solution, it is not just about the ML tool – very often, we

need a system to apply ML.

AI (Autonomous System)

Where we are now


10/15/2018

11

The project (Intelligence Engineering Assistant)


The IEA research lab: https://iea.ece.ucsb.edu/

IEA pounced as “Ai‐Ya”

(Artificial Intelligence – Your Assistant)


10/15/2018

12

What’s IEA like?


Tasks Performed By A Product Engineer


ManufacturingProcess

ClassProbe

WaferProbe

Packaging Burn‐inFinalTest

CustomerReturns

Shipped to customers

Production Test

Interface

yield issueAnalytic Workflow

Findings

PPT Presentation

10/15/2018

13

IEA for Yield Optimization


ManufacturingProcess

ClassProbe

WaferProbe

Packaging Burn‐inFinalTest

CustomerReturns

Shipped to customers

Production Test

Interface

yield issueAnalytic Workflow

Findings

PPT Presentation

IEA System to execute the tasks

<IEA Demo>


10/15/2018

14

What’s IEA?


Concept‐Based Workflow Programming

To implement this workflow, we need several data‐driven concept recognizers– Yield excursion, low‐yield, grid pattern, edge failing, correlation trend


IF (observe a yield excursion) {% enter the particular yield excursion contextFOR (selected low‐yield lots) {

% enter the low‐yield contextSELECT a test bin with the largest yield loss;PERFORM e‐test correlation analysis;IF (observe a correlation trend) {% enter the correlation contextREPORT the trend plot;GENERATE stack‐wafer plots;IF (observe a grid pattern) ‐> REPORT plots;IF (observe a edge failing pattern)

‐> REPORT plots;}}}…

10/15/2018

15

A New Programming Paradigm

IEA is a concept‐based workflow programming environment

An IEA system comprises three components– API for workflow construction– Library of concept recognizers– Data processing and analysis tools


ExecutableWorkflow

ConceptRecognizers

Data processing and analysis tools

The Analogy

Data Tools: Collect all relevant “data”Concept Recognition: What “data” meanWorkflow: What to do next


Sensing

Perception

Reasoning& Control

IEA System

ExecutableWorkflow

ConceptRecognizers


10/15/2018

16

Machine Learning in Test‐ A Closer Look At The Journey






e.g. Test data




Apply

Test cost reduction


Layoutverification



FmaxSpeed test

This journey went through multiple stages …

10/15/2018

17





e.g. Test data




Apply

Test cost reduction


Layoutverification



FmaxSpeed test

At first, “Algorithmic Focus” –What is the “right”

ML algorithm (tool) to apply?

Learning from “Data”

Machine learns from training data to build a model

The model is used to “predict” future “unseen” data


“Data” MachineLearning Model

10/15/2018

18

ML Python Lib: http://scikit‐learn.org/


Dataset Format

A learning tool usually takes the dataset as above– Samples: examples to be reasoned on– Features: aspects to describe a sample– Vectors: resulting vector representing a sample– Labels: care behavior to be learned from (optional)

36

features

samples labels

vectors


10/15/2018

19

Supervised Learning

Classification– Labels represent classes (e.g. +1, ‐1: binary classes)Regression– Labels are some numerical values (e.g. frequencies)

37

(features)

Labels


Unsupervised Learning

Work on features– Dimension reduction– TransformationWork on samples– Clustering– Novelty detection– Density estimation

38

(features)

No y’s


10/15/2018

20

Take Classification


Basic Approaches for Classification

Nearest NeighborsLinear Discriminant Analysis (LDA)– Quadratic Discriminant Analysis (QDA)Naïve BayesDecision Tree– Random ForestSupport Vector Machine– Linear– Radius Based Function (RBF)Neural Networks


10/15/2018

21

e.g. Nearest Neighbors


= = average of the k nearest neighbors to Uniform average or

weighted by inverse of distance User choosea given distance functionIn a given space

Source: http://scikit‐learn.org/stable/auto_examples/neighbors/plot_classification.html#example‐neighbors‐plot‐classification‐py

Linear Discriminant Analysis (LDA)

For each class, the mean and covariance are estimated based on the dataIn LDA, the two covariances are assumed to be the same– Otherwise, it is called Quadratic Discriminant Analysis (QDA)In many cases, the difference between LDA and QDA is small


Model it as a Gaussian Distribution ( , )Model it as another Gaussian Distribution( , )

Class 1

Class 2

Decision function: = ( inclass1|given )( inclass2|given )

10/15/2018

22

Bayesian Inference – Naïve Bayes Classifier


evidencelikelihoodprior

xxpclassxxpclasspxxclassp

n

nn

×==

),...()|,...,()(),...,|(

1

11

)|,...,()(),...,|( 11 classxxpclasspxxclassp nn ∝

Independent assumptions

The naïve Bayes classifier uses the assumption that features are mutually independent– This is not usually not true as we have seen in the test data

Also, if each xi is a continuous variable, we either need to estimate the probability density, or we need to discretize the value into ranges

)|()|()( 1 classxpclassxpclassp nL∝

Decision Tree Classifier

An easy and popular learning algorithm CART (1984 Breiman et al.)Of course, the key question is how to measure “purity”


Find the best feature f and thedecision rule f>c to split the datasetinto 2 dataset with more purity

Recursivelyfind the best split

Recursivelyfind the best split

10/15/2018

23

CART Approach

Randomly select m1/2

variable to be tried at each split node

Find the variable that split the data the best (purity meas.)

Stop Criterion1. The split has fully

separated the subset2. None of the variable

can further separated the subset anymore.


x1>c1

Class 1

x2>c2

x3<c3

Class 1Class 2

Class 1

Class 2

Class 2

Support Vector Machine (SVM)

In SVM, it tries to find an optimal weight vector (the “alpha” vector) for the samplesIdeally, many alpha’s values are zeroThe non‐zero samples are “support vectors”

46

(features)

Labels


10/15/2018

24

What Is A SVM Model Like?

Suppose we have a similarity function that measures the similarity between two sample vectors


),( ixxk rrmeasures the similarity between two vectors

A simple classifier is to compare the average similarity to class +1 samples to the average similarity to class ‐1 samples

= ( ) = 1 , − 1 ,: +:),()( ii xxkbxf rrr αΣ+=

If we weight each sample differently

Neural Networks

A neural network learning algorithm determines the best weights – most of them might be zero


… +⋯+∙ > ; , ‐1

Inputs Outputs

(∙)NeuronInputs

10/15/2018

25

A Comparison of Classifiers

Algorithms are comparable on the 1st and 3rd examplesPerformance on the 2nd example variesIn practical application, a more complex algorithm is not necessarily betterResults also largely depend on the “space” the data is projected onto


Source: http://scikit‐learn.org/stable/

An Common Application

Learning model tries to replace the expensive flow with the cheaper one


… An complex andexpensive test flow

……

(N+M) sample parts Class 1

Class 2

N parts

M parts

……

A much cheapertest flow involving

K tests

⋯⋮ ⋱ ⋮⋯⋯⋮ ⋱ ⋮⋯+1

‐1

Learn

…… LearningModel

Parts in production

…Class 1

Class 2

10/15/2018

26

Alternative Analog Test – A Classification Problem


n measurements

mnmm

n

n

mxxx

xxxxxx

x

xx

...............

...

...

...21

22221

11211

2

1

==X

my

yy

...2

1

=y

nMMM ...21Pass/Fail

msa

mpl

es c

hips

Dataset

nxxxx Lr

21= (chip under test)

Pass/Fail?

These measurements are low‐cost alternative tests

m chips

Unsupervised


10/15/2018

27

Basic Clustering Algorithms

Clustering largely depends on– The space the samples are projected onto– The definition of the concept “similarity”


Source: http://scikit‐learn.org/stable/auto_examples/cluster/plot_cluster_comparison.html

Clustering: K‐Mean

K‐Means– User gives the number of clusters k– The algorithm iteratively tries to find the “best” k centroids Mini Batch K‐Means (for speed reason)– Search is based on a subset of samples– “Best” is evaluated based on all samples

ITC 2018 Tutorial ‐ML in Test ‐Wang 54http://scikit‐learn.org/stable/modules/clustering.html

10/15/2018

28

Transformation – Principal Component Analysis

Principal Component Analysis (PCA) – find directions where the data spread out with large variance– 1st PC (PC1) – data spread out with the most variance– 2nd PC (PC2) – data spread out with the 2nd most variance– …

PCA is used for outlier analysis in test


s1

sM

r1

rM

= [ d11, d12, d1N

= [ d1, dK2, DMN

]

]

…

…

…

M sam

ples

PCA

Re‐Projectionof data in aPCA space

…

f1 f2 fN…

PC1

PC2

PCA for Outlier Analysis in Test

Each test is used to screen with a test limit– Two tests essentially define a bounding box

Multivariate outliers are not screened by applying tests individually


This outliers are notscreened by the twotests individually

10/15/2018

29

Multivariate Outlier Analysis

Use PCA to re‐project the data into a PCA space – then define the test limits in the PCA space– Each PC becomes just another test individually

Also see Nik Sumikawa et al. (ITC 2012) – “Screening Customer Returns With Multivariate Test Analysis”


This is what we desire PCA helps achieve that

Main Points to Keep In Mind

There are categories of machine learning problems

Many algorithms for each category

An algorithm usually requires user‐input parameters– This determination is often done with cross‐validation

For a practitioner– 1. Formulate the problem as a machine learning problem– 2. Pick an algorithm– 3. Experimentally to select the parameters to build model


10/15/2018

30

But, it is not just about the ML tool …


An Application Example – Fmax Prediction


n low‐cost delay measurements

mnmm

n

n

mxxx

xxxxxx

x

xx

...............

...

...

...21

22221

11211

2

1

==X

my

yy

...2

1

=y

nMMM ...21Fmax

msamples chips

Dataset

nxxxx Lr

21= (a new chip c)

Fmax of cDelay measurements can be– FF based, pattern based, path based, or RO based

10/15/2018

31

Example Algorithms For Regression

See Janine Chen et al. (ITC 2009)– “Data Learning Techniques and Methodology for Fmax Prediction”


LSF method(linear model,over‐fitting the training dataset)

RG method(linear model,

provide a way toavoid the over‐fitting)

K‐NN method(distance‐based,over‐fitting the training dataset)

SVR method(distance‐based,use kernel k( ) to

calculate the distance,provide a way to

avoid the over‐fitting)

GP method(Bayesian version ofthe SVR method with the abilityto estimate the

prediction confidence)

Improve on the over-fitting issue

Improve on the over-fitting issue

Combinedwith

Bayesianinference

Replace linear modelwith a model in the formof a linear combinationof kernel basis functions

GP Was The Best! (Conformal Check)

See Janine Chen et al. (ITC 2009)– “Data Learning Techniques and Methodology for Fmax Prediction”


1 5 9 13 17 21 25 29 33 37 41 45

Freq

uenc

y

99.7% confidence band

45 conformal samples (sorted by predicted Fmax)

95.4% confidence band (dotted line)

Predicted FmaxActual Fmax

10/15/2018

32

System Fmax Prediction

Structural Fmax on 1443 FFs were measured5/6 samples were used to build a predictive model for system FmaxModel was applied on remaining 1/6 samplesSee Janine Chen et al. (ASP‐DAC 2010)– “Correlating System Test Fmax with Sturctural Test Fmax and Process Monitoring

Measurements”


Correlation = 0.98Sys F

max

Predicted Sys Fmax

0

5

10

15

Fmax Freq. (lot1 – 79 cores)

Coun

t

0

5

10

15

20

Fmax Freq. (lot2 – 74 cores)

Correlation Lot1 Lot2

Predictive model 0.98 0.87

Best single FF 0.83 0.55

A Barrier for Deployment

Can’t deploy a model without having a consistent set of features across all lots


0

0.2

0.4

0.6

0.8

1

1443 722 316 181 91 46 23 12

Accuracy

Lot1

Lot2

Selecting 23 featuresgives the best result = 0.98

Selecting 46 featuresgives the best result = 0.80

Lot1

Lot2

FF number

# of features

10/15/2018

33





e.g. Test data




Apply

Test cost reduction


Layoutverification



FmaxSpeed test

In the 2nd stage, it was “Methodology Centric” – What should be the methodology to enable deployment of a ML‐

based solution/model?

Design‐Silicon Timing Correlation

Question to answer: Why Static Timing Analyzer (STA) and Silicon Path Delays disagree?


10/15/2018

34

Design‐Feature‐Based Analytics Framework

Answer: Those features cause the mismatch


Design databaseVerilog netlist

Timing report

Cell models

LEF/DEFSwitchingactivity

SImodel

Temperature map

Power analysis

Path

selection

pathsPath

encoding

Design features

ATPG

Tests Test data

Path data

PathVectors(dataset)

Learning

Test pattern simulation

Features

Preparation – Feature Generation


Cell-Based Transistor-Based

Interconnect-Based

Features are potential sources of uncertainty on a path

10/15/2018

35

Feature Generation


victim

0

1Coupling

Multiple Input Switching

Dynamic effects

Temperature(Eli Chiprout, Intel)

Power noise(Eli Chiprout, Intel)

Location-based

Y

X

Binary Classification – Tree Learning

Design features extracted from STA timing reports and GDSII

Tree model: There are > 14 single vias between layers 4/5 and > 70 double vias between layers 5/6

One can validate a tree model by visualizing the colored scatter plot


1.5 2 2.5 3 3.5 4 4.51.5

2

2.5

3

3.5

4

4.5

5

5.5

Normalized Expected Slack

Nor

mal

ized

Mea

sure

d Sl

ack

Single Via C

ount Layers 4/5 Clock P

ath

0

2

4

6

8

10

12

14

16

18

20

Validation of the tree model

10/15/2018

36

Another Example (Joint work with AMD)

Fifteen 4‐core processor parts480 critical paths from AC delay tests (shown on top 4 freq. steps)Many are STA non‐critical (Slack > x)See Janine Chen et al. (ITC 2010) – “Mining AC Delay Measurements for Understanding Speed‐limiting Paths”


050100150200250300350

Step 1 Step 2 Step 3 Step 4

Path cou

nt

Very Unbalanced Dataset (M >> k)

There are STA‐critical 12,248 paths activated by patterns– with slacks <= x– Do not show up as critical paths in top 4 frequencies

For 480 failing FFs from top 4 frequency steps– 158 silicon critical but STA non‐critical paths

Question: What are unique about the 158 paths?– Use 12,248 silicon non‐critical paths as the basis for comparison


12,248 silicon non‐critical paths

158 siliconcritical paths

vs.

10/15/2018

37

Feature Creation

92 path features – Basic information: for example, whether the path is a half‐cycle

path, transition directions on N/PMOS, etc.– Timing statistics: delay information from STA timing report.– Usage of Vt devices: counts of various Vt devices used on the

path, i.e. high, low, regular, etc.– Cell type: features of the usage of various important cell types.– RC related: features capturing the load information on the path.– MUX related: special features focusing on the usage of some

MUX cells.– Location: describing location of the path.

16 features correlate to others


Potential Features as Causes

Multiple rules may be reflecting the same cause– Rules #1,2,4,5 all involves CMAC feature– They actually reflect the same cause (later confirmed)

Finding: The methodology is practical – finding something missed by designer– If one is willing to pay the price for developing the features


10/15/2018

38

The Need for Domain Expert

A domain expert won’t accept a solution if he/she can’t see the value, or don’t understand it– Interpretable and actionable model– Added value to their existing solutions already in place

Let the methodology start with an expert, by – Asking for a set of “reasonable” features– Collecting sufficient data for learning feature importance

But …– If the engineer knows what features are relevant, why even apply so‐called “Machine Learning”?

– If they don’t know, how much data is needed? – If collecting the data is hard, will it ever get done? – If it is too costly, what’s the added value?


Modern ML is about learning the features


10/15/2018

39





e.g. Test data




Apply

Test cost reduction


Layoutverification



FmaxSpeed test

In the 3rd stage, “Application Driven” – Which application has a better chance to succeed with a

ML solution?

One Application‐ Selective Burn‐In


10/15/2018

40

Case Study 1

Product– Analog and sensor product

Fail characteristics– Fails were verified – we knew the root causes

Objective– Identify pre‐burn‐in outlier models to capture them

79

WaferTest

Final Test(H/C)

BurnIn

Final Test(H/A)

7 verified fails with FA reports


Case Study 1 – 2‐test models

All 7 models were verified with test and design knowledge– The tests used in the model were testing the failing site

The study suggests existing pre‐burn‐in test “signatures” for burn‐in fails80

Fail #1 Fail #2

Fail #4 Fail #7


10/15/2018

41

Case Study 1 ‐ Summary

To capture all 7 fails, only < 5% of the population require burn‐in (Is this a good news?)

81

Fail # Fail Site Outlier Model Kill %

1 Gate Oxide 2‐test model 1%

2 Polysilicon 2‐test model 0.44%

3 Gate Oxide Univeriate 1.28%

4 Polysilicon 2‐test model 0.01%

5 Gate Oxide Univariate 0.23%

6 Gate Oxide Univariate 0.48%

7 Gate Oxide 2‐test model 1.24%

Accumulated Kill % 4.54%


Case Study 2

Product– Automotive microcontroller

Fail characteristics– Some fails had FA reports, but most did not

Objective– Identify outlier models to capture them before burn‐in

82

WaferTest

BurnIn

Final Test(H/A)

48 fails collected

Category Type # of fails

A 1 8

2 3

3 9

4 2

B 1 15

C 1 11

Total 48


10/15/2018

42

Case Study 2

Models are not selected based on only outlying properties– The tests in use are testing the block containing the fails– The model is shared by most of the failsDifficulties

– Even the fails are given with the same type – not necessarily the case– Not sure if all 48 fails were true burn‐in fails– If we couldn’t find a good screen, what did we tell the product team – need supporting

evidence to show that the part was not screenable with the tests

83

Category Type # of fails

Model # of fails screened by the model

Kill % Kill % to include the next

fail

A 1 8 2‐test model X 6 7.3% 24.9%

2 3 2‐test model X 2 6.7% 16.8%

3 9 2‐test model X 5 7.8% 11.4%

4 2 2‐test model X 2 6.9% ‐

B 1 15 Univariate 8 4.1% 9%

C 1 11 Univariate 7 7.5% 27.9%

Total 48 3 models 30 fails


Case Study 2

New tests became available later– Allow better models to be built on


10/15/2018

43

Case Study 2 – Additional Models

As new tests became available, more models could be found– So this is not really an outlier modeling problem, isn’t it?

85

Same model from the previous slide projectmany other fails as outliers

Another model projects multiple fails as outliers


Burn‐In Fail Prediction – Is It Possible?

Consider a target test T with many post‐burn‐in fails– Observe significant change of T measured value pre‐ and post‐ burn‐inOnly 2 fails were marginal in both pre‐ and post‐ burn‐in stageWithout the stress, how can one screen those fails?– We need alternative stress

86

Pre‐burn‐in test T measured value

Post‐burn‐in te

st T m

easured value


10/15/2018

44

Four Key Considerations

In this picture, I did not mention a “learning algorithm” because it was not as deciding a factor than these four for realizing a practical methodology for an application

87

Low‐HangingFruits

DomainKnowledgeAvailability

DataAvailability

LearningResult

Utilization

Added Value toExisting Flow


Applying ML in Design/Test (Since 2013)




e.g. Test data




Apply

Test cost reduction


Layoutverification



FmaxSpeed test

In the 4th stage, it was all about the question of “Added Value?”

10/15/2018

45

Resolving A Production Yield Issue (2013)

An automotive SoC productYield fluctuated over timeThe issue could not be resolved for months after several design and test revisions, and several process tuning recipesITC 2014 paper: “Yield Optimization Using Advanced Statistical Correlation Methods”

89

Lots in time

Yield

(for illustration)Problem:


Objective: Discover Process Adjustments

Task: Discover strong correlations between a test fallout and a process parameter

Desire: Adjustment of the process parameter leads to improved yield and reduced yield fluctuation

90

Yield

Density Distribution

estimated based on 2000+ wafers

The desired outcome with yield optimization(what we want)


10/15/2018

46

Basic Question

The correlation can exist in many different ways– Statistical correlation or just an association– Univariate or multivariate relationship

With many variables, this essentially is a search problem


Wafer to wafer

Average

# of fa

ilsSome variable (factor)

Wafer to waferCorrelate?

Initial Unsuccessful Search

The best results from searching a correlation between the number of fails and a process parameter are not enough to support a silicon design‐of‐experiment


Bin 26

Corr=0.463

Test A

Corr=0.456Process parameter

# of fa

ils

Process parameter

10/15/2018

47

Need To Consider More Variables

The correlation may exist in certain regionAlso, process parameters are measured at different locations on a wafer – their values can be different


Test A Test D

Site 1

Site 2

Site 3

Site 4

Site 5

Site Locations

Parameter PP1

Varia

nce Correlation = ‐0.75

Consider More Variables

The left shows a correlation found only by considering failing with a specific test valueThe right shows a correlation found only to the variance of a test distribution


Parameter PP1

X 4category fails Correlation = ‐0.766

10/15/2018

48

Consider Temporal Effect

Temporal variations can mask a correlationBy unmasking the temporal effect, a high correlation is found


# of X

1–

X3 type

s of fails

PP5 parameter

Corr= 0.75

Correlation without separation= 0.63

Corr= 0.86

Risk Evaluation – “No Correlation”

Before silicon experiment, we had to ensure a recommended process change does not affect other test fallouts– We don’t want to solve one issue and cause anotherOur method: ensuring statistical independence between a process change and each of the other test fallout– Implementation is complicated – see paper for detail

96

0

0.5

1

Bin 31

Risk

PP1 risk evaluation

PP1 Avg. Value

Mean test value Test

limits

Risk inspection and containmentITC 2018 Tutorial ‐ML in Test ‐Wang

10/15/2018

49

Silicon Result

Found several recommended process changes– 3 process changes based on 5 process parameters– 1 change applied to all experimental lots

Result: Significant yield improvement and reduction of the yield fluctuation

97

Before

Yield

Den

sity

Before ADJ #1 ADJ #2 Both

Yield


Finally, we observed “added value!”


10/15/2018

50

But not so fast …

In the yield example, we and the product team had access to the same set of ML tools

So, why we succeeded and they did not?

Because we had the knowledge enabling us to conduct a more effective analytic process to apply the ML tools

It was that piece of knowledge made the difference, not the tools in use


The True Value

For deploy a solution, I can’t just package the ML tools and give it to the product team

I needed to package my “knowledge” – How am I going to do that?


10/15/2018

51

Between 2014‐16, we were working on several more

production lines in the context of yield optimization

(Unpublished, all internal)


We Need A System View To Apply ML (Since 2016)

So in short, why the system view?

Because we need domain knowledge


InputPreparation

ResultEvaluation

ToolInvocation

Domain Expert

Need Automation of all three components

10/15/2018

52

We Need A System View To Apply ML

Why do we need domain knowledge?

Mainly, because we have limited data


InputPreparation

ResultEvaluation

ToolInvocation

Expert


What is so special about applying ML in view of “limited data?”

“Learning from Limited Data in VLSI CAD” – an upcoming book chapter ‐ preview at our lab web site: https://iea.ece.ucsb.edu/

Because there are theoretical assumptions made to achieve ML, and with limited data those assumptions would be hard to meet in practice→ we need domain knowledge to compensate ML


InputPreparation

ResultEvaluation

ToolInvocation

Expert

10/15/2018

53

What Theoretical Assumptions for

Machine Learning?


Classification, Machine Learning, Pattern Recognition

In machine learning, Perceptron is widely considered as one of the earliest examples to show that a machine can actually “learn”SVM is based on statistical learning theory that provides the necessary and sufficient conditions where a machine is guaranteed to “learn”


Perceptron (1958 Rosenblatt – 2‐level neural network)

Back propagation (1975 Werbos – NN with hidden layer)

Kernel trick (1964 Aizerman et al.)

Support Vector Machine (1995 Vapnik et al.)

Gaussian Process for Regression (1996 Williams&Rasmussen)

Gaussian Process for Classification (1998 Williams&Barber)

SVM one‐class (1999 Scholkopf et al.)

Decision tree learning (1986 ID3)Rule learning (1989 CN2)

Rule learning (1993 C4.5)Random Forests (2001 Breiman)

Rule learning (2002 CN2‐Subgroup Discovery)

Decision tree learning (1984 CART)

Deep Learning (2012)

10/15/2018

54

A Popular Dataset For Machine Learning Research

One of the most popular datasets used in ML research was the USPS dataset for hand‐written postal code recognition– e.g. When SVM was introduced, it substantially outperformed others

based on this dataset

Question: What is the difference between this problem and yours?


Source: Hastie, et al. “The Elements of Statistical Learning” 2nd edition 2008 (very good introduction book)

Binary Classification

There are subspaces that are easy to classify (all algorithms agree)One algorithm differs from another on how each partitions the subspace in the “grey area”– What’s the “best way” to define the “orange‐blue” boundary?


Source: Hastie, et al. “The Elements of Statistical Learning” 2nd edit 2008 (very good introduction book)

Orange space

Blue space

Grey area

10/15/2018

55

Model Complexity

You can always find a model that perfectly classifies the two classes of training samples (middle picture – based on nearest neighbor strategy)– The model is usually complex

However, this may not be what you want– Because your model is highly biased by the training data


Source: Hastie, et al. “The Elements of Statistical Learning” 2nd edition 2008 (very good introduction book)

Complex – rough edge Complex – fragmented SmoothNearest neighbor model

Model Complexity Vs. Prediction Error

In learning, an algorithm tries to explore this tradeoff to avoid over‐fittingThere are two fundamental approaches– Fixing a model complexity

• Find the best fit model to the train data• e.g. Neural Network, equation based models

– Fixing a training error (say almost 0)• Find the low‐complexity model (given ALL possible functional choices in a space)• e.g. SVM


Pred

ictio

n error

Model Complexitylow high

Error on the validation samples

Error on the training samples

Over‐fitting

10/15/2018

56

Neural Network (Fixed Complexity)


Source: Hastie, et al. “The Elements of Statistical Learning” 2nd edition

K classes

0112211111 bZbZbZbY MM ++++= K

M hidden variables

)( 0112211111 aXaXaXaZ PP ++++∂= K

)exp(11)(

xx

−+=∂

A neural network model complexity is fixed by fixing the number of Z variablesLearning is by finding the best‐fit values for the parameters– (M+1)K parameters– (P+1)M parameterse.g. Use the back propagation algorithm (1975 Werbos)

What Causes Over‐Fitting Anyway?


10/15/2018

57

Five Assumptions for Supervised Learning

A restriction on H (otherwise, NFL)An assumption on D (i.e. not time‐varied, e.g silicon data)Assuming size m is in order O(poly(n)), n: # of featuresMaking sure a practical algorithm L existsAssuming a way to measure error, e.g. Err(f(x), h(x))


SampleGenerator G

Hypothesis Space H

LearningAlgorithm L

Function y=f(x)

Hypothesis h

D

m samples

f

(1)

(2)

(3)

(4)(5)

(x,y)

In Practice, Issue #1


SampleGenerator G

Hypothesis Space H

LearningAlgorithm L

Function y=f(x)

Hypothesis h

D

m samples

f

(1)

(2)

(3)

(4)(5)

(x,y)

Because we don’t know how complex H should be, we assume the most complex H we can afford in training

10/15/2018

58



SampleGenerator G

Hypothesis Space H

LearningAlgorithm L

Function y=f(x)

Hypothesis h

D

m samples

f

(1)

(2)

(3)

(4)(5)

(x,y)

For a complex H we need a large amount of data, but we usually don’t know if we have enough in advance

Occam Razor Assumption

Hypothesis space: e.g. all possible assignment of weight values in a neural network (can be infinite)

Occam Razor: Find the “simplest” hypothesis that fit the data– Hence, many machine learning algorithms solve a constrained

minimization problem

Conceptually, Occam Razor means making a “smooth” assumption in the hypothesis space


Space of all hypotheses

Data is used to filter out

some hypotheses

For the remaining,find the “simplest”hypothesis as the

answer

10/15/2018

59

“Smooth” Assumption

ML would say the color of the area is blueNFL would say “No, it can be any color”

117

?


Another Example of “Smooth” Assumption

Occam’s razor would give you a simpler model and conclude “It is red”NFL would say “can be either green or red”– See IEEE TCAD 36(6) 2017

“Experience of Data Analytics in EDA and Test”

118

000

?yx

z001

100

110

010

011101111

= ⇒ Green= ⇒ Red


10/15/2018

60



SampleGenerator G

Hypothesis Space H

LearningAlgorithm L

Function y=f(x)

Hypothesis h

D

m samples

f

(1)

(2)

(3)

(4)(5)

(x,y)

Because non‐convex optimization is hard, some heuristic is used, and the solution is often a local minimum



SampleGenerator G

Hypothesis Space H

LearningAlgorithm L

Function y=f(x)

Hypothesis h

D

m samples

f

(1)

(2)

(3)

(4)(5)

(x,y)

NP‐Hard– The problem to find the desired model is NP‐Hard– This is proved by fixing the representation

Crypto‐Hard– The problem is as hard as breaking a crypto function– This is proved by not fixing the representation

10/15/2018

61

Learning Complexity Hierarchy


No Free Lunch

All Possible Function Classes

Poly VC‐D

NP‐Hard Crypto‐Hard

Efficiently Learnable

K‐term DNF formulaPoly‐size circuit

Poly‐size Neural NetDepth‐2 circuit

Learnable

But, ML Is So Successful, Isn’t It?

Speech recognitionLanguage translationComputer visionAutonomous vehicle…

If so difficult, how could they be so successful?


10/15/2018

62

The Common Approach To Succeed in ML


No Free Lunch

All Possible Functions

Poly VC‐D

NP‐Hard Crypto‐Hard

Learnable

Assume the largest DeepNeural Network affordable

Demand huge datafor training!

Demand special HWfor speed!

Assume your target is here

Well‐Known Issue With A NN Model

Adversarial Examples – A slightly perturbed input that causes the model to misclassify

Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, Rob Fergus, “Intriguing properties of neural networks,” arXiv:1312.6199v4, 2013– Google scholar: >1K citations already


10/15/2018

63

Adversarial Examples Are Easy To Find

MNIST Dataset, Source: Xiaowei Huang, Marta Kwiatkowska, Sen Wang, Min Wu “Safety Verification of Deep Neural Networks,” arXiv:1610.06940v3 , 5 May 2017 latest version


Current Findings with Adversarial Examples

Finding adversarial examples is easy; but defending against adversarial examples is hard

Adversarial training can improve performance; and ensemble learning helps

Adversary might not be caused by “overfitting,” and more likely caused by out‐of‐domain inputs

But when can we say we “verify” a CNN model?


10/15/2018

64

In Summary …


In Summary, Four Barriers To Consider …

A result after considering those 4 barriers– Data barrier– Theoretical barrier– Computational barrier– Deployment barrier (over an existing solution)

The system is largely domain‐knowledge‐driven


InputPreparation

ResultEvaluation

ToolInvocation

Expert

10/15/2018

65

The Yield Context


InputPreparation

ResultEvaluation

ToolInvocation

Expert

2. Learning the process as how a yield expert applies

those ML tools(VTS 2017)

1. What ML tools are useful and required in yield engineering

(ITC 2014)

3. GAN‐based result recognizer (ITC 2018)

The AI System

The core of this AI system view is the autonomous execution of the workflow


InputPreparation

ResultEvaluation

ToolInvocation

The AI SystemThe Speech Recognition and NLP Component

Integration ⇒ Autonomous Execution

10/15/2018

66



InputPreparation

ResultEvaluation

ToolInvocation

Workflow

Need Automation of all three components

IEA Autonomous System

ExecutableWorkflow

ConceptRecognizers


Recall: The Analogy

Data Tools: Collect all relevant “data”Concept Recognition: What “data” meanWorkflow: What to do next


Sensing

Perception

Reasoning& Control

IEA System

ExecutableWorkflow

ConceptRecognizers


10/15/2018

67

“Intelligence” In IEA

The language interface is mostly used for queries of results after the IEA autonomous execution is completed


NaturalLanguageInterface

Manifestation AutonomousSystem PersonManifestation

Level 0Level 1 Level ∞

IEA

Final Remarks‐ Recent Trends in ML


10/15/2018

68

Noticeable ML Applications In Recent Years


Self‐Driving Car Mobile Google Translation

Smart Robot AlphaGo (Google)

*These images are found in public domain

Deep Learning for Image Recognition

ImageNet: Large Scale Visual Recognition Challenge (http://www.image‐net.org/challenges/LSVRC/) – 1000 Object Classes, 1.4M Images


28.20%25.80%

16.40%

11.70%

7.30% 6.70%3.57% 5.10%

0.00%

5.00%

10.00%

15.00%

20.00%

25.00%

30.00%

2010 2011 2012 2013 2014 2014 2015 Human8‐LayerAlexNet

8‐LayerZFNet

19‐LayerVGG

22‐LayerGoogleNet

152‐LayerResNet

2016 CUImage: 269 Layers

Top‐5 error rate

10/15/2018

69




28.20%25.80%

16.40%

11.70%

7.30% 6.70%3.57% 5.10%

0.00%

5.00%

10.00%

15.00%

20.00%

25.00%

30.00%


8‐LayerZFNet

19‐LayerVGG

22‐LayerGoogleNet

152‐LayerResNet


Top‐5 error rate

1st Enabler: The availability of a large dataset to enable the study of deeper neural

network




28.20%25.80%

16.40%

11.70%

7.30% 6.70%3.57% 5.10%

0.00%

5.00%

10.00%

15.00%

20.00%

25.00%

30.00%


8‐LayerZFNet

19‐LayerVGG

22‐LayerGoogleNet

152‐LayerResNet


Top‐5 error rate

2nd Enabler: The availability of efficient hardware to

enable training with such a large neural network

10/15/2018

70

NN for Unsupervised Learning

GANs is one of the hottest topics in 2017

Ian J. Goodfellow, Jean Pouget‐Abadie, Mehdi Mirza, Bing Xu, David Warde‐Farley, Sherjil Ozair, Aaron Courville, YoshuaBengio, “Generative Adversarial Networks,” arXiv:1406.2661v1, 2014

Introductory site: https://deeplearning4j.org/generative‐adversarial‐network

Take a look at the rich applications of GANs: https://github.com/hindupuravinash/the‐gan‐zoo

Tips and tricks to implement GANs: https://github.com/soumith/ganhacks


A Recent DNN Study

Size and power efficiency are concerns


Source: Alfredo Canziani, Adam Paszke, Eugenio Culurciello, “An Analysis Of Deep Neural Network Models For Practical Applications,” arXiv:1605.07678v4 [cs.CV] 14 Apr 2017

10/15/2018

71

Learning Hardware ‐ A Very Good Tutorial

Dr. Song Han, MIT (prior at Stanford University)– https://stanford.edu/~songhan/talks.html

https://www.youtube.com/watch?v=Q0b‐CkejWcc&feature=youtu.be


Improvement on DL Models

Deep Learning models continue to “train more and get better! ”

142

Image Recognition Speech Recognition

‘12 AlexNet ‘15 ResNet ‘14 DeepSp1 ‘15 DeepSp2

1.4 GFLOP 22.6 GFLOP 80 GFLOP 465 GFLOP

∼16% err ∼3.5% err ∼8% err ∼5%err

16X improvement 10X improvement

Source: S. Han’s tutorialat FPGA 17


10/15/2018

72

Three Issues With Increasing DNN Model Size

Training speed– RestNet152 (Microsoft) – 1.5 weeks– This turn‐around time limits model development

Energy efficiency– AlphaGo (Google) – 1920 CPUs + 280 GPUs– Huge electric bill for each run

Mobile application– Model downloaded to mobile device (size limitation)– Running the model needs speed and low‐power

In the future, it is expected that a brain‐scale system might integrate 1011 neurons and 1015 weight values


Hardware for Training ‐ Examples

CPU– Intel Knights Landing (2016)

• 7 TFLOPS FP32, 16GB MCDRAM (400 GB/s), 14nm

GPU– Nvidia PASCAL GP100 (2016)

• 10 TFLOPS FP32, 16BG HBM (750 GB/s), 16nm– Nvidia Volta GV100 (2017)

• 15 TFLOPS FP32, 120 Tensor TFLOPS, 16BG HBM2 (900GB/s), 12nm, 21B Transistors

• Matrix performance improves by >9 times

TPU– Google Cloud TPU

• 180 TFLOPs in training• “TPU pod” comprising 64 TPUs for 11.5 PetaFLOPs performance


10/15/2018

73

Hardware for Inference ‐ Examples

Heavily researched area– A‐eye (FPGA 2011), Diannao (2014), MIT Eyeriss (2016), Stanford EIE (2016), Google TPU, many start‐ups…

A common goal is to minimize memory access– To improve performance and energy efficiency

Most of the practical works are based on novel microarchitecture design– They do not necessarily pose new test problems at the component level


Example: Google TPU Architecture

Source: “In‐Datacenter Performance Analysis of a Tensor Processing Unit TM” 2017 International Symposium on Computer Architecture


Not necessarily pose fundamentally new test problems, except

there are more memory elements

distributed among the computation units

10/15/2018

74

New Devices Under R&D

Source: C. D. Schuman, et al. “A Survey of Neuromorphic Computing and Neural Networks in Hardware” May 19, 2017


Why Memory Resistor (Memristor)?

It is believed that a von Neumann Architecture is reaching its performance limitsA Neuromorphic Architecture merges memory into the computation in a distributed way– Memristor is an ideal device to implement synapses


PU

Memory

von Neumann Architecture

Bottleneck

Neuromorphic Computing

10/15/2018

75

Neurocomputing Hardware

Dmitri B. Strukov: Mixed‐Signal NanoelectronicNeurocomputing– Analog/mixed‐signal design can offer far better performance for INFERENCE


Neurocomputing Hardware

Dmitri B. Strukov: High‐Performance Mixed‐Signal Neurocomputingwith Nanoscale Floating‐Gate Memory Cell Arrays, IEEE Trans. Neural Networks & Learning Systems, 2018 (early access)– The core component is a resistive crossbar with tunable conductance G,

essentially an analog NVM, for vector‐by‐matrix multiplication


Test: what’s the requirement for testing?Validation: What’s the impact of imperfect hardware to model inference?

10/15/2018

76

Summary


Summary

Machine Learning for Test– Many works exist (but how many are practical?)– For applications with limited data, need an AI system to apply machine learning

Test for Machine Learning– Driven by self‐driving car and deep‐learning hardware– Functional safety will remain a major concern in the foreseeable future


10/15/2018

77

THANK YOU!

Questions?


Documents

Machine Learning in Test A Journey To AI · 10/15/2018 2 What AI? ITC 2018 Tutorial ‐ML in Test ‐Wang 3 What Kind of “AI”? ¾1950 “Computing Machinery and Intelligence”