Predictive analysis on Multivariate, Time Series …cs229.stanford.edu/proj2016/report/Thakkar_Predictive_Analysis_on...1 Predictive analysis on Multivariate, Time Series datasets

1

Predictive analysis on Multivariate, Time Seriesdatasets using Shapelets

Hemal ThakkarDepartment of Computer Science, Stanford University

[email protected] | [email protected]

Abstract—Multivariate, Time Series analysis is a very commonstatistical application in many fields. Such analysis is also appliedin large scale industrial processes like batch Processing which isheavily used in Chemical, Textile, Pharmaceutical, Food process-ing and many more industrial domains. The goals of this projectare 1. Explore the usage of Supervised learning techniques likeLogistic regression and Linear SVM to simplify and convert thedimensionality of multivariate dependent variable to a confidencescore, indicating whether the batch production is on-spec oroff-spec. 2. Instead of using classical Multivariate classificationtechniques like PCA/PLS or Nearest neighbour methods explorethe usage and application of novel concept of Time Series shapeletstowards predicting the qualitative outcome of a batch.

I. INTRODUCTION

Multivariate, Time Series analysis is a very common statis-tical application in many fields. Furthermore, with the arrivalof Industrial Internet (IIoT) more and more processes arebeing instrumented for better accuracy and predictability, thusproducing a large amount of sensor data. Performing on-line, predictive analysis for such high dimensional and timevarying data is already becoming a challenging task. Thecurrent research and applications for Predictive analysis onbatch datasets heavily rely on Time Warping and PLS basedtechniques. The aim of this project is to simplify the processof predicting batch quality outcomes by using current researchin the field of representative temporal sub-sequences likeshapelets. This project will particularly focus on the problemstatement of predicting the quality outcome of an industrialbatch process using a combination of initial charge variablesand time series information.

II. LITERATURE REVIEW AND PROBLEM DESCRIPTION

There has been an extensive amount of research on Mul-tivariate, Time Series analysis that involve dealing with usecases like Pattern recognition (1), Classification (2), Fore-casting (3) and Anomaly detection (4). Several MachineLearning techniques are used to address these problems. Acomprehensive survey of theoretical aspects is covered in (5)for forecasting techniques.

This project will particularly focus on the problem statementof predicting quality outcome for Industrial Batch processing

use case. For such cases, algorithms like Dynamic timewarping (6) are used with supervised learning techniques likeNearest Neighbour search for finding similarities, querying anddata mining of temporal sequences. We notice that with suchtime warping there is a risk of loosing important temporalvariations in the time series data that could aid in predictionof qualitative outcome of the final batch. Thus instead of usingtime warping we propose the use of temporal sub-sequenceswhich can act as a representative of quality class.

III. DATA SET, NOTATIONS AND PROBLEM STATEMENT

To further elaborate on the problem statement we firstintroduce notations and describe terminology that will be usedthroughout the paper. To aid our notation we will use thedataset that is acquired to test and verify our analysis andresults. We will be using a Batch dataset from a Chemicalprocess, though the underlying chemistry is not of significantrelevance one can refer (7) for a better description of un-derlying chemical process. The dataset we will use is verysimilar to the one referred in (7), hence in this paper wewill only describe the variables used for predictive analysisinstead of full details and background of the Chemical process.Particularly we refer to a Batch dataset bi as a collection offollowing 3 data structures : {zi, xi, yi}. For our analysis wehave acquired a data set of more than 80 batches, however afterremoving anomalous and missing data values the algorithmsare executed on 33 batches as described in following sections.Consider the following notation and properties of the dataset:

• Total number of Batches: B = 33, represented by bi

for i = 1 ... 33• Initial chemical properties of each batch bi: zi ∈ Rk

where k is number of initial chemical properties measuredfor each batch. These properties are measured at timet = 0 before the batch process starts and are also referredto as initial charge variables.

• Multivariate product Quality parameters: yi ∈ Rj

where j is the total number of quality measurementstaken at the end of each batch process.

• Final product Quality: Qi ∈ {1,−1} with 1 = On-spec i.e. the final quality of batch i were as per the

2

Fig. 1: Batch trajectories

specifications and -1 = Off-Spec i.e. the final quality ofbatch i was not as per the expected specifications.

• Time series process variable trajectories as they evolveduring the batch processing: xni such that n is thenumber of trajectories or time series variables measuredduring the batch process. Thus x(n,t)i ∈ R is one readingof nth process variable of ith batch taken at time t.See figure 1 for a visual representation of 3 trajectoriesevolving from time t = 0 to 160

• In our case, n = 10, t varies for each i but remainsconstant across all n variables, k = 11 and j = 11.

• Note that yi and zi may or may not represent thesame chemical measurements. Also note that yi ∈ Rj

is multidimensional measurement of the output qualityparameters whereas Qi ∈ {1,−1} represents the finalquality flag for each batch.

• For brevity we have not formally defined the concepts ofTime Series, Subsequence and Shapelet in this project.Instead we closely follow the definitions described in (9)and (10)

Given the variability within all n trajectories in xi it isdifficult to align the trajectories across batches based onany single variable, hence techniques like Time warping asdescribed in (6) are used for accurate comparison. Howeversuch manipulation may introduce distortion in time seriesvalues. Furthermore, it is difficult to identify noticeable Land-marks within the dataset due to variance across n variabletrajectories. To address this and exploit the temporal natureof trajectories we propose use of shapelets as representative

Fig. 2: Confidence Scores

temporal subsequences to predict final quality output Qi. Theprediction pipeline is structured as follows: Firstly, yi willbe used to derive Confidence scores as explained in sectionIV. Secondly in section V we show a few sample shapeletsidentified manually whereas in Section VI we implementthe algorithm as mentioned in (10) to learn high entropyrepresentative shapelets directly from dataset. Lastly in sectionVII and VIII we summarize the results obtained by usingcombination of zi and shapelets distances to predict Qi.

3

Fig. 3: Shapelets: A visual intuition on one process trajectory

IV. DATA PRE-PROCESSING AND LEARNING CONFIDENCESCORES

In this section we try to simplify the multivariate output yito uni-variate confidence score. For this purpose we implementRegularized Logistic regression as described in (11) and(12) with yi as input to learn classification output Qi. Theconfidence score is nothing but θT ∗ yi i.e. the distance ofeach yi from separating hyperplane. This score indicates thedegree of On-Spec/Off-Spec quality measure and transformsdata from {1,−1} to R. For brevity the implementation detailsof logistic regression, data cleaning and removal of anomaliesare not listed here, instead figure 2 shows the final summary ofpositive (on-spec) or negative (off-spec) score for each batch.

V. SHAPELETS, INTUITION AND VISUALIZATIONS

As per (9) Shapelets are informally defined as Temporalsubsequences present in the time series data set which act asMaximal Representatives of a given class. In this section weprovide visual motivation to use shapelets for our use case.After finding the confidence score in section IV we find top-4batches which can informally represent batch in each class:a) Highest degree of On-Spec (Max. Positive Confidence) andb) with Highest degree of Off-Spec (Lowest confidence). Wefind that Batch Ids: {66, 14, 68, 13} represent positive classwhereas {48, 45, 55, 53} represent negative class where allIds are in decreasing order of respective degrees. By simplyvisualizing trajectories of each variable we find that we canderive temporal subsequence which can uniquely represent agiven class. For eg: figure 3 represents one such trajectoryvariable. The Green boxes represent Positive class whereasRed boxes are very dissimilar to the Positive pattern. Weuse this pattern as representative of positive class in this

Fig. 4: Pressure variance with Confidence score. Divergingfrom Green (1: On-Spec) to Purple (-1:Offspec)

Fig. 5: Temperature variance with Confidence score . Diverg-ing from Green (1: On-Spec) to Purple (-1:Offspec)

example. Similarly shapelets representing negative class werealso found. Furthermore figures 4 and 5 display time seriesdata using confidence scores as colors to visualize the temporalplacements and sequencing of On-Spec (Green) and Off-Spec(Purple) batches.

VI. LEARNING REPRESENTATIVE SHAPELETS DIRECTLYFROM DATA

In this section we describe the algorithm implemented foridentification of representative shapelets. We closely followand refine the extensive research implemented in (10). Forbrevity we have not re-written the methods discovered in(10), instead we will only describe the refinements appliedto make the model usable for our dataset. We have usedBinary classification model (see Section 3) instead of Genericmodel described in Section 5. We will refer to this Binaryclassification model as ”LTS model”. We recommend thereader go through (10) (mainly Sections 3 and 4) for a betterunderstanding of the refinements performed in this project.

4

The implementation learns representative shapelets and as-sociated weights directly from data. The learning is performedusing Stochastic Gradient Descent for optimization on onetime series in each iteration. Equations relevant to describethe updated algorithm for our dataset are described below:Q̂i represents the estimated quality of ith batch:

Q̂i =W0 +

K∑k=1

Mi,kWk (1)

L represents length of one shapelet and hence length ofsegments of time series data.Di,k,j represents distance between jth data segment of ith

time series and kth shapelet

Di,k,j =1

L

L∑l=1

(x(n,j:j+l)i − Sk,l)2 (2)

Mi,k represents the distance between a time series x(n,:)i

and kth shapelet Sk.

Mi,k = minj=1...J

Di,k,j (3)

In order to compute derivatives, Mi,k is estimated by usingSoft Minimum function approximation given by M̂i,k

M̂i,k =

∑Jj=1Di,k,je

αDi,k,j∑Jj′=1Di,k,j′

(4)

Loss functions are same as equations 3 and 7 from (10)Below are the updated derivative equations modified to

address our dataset:

∂Fi∂Sk,l

=∂L(Qi, Q̂i)

∂Q̂i

∂Q̂i

∂M̂i,k

J∑j=1

∂M̂i,k

∂Di,k,j

∂Di,k,j

∂Sk,l(5)

∂L(Qi, Q̂i)

∂Q̂i= −(Qi − σ(Q̂i)) (6)

∂Q̂i

∂M̂i,k

=Wk (7)

∂M̂i,k

∂Di,k,j=eαDi,k,j(1+α(Di,k,j−M̂i,k))∑J

j′=1 eαDi,k,j

(8)

∂Di,k,j

∂Sk,l=

2

L(Sk,l − x(n,j+l−1)

i ) (9)

∂Fi∂Wk

= −(Qi − σ(Q̂i))M̂i,k +2λWI

Wk (10)

∂Fi∂W0

= −(Qi − σ(Q̂i)) (11)

Equations 1 to 11 represent the basis of our implementation.Algorithm 1 described in (10) is updated to address:

• The variable nature of our time series data• Improve algorithmic efficiency to reduce the processing

time by pre-calculating M̂i,k and ψi,k =∑Jj′=1 e

αDi,k,j

(Phase 1) which can be reused when iterating througheach time series reading (Phase 2).

Our version of algorithm is described in Algorithm 1. TheWeights Wk, intercept W0 and Shapelet set Sk can be used topredict Qi for any batch using Equation 1

Algorithm 1 Learning Time Series ShapeletsRequire xni , number of shapelets K, Length of a shapelet L,Regularization parameter λW , Learning rate η, Max Numberof iterations: maxIter, Soft Min parameter α

1: procedure fit()2: . Initialize centroid using K Means clustering for

variable TS data3: Sk,l ← getKMeansCentroid(xni )4: . Initialize weights randomly5: Wk,W0 ← random()6: for iter = 1...maxIter do7: for i = 1...I do8: for k = 1...K do9: . Phase 1: Prepare data for xi and Sk

10: Wk ←Wk − η ∂Fi

∂Wk

11: M̂i,k, ψi,k ← calcSoftMinDist(xni , Sk)12: . Phase 2: Iterate through the TS data13: for l = 1...L do14: Sk,l ← Sk,l − η ∂Fi

∂Sk,l

15: end for16: end for17: W0 ←W0 − ∂Fi

∂W0

18: end for19: end for20: return Wk,W0,Sk21: end procedure

LTS Model Initialization and Convergence: The Lossfunction defined for LTS Model is a Non-Convex function inS and W hence convergence of the model is not guaranteedwhen using optimization techniques like Stochastic GradientDescent. This combined with variable nature and multipletime series datasets within one batch makes convergence evenmore difficult. Hence initializing shapelets to true centroidsis very important for convergence and learning algorithm.Clustering techniques like KMeans and K-NN were evaluatedfor initialization. The performance of both algorithms lead toa similar precision but K-NN technique required larger pro-cessing time. Hence KMeans was used for simple clusteringof data segments. Figure 6 shows a visualization of Centroidsderived for one of the parameters using KMeans algorithm.

VII. RESULTS

Various techniques were evaluated to predict Qi, first usingz alone and then a combination of predictions obtained by

5

Feature and Model Selection Testaccuracy

Observation

Linear Regression using zi to predict Qi 0.56 Slightly better than random guessing but not a usablemodel.

Logistic Regression using zi to predict Qi 0.64 Better model, but still low accuracy. Proves that initialcondition do not fully determine the final quality mea-sure.

Predict Qi using algorithm described inAlgorithm 1 without using z

0.67 Shows only minor improvement, proves that time seriesalone may not be enough to get higher accuracy.

Logistic Regression using zi and Mi,k

as defined in equation 3 using learnedshapelets to predict Qi

0.75 Promising improvement without much work on param-eter tuning. The only parameter that required tuningwas Soft Min parameter α since the model would notconverge without tuning of Soft Min condition.

Logistic Regression using zi and Mi,k

as defined in equation 3 using learnedshapelets to predict Qi, combined withCross Validation and Grid search on Hy-per parameters described in Table II.

0.80 Significant improvement in accuracy, but the trainingtime and complexity also increased significantly

TABLE I: Test accuracy defined as average fraction of correct test predictions, note that the test data had equal number of +1and -1 quality hence the model did not suffer from irregular Recall measure.

Fig. 6: Centroids visualization

using Representative shapelets. Table I summarizes the results.Also for the model to converge and produce workable resultsa considerable amount of effort was required to find workablerange of hyper parameters, the ranges are summarized in TableII

VIII. SUMMARY AND FUTURE WORK

Firstly, Supervised learning techniques were used to im-prove visualization techniques on Batch, Multivariate, TimeSeries datasets. And the intuition developed was used to verifyif Shapelets can be used for predicting a batch’s qualityoutcome. Secondly, LTS model algorithm was implementedto learn Representative shapelets on the available datasets.Lastly an array of Machine learning techniques were usedin conjunction with available data and learned shapelets topredict quality outcome of batches. Below is a list of aspects

Hyper ParameterDescription

Symbol WorkableRange

Regularizationparameter forlearning weights onzi

C 0.5 - 0.7

Soft minimum pa-rameter

α -1 to -5

Number of shapelets K 3 to 7Length of eachshapelet

L 20 to 40

Shapelet Learningrate

ηS 0.05 to 0.1

Shapelet Learningregularizationparameter

λW 0.001 to0.003

TABLE II: Hyper Parameters and workable ranges

which can be improved or worked upon in follow up researchprojects:

• Shapelet learning can be improved in various ways. Eg:Use alternatives to Soft Minimum Function, Find α as afunction of Time Series vector length or improve shapeletinitialization by exploring different Unsupervised learn-ing techniques.

• Use a combination of Time Warping and Shapelet learn-ing techniques.

• Update Algorithm 1 to simultaneously learn weights onshapelet distance from all time series variables and initialconditions zi.

6

REFERENCES AND NOTES

1. Pattern Recognition and Classification for Multivariate Time Series2. Multivariate Time Series Classification by Combining Trend-Based and

Value-Based Approximations3. Forecasting performance of multivariate time series models with full and

reduced rank: an empirical examination4. Detection and Characterization of Anomalies in Multivariate Time Series5. Machine Learning Strategies for Time Series Prediction6. Querying and mining of time series data: experimental comparison of

representations and distance measures7. Trouble-shooting of an industrial batch process using multivariate methods8. Multivariate SPC Charts for Monitoring Batch Processes9. Time Series Shapelets: A New Primitive for Data Mining10. Learning Time-Series Shapelets11. CS229 Notes Supervised Learning : Classification12. CS229 Notes Regularization and model selection

Documents

Predictive analysis on Multivariate, Time Series …cs229.stanford.edu/proj2016/report/Thakkar_Predictive_Analysis_on...1 Predictive analysis on Multivariate, Time Series datasets