Modelling procedures for directed network of data blocks

1

Modelling procedures for directed network of data blocks

Agnar Höskuldsson, Centre for Advanced Data Analysis, Copenhagen

Data structures:

Directed network of data blocksInput data blocksOutput data blocksIntermediate data blocks

Methods

Optimization procedures for each passage through the networkBalanced optimization of fit and prediction (H-principle) Scores, loadings, loading weights, regression coefficients for each data blockMethods of regression analysis applicable at each data blockEvaluation procedures at each data blockGraphic procedures at each data block

2

Chemometric methods1. Regression estimation,

X, Y. Traditional presentation: Yest=XB, and standard deviations for B.Latent structure:X=TP’ + X0. X0 not used.Y=TQ’+Y0. Y0 not explained.

2. Fit and precision. Both fit and precision are controlled.

3. Selection of score vectorsAs large as possibledescribe Y as well as possiblemodelling stops, when no more found (cross-validation)

4. Graphic analysis of latent structureScore and loading plotsPlot of weight (and loading weight) vectors

Chemometric methods

3

5. Covariance as measure of relationship X’Y for scaled data measures strength X1’Y=0, implies that X1 is remmoved from analysis

6. Causal analysis T=XR From score plots we can infer about the original measurement values Control charts for score values can be related to contribution charts

7. Analysis of X Most time of analysis is devoted to understand the structure of X. Plots are marked by symbols to better identify points in scor or loading plots.

8. Model validation. Cross-validation is used to validate the results Bootstrapping (re-sampling from data) used to establish confidence intervals

Chemometric methods

4

9. Different methods Different types of data/situations may require different type of method One is looking for interpretations of the latent structure found

10. Theory generation Results from analysis are used to establish views/theories on the data Results motivate further analysis (groupings, non-linearity etc)

5

Partitioning data, 1

X1 X2 XL Y1 Y2

Z1

Z2

Z3

Measurement data Responsedata

Reference data

6

Partitioning data, 2

-There is often a natural sub-division of data.

- It is often required to study the role of a sub-block

- Data block with few variables may ’disappear’ among one with many variables, e.g. Optical instruments often give many variables.

Instrumental data Response data

X YX1 X2 X3 Y1 Y2

engineering

chemicalprocess

quality

chemical results

7

Path diagram 1

X1

X2

X3

X4

X5

X6 X7

Examples:

Production processOrganisational dataDiagram for sub-processesCausal diagram

8

Path diagram 2, schematic application of modelling

X1

X2

X3

X4

X5

X6 X7

x10

x20

x30

x10 is a new sample from X1,x20 is a new one from X2,x30 is a new one from X3,

how do they generate new samples for X4, X5, X6 and X7?

Resulting estimating equations

X4,est=X1B14+X2B24+X3B34

X5,est=X1B15+X2B25+X3B35

X6,est=X4B46+X5B56

X7,est=X6B67

9

Path diagram 3

X1

X2

X3

X4

X5

X6 X7

Time t1

Time t2

Data blocks can be aligned to time.Modelling can start at time t2.

10

Notation and schematic illustrations

X Y

Instrumental data Response dataw

tq

u

w: weight vector (to be found)t: score vector, t = Xw =w1x1 + ... + wKxK

q: loading vector, q =YTt = [ (y1Tt), ... , (yM

Tt) ]u: Y-score vector, u=Yq = q1 y1 + ... + qM yM

Vectors are collected into matrices, e.g., T=(t1, ... , tA)

Adjustments:

XX – t pT/(tTt)

YY – t qT/(tTt)

11

Conjugate vectors 1

X

w

tp

r: t=Xw, p=XTt. paTrb=0 for ab.

X

r

tq

r

r: t=Xq, qaTrb=0 for ab.

Xt

p

r

r and s:

t=Xw, p=XTv,

paTrb=0, ta

Tsb=0 for ab.

ws v

12

Conjugate vectors 2

The conjugate vectors R=(r1, r2, ..., rA) satisfy: T=XR.

Latent structure solution:

X = T PT + X0, where X0 is the part of X that is not usedY = T QT + Y0, where Y0 is the part of Y that could not be explained

Y = T QT + Y0= X (R QT) + Y0= X B + Y0, for B= R QT

The conjugate vectors are always computed together with the score vectors.

When regression on score vectors has been computed, the regression on the original variables is computed as shown.

13

Optimization procedure, 1

Two data blocks: X1 X2

w1

t1q2

|q2|2 max

One data block: X1

w1

t1

|t1|2 max

14

Three data blocks

Ztz

qz

Start

|qz|2 max

X1

t1

X2

q2

w

Xt

Yty

qy

w

X3

t3

X4

t4q4

X basis Y estimated Y basis Z estimated

Adjustments:t1 describes X1: X1X1-t1p1

T/(t1Tt1), p1=X1

Tt1.

t1 describes X2: X2X2-t1q2T/(t1

Tt1), q2=X2Tt1.

q2 describes X3: X3X3-t3q2T/(q2

Tq2), t3=X3q2.

t3 describes X4: X4X4-t3q4T/(t3

Tt3), q4=X4Tt3.

15


Two input and two output data blocks:

X2

X1X3

X4

w1

t1w2

t2

q13

q23

q14q24

Find w1 and w2:

|q13+q23+q14+q24|2 max

Two input, one intermediate and one output data blocks:

X2

X1

X3 X4

w1

t1w2

t2

q13q23

q134q234

Find w1 and w2:

|q134+q234|2 max

16

Balanced optimization of fit and prediction (H-principle) X Y

Linear regressionIn linear regression we are looking for a weight vector w, so that the resulting score vector t=Xw is good!

The basic measure of quality is the prediction variance for a sample, x0. Assuming negligible bias it can be written (assuming standard assumptions)

F(w) = Var(y(x0)) = k[1 – (yTt)2/(tTt)][1 + t02/(tTt)].

It can be shown that F(cw)=F(w) for all c>0. Choose c such that (tTt)=1. Then

F(w) = k[1 – (yTt)2][1 + t02].

In order to get a prediction variance as small as possible, it is natural to choose w such that (yTt)2 becomes as large as possible,

maximize (yTt)2 = maximize |q|2 (PLS regression)

17


Weighing along objects (rows) (same algorithm, but using the transposes):

X1

X2

v1

p1

t2

Task: find weight vector v1:maximize |t2|2

X1

X2

v1

p1

t2

Task: find weight vector v1:maximize |q3|2

X3

q3

18


X1

X2

p1

t2

Task: find weight vector w1:maximize |q3|2,

where

X3

q3

w1

t1

q3=X3Tt2

=X3TX2p1

=X3TX2X1

Tt1

=X3TX2X1

TX1w1

Regression equations

X3,est=X2B23

X2,est=B12X1

X1,est=X1B11

If p1 is a good weight vector for X2, a good result may be expected.

Pre-processing may be needed to find variables in X1 and in X2 that are highly correlated to each other.

19

Three types of reports

Reports:

How a data block is doing in a network

How a data block can be described bydata blocks that lead to it.

How a data block can be described byone data block that leads to it.

Xi

Xi-1

Xi-2

Xi

Xi-3

Xi

Xi

Xi-2

20

Production data, 1

X2 YX1

X1: Process parameters, 8 variables

X2: NIR data, 1560 variables (reduced to 120)

No |X2|2 |Y|2 |X|2 |Y|2 1 78,961 51,483 74,969 51,9642 91,538 67,559 86,786 69,5533 96,351 76,291 91,627 80,6434 97,942 81,383 95,373 85,0585 98,620 83,900 95,919 89,0566 98,967 85,705 97,054 90,0507 99,205 87,917 97,508 91,9908 99,294 90,472 97,990 93,4559 99,349 92,183 98,667 94,02010 99,426 92,947 98,896 94,70811 99,606 93,084 99,103 95,08212 99,657 93,376 99,202 95,740

X1 ’disappears’ inthe NIR data X2.

21

Production data, 2

At each step:

X1

X2

w1

t1w2

t2

Y

Results for X2, process parameters:5 score vectors explain 11.92% of Y.

Results for X1, NIR data:12 score vectorsexplain 84.141% of Y.

No Step |Y|2

1 1 4,957

2 2 9,315

3 5 10,393

4 6 10,929

5 8 11,920

No Step |Y|2

1 1 51,483

2 2 69,121

3 3 73,070

4 4 76,506

5 5 78,669

6 6 80,923

7 7 82,129

8 8 82,552

9 9 83,132

10 10 83,590

11 11 83,881

12 12 84,141

Total 96.06%=11.920%+84.14% is explained of Y.

At each step the score vectors are evaluated. Non-significant ones are excluded.

22

Production data, 3

-0.4 -0.3 -0.2 -0.1 0 0.1 0.2 0.3-0.4

-0.3

-0.2

-0.1

0

0.1

0.2

0.3

Plot of estimated versus observed quality variable using only score vectors for process parameters.X2

X1

Y75.12%

96.06%

R2-values:

87.75%

The process parameters contribute marginally by 11.92%. But if only they were used, they would explain 75.12% of the variation of Y.

R2=0.7512

23

Directed network of data blocks

...

... ...

Input blocks Intermediate blocks Output blocks

Give weight vectors for initial score vectors

Are described by previous blocks and give score vectors for succeeding blocks

Are described by previous blocks

24

Magnitudes computed between two data blocks

Xi

Xk

Ti: Score vectorsQi: Loading vectorsBi: Regression coefficients

Measures of precision

Measures of fit

Etc

Different views:a) As a part of a pathb) If the results are viewed

marginallyc) If only XiXk

...

25

Stages in batch processes

Y

Time

Batches

Stages

XkX2X1

1 2 K Final quality

Paths: X1 X2 ... XK Y Given a sample x10, the path modelgives estimated samples for later blocks

[X1 X2 X3] X4 Y Given values of (x10 x20 x30), estimatesfor values of x4 and y are given.

[X1 X2 X3] [X4 X5] Y Given values of (x10 x20 x30), estimatesfor values of (x4 x5) and y are given.

26

Schematic illlustration of the modelling task for sequential processes

Stages

X1

Initial conditions

Known process parameters

X2 X3

Next stage

X4

Later stages

Now

Y

27

Plots of score vectors

X1

t1

X2

t2

XL

tL

X1 X1 – X2

t1

t2 X1 – XL

t1

tL

The plots will show how the changes are relative to the first data block.

28

Graphic software to specify paths

X4

X5

XL

...

X1

X2

X3

Blocks are dragged into the screen. Relationships specified.

29

Pre-processing of data

• Centring. If desired centring of data is carried out

• Scaling. In the computations all variables are scaled to unit length (or unit standard deviation if centred). It is checked if scaling disturbs the variable, e.g. if it is constant except for two values, or if the variable is at the noise level. When analysis has been completed, values are scaled back so that units are in original values.

• Redundant variable. It is investigated if a variable does not contribute to the explanation of any of the variables that the presnt block lead to. If it is redundant, it iseliminated from analysis.

• Redundant data block. It is investigated if a data block can provide with a significant description of the block that it is connected to later in the network. If it can not contribute to the description of the blocks, it is removed from the network.

30

Post-processing of results

Score vectors computed in the passages through the network are evaluated in the analysis at one passage. Apart from the input blocks the score vectors found between passages are not independent. The score vectors found in a relationship XiXj are evaluated to see if all are significant or some should be removed for this relationship.

Cross-validation like in standard regression methods

Confidence intervals for parmeters by resampling technique

31

International workshop on

Multi-block and Path Methods

24. – 30. May 2009, Mijas, Malaga, Spain

Documents

Modelling procedures for directed network of data blocks