CALIBRATION Prof.Dr.Cevdet Demir cevdet@uludag.edu.tr

Preview:

Citation preview

CALIBRATION

Prof.Dr.Cevdet Demir

cevdet@uludag.edu.tr

LINKING TWO SETS OF DATA TOGETHER

• Peak height to concentration

• Spectra to concentrations

• Taste to chemical constituents

• Biological activity to structure

• Biological classification to chromatographic peak areas

NORMALLY WE ARE INTERESTED IN SOME FUNDAMENTAL PARAMETER e.g. concentration or biological classification

WE TAKE SOME MEASUREMENTS e.g. spectra or chromatograms

WE WANT TO USE THESE MEASUREMENTS TO GIVE US A PREDICTION OF THE FUNDAMENTAL PARAMETER

UNIVARIATE CALIBRATION

One measurement e.g. a peak height

MULTIVARIATE CALIBRATION

Several measurements e.g. spectra

NOTATION

“x” block is measured data e.g. spectra, chromatograms, GCMS of biological extract, structural parameters

“c” block is what we are trying to predict e.g. concentration, species, acceptability of a product, taste

X

Y

Independent variable, e.g. Concentration

Response e.g. Spectroscopic

Experimental design

C

X

Predicted parameter, e.g. Concentration

Measurement e.g. spectroscopic

Calibration

c x c X

C X

MULTIVARIATE CALIBRATION IN ANALYTICAL CHEMISTRY

•Single component.

Example, concentration of chlorophyll a by uv/vis spectra.

•Mixture of components, all compounds known.

Example, mixture of pharmaceuticals, all pure compounds known.

•Mixture of components, only some compounds known.

Example, coal tar pitch volatiles in industrial waste studied by spectroscopy, only some known.

•Statistical parameters.

Example, protein in wheat by NIR spectroscopy.

UNIVARIATE CALIBRATION

“x” and “c” blocks consist of single measurements.

Traditional analytical chemistry

CLASSICAL CALIBRATION

x c . s

Unknown : s

s c+ . x

where c+ is the pseudo-inverse

x

=

c s

c

x

TREATMENT OF ERRORS IN CLASSICAL CALIBRATION

PROBLEMS

1. Modern lab : dilution and sample preparation errors (in “c”) are probably bigger than spectroscopic errors (in “x”). Spectra are more reproducible. Differs to classical statistics.

2. Want to predict concentration from spectra etc. not vice versa.

Most classical textbooks in analytical chemistry and most spreadsheets incorrectly recommend classical calibration.

INVERSE CALIBRATION

c x . b

Unknown : b

b c . x+

c

x

c

=

b

=

x

COMPARING FORWARD AND INVERSE CALIBRATION

0

5

10

15

20

25

30

35

40

0 1 2 3 4 5 6 7 8 9 10

Inverse

Classical

INCLUDING THE INTERCEPT : first column of “x” is 1s

c b0+ b1x

c X . b

b X+ . c

c

=

b

=

X

HOW WELL IS THE MODEL PREDICTED?

Huge number of approaches

• Root mean square error (divide by degrees of freedom – number of samples – 1 or 2 according to parameters in the model).

Often express as percentage either of the mean measurement or the standard deviation of the measurements

dxxEI

iii /)ˆ( 2

1

• Correlation coefficient of predicted versus true – has problems if the number of samples is small.

• ANOVA and replicates analysis using lack-of-fit error, as discussed in the experimental design lectures.

• Leaving samples out and predicting them : cross-validation and testing will be discussed later.

PROBLEMS

•Outliers can be a major difficulty. Graphical ways of looking for outliers – big area.

•Undue influence on least square models.

MULTIWAVELENGTH

 

Example : four compounds, four wavelengths.

MULTIPLE LINEAR REGRESSION (MLR)

X = C. B 

Know

•X : a series of spectra

•C : concentrations

WAYS OF PERFORMING THE CALIBRATION

1. Producing a series of mixture spectra of known concentrations by weighing different amounts and adding together

2. Taking a series of spectra and calibrating against and independent method e.g. HPLC.

220 240 260 280 300 320 340 360 380 400

EXAMPLE : UV/VIS OF PAHs AT 4 WAVELENGTHS, NO WAVELENGTH IS UNIQUE

B = X+ . C

estimated [pyrene] = -3.870 A330 + 8.609 A335 – 5.098 A340 + 1.848 A345

Can also use classical methods

This can be done by knowledge of the pure spectra.

Different to calibration where a series of mixtures recorded

ˆ X.S+C

MULTIPLE LINEAR REGRESSION

•Why use only 4 wavelengths?

•Why not 10 or 100 wavelengths?

More information – not arbitrary choice of wavelengths.

•Number of wavelengths can be greater than number of compounds.

X C B

=

Example

• 25 spectra

• 10 compounds

• 100 wavelengths

B = X+ . C

In this case

•B is a matrix of coefficients, 100 10

•X is a spectral matrix, 25 100

•C is a concentration matrix, 25 10

Some technical problems using inverse calibration in this case, and often it does not work.

Better approach

1. First predict the spectra S.

•Either they are known from the calibration of the pure standards

•Or they can be predicted from the mixture spectra

S C+. X

2. Then use these predictions in a model (e.g. of unknowns)

C X. S+

MLR effectively models a spectrum as a sum of spectra of the components, e.g. for a 3 component model

Observed spectrum =

conc A spectrum A +

conc B spectrum B +

conc C spectrum C

ENHANCEMENTS

• Selecting only certain variables, not all the wavelengths.

• Weighting of variables.

ERROR ANALYSIS

This now becomes more sophisticated.

In addition to errors in the “c” block (concentration errors), now also errors in the “x” block (reconstruction of spectra).

Discuss later.

LIMITATIONS AND PROBLEMS WITH MLR

• Number of experiments and number of wavelengths must never be less than number of compounds

• All significant compounds must be known. If still unknowns, then these are mixed up with the knowns. Problems if no pure standards and no reliable reference method. THIS IS THE BIGGEST LIMITATION.

•Sometimes extra wavelengths can be bad ones e.g. noise or background.

• Assume that concentrations are perfectly known, errors in only one variable, using classical approach.

However if information on all the significant compounds is known then MLR is a simple an effective method.

PRINCIPAL COMPONENTS REGRESSION (PCR)

Do not need to know all components in advance, simply "how many components", and the compounds of interest.

Overcomes a major limitation of MLR

X

P

T

PCA

Sam

ples

Detector (e.g. wavelength)

Regression

T c

Sam

ples

r

concentration

c T . r

The first step is to perform PCA.

Obtain a scores matrix, retaining A components

The value of A may be a guess of the number of compounds in the mixture.

Then r = T+. c

Can extend to more than one concentration –

C T . R

TC R

Example

25 spectra taken at 100 wavelengths

We know about and want to predict 4 compounds

We think there are around 10 compounds in the mixture, 6 are unknown.

T is a matrix of dimensions 25 10

C is a matrix of dimensions 25 4

R is a matrix of dimensions 10 4

Example of the calculation of the concentration of pyrene in a set of 25 uv/vis spectra containing 10 different PAHS.

How many PCA components to use? The prediction gets better the more the number of components.

ERRORS – “x” block

Simply as in PCA, look at eigenvalues as more principal components are calculated

0.001

0.01

0.1

1 3 5 7 9 11 13 15

ERRORS – “c” block

Look at errors in calculation of concentrations – often different behaviour

0.01

0.1

1

1 3 5 7 9 11 13 15

Predictions for pyrene concentration using 1, 5 and 10 principal components.

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

observed concentration

pre

dic

ted

co

nce

ntr

ati

on

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

observed concentration

pre

dic

ted

co

nce

ntr

ati

on

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

observed concentration

pre

dic

ted

co

nce

ntr

ati

on

Why not use a large number of PCA components?

Then one can get perfect prediction?

FALLACY : the idea is to predict unknowns, after the knowns have been modelled. Later PCs often model noise.

Choose no of PCs equal to number of compounds in the mixture? Methods for determining number of PCs described later when this is unknown.

Advantage over MLR - only partial knowledge necessary.

 

Disadvantage : assumption that all errors in the "x" block.

Practical situation. 

•Modern instruments very reproducible.

•Volumetrics, measuring cylinders, syringes are inaccurate.

PARTIAL LEAST SQUARES (PLS)

This technique assumes that errors in both “x” and “c” block are equally significant.

X T P

T q

c

=

=

E

f

+

+ .

.

What does this mean?

X = T.P + E

c = T.q + f

THERE IS A COMMON SCORES MATRIX FOR BOTH “x” AND “c” BLOCKS.

In PCR we calculate the scores just for the “x” block and then use a separate step for regression.

A big difference between PCR and PLS is that in PCR there is only one scores matrix whereas for PLS (using 1 column) there are different scores matrices according for each compound.

The vector q is analogous to loadings.

PLS components have some analogies to PC components.

In PCA, each component consists of a

•scores vector

•loadings vector

•eigenvalue.

In PLS, each component consists of a

•scores vector

• “x” loadings vector (p)

• “c” loadings vector (q) – a single number

• magnitude.

FOR THE TECHNICALLY MINDED.

•Unlike eigenvalues, the magnitudes of success PLS components do not necessarily decrease in size, although they do model the overall datasets.

•Unlike loadings for PCA, loadings in PLS are not orthogonal.

•In most cases PLS loadings are not normal.

•There are many algorithms for PLS and it can be confusing.

ERROR ANALYSIS : similar principles to PCR but different curves for different compounds.

Sometimes different number of PLS components are used to model different compounds in one mixture.

0

10

20

30

40

50

60

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

x errors

c errors

For a dataset consisting of 25 spectra observed at 27 wavelengths, for which 8 PLS components are calculated, there will be

•a T matrix of dimensions 25 8,

•a P matrix of dimensions 8 27,

•an E matrix of dimensions 25 27,

•a q vector of dimensions 8 1 and

•an f vector of dimensions 25 1.

PLS2 – when more than one “c” variable

X T P

T Q

C

=

=

E

F

+

+ .

.

X = T.P + E

C = T.Q + F

Differences to PLS1

•C is now a matrix

•Q is also a matrix

•F is also a matrix

•Single scores for all compounds in the mixture.

•Theoretically PLS2 should perform better than PLS1 but in practice it often performs worse.

•Computationally faster, important 10 years ago.

•Useful for non-linear problems such as QSAR where interactions, but not so useful in analytical chemistry which is very linear.

SUMMARY OF MAIN METHODS

• Univariate calibration

•Classical

•Inverse

•Multiple linear regression

•Principal components regression

•Partial least squares

•PLS1

•PLS2

Recommended