50
Optimised Scoring in Proficiency Tests Michael Thompson School of Biological and Chemical Sciences Birkbeck College (University of London) Malet Street London WC1E 7HX [email protected]

Optimised Scoring in Proficiency Tests

  • Upload
    others

  • View
    14

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Optimised Scoring in Proficiency Tests

Optimised Scoringin

Proficiency TestsMichael Thompson

School of Biological and Chemical SciencesBirkbeck College (University of London)

Malet StreetLondon WC1E 7HX

[email protected]

Page 2: Optimised Scoring in Proficiency Tests

Criteria for an ideal scoring method

• Adds value to raw results.

• Easily understandable, no arbitrary scalingtransformation.

• Is transferable between differentconcentrations, analytes, matrices, andmeasurement principles.

Page 3: Optimised Scoring in Proficiency Tests

The z-score

p

Axxz

Result “Assigned value”

Scheme provider’s bestestimate of true value

“Target value” or

“standard deviation forproficiency”

Page 4: Optimised Scoring in Proficiency Tests

Determining an assigned value

• Reference laboratory result

• Certified reference material(s)

• Formulation

• Consensus of participants’ results

Page 5: Optimised Scoring in Proficiency Tests

“Health warnings” aboutthe consensus

• The consensus is not necessarily identicalwith the true value. PT providers andusers have to be alert to this possibility.

• The consensus must have a sufficientlysmall uncertainty. This usually requires>20 participants.

Page 6: Optimised Scoring in Proficiency Tests

What exactly is a ‘consensus’?• Mean? - easy to calculate, but affected by

outliers and asymmetry.

• Robust mean? - fairly easy to calculate, handlesoutliers but affected by strong asymmetry.

• Median? - easy to calculate, more robust forasymmetric distributions, but larger standarderror than robust mean.

• Mode? - intuitively good, handles strong skews,difficult to define, difficult to calculate.

Page 7: Optimised Scoring in Proficiency Tests

Finding a ‘consensus’—the tools of the trade

• Robust mean and standard deviation

• Kernel density mode and its standard error

• Mixture model representation

Page 8: Optimised Scoring in Proficiency Tests

Robust mean and standard deviation

• Robust statistics is applicable to datasets that look likenormally distributed samples contaminated with outliersand stragglers (i.e., unimodal and roughly symmetric).

• The method downweights the otherwise large influenceof outliers and stragglers on the estimates.

• It models the central ‘reliable’ part of the dataset.

• The estimates are found by a procedure, not a formula.

robrob ˆ,ˆ

Page 9: Optimised Scoring in Proficiency Tests

nxxx 21Tx

MAD5.1ˆ,medianˆ,0,21Set 00 pk

ppipp

ppipp

ppippi

i

kxk

kxk

kxkx

x

ˆˆifˆˆ

ˆˆifˆˆ

ˆˆˆˆif~

)~var()(ˆ

)~(meanˆ2

1

1

ip

ip

xkf

x

1converged,notIf pp

Huber’s H15estimators

Page 10: Optimised Scoring in Proficiency Tests

When can I safely userobust estimates?

Measurement axis

Skewed

Bimodal

Heavy-tailed

Page 11: Optimised Scoring in Proficiency Tests

• The robust mean provides a usefulconsensus in the great majority ofinstances.

• The uncertainty of this consensus canbe safely taken as

The robust mean as consensus

nxu roba

Page 12: Optimised Scoring in Proficiency Tests

Finding a ‘consensus’—the tools of the trade

• Robust mean and standard deviation

• Kernel density mode and its standard error

• Mixture model representation

Page 13: Optimised Scoring in Proficiency Tests

The mode as a consensusCan I use the mode? How many modes? Where are they?

Page 14: Optimised Scoring in Proficiency Tests

The normal kernel density foridentifying a mode

where Φ is the standard normal density,

Reference: AMC Technical Brief No. 4. (www.rsc.org/amc)

n

i

i

hxx

nhy

1

1

2

)2/exp()(

2aa

Page 15: Optimised Scoring in Proficiency Tests

A normal kernel

Page 16: Optimised Scoring in Proficiency Tests

A kernel density

Reference: AMC Technical Brief No. 4. (www.rsc.org/amc)

Page 17: Optimised Scoring in Proficiency Tests

Another kernel density:same data, different h

Reference: AMC Technical Brief No. 4. (www.rsc.org/amc)

Page 18: Optimised Scoring in Proficiency Tests

Uncertainty of the mode

• The uncertainty of the consensus can beestimated as the standard error of themode by applying the bootstrap to theprocedure.

• The bootstrap is a general procedure,based on resampling, for estimatingstandard errors of complex statistics.

• Reference: Bump-hunting for the proficiency tester – searching formultimodality. P J Lowthian and M Thompson, Analyst, 2002,127,1359-1364.

Page 19: Optimised Scoring in Proficiency Tests

Finding a ‘consensus’—the tools of the trade

• Robust mean and standard deviation

• Kernel density mode and its standard error

• Mixture model representation

Page 20: Optimised Scoring in Proficiency Tests

Mixture models and consensus

• For eachcomponentyou cancalculate:

- a mean- a variance- a proportion

Page 21: Optimised Scoring in Proficiency Tests

2-component normal mixture modeland kernel density

Page 22: Optimised Scoring in Proficiency Tests

The normal mixture model

References: AMC Technical Brief No 23, and AMC Software.Thompson, Acc Qual Assur, 2006, 10, 501-505.

1,)()(11

m

jj

m

jjj pyfpyf

2

2/)(exp()(

22j

j

yyf

Page 23: Optimised Scoring in Proficiency Tests

Mixture models found by the maximumlikelihood method (the EM algorithm)

• The M-step

• The E-step

nyjPpn

iij /)(ˆ

1

n

ii

n

iiij yjPyjPy

11)()(

)()()(ˆ1 1

22i

n

j

m

iiji yjPyjPy

)(ˆ)(ˆ)(1

i

m

jjjijji yfpyfpyjP

Page 24: Optimised Scoring in Proficiency Tests

Example datasets

Page 25: Optimised Scoring in Proficiency Tests

Example dataset 1

Page 26: Optimised Scoring in Proficiency Tests
Page 27: Optimised Scoring in Proficiency Tests

Number of modes vs smoothing factor h

Page 28: Optimised Scoring in Proficiency Tests
Page 29: Optimised Scoring in Proficiency Tests
Page 30: Optimised Scoring in Proficiency Tests

Statistics: dataset 1

0.00750.0752.913Mixture model

0.0056-2.912Kernel density mode

0.00560.0562.912Robust

se

Page 31: Optimised Scoring in Proficiency Tests

Skewed/multimodal distributions

• Skews and extra modes can arise when theparticipants’ results come from two or moreinconsistent methods.

• Skews can also arise as an artefact at lowconcentrations of analyte as a result of commondata recording practices.

• Rarely, skews can arise when the distribution istruly lognormal (e.g., in GMO determinations).

Page 32: Optimised Scoring in Proficiency Tests

Example dataset 2

Page 33: Optimised Scoring in Proficiency Tests
Page 34: Optimised Scoring in Proficiency Tests
Page 35: Optimised Scoring in Proficiency Tests
Page 36: Optimised Scoring in Proficiency Tests
Page 37: Optimised Scoring in Proficiency Tests
Page 38: Optimised Scoring in Proficiency Tests

What went wrong?

• Analyte defined as % fatty acid in oil.

• Most labs used an internal standard method.

• Hypothesis: other labs (incorrectly) reportedresult based on methyl ester peak area ratio.

• Incorrect results expected to be high by a factorof 1.05.

• Ratio of modes found = 1.04.

Page 39: Optimised Scoring in Proficiency Tests

Example 3—Ba in silicate rock

Page 40: Optimised Scoring in Proficiency Tests
Page 41: Optimised Scoring in Proficiency Tests

Choice of value for σp

• Robust standard deviation of participants’results in round?

• From perception of how well similarmethods perform?

• Legislation?• Other?

Page 42: Optimised Scoring in Proficiency Tests

Self-referential scoring

• Nearly always, more than 90% of laboratoriesreceive a z-score between ±2.

• This suggests, to both provider and participants,that accuracy is generally OK, whether or notthat is the case.

• No reference is made to end-user requirements.

• z-Scores for a participant cannot bemeaningfully compared round-to-round.

robrobxz ˆˆ

Page 43: Optimised Scoring in Proficiency Tests

What more do we need?

• We need a method that evaluates the results inrelation to their intended use, rather than merelydescribing them.

• We need a method in which a score of (say) -3.1has an meaning independent of the analyte,matrix, or analytical method.

• We need a method based on:

fitness for purpose.

Page 44: Optimised Scoring in Proficiency Tests

Fitness for purpose

• Fitness for purpose occurs when the uncertaintyof the result uf gives best value for money.

• If the uncertainty is smaller than uf , the analysismay be too expensive.

• If the uncertainty is larger than uf , the cost andthe probability of a mistaken decision will rise.

Page 45: Optimised Scoring in Proficiency Tests

Fitness for purpose

• The value of uf can sometimes be estimatedobjectively by decision theory methods.

• Usually uf can be simply agreed between thelaboratory and the customer by professionaljudgement.

• In the proficiency test context, uf should bedetermined by the scheme provider.

Reference: T Fearn, S A Fisher, M Thompson, and S L R Ellison,Analyst, 2002, 127, 818-824.

Page 46: Optimised Scoring in Proficiency Tests

• If we now define a z-score thus:

we have a z-score that is both robustifiedagainst extreme values and tells us aboutfitness for purpose.

• In an exactly compliant laboratory, scores of2<|z|<3 will be encountered occasionally, andscores of |z|>3 rarely.

• Better performers will receive fewer of theseextreme z-scores, worse performers more.

A score that meets all of the criteria

fpprob uxz whereˆ

Page 47: Optimised Scoring in Proficiency Tests

• Use z-scores based on fitness forpurpose.

• Estimate the consensus as the robustmean and its uncertainty asif the dataset is roughly symmetric.

• If the dataset is skewed and plausiblycomposite, use a kernel density or amixture model to find a consensus.

Conclusions—optimal scoring

nrob

Page 48: Optimised Scoring in Proficiency Tests

And finally……

• Each dataset is unique. It is impossible todefine a sequence of statistical operationsthat will properly handle every eventuality.

• Statistics (in the right hands) assists, butcannot replace, professional judgement.

Page 49: Optimised Scoring in Proficiency Tests

Statistical References• Mixture models

M Thompson. Accred Qual Assur. 2006, 10, 501-505.AMC Technical Brief No. 23, 2006. www/rsc.org/amc

• Kernel densitiesB W Silverman, Density estimation for statistics and data analysis.Chapman and Hall, London, 1986.AMC Technical Brief, no. 4, 2001 www/rsc.org/amc

• The bootstrapB Efron and R J Tibshirani, An introduction to the bootstrap.Chapman and Hall, London, 1993AMC Technical Brief, No. 8, 2001 www/rsc.org/amc

• Robust statisticsAnalytical Methods Committee, Analyst,1989, 114, 1489AMC Technical Brief No 6, 2001 (www/rsc.org/amc)P J Rousseeuw, J. Chemomet, 1991, 5, 1.

Page 50: Optimised Scoring in Proficiency Tests

General references• The International Harmonised Protocol for Proficiency Testing in

Analytical Chemistry Laboratories (revised), M Thompson, S L REllison and R Wood. Pure Appl. Chem., 2006, 78, 145-196.

• R E Lawn, M Thompson and R F Walker, Proficiency testing inanalytical chemistry. The Royal Society of Chemistry, Cambridge,1997.

• ISO Guide 43. Proficiency testing by interlaboratory comparisons,Geneva, 1997.

• ISO Standard 13528. Statistical methods for use in proficiencytesting by interlaboratory comparisons, Geneva, 2005.