Optimised Scoring in Proficiency Tests

Optimised Scoringin

Proficiency TestsMichael Thompson

School of Biological and Chemical SciencesBirkbeck College (University of London)

Malet StreetLondon WC1E 7HX

m.thompson@bbk.ac.uk

Criteria for an ideal scoring method

• Adds value to raw results.

• Easily understandable, no arbitrary scalingtransformation.

• Is transferable between differentconcentrations, analytes, matrices, andmeasurement principles.

The z-score

Result “Assigned value”

Scheme provider’s bestestimate of true value

“Target value” or

“standard deviation forproficiency”

Determining an assigned value

• Reference laboratory result

• Certified reference material(s)

• Formulation

• Consensus of participants’ results

“Health warnings” aboutthe consensus

• The consensus is not necessarily identicalwith the true value. PT providers andusers have to be alert to this possibility.

• The consensus must have a sufficientlysmall uncertainty. This usually requires>20 participants.

What exactly is a ‘consensus’?• Mean? - easy to calculate, but affected by

outliers and asymmetry.

• Robust mean? - fairly easy to calculate, handlesoutliers but affected by strong asymmetry.

• Median? - easy to calculate, more robust forasymmetric distributions, but larger standarderror than robust mean.

• Mode? - intuitively good, handles strong skews,difficult to define, difficult to calculate.

Finding a ‘consensus’—the tools of the trade

• Robust mean and standard deviation

• Kernel density mode and its standard error

• Mixture model representation

Robust mean and standard deviation

• Robust statistics is applicable to datasets that look likenormally distributed samples contaminated with outliersand stragglers (i.e., unimodal and roughly symmetric).

• The method downweights the otherwise large influenceof outliers and stragglers on the estimates.

• It models the central ‘reliable’ part of the dataset.

• The estimates are found by a procedure, not a formula.

robrob ˆ,ˆ

nxxx 21Tx

MAD5.1ˆ,medianˆ,0,21Set 00 pk

ppippi

ˆˆifˆˆ

ˆˆˆˆif~

)~var()(ˆ

)~(meanˆ2

1converged,notIf pp

Huber’s H15estimators

When can I safely userobust estimates?

Measurement axis

Skewed

Bimodal

Heavy-tailed

• The robust mean provides a usefulconsensus in the great majority ofinstances.

• The uncertainty of this consensus canbe safely taken as

The robust mean as consensus

nxu roba

The mode as a consensusCan I use the mode? How many modes? Where are they?

The normal kernel density foridentifying a mode

where Φ is the standard normal density,

Reference: AMC Technical Brief No. 4. (www.rsc.org/amc)

)2/exp()(

A normal kernel

A kernel density

Another kernel density:same data, different h

Uncertainty of the mode

• The uncertainty of the consensus can beestimated as the standard error of themode by applying the bootstrap to theprocedure.

• The bootstrap is a general procedure,based on resampling, for estimatingstandard errors of complex statistics.

• Reference: Bump-hunting for the proficiency tester – searching formultimodality. P J Lowthian and M Thompson, Analyst, 2002,127,1359-1364.

Mixture models and consensus

• For eachcomponentyou cancalculate:

- a mean- a variance- a proportion

2-component normal mixture modeland kernel density

The normal mixture model

References: AMC Technical Brief No 23, and AMC Software.Thompson, Acc Qual Assur, 2006, 10, 501-505.

1,)()(11

jjj pyfpyf

2/)(exp()(

Mixture models found by the maximumlikelihood method (the EM algorithm)

• The M-step

• The E-step

nyjPpn

iij /)(ˆ

iiij yjPyjPy

11)()(

)()()(ˆ1 1

iiji yjPyjPy

)(ˆ)(ˆ)(1

jjjijji yfpyfpyjP

Example datasets

Example dataset 1

Number of modes vs smoothing factor h

Statistics: dataset 1

0.00750.0752.913Mixture model

0.0056-2.912Kernel density mode

0.00560.0562.912Robust

Skewed/multimodal distributions

• Skews and extra modes can arise when theparticipants’ results come from two or moreinconsistent methods.

• Skews can also arise as an artefact at lowconcentrations of analyte as a result of commondata recording practices.

• Rarely, skews can arise when the distribution istruly lognormal (e.g., in GMO determinations).

Example dataset 2

What went wrong?

• Analyte defined as % fatty acid in oil.

• Most labs used an internal standard method.

• Hypothesis: other labs (incorrectly) reportedresult based on methyl ester peak area ratio.

• Incorrect results expected to be high by a factorof 1.05.

• Ratio of modes found = 1.04.

Example 3—Ba in silicate rock

Choice of value for σp

• Robust standard deviation of participants’results in round?

• From perception of how well similarmethods perform?

• Legislation?• Other?

Self-referential scoring

• Nearly always, more than 90% of laboratoriesreceive a z-score between ±2.

• This suggests, to both provider and participants,that accuracy is generally OK, whether or notthat is the case.

• No reference is made to end-user requirements.

• z-Scores for a participant cannot bemeaningfully compared round-to-round.

robrobxz ˆˆ

What more do we need?

• We need a method that evaluates the results inrelation to their intended use, rather than merelydescribing them.

• We need a method in which a score of (say) -3.1has an meaning independent of the analyte,matrix, or analytical method.

• We need a method based on:

fitness for purpose.

Fitness for purpose

• Fitness for purpose occurs when the uncertaintyof the result uf gives best value for money.

• If the uncertainty is smaller than uf , the analysismay be too expensive.

• If the uncertainty is larger than uf , the cost andthe probability of a mistaken decision will rise.

Fitness for purpose

• The value of uf can sometimes be estimatedobjectively by decision theory methods.

• Usually uf can be simply agreed between thelaboratory and the customer by professionaljudgement.

• In the proficiency test context, uf should bedetermined by the scheme provider.

Reference: T Fearn, S A Fisher, M Thompson, and S L R Ellison,Analyst, 2002, 127, 818-824.

• If we now define a z-score thus:

we have a z-score that is both robustifiedagainst extreme values and tells us aboutfitness for purpose.

• In an exactly compliant laboratory, scores of2<|z|<3 will be encountered occasionally, andscores of |z|>3 rarely.

• Better performers will receive fewer of theseextreme z-scores, worse performers more.

A score that meets all of the criteria

fpprob uxz whereˆ

• Use z-scores based on fitness forpurpose.

• Estimate the consensus as the robustmean and its uncertainty asif the dataset is roughly symmetric.

• If the dataset is skewed and plausiblycomposite, use a kernel density or amixture model to find a consensus.

Conclusions—optimal scoring

And finally……

• Each dataset is unique. It is impossible todefine a sequence of statistical operationsthat will properly handle every eventuality.

• Statistics (in the right hands) assists, butcannot replace, professional judgement.

Statistical References• Mixture models

M Thompson. Accred Qual Assur. 2006, 10, 501-505.AMC Technical Brief No. 23, 2006. www/rsc.org/amc

• Kernel densitiesB W Silverman, Density estimation for statistics and data analysis.Chapman and Hall, London, 1986.AMC Technical Brief, no. 4, 2001 www/rsc.org/amc

• The bootstrapB Efron and R J Tibshirani, An introduction to the bootstrap.Chapman and Hall, London, 1993AMC Technical Brief, No. 8, 2001 www/rsc.org/amc

• Robust statisticsAnalytical Methods Committee, Analyst,1989, 114, 1489AMC Technical Brief No 6, 2001 (www/rsc.org/amc)P J Rousseeuw, J. Chemomet, 1991, 5, 1.

General references• The International Harmonised Protocol for Proficiency Testing in

Analytical Chemistry Laboratories (revised), M Thompson, S L REllison and R Wood. Pure Appl. Chem., 2006, 78, 145-196.

• R E Lawn, M Thompson and R F Walker, Proficiency testing inanalytical chemistry. The Royal Society of Chemistry, Cambridge,1997.

• ISO Guide 43. Proficiency testing by interlaboratory comparisons,Geneva, 1997.

• ISO Standard 13528. Statistical methods for use in proficiencytesting by interlaboratory comparisons, Geneva, 2005.

Optimised Scoring in Proficiency Tests

Documents

Motivating the Unmotivated: Making Teacher Corrective ...ilt.atu.ac.ir/article_10216_4b8fffc10e4c6881380dad5b11a2203c.pdf · that according to the scoring rubrics and proficiency

Optimised Buildings

RBPC | Logistically Optimised Pipe Cutting · PDF fileRBPC | Logistically Optimised Pipe Cutting Machine ... Logistically Optimised Pipe Cutting Machine ... CAD design from Tekla to

Xposure 04 (iPad Optimised)

Xposure 03 (iPad Optimised)

Goals for this session Participants will know: Requirements for demonstrating proficiency in the Essential Skill of Writing Official State Scoring

Optimised 1 12

AND LEVELS OF PROFICIENCY: TAVAKOLI, NAKATSUHARA + HUNTER re · scoring validity of the aptis speaking test: investigating fluency across tasks and levels of proficiency: tavakoli,

PROFICIENCY TESTING PROFICIENCY TESTING

Portfolio theodora sakellariadi (optimised)

Toolboxes wiifm2 optimised

Mathscape10 Optimised

ELPAC Practice Test Scoring Guide - Grades 3 - 5 · 2020. 12. 16. · Introduction to the Practice Test Scoring Guide September 2020 2020–21 English Language Proficiency Assessment

διαθεματική εργασια χημείας Optimised

2019-2020 Action Plan · 2020. 1. 2. · 2018 reading proficiency was 30%- Goal is 33% RI- Kindergarten from 5% scoring a Lexile to 8% First Grade from 23% scoring a Lexile to 26%

INTRODUCTIONRESULTS PURPOSE METHODS CONCLUSION The Correlation between Parental Perception of Movement Difficulties and Scoring on a Motor Proficiency

J1... · 3. 4. The interview takes between 20-30 and answers with several language scoring rubric to determine if ca proficiency requirements. The types

Optimised Landing page software

USING SCORING RUBRICS TO ASSESS LANGUAGE Web view · 2004-11-16Using Scoring Rubrics to Assess Language Proficiency 4. ... USING SCORING RUBRICS TO ASSESS LANGUAGE PROFICIENCY.

WEBSITES RESULT OPTIMISED