Download ppt - CALIBRATION, PAST, PRESENT and FUTURE? Jean-Claude DEVILLE Ecole Nationale de la Statistique et de lAnalyse de lInformation/crest/Laboratoire de Statistique

CALIBRATION, PAST, PRESENT and FUTURE?

Jean-Claude DEVILLEEcole Nationale de la Statistique et de l’Analyse de l’Information/crest/Laboratoire de Statistique d’Enquête,

Campus de Ker-Lann, 2 rue Blaise Pascal – 35170-BRUZ [email protected]

0-Standard calibration principle

Calibration equations :

xk : p-vector of auxiliary variablesX : total of the xk

dk : design weights

We are seeking for new weights wk

s kk xwX

'

kkkk xFdw

10 kF

is a p-vector and the Fk are regular functions of ONE variable

verifying: 00' kk qF

Standard calibration principle

s kkHTs kk edVarywVar

where the ek are the residuals of the regression

with weights qk

The modified estimator is approximately unbiased and its variance is given by the residual trick

kkk exy '

Their classical form comes from the fact that thy are deduced by minimizing a distance function between the old and the new weights.

1 Generalized calibration

IRIRF p

k : 10 kF

))(1(2' Ozdw kkk

)(kkk Fdw

• The usual case is the’generalized linear’ where we use only one function F monotonic, regular, verifying F(0)=1 .

We start from functions Fk and we seek for weights having the form:

where

0

ddFz k

kDefine

with

Therefore we have :

The simplest case (linear) is obtained with Fk(u) = '1 kzzk is then a variable with p componants known on the sample.

s kkk zFxdX )( '

21' ˆˆ XXOXXTszx

s kkkszxC yzdTXXYY 1''ˆˆˆ ~ˆˆ '

XXY

0)( ' s kkkk xyzd ofsolution is ~

)( ' kkk zFdw The calibrated weights have the expression:

The calibration equations are:

We then get with s kkkszx xzdT '

Results are as in Deville-Särndal(1992) :- convergence and negligable bias.-All the estimators having the same zk have the same asymptotic variance .

-It can be evaluated from the linear case, where we have:

Generalized calibration

This is exactly the instrumental regression (Fuller (1987)) using the zk as instruments. The variance of the estimator is

computed by the residual trick using the residuals of this regression. Variance estimation follows the same lines. The "instruments" zk have to be known ONLY on the sample:

they are NOT an external auxiliary information.

Generalized calibration is one of the novelties includedin CALMAR II (Sautory,Le Guennec(2003))

2-CALIBRATION FOR DEALING WITH NON-RESPONSE

• A parametric model for response probabilities is defined by: 0/1)Pr( kk GsrkP

)( ' kk zGG In practice a generalized linear model:

r kkkr kkk zGdxGdxX )ˆ()ˆ( '

The calibration estimating equations are:

Non-response

L e t )()()ˆ( 0 kkk FGG w i t h 0ˆ

T h e f u n c t i o n )(/)() 00 kkk GG( λF i s a c a l i b r a t i o n f u n c t i o n .

If 0 were exactly known (as for two phasessampling), we would be exactly in the case ofgeneralized calibration , using the instrument :

))(())(( 00* kkk GLogFz They can be approximately computed

replacing 0by (consistent estimator ).

non-responseThe usefulness of the z* variables is to

reduce the bias caused by non-response.The x variables, (on which we have

auxiliary information) , at the opposite ,contribute to variance reduction.

Application (see Le Guennec(2002) andDeville(2002)) : The zk are observed, and the xk

are the value of the same variable in thesampling frame

REMARK: it possible to include in the response model variableswhich are NOT observed for the non-respondant.In particular they may also be variable of interest.This give interesting perspectives for ‘non-ignorable’ non-response.

A GOOD EXAMPLE/EXERCISE

Les corrections destinées à compenser les effets de la non réponse demandent une connaissance très précise des facteurs qui la causent. En particulier, si ce que l’on veut mesurer influe directement sur la probabilité de réponse, on est amené à prendre des risques avec les données. Voici un petit exemple fictif : un groupe d’étudiants est interrogé sur sa consommation de drogue. Les résultats de l’enquête sont les suivants :

OUI NON NON REPONSE

ENSEMBLE

Garçons 40 80 180 300

Filles 20 160 120 300

ENSEMBLE 60 240 300 600

Naïvement on dirait que le pourcentage de consommateurs est estimé par 60/(240+60)=25%. Cette estimation est faite sous l’hypothèse que les non-répondants ont le même comportement que les répondants. Mais on remarque que le taux de réponse des filles est plus important que celui des garçons. Pour corriger cela, on calcule le taux de consommateurs chez les filles, soit 1/9, et chez les garçons soit 3/9, et on conclut que la population étudiante observée est consommatrice à 2/9=22,2%. Si maintenant on pense que c’est le fait de consommer qui induit la non-réponse, le modèle a deux paramètres poui et pnon , respectivement probabilité de répondre des consommateurs et des non-consommateurs. On trouve que ces

probabilités valent respectivement 0,2 et 0,8. Le nombre estimé de consommateurs est donc de 200 chez les garçons et 100 chez les filles et l’estimation du pourcentage global est de 50% !

CALIBRATION ON IMPRECISE DATA

L’information auxiliaire X est maintenant supposée incertaine (autres enquêtes, d’estimations d’experts).

X et estiment sans biais le même vecteur X0 ,les variances de

ces deux quantités étant connues ou estimées de façon fiable. Cette estimation peut être comprise comme une estimation par la régression raccourcie (ridge): chercher un estimateur linéaire de Y de la forme laissant l’estimation sans biais. Si X est indépendant de le vecteur optimisant est évidemment:

B =((Var( ) + Var(X))-1 Cov( )

Une approximation commode de cette quantité, exacte en cas de sondage aléatoire simple, est

ce qui donne les poids :

X

)-ˆ(ˆ~ ' XXBYY XY ˆetˆ

X XY ˆ,ˆ

s kkksxx yxdXVarTB 1))((ˆ

ksxxkk xXVarTXXdw 1))(()'ˆ(1(

Autrement dit la régression est du genre ridge et on peut montrer que la variance de l’estimateur vaut

On l’estime par celle de l’estimateur GREG augmentée d’un terme connu.

Il est intéressant de noter que cet estimateur reçoit aussi une interprétation en termes de calage. Si on l’applique aux xk , on

obtient en effet :

soit l’estimateur (quasi)optimal formé par combinaison linéaire de X et . On peut donc dire que l’estimateur est calé sur et en déduire une autre expression des poids et de la variance en fonction de au lieu de X.

Des idées analogues peuvent être développées dans le cadre du calage généralisé (avec non réponse).

BXVarBYVar GREG )(')ˆ(

XXXVarTXVarXXVarTTX sxxsxxsxxˆˆ))()(())((

~ 11 Y~

Y~

X Xˆ

Xˆ

CALAGE ET ECHANTILLONNAGE

INDIRECT L’échantillonnage indirect (ou Méthode généralisée du partage des poids, Lavallée(2002)) consiste à échantillonner dans une population UA liée à une population UB qu’elle permet d’’attraper’. Elle conduit à des estimateurs sans biais de variance connue et estimable pour les variables de UB. On peut aussi, grâce au calage généralisé, renforcer l’estimateur ‘naturel’ en le calant simultanément sur des totaux auxiliaires connus de variables de UA et de UB. L’essentiel des résultats se trouve dans Lavallée(2002), chapitre 7. Le calage sur des informations relatives à plusieurs unités statistiques emboîtées (ménages et individus par exemple) est un cas particulier de cette approche

CALAGE SUR DES FONCTONS DE REPARTITION

Ren(2000), Breidt et Opsomer (2000), Goga(2002,2005)

Le calage sur la fonction de répartition d’une variable auxiliaire continue n’est autre qu’une variante de la poststratification à l’aide de tranches de cette variable. La question est de choisir un estimateur de l’espérance de yk conditionnelle à xk (à

condition de donner un sens à cette notion dans le cadre des populations finies). L’estimateur du total des yk est alors

C’est toujours un estimateur linéaire (pondéré), et, idéalement, sa variance est voisine de celle de .

Le calage sur plusieurs fonctions de répartition n’a pas donné lieu à publication. C’est une extension de la technique du raking-ratio analogue à l’extension de la poststratification décrite ci-dessus.

U s kkkkfr yydyY )ˆ(ˆˆ

s kkk yyd )ˆ(

CALAGE INVERSE ET DONNEES ABERRANTES

(Ren et Chambers (2003))

On commence par définir un estimateur robuste du total Y. On cherche ensuite à modifier les valeurs aberrantes ‘vraies’ , par des valeurs plus ‘normales’ telles que

La contribution au total ‘robuste’ des valeurs aberrantes est connue et vaut

L’objectif est donc d’imputer des valeurs ,telles que

De plus, on recherche des valeurs imputées proches des valeurs vraies. En posant, pour , avec et , on retrouve un problème de calage où est solution de

Si, par exemple, F est linéaire on trouve :

Yak sky , *

ky

YywywYask kksk kk

ˆˆ **0

0

ˆ2 sk kky ywYt

*ky

ak sky , ysk kk tywa

2* ˆ

ask )(* kkkk wFyy 1)0( kF kk qF )0(

ysk kkkk twFywa

2)(

ysk kkkk twFywa

2)(

3-ESTIMATION OF A NON-LINEAR FUNCTIONAL BY CALIBRATION

ON A SET OF FUNCTIONNALS (hypercalibration?)

C a l ib r a te d w e ig h ts a r e d e s ig n e d f o r im p r o v in ge s t im a t io n s f o r to ta l s . H o w e v e r , in p r a c t ic e ,th e y a r e u s e d f o r th e w h o le e s t im a t io n p r o c e s s ,in c lu d in g n o n - l in e a r e s t im a t io n s . I t m a y b ein e f f ic ie n t !E x a m p le : T h e r a t io e s t im a to r s , w i th th e

w e ig h ts XXd kˆ/ , im p r o v e th e e s t im a t io n o f

th e to ta l s Y e t Z , b u t h a s n o e f f e c t o n th ee s t im a t io n o f Y /Z .

hypercalibrationW e s h a l l t r y t o i m p r o v e t h e ‘ p lu g - i n ’ e s t i m a t o r ( D e v i l l e ( 1 9 9 9 ) ) T o f a f u n c t i o n n a l T u s i n g t h e k n o w n v a lu e o f a ( p - d i m e n t i o n a l ) f u n c t i o n a l S a s w e l l a s i t s ‘ p lu g - i n ’ e s t i m a t i o n

S . D e f i n e t w o f a m i l i e s o f f u n c t i o n a l s )(et)( ST ( w h e r e i s a p - d i m e n t i o n a l

p a r a m e t e r ) s u c h t h a t SSTT )0(and)0( a n d )( S i s a r e g u la r o n e t o o n e f u n c t i o n o f a n e i g h b o u r h o o d o f 0 o n a n e i g h b o u r h o o d o f S . In p a r t i c u la r ( w i t h a p r o b a b i l i t y a s y m p t o t i c a l ly t e n d i n g t o 0 ) , t h e r e e x i s t a u n i q u e s o lu t i o n

s u c h t h a t SS )ˆ(ˆ . T h e h y p e r c a l i b r a t e d

e s t i m a t o r i s )ˆ(ˆ T .

hypercalibration:example

A s i m p l e e x a m p l e : S , T a n d h a v ed i m e n s i o n 1 .

SS )1()( a n d TT )1()( .T h i s l e a d s t o t h e n a t u r a l r e s u l t s ( ‘ r a t i o ’e s t i m a t o r ) : SS ˆ/ˆ1 a n d

SSTT ˆ/ˆ)ˆ(ˆ T h i s e s t i m a t o r h a s f o r v a r i a n c e ( D e v i l l e ( 1 9 9 9 ) )t h e v a r i a n c e o f t h e t o t a l o f t k - T / S s k ( w i t h t k

a n d s k l i n e a r i z e d v a r i a b l e s f o r T a n d S ) .V a r i a n c e i s r e d u c e d w h e n t a n d s a r es u f f i c e n t l y c o r r e l a t e d .

hypercalibration: variance and variance estimation

I n a g e n e r a l w a y , t h e l i n e a r i z e d v a r i a n c e o f t h e

h y p e r c a l i b r a t e d e s t i m a t o r )ˆ(ˆ T i s t h e v a r i a n c eo f t h e i n i t i a l e s t i m a t o r ( g e n e r a l l y H o r v i t z -T h o m p s o n ) f o r t h e t o t a l o f t h e l i n e a r i z e d

v a r i a b l e kk sST

t1

.

A v a r i a n c e e s t i m a t i o n p r o c e d u r e f o l l o w s i n as t r a i g h t f o r w a r d w a y .

hypercalibration:weighted estimator

I t w o u l d b e n i c e t o g e t a w e i g h t e d ( l i n e a r ) e s t i m a t o r .

F o r t h a t w e c a n u s e w e i g h t s h a v i n g t h e f o r m :

)()( kkk Fww .

)( T i s t h e f u n c t i o n a l o b a i n e d b y t h e s u b s t i t u t i o n o f t h e w e i g h t s )( kF i n p l a c e o f w e i g t h s 1 i n t h e d e f i n i t i o n o f T , i s o b t a i n e d w i t h t h e w e i g t h s ( a n d t h e s a m e f o r S a n d ) . S

)(ˆ T )(kk Fw

Hypercalibration:weighted estimator 2

is the solution of SS )ˆ(ˆ

U kk zsS '00

k

kFzAs where

kUk szsl 1')( We get the linearized variable:

U kk ztT0 )ˆ(T

kU kU kk ztzsB 1')(

As the linearized of is :

tk – B sk where

•B is the regression of t on s using z as instument.

•In the case of totals, we get the previous results.

Hypercalibration:example of weighted estimator 2

s kks kk swswS )1()1(

An example : T=Y/X (ratio) is to be estimated, and sk=yk /xk is

observed on the sample and available on the frame. One can build a weighted estimator with the calibration function:

1)(kF Ssk if

SsifF kk 1)( (sample s<)

(sample s>)

The calibration equation is :

U ks

)ˆˆ(ˆˆ)ˆˆ(ˆˆ

)ˆ(ˆ

XXX

YYYT

And finally:

Instruments!

Exportation towards classical statistics

Empirical Likelihood? Seems to be nothing else than classical calibration using Kulback-Leibler distance centered at the ‘model’ instead of the ‘true value’.Already present in my paper of 92. The likelihood argument was cut in th final version to make it short and to avoid pedantry.See for instance papers by Changbao Wu or JNK Rao.

Calibration principe: it’s what I called ‘hyper’calibration.

Applicable to classical statistics in problems like estimating a median knowing the mean of the distribution. In parametric statistics, estimation by maximum likelihood using the known true value of an auxiliary parameter(eg log-normal law) is a particular case of the principe.

Variance estimation seems to be tackled by balanced bootstrap, a technique in progress which poses some intricate questions of balancing a sample WITH replacement!