Upload
others
View
12
Download
0
Embed Size (px)
Citation preview
METHODS OF DETECTING AND TREATING OUTLIERS USED IN REPUBLIKA SRPSKA INSTITUTE OF
STATISTICS
Darko Marinković, Aleksandra Djonlaga
Republika Srpska Institute of Statistics
Introduction
• Difference between “true value” of the parameter and its survey estimate – Total survey error (TSE)1
• Sampling error is part of TSE that is under control of statisticians
• Adequate allocation of the sample and usage of auxiliary information in design and estimation stage usually solves the problem
• Non-sampling errors are caused by all other survey operations except sampling
• Keeping non-sampling errors under control can be a real challenge to statisticians
1Beimer, P.P and Lyberg, L.E (2003)
Non-sampling errors (with respect to source)
• Specification error (concept implied by the survey question and the concept that should be measured in the survey differ)
• Frame error (construction of the sampling frame for the survey –possible over/undercoverage, misclassifications, duplications)
• Non-response error (arises from incomplete or completely missing data);
• Measurement error (response differs from the true value because of interviewer, respondent, questionnaire design, collection method, information system,…)
• Processing error (editing of data, data entry, coding, the assignment of survey weights, and the tabulation of survey data)
Non-sampling errors (with respect to nature)
• Stochastic - occur due to accidental sources of errors • They do not significantly bias the parameter estimates in any specific
direction, because they are cancelled out if a large enough sample is used
• Systematic - consistently affect data in one or more survey phases (data capturing, data entry, data coding, data processing…) and tend to accumulate over the entire sample• If not treated properly, can cause serious bias in survey estimates
Non-sampling errors (with respect to impact or influence)• Impact or influence – dependant on definition of target parameter,
applied estimator, domain of interest
• Values that significantly affect estimates of target parameter if not treated properly (or if included or excluded from estimation)
• Does not necessarily correspond to extreme values, can be bounded to large sampling weight to be influential
• Eg. Extreme value bounded to large sampling weight – classification problem?
Non-sampling errors (with respect to appearance)• If we can, they appear as:
• Missing values (data not collected for some reason)
• Outliers (values not coherent with expected model)
• Data inconsistencies (data not coherent with prespecified set of mathematical and/or logical rules)
• Can we detect all non-sampling errors? (inliers)
Outliers
• Fall outside of an overall trend or do not follow a model that is assumed for a specific phenomenon that is of interest
• Can be defined with respect to one variable (univariate) or more (bivariate or multivariate)
• Can be defined with respect to the whole population or specific domains of interest
• Does not have to be an erroneous observation – it can be true change in phenomenon of interest – distinction is of crucial importance
Outliers
• From sampling perspective:• Representative outliers – observations identified as outliers, but they are not
erroneous - they potentially represent other elements in the population and contain information about the (higher) variability of the investigated phenomenon
• Non-representative outliers:• observations identified as outliers but are unique correct cases that do not represent any
other element in the population (shouldn’t be extrapolated to population)
• observations which are identified as outliers and erroneous, and are unique cases whose unknown true values are not outliers (must be treated)
Detection and treatment of outliers
• How different does the value have to be from the rest of the data to be an outlier?
• No simple answer – balance between objectives of the survey (estimates, domains of interest, precision) and available resources on the other
• Representative outliers are treated at the estimation phase
• Non-representative outliers are treated at the editing and imputation phase
• Problem of overediting!!!
Bivariate method of Hidiroglu and Berthelot for outlier detection• Originally designed for detection of outliers in periodic surveys (M.A.
Hidiroglou and J.M. Berthelot (1986))
• If a phenomenon of interest is measured at two (or more) different time points, then it is possible to identify observations with “biggest change”
• The method is applied in the context of comparing two related variables (Y1 and Y2) in the same survey iteration
Bivariate method of Hidiroglu and Berthelot for outlier detection• Business population variables usually have skewed distribution, so
ratios R12 should be transformed:
)(1)(
)(0)(
1
,12,12
,12
,12
,12,12
,12
,12
ii
i
i
ii
i
i
i
RmedianRifRmedian
R
RmedianRifR
Rmedian
s
Bivariate method of Hidiroglu and Berthelot for outlier detection• To take into account the magnitude of the difference between the
two variables values that are analyzed, a further transformation of the si values is performed:
• The parameter 0≤U≤1 controls the importance associated to the magnitude of the difference between the two variables
Uiiii yysE 21 ,max
Bivariate method of Hidiroglu and Berthelot for outlier detection• The values which are external to the following interval are classified
as outliers:
• Where:
31, QmedianQmedian dCEdCE
251 QmedianQ EEd medianQQ EEd 753
Bivariate method of Hidiroglu and Berthelot for outlier detection• No general recommendation for choice parameters U and C
• Examination of scatter plots, influence on target estimate and statistics of number of identified outliers
• Available resources and purpose of the survey (quality requirements) play significant role
• Subject matter knowledge crucial in decision making
• Once determined values of the parameters U and C can be reused (or minimally changed) in the next survey iteration, since the respective variables that are forming the ratio for all survey iterations show very similar behavior
Outliers in Labour Cost Survey
LCS 2014
Ratios Parameter U Parameter C
Number of
identified Outliers
Total gross salary by total paid hours 0.25 3.7 57
Total gross salary by total hours actually worked 0.29 4.0 52
Total gross salary by total number of employees 0.29 4.5 51
Total hours actually worked by total number of
employees 0.38 6.5 39
Total hours paid but not worked by total number
of employees 0.30 6.0 108
Total paid hours by total number of employees 0.17 11.5 88
Total labour cost by total paid hours 0.55 4.0 97
Total labour cost by total hours actually worked 0.55 4.0 96
Total labour cost by total number of employees 0.50 3.4 142
Outliers in Labour Cost Survey
Treatment of outliers
• Outliers should always be analyzed interactively (manual review of the printed or electronic questionnaires, contacts with the reporting unit, use of auxiliary information)
• Responsible subject matter methodologists are making decisions on how to treat the outliers
• If an outlier is confirmed to be an error, then it should be corrected or eliminated (either by recontacting the enterprise by telephone, or by imputation)
Implementation of the method
• Used R software environment for statistical computing (R Core Team (2016))
• Since method is mathematically simple, it is implemented just using built in facilities in the standard libraries
• Input to the function are variables of the survey dataset that will form the ratio and parameters U and C
• Output is dataset that contains units with observations that are identified as outlying and the scatter plots with marked outlying values (possibly, with influential values in terms of large sampling weight)
Implementation of the method
• Core function can be automatically repeated for desired variables forming the ratios and also by domain of interest for the survey
• Output dataset is organized as a table which in the rows contains unit records that have at least one identified outlier for the ratios that are subject to the analysis
• This means that all outliers for one specific unit are treated at once
Conclusion and future plans
• Before considering the possible treatment of identified outliers, we should try to understand why they occurred and whether it is likely that similar values will continue to appear
• Outliers are to be detected (together with influential errors) at the first stages of the editing and imputation process, and their treatment has to be particularly accurate
• Outliers should always be analyzed interactively
Conclusion and future plans
• RSIS applies the “Ratio method” of Hidiroglou and Berthelot in the context of the long term surveys
• The method is originally designed for detection of outliers in periodic surveys
• Plan is to expand application in the short term surveys, in which original idea is more appropriate
• Another idea is usage of external data for forming ratios