72
Poznan 20 October 2010 Data integration: an overview on statistical methodologies and applications. Mauro Scanu Istat Central Unit on User Needs, Integration and Territorial Statistics [email protected]

Poznan 20 October 2010 Data integration: an overview on statistical methodologies and applications. Mauro Scanu Istat Central Unit on User Needs, Integration

Embed Size (px)

Citation preview

Page 1: Poznan 20 October 2010 Data integration: an overview on statistical methodologies and applications. Mauro Scanu Istat Central Unit on User Needs, Integration

Poznan 20 October 2010

Data integration: an overview on statistical methodologies and

applications.

Mauro Scanu

Istat

Central Unit on User Needs, Integration and Territorial Statistics

[email protected]

Page 2: Poznan 20 October 2010 Data integration: an overview on statistical methodologies and applications. Mauro Scanu Istat Central Unit on User Needs, Integration

Poznan 20 October 2010 World Statistics Day

2

Summary

• In what sense methods for integration are “statistical”?• Record linkage: definition, examples, methods, objectives and

open problems• Statistical matching: definition, examples, methods, objectives

and open problems• Micro integration processing: definition, examples, methods,

objectives and open problems• Other statistical integration methods?

Page 3: Poznan 20 October 2010 Data integration: an overview on statistical methodologies and applications. Mauro Scanu Istat Central Unit on User Needs, Integration

Poznan 20 October 2010 World Statistics Day

3

Methods for integration 1

Generally speaking, integration of two data sets is understood as a single unit integration: the objective is the detection of those records in the different data sets that belong to the same statistical unit. This action allows the reconstruction of a unique record of data that contains all the unit information collected in the different data sources on that unit.

On the contrary: let’s distinguish two different objectives - micro and macro

Micro: the objective is the “development” of a complete data set

Macro: the objective is the “development” of an aggregate (for example, a contingency table)

Page 4: Poznan 20 October 2010 Data integration: an overview on statistical methodologies and applications. Mauro Scanu Istat Central Unit on User Needs, Integration

Poznan 20 October 2010 World Statistics Day

4

Methods for integration 2

Further, the methods of integration can be split in automatic and statistical methods

The automatic methods take into account a priori rules for the linkage of the data records

The statistical methods include a formal estimation or test procedure that should be applied on the available data: this estimation or test procedure

1. can be chosen according to optimality criteria, 2. and are associated with an estimate error.

This talk restricts the attention on the (micro and macro) statistical methods of integration

Page 5: Poznan 20 October 2010 Data integration: an overview on statistical methodologies and applications. Mauro Scanu Istat Central Unit on User Needs, Integration

Poznan 20 October 2010 World Statistics Day

5

Statistical methods

Classical inference

1) There exists a data generating model

2) The observed sample is an image of the data generating model

3) We estimate the model from the observed sample

Page 6: Poznan 20 October 2010 Data integration: an overview on statistical methodologies and applications. Mauro Scanu Istat Central Unit on User Needs, Integration

Poznan 20 October 2010 World Statistics Day

6

Statistical methods of integration

If a method of integration is used, it is necessary to include an intermediate phase.

The final data set is a blurred image of the data generating model

Page 7: Poznan 20 October 2010 Data integration: an overview on statistical methodologies and applications. Mauro Scanu Istat Central Unit on User Needs, Integration

Poznan 20 October 2010 World Statistics Day

7

Statistical methods of integration

Statistical methods for integration can be organized according to the available input

Input Output Metodo

Two data sets that observe (partially) overlapping groups of units

Micro Record linkage

Two independent samples Macro/micro Statistical matching

Sets of estimates from different surveys, that are not coherent

Macro Calibration methods

Graphical methods

Page 8: Poznan 20 October 2010 Data integration: an overview on statistical methodologies and applications. Mauro Scanu Istat Central Unit on User Needs, Integration

Poznan 20 October 2010 World Statistics Day

8

Record linkage

Input: two data sets on overlapping sets of units.Problem: lack of a unique and correct record identifierAlternative: sets of variables that (jointly) are able to identify unitsAttention: variables can have “problems”!Objective: the largest number of correct links, the lowest number of

wrong links

Page 9: Poznan 20 October 2010 Data integration: an overview on statistical methodologies and applications. Mauro Scanu Istat Central Unit on User Needs, Integration

Poznan 20 October 2010 World Statistics Day

9

Book of life

Dunn (1946)* describes record linkage in this way:

…each person in the world creates a book of life. The book starts with the birth and ends with the death. Its pages are made up of all the principal events of life. Record linkage is the name given to the process of assembling the pages of this book into one volume. The person retains the same identity throughout the book. Except for advancing age, he is the same person…

*Dunn (1946) "Record Linkage". American Journal of Public Health 36 (12): 1412–1416.

Page 10: Poznan 20 October 2010 Data integration: an overview on statistical methodologies and applications. Mauro Scanu Istat Central Unit on User Needs, Integration

Poznan 20 October 2010 World Statistics Day

10

When there is the lack of a unique identifier

If a record identifier is missing or cannot be used, it is necessary to use the common variables in the two files.

The problem is that these variables can be “unstable”:

1. Time changes (age, address, educational level)

2. Errors in data entry and coding

3. Correct answers but different codification (e.g. address)

4. Missing items

Page 11: Poznan 20 October 2010 Data integration: an overview on statistical methodologies and applications. Mauro Scanu Istat Central Unit on User Needs, Integration

Poznan 20 October 2010 World Statistics Day

11

Main motivations for record linkage

According to Fellegi (1997)*, the development of tools for integration is due to the intersection of these facts:

• occasion: construction of big data bases• tool: computer• need: new informative needs

*Fellegi (1997) “Record Linkage and Public Policy: A Dynamic Evolution”. In Alvey, Jamerson (eds) Record Linkage Techniques, Proceedings of an international workshop and exposition, Arlington (USA) 20-21 March 1997.

Page 12: Poznan 20 October 2010 Data integration: an overview on statistical methodologies and applications. Mauro Scanu Istat Central Unit on User Needs, Integration

Poznan 20 October 2010 World Statistics Day

12

Why record linkage? Some examples

1. To have joint information on two or more variables observed in distinct data sources

2. To “enumerate” a population

3. To substitute (parts of) surveys with archives

4. To create a “list” of a population

5. Other official statistics objectives (imputation and editing / to enhance micro data quality; to study the risk of identification of the released micro data)

Page 13: Poznan 20 October 2010 Data integration: an overview on statistical methodologies and applications. Mauro Scanu Istat Central Unit on User Needs, Integration

Poznan 20 October 2010 World Statistics Day

13

Example 1 – analysis of mortality

Problem: to analyze jointly the “risk factors” with the event “death”.

A) The risk factors are observed on ad hoc surveys (e.g. those on nutrition habits, work conditions, etc.)

B) The event “death” (after some months the survey is conducted) can be taken from administrative archives

These two sources (survey on the risk factors and death archive) should be “fused” so that each unit observed in the risk factor survey can be associated with a new dichotomous variable (equal to 1 if the person is dead and zero otherwise).

Page 14: Poznan 20 October 2010 Data integration: an overview on statistical methodologies and applications. Mauro Scanu Istat Central Unit on User Needs, Integration

Poznan 20 October 2010 World Statistics Day

14

Example 2 – to enumerate a population

Problem: what is the number of residents in Italy?

Often the number of residents is found in two steps, by means of a procedure known as “capture-recapture”. This method is usually applied to determine the size of animal populations.

A) Population censusB) Post enumeration survey (some months after the census) to

evaluate Census quality and give an accurate estimate of the population size

USA - in 1990 Post Enumeration Survey, in 2000 Accuracy and Coverage Evaluation

Italy - in 2001 “Indagine di Copertura del Censimento”

Page 15: Poznan 20 October 2010 Data integration: an overview on statistical methodologies and applications. Mauro Scanu Istat Central Unit on User Needs, Integration

Poznan 20 October 2010 World Statistics Day

15

Example 2 – to enumerate a population

The result of the comparison between Census and post enumeration survey is a 22 table:

Obs. Post Non obs Post

Obs. Cens.

noo non

Non obs Cens

nno ??

Page 16: Poznan 20 October 2010 Data integration: an overview on statistical methodologies and applications. Mauro Scanu Istat Central Unit on User Needs, Integration

Poznan 20 October 2010 World Statistics Day

16

Example 2 - to enumerate a population

For short, for any distinct unit it is necessary to understand if it was observed

1) both in the census and in the PES

2) only in the census

3) only in the PES

These three values allow to estimate (with an appropriate model) the fourth value.

Page 17: Poznan 20 October 2010 Data integration: an overview on statistical methodologies and applications. Mauro Scanu Istat Central Unit on User Needs, Integration

Poznan 20 October 2010 World Statistics Day

17

Example 3 – surveys and archives

Problem: is it possible to use jointly administrative archives and sample surveys?

At the micro level this means: to modify the questionnaire of a survey dropping those questions that are already available on some administrative archives (reduction of the response burden)

E.g., for enterprises:

Social security archives, chambers of commerce, …

Page 18: Poznan 20 October 2010 Data integration: an overview on statistical methodologies and applications. Mauro Scanu Istat Central Unit on User Needs, Integration

Poznan 20 October 2010 World Statistics Day

18

Example 4 – Creation of a list

Problem: what is the set of the active enterprises in Italy?

In Istat, ASIA (Archivio Statistico delle Imprese Attive) is the most important example of a creation of a list of units (the active enterprises in a time instant) “fusing” different archives.

It is necessary to pay attention to:• Enterprises which are present in more than one archives

(deduplication)• Non active enterprises• New born enterprises• transformations (that can lead to a new enterprise or to a

continuation of the previous one)

Page 19: Poznan 20 October 2010 Data integration: an overview on statistical methodologies and applications. Mauro Scanu Istat Central Unit on User Needs, Integration

Poznan 20 October 2010 World Statistics Day

19

Example 5 – Imputation and editing

Problem: to enhance microdata quality

Micro Integration in the Netherlands (virtual census, social statistical data base)

It will be seen later, when dealing with micro integration processing

Page 20: Poznan 20 October 2010 Data integration: an overview on statistical methodologies and applications. Mauro Scanu Istat Central Unit on User Needs, Integration

Poznan 20 October 2010 World Statistics Day

20

Example 6 - Privacy

Problem: does it exist a “measure” of the degree of identification of the released microdata?

In order to evaluate if a method for the protection of data disclosure is good, it is possible to compare two datasets (the true and the protected ones) and detect how many modified records are “easily” linked to the true ones.

Page 21: Poznan 20 October 2010 Data integration: an overview on statistical methodologies and applications. Mauro Scanu Istat Central Unit on User Needs, Integration

Poznan 20 October 2010 World Statistics Day

21Tiziana Tuoto, FCSM 2007, Arlington, November 6 2007

The record linkage techniques are a multidisciplinary set of methods and practices

RECORD LINKAGE

SEARCH SPACE REDUCTION• Sorted Neighbourhood Method• Blocking• Hierarchical Grouping• …

DECISION MODEL CHOICE• Fellegi & Sunter • exact• Knowledge – based• Mixed• …

COMPARISON FUNCTION CHOICE

• Edit distance• Smith-Waterman• Q-grams• Jaro string comparator• Soundex code• TF-IDF• …

............

......

PRE-PROCESSING • Conversion of upper/lower cases• Replacement of null strings• Standardization• Parsing•…

Record linkage steps

Page 22: Poznan 20 October 2010 Data integration: an overview on statistical methodologies and applications. Mauro Scanu Istat Central Unit on User Needs, Integration

Poznan 20 October 2010 World Statistics Day

22

Example (Fortini, 2008)*

Census is sometimes associated with a post enumeration surveys, in order to detect the actual census coverage.

To this purpose, a “capture-recapture” approach is generally considered.

It is necessary to find out how many individuals have been observed:• in both the census and the PES• Only in the census• Only in the PES

These figures allow to estimate how many individuals have NOT been observed in both the census and the PES

* In ESSnet Statistical Methodology Project on Integration of Survey and Administrative Data “Report of WP2. Recommendations on the use of methodologies for the integration of surveys and administrative data”, 2008

Page 23: Poznan 20 October 2010 Data integration: an overview on statistical methodologies and applications. Mauro Scanu Istat Central Unit on User Needs, Integration

Poznan 20 October 2010 World Statistics Day

23

Step 1

Step 2

Step 3.bStep 3.a

Matched households

Unmatched households

Matched households

Unmatched households

Matched people

Unmatched people

Unmatched people

Step 4.a Step 4.b

Matched people

Unmatched people

Matched people

Unmatched people

Step 5Matched people Unmatched

people

CENSUS PES Record linkage workflow for Census - PES

Matched people

Page 24: Poznan 20 October 2010 Data integration: an overview on statistical methodologies and applications. Mauro Scanu Istat Central Unit on User Needs, Integration

Poznan 20 October 2010 World Statistics Day

24

Problem: Lack of identifiers

Difference between step 1 and step 2 is that:

Step 1 identifies all those households that coincide for all these variables:

• Name, surname and date of birth of the household head• Address• Number of male and female components

Step 2 uses the same keys, but admits the possibility of differences of the variable states for modifications of errors

Page 25: Poznan 20 October 2010 Data integration: an overview on statistical methodologies and applications. Mauro Scanu Istat Central Unit on User Needs, Integration

Poznan 20 October 2010 World Statistics Day

25

Probabilistic record linkage

For every pairs of records from the two data sets, it is necessary to estimate

• The probability that the differences between what observed on the two records is due to chance, because the two records belong to the same unit

• The probability that the two records belong to different units

These probabilities are compared: this comparison is the basis for the decision whether a pair of records is a match or not

Estimate of this probability is the “statistical step” in the probabilistic record linkage method

Page 26: Poznan 20 October 2010 Data integration: an overview on statistical methodologies and applications. Mauro Scanu Istat Central Unit on User Needs, Integration

Poznan 20 October 2010 World Statistics Day

26

Statistical step

Data set A with na units.

Data set B with nb units.

K key variables (they jointly make an identifier)

Key variables

a X1 X2 … Xk

1 Ax11 Ax12 … A

kx1 XA1

2 Ax21 Ax22 … A

kx2 XA2

… … … … … …

nA Anax 1

Anax 2 … A

nakx XAk

Key variables

b X1 X2 … Xk

1 Bx11 Bx12 … B

kx1 XB1

2 Bx21 Bx22 … B

kx2 XB2

… … … … … …

nb Bnbx 1

Bnbx 2 … B

nbkx XBk

Page 27: Poznan 20 October 2010 Data integration: an overview on statistical methodologies and applications. Mauro Scanu Istat Central Unit on User Needs, Integration

Poznan 20 October 2010 World Statistics Day

27

Statistical procedure

The key variables of the two records in a pair (a,b) is compared:

yab=f(xAa,xBb)

The function f(.) should register how much the key variables observed in the two units are different.

For instance, y can be a vector with k components, composed of 0s (inequalities) or 1s (equalities)

The final result is a data set of na x nb comparisons

(a,b) comparisons

(1,1) f(XA1,XB1)= y11

(1,2) f(XA1,XB2)= y12

… … …

… … …

… … …

(na,nb) f(XAna,Xb1)= ynanb

Page 28: Poznan 20 October 2010 Data integration: an overview on statistical methodologies and applications. Mauro Scanu Istat Central Unit on User Needs, Integration

Poznan 20 October 2010 World Statistics Day

28

Statistical procedure

The na x nb pairs are split in two sets:

M: the pairs that are a match

U: the unmatched pairs

Likely, the comparisons y will follow this situation:• Low levels of diversity for the pairs that are match, (a,b)M• High levels of diversity for the pairs that are non-match, (a,b)U

For instance: if y=(sum of the equalities for the k key variables), y tends to assume large values for the pairs in M with respect to those in U

Page 29: Poznan 20 October 2010 Data integration: an overview on statistical methodologies and applications. Mauro Scanu Istat Central Unit on User Needs, Integration

Poznan 20 October 2010 World Statistics Day

29

Statistical procedure

If y=(sum of the equalities), the distribution of y is a mixture of the distribution of y in M (right) and that in u (left)

Page 30: Poznan 20 October 2010 Data integration: an overview on statistical methodologies and applications. Mauro Scanu Istat Central Unit on User Needs, Integration

Poznan 20 October 2010 World Statistics Day

30

Statistical procedure

Inclusion of a pair (a,b) in M or U is a missing value (latent variable).

Let C denote the status of a pair (C=1 if (a,b) in M; C=0 if (a,b) in U)

Likelihood is the product on the na x nb pairs of

P(Y=y, C=c) = [p m(y)]c [(1-p) u(y)](1-c)

Estimation method: maximum likelihood on a partially observed data set (EM algorithm – Expectation Maximization)

Parameters data

p: fraction of matches among the

na x nb pairs

Y: observed

m(y): distribution of y in M C: missing (latent)

u(y): distribution of y in U

Page 31: Poznan 20 October 2010 Data integration: an overview on statistical methodologies and applications. Mauro Scanu Istat Central Unit on User Needs, Integration

Poznan 20 October 2010 World Statistics Day

31

Statistical procedure

A pair is assigned to M or U in the following way

1) For every comparison y assign a “weight”:

t(y)=m(y)/u(y)

where m and u are estimated;

2) Assign the pairs with a large weight to M and the pairs with a small weight to U.

3) There can be a class of weights t where it is better to avoid definitive decisions (m and u are similar)

Page 32: Poznan 20 October 2010 Data integration: an overview on statistical methodologies and applications. Mauro Scanu Istat Central Unit on User Needs, Integration

Poznan 20 October 2010 World Statistics Day

32

Statistical procedure

The procedure is the following.

Note that, generally, probabilities of mismatching are still not considered

Page 33: Poznan 20 October 2010 Data integration: an overview on statistical methodologies and applications. Mauro Scanu Istat Central Unit on User Needs, Integration

Poznan 20 October 2010 World Statistics Day

33

Open problems

Different probabilistic record linkage aspects should still be better investigated. Two of them are related to record linkage quality

a) What model should be considered– a1) on the pairs relationship (Copas and Hilton, 1990)– a2) on the key variables relationship (Thibaudeau, 1993)

b) How probabilities of mismatching can be used for a statistical analysis of a linked data file? (Scheuren and Winkler, 1993, 1997)

Copas J.R., Hilton F.J. (1990). “Record linkage: statistical models for matching computer records”. Journal of the Royal Statistical Society, Series A, 153, 287-320.

Thibaudeau Y. (1993). “The discrimination power of dependency structures in record linkage”. Survey Methodology, 19, 31-38.

Scheuren F., Winkler W.E. (1993). “Regression analysis of data files that are computer matched”. Survey Methodology, 19, 39-58

Scheuren F., Winkler W.E. (1997). “Regression analysis of data files that are computer matched - part II”. Survey Methodology, 23, 157-165.

Page 34: Poznan 20 October 2010 Data integration: an overview on statistical methodologies and applications. Mauro Scanu Istat Central Unit on User Needs, Integration

Poznan 20 October 2010 World Statistics Day

34

Statistical matching

What kind of integration should be considered if the analysis involves two variables observed in two independent sample surveys?

• Let A and B be two samples of size nA and nB respectively, drawn from the same population.• Some variables X are observed in both samples• Variables Y are observed only in A• Variables Z are observed only in B.

Statistical matching aims at determining information on (X;Y;Z), or at least on the pairs of variables which are not observed jointly (Y;Z)

Page 35: Poznan 20 October 2010 Data integration: an overview on statistical methodologies and applications. Mauro Scanu Istat Central Unit on User Needs, Integration

Poznan 20 October 2010 World Statistics Day

35

Statistical matching

It is very improbable that the two samples observe the same units, hence record linkage is useless.

Page 36: Poznan 20 October 2010 Data integration: an overview on statistical methodologies and applications. Mauro Scanu Istat Central Unit on User Needs, Integration

Poznan 20 October 2010 World Statistics Day

36

Some statistical matching applications 1

The objective of the integration of the Time Use Survey (TUS) and of the

Labour Force Survey (LFS) is to create at a micro level, a synthetic file of

both surveys that allows the study of the relationships between variables

measured in each specific survey.

By using together the data relative to the specific variables of both surveys,

one would be able to analyse the characteristics of employment and the

time balances at the same time.

Information on labour force units and the organisation of her/his life

times will help enhance the analyses of the labour market

The analyses of the working condition characteristics that result from

the labour force survey will integrate the TUS more general analysis of

the quality of life

Page 37: Poznan 20 October 2010 Data integration: an overview on statistical methodologies and applications. Mauro Scanu Istat Central Unit on User Needs, Integration

Poznan 20 October 2010 World Statistics Day

37

Some statistical matching applications 1

The possibilities for a reciprocal enrichment have been largely recognised(see the 17th International Conference of Labour Statistics in 2003 and the2003 and 2004 works of the Paris group). The emphasis was indeed put onhow the integration of the two surveys could contribute to analysing thedifferent participation modalities in the labour market determined by hourand contract flexibility.Among the issues raised by researchers on time use, we list the followingtwo:the usefulness and limitations involved in using and combining varioussources, such as labour force and time-use surveys, for improving dataqualityTime-use surveys are useful, especially for measuring hours worked ofworkers in the informal economy, in home-based work, and by thehidden or undeclared workforce, as well as to measure absence fromwork

Page 38: Poznan 20 October 2010 Data integration: an overview on statistical methodologies and applications. Mauro Scanu Istat Central Unit on User Needs, Integration

Poznan 20 October 2010 World Statistics Day

38

Some statistical matching applications 1

Specific variables in the TUS (Y ): it enables to estimate the time

dedicated to daily work and to study its level of "fragmentation" (number of intervals/interruptions), flexibility (exact start and end of working hours) and intra-relations with the other life times

Specific variables in the LFS (Z): The vastness of the information gathered allow us to examine the peculiar aspects of the Italian participation in the labour market: professional condition, economic activity sector, type of working hours, job duration, profession carried out, etc. Moreover, it is also possible to investigate dimensions relative to the quality of the job

Page 39: Poznan 20 October 2010 Data integration: an overview on statistical methodologies and applications. Mauro Scanu Istat Central Unit on User Needs, Integration

Poznan 20 October 2010 World Statistics Day

39

Some statistical matching applications 2

The Social Policy Simulation Database and Model (SPSD/M) is a micro computer-based product designed to assist those interested in analyzing the financial interactions of governments and individuals in Canada (see http://www.statcan.ca/english/spsd/spsdm.htm).

It can help one to assess the cost implications or income redistributive

effects of changes in the personal taxation and cash transfer system.

The SPSD is a non-confidential, statistically representative database of individuals in their family context, with enough information on each

individual to compute taxes paid to and cash transfers received from

government.

Page 40: Poznan 20 October 2010 Data integration: an overview on statistical methodologies and applications. Mauro Scanu Istat Central Unit on User Needs, Integration

Poznan 20 October 2010 World Statistics Day

40

Some statistical matching applications 2

The SPSM is a static accounting model which processes each individual and family on the SPSD, calculates taxes and transfers using legislated or proposed programs and algorithms, and reports on the results.

It gives the user a high degree of control over the inputs and outputs to the model and can allow the user to modify existing tax/transfer programs or test proposals for entirely new programs. The model can be run using a visual interface and it comes with full documentation.

Page 41: Poznan 20 October 2010 Data integration: an overview on statistical methodologies and applications. Mauro Scanu Istat Central Unit on User Needs, Integration

Poznan 20 October 2010 World Statistics Day

41

Some statistical matching applications 2

In order to apply the algorithms for microsimulation of tax–transfer benefits

policies, it is necessary to have a data set representative of the Canadian

population. This data set should contain information on structural (age,

sex,...), economic (income, house ownership, car ownership, ...), health–related (permanent illnesses, child care,...) social (elder assistance,

cultural–educational benefits,...) variables (among the others).

• It does not exist a unique data set that contains all the variables that can influence the fiscal policy of a state

• In Canada 4 samples are integrated (Survey of consumers finances, Tax return data, Unemployment insurance claim histories, Family expenditure survey)

• Common variables: some socio-demographic variables

• Interest is on the relation between the distinct variables in the different

samples

Page 42: Poznan 20 October 2010 Data integration: an overview on statistical methodologies and applications. Mauro Scanu Istat Central Unit on User Needs, Integration

Poznan 20 October 2010 World Statistics Day

42

Example (Coli et al, 2006*)

The new European System of the Accounts (ESA95) is a detailed source of information on all the economic agents, as households and enterprises. The social accounting matrix (SAM) has a relevant role.

Module on households: it includes the amount of expenditures and income, per typology of household

Coli A., Tartamella F., Sacco G., Faiella I., D’Orazio M., Di Zio M., Scanu M., Siciliani I., Colombini S., Masi A. (2006). “La costruzione di un Archivio di microdati sulle famiglie italiane ottenuto integrando l’indagine ISTAT sui consumi delle famiglie italiane e l’Indagine Banca d’Italia sui bilanci delle famiglie italiane”, Documenti ISTAT, n.12/2006.

Page 43: Poznan 20 October 2010 Data integration: an overview on statistical methodologies and applications. Mauro Scanu Istat Central Unit on User Needs, Integration

Poznan 20 October 2010 World Statistics Day

43

Example

Problem:

1) Income are observed on a Bank of Italy survey

2) Expenditures are observed on an Istat survey

3) The two samples are composed of different households, hence record linkage is useless

Page 44: Poznan 20 October 2010 Data integration: an overview on statistical methodologies and applications. Mauro Scanu Istat Central Unit on User Needs, Integration

Poznan 20 October 2010 World Statistics Day

44

Adopted solutions 1

The first statistical matching solution was imputation of missing data. Usually, “distance hot deck” was used.

In pratice, this method “mimics” record linkage: instead of matching records of the same unit, this approach “matches” records of similar units, where similarity is in terms of the common variables in the two files.

The procedure is

1) Compute the distances between the matching variables for every pair of records

2) Every record in A is associated to that record in B with minimum distance

Page 45: Poznan 20 October 2010 Data integration: an overview on statistical methodologies and applications. Mauro Scanu Istat Central Unit on User Needs, Integration

Poznan 20 October 2010 World Statistics Day

45

Adopted solutions 1

The inferential path is the following

Page 46: Poznan 20 October 2010 Data integration: an overview on statistical methodologies and applications. Mauro Scanu Istat Central Unit on User Needs, Integration

Poznan 20 October 2010 World Statistics Day

46

Adopted solutions 2

It is applied an estimate procedure under specific models that considers the presence of missing items. The easiest model is: conditional independence of the never jointly observed variables (e.g., income and expenditures) given the matching variables.

Example:

Y = income, Z = expenditures, X = house surface

(X,Y,Z) is distributed as a multivariate normal with parameters:

Mean vector =

Variance matrix =

Page 47: Poznan 20 October 2010 Data integration: an overview on statistical methodologies and applications. Mauro Scanu Istat Central Unit on User Needs, Integration

Poznan 20 October 2010 World Statistics Day

47

Adopted solutions 2

1) Estimate the regression equation on A: Y=+X

2) Impute Y in B: Yb=+Xb , b=1,…,nB

3) Estimate the regression equation in B: Z=+X

4) Impute Z in A: Za= +Xa , a=1,…,nA

Page 48: Poznan 20 October 2010 Data integration: an overview on statistical methodologies and applications. Mauro Scanu Istat Central Unit on User Needs, Integration

Poznan 20 October 2010 World Statistics Day

48

Adopted solutions 2

The inferential mechanism assumes that

Y and Z are independent given X

(there is not the regression coefficient of Z on Y

given X)

Page 49: Poznan 20 October 2010 Data integration: an overview on statistical methodologies and applications. Mauro Scanu Istat Central Unit on User Needs, Integration

Poznan 20 October 2010 World Statistics Day

49

Adopted solutions 2

This method can be applied also with this inferential scheme: the problem is what hypotheses are before the analysis phase

Page 50: Poznan 20 October 2010 Data integration: an overview on statistical methodologies and applications. Mauro Scanu Istat Central Unit on User Needs, Integration

Poznan 20 October 2010 World Statistics Day

50

Adopted solutions 3

We do not hypothesize any model. It is estimated a set of values, one for every plausible model given the observed data

Example

When matching two sample surveys on farms (Rica-Rea - FADN and SPA - FSS), it was asked the following contingency table for farms

Y = presence of cattle (FSS)

Z = class of intermediate consumption (from FADN)

Using the common variables

X1 = Utilized Agricultural Area (UAA) ,

X2 = Livestock Size Unit (LSU)

X3 = geographical characteristics

Page 51: Poznan 20 October 2010 Data integration: an overview on statistical methodologies and applications. Mauro Scanu Istat Central Unit on User Needs, Integration

Poznan 20 October 2010 World Statistics Day

51

Example

We consider all the models that we cam estimate from the observed data in the two surveys

In practice, the available data allow to say that the estimate of the number of farms with at least one cow (Y=1) in the lowest class of intermediate consumption (Z=1) is between 2,9% and 4,9%

Page 52: Poznan 20 October 2010 Data integration: an overview on statistical methodologies and applications. Mauro Scanu Istat Central Unit on User Needs, Integration

Poznan 20 October 2010 World Statistics Day

52

Inferential machine

The inferential machine does not use any specific model

It is possible to simulate data including uncertainty on the data generation model (e.g. by multiple imputation)

Page 53: Poznan 20 October 2010 Data integration: an overview on statistical methodologies and applications. Mauro Scanu Istat Central Unit on User Needs, Integration

Poznan 20 October 2010 World Statistics Day

53

Quotation (Manski, 1995*)

…”The pressure to produce answers, without qualifications, seems particularly intense in the environs of Washington, D.C. A perhaps apocryphal, but quite believable, story circulates about an economist’s attempt to describe his uncertainty about a forecast to President Lyndon Johnson. The economist presented his forecast as a likely range of values for the quantity under discussion. Johnson is said to have replied, “Ranges are for cattle. Give me a number”

*Manski, C. F. (1995) Identification problems in the Social Sciences, Harvard University Press.

Manski and other authors show that in a wide range of applied areas (econometrics, sociology, psychometrics) there is a problem of identifiability of the models of interest, usually caused by the presence of missing data. The statistical matching problem is an example of this.

Page 54: Poznan 20 October 2010 Data integration: an overview on statistical methodologies and applications. Mauro Scanu Istat Central Unit on User Needs, Integration

Poznan 20 October 2010 World Statistics Day

54

Why statistical matching?

Applications in Istat

SAM

Joint analysis FADN / FSS

Joint use of Time Use / Labour force

Objectives

Estimates of parameters of not jointly observed parameters

Creation of synthetic data (e.g. data set for microsimulation)

Page 55: Poznan 20 October 2010 Data integration: an overview on statistical methodologies and applications. Mauro Scanu Istat Central Unit on User Needs, Integration

Poznan 20 October 2010 World Statistics Day

55

Open problems

1) Uncertainty estimate (D’Orazio et al, 2006)2) Variability of uncertainty (Imbens e Manski, 2004)3) Use of sample drawn according to complex survey designs (Rubin, 1986;

Renssen, 1998)4) Use of nonparametric methods (Marella et al, 2008; Conti et al 2008)

Conti P.L., Marella D., Scanu M. (2008). “Evaluation of matching noise for imputation techniques based on the local linear regression estimator”. Computational Statistics and Data Analysis, 53, 354-365.

D’Orazio M., Di Zio M., Scanu M. (2006). “Statistical Matching for Categorical Data: Displaying Uncertainty and Using Logical Constraints”, Journal of Official Statistics, 22, 137-157.

Imbens, G.W, Manski, C. F. (2004). "Confidence intervals for partially identified parameters". Econometrica, Vol. 72, No. 6 (November, 2004), 1845–1857 

Marella D., Scanu M., Conti P.L. (2008). “On the matching noise of some nonparametric imputation procedures”, Statistics and Probability Letters, 78, 1593-1600.

Renssen, R.H. (1998) Use of statistical matching techniques in calibration estimation. Survey Methodology 24, 171–183.

Rubin, D.B. (1986) Statistical matching using file concatenation with adjusted weights and multiple imputations. Journal of Business and Economic Statistics 4, 87–94.

Page 56: Poznan 20 October 2010 Data integration: an overview on statistical methodologies and applications. Mauro Scanu Istat Central Unit on User Needs, Integration

Poznan 20 October 2010 World Statistics Day

56

Micro integration processing

It can be applied every time it is produced a complete data set (micro level) by any kind of method. Up to now, applied after exact record linkage

Micro integration processing consists of putting in place all the necessary actions aimed to ensure better quality of the matched results as quality and timeliness of the matched files. It includes

• defining checks, • editing procedures to get better estimates, • imputation procedures to get better estimates.

Page 57: Poznan 20 October 2010 Data integration: an overview on statistical methodologies and applications. Mauro Scanu Istat Central Unit on User Needs, Integration

Poznan 20 October 2010 World Statistics Day

57

Micro integration processing

It should be kept in mind that some sources are more reliable than others.

Some sources have a better coverage than others, and there may even be conflicting information between sources.

So, it is important to recognize the strong and weak points of all the data sources used.

Page 58: Poznan 20 October 2010 Data integration: an overview on statistical methodologies and applications. Mauro Scanu Istat Central Unit on User Needs, Integration

Poznan 20 October 2010 World Statistics Day

58

Micro integration processing

Since there are differences between sources, a micro integration process is needed to check data and adjust incorrect data. It is believed that integrated data will provide far more reliable results, because they are based on an optimal amount of information. Also the coverage of (sub) populations will be better, because when data are missing in one source, another source can be used. Another advantage of integration is that users of statistical information will get one figure on each social phenomenon, instead of a confusing number of different figures depending on which source has been used.

Page 59: Poznan 20 October 2010 Data integration: an overview on statistical methodologies and applications. Mauro Scanu Istat Central Unit on User Needs, Integration

Poznan 20 October 2010 World Statistics Day

59

Micro integration processing

During the micro integration of the data sources the following steps have to be taken (Van der Laan, 2000):

a. harmonisation of statistical units;

b. harmonisation of reference periods;

c. completion of populations (coverage);

d. harmonisation of variables, in case of differences in definition;

e. harmonisation of classifications;

f. adjustment for measurement errors, when corresponding variables still do not have the same value after harmonisation for differences in definitions;

g. imputations in the case of item nonresponse;

h. derivation of (new) variables; creation of variables out of different data sources;

i. checks for overall consistency.

All steps are controlled by a set of integration rules and fully automated.

Page 60: Poznan 20 October 2010 Data integration: an overview on statistical methodologies and applications. Mauro Scanu Istat Central Unit on User Needs, Integration

Poznan 20 October 2010 World Statistics Day

60

Example: Micro integration processing

From Schulte Nordholt, Linder (2007) Statistical Journal of the IAOS 24,163–171

Suppose that someone becomes unemployed at the end of November and gets unemployment benefits from the beginning of December. The jobs register may indicate that this person has lost the job at the end of the year, perhaps due to administrative delay or because of payments after job termination. The registration of benefits is believed to be more accurate. When confronting these facts the ’integrator’ could decide to change the date of termination of the job to the end of November, because it is unlikely that the person simultaneously had a job and benefits in December. Such decisions are made with the utmost care. As soon as there are convincing counter indications of other jobs register variables, indicating that the job was still there in December, the termination date will, in general, not be adjusted.

Page 61: Poznan 20 October 2010 Data integration: an overview on statistical methodologies and applications. Mauro Scanu Istat Central Unit on User Needs, Integration

Poznan 20 October 2010 World Statistics Day

61

Example: Micro integration processing

Method: definition of rules for the creation of a usable complete data set after the linkage process.

If these approaches are not applied, the integrated data set can contain conflicting information at the micro level.

These approaches are still strictly based on quality of data sets knowledge.

Proposition for a possible next ESSnet on integration: study the links between imputation and editing activities and

Page 62: Poznan 20 October 2010 Data integration: an overview on statistical methodologies and applications. Mauro Scanu Istat Central Unit on User Needs, Integration

Poznan 20 October 2010 World Statistics Day

62

Other supporting slides

Page 63: Poznan 20 October 2010 Data integration: an overview on statistical methodologies and applications. Mauro Scanu Istat Central Unit on User Needs, Integration

Poznan 20 October 2010 World Statistics Day

63

Macro integration: coherence of estimates

Sometimes it is useful to integrate aggregate data, where aggregates are computed from different sample surveys.

For instance: to include a set of tables in an information system

A problem is the coherence of information in different tables.

The adopted solution is at the estimate level: for instance, with calibration procedures (e.g.: the Virtual census in the Netherlands)

Page 64: Poznan 20 October 2010 Data integration: an overview on statistical methodologies and applications. Mauro Scanu Istat Central Unit on User Needs, Integration

Poznan 20 October 2010 World Statistics Day

64

Project

The objective of a project is to gather the developments in two distinct areas

Probabilistic expert systems: these are graphical models, characterized by the presence of an easy updating system of the joint distribution of a set of variables, once one of them is updated. These models have been used for a class of estimators that includes poststratification estimators

Statistical information systems: SIS for the production of statistical output (Istar) with the objective to integrate and manage statistical data given and validated by the Istat production areas, in order to produce purposeful output for the end users

Page 65: Poznan 20 October 2010 Data integration: an overview on statistical methodologies and applications. Mauro Scanu Istat Central Unit on User Needs, Integration

Poznan 20 October 2010 World Statistics Day

65

Objectives and open problems

Objectives

To develop a statistical information system for agriculture data, managing tables from FADN. FSS, and lists used for sampling (containing census and archive data)

To manage coherence bewteen different tables

To update information on data from the most recent survey and to visualize what changes happen to the other tables

To allow simulations (for policy making)

Problems

Use of graphical models for complex survey data

To link the selection of tables to the updating algorithm

To update more than one table at the same time

Page 66: Poznan 20 October 2010 Data integration: an overview on statistical methodologies and applications. Mauro Scanu Istat Central Unit on User Needs, Integration

Poznan 20 October 2010 World Statistics Day

66

Some practical aspects for integration: Software

There exist different software tools for record linkage record linkage and statistical matching

Relais: http://www.istat.it/strumenti/metodi/software/analisi_dati/relais/

R package for statistical matching:

http://cran.r-project.org/index.html

Look for Statmatch

Probabilistic expert systems: Hugin (it does not work with complex survey data)

Page 67: Poznan 20 October 2010 Data integration: an overview on statistical methodologies and applications. Mauro Scanu Istat Central Unit on User Needs, Integration

Poznan 20 October 2010 World Statistics Day

67

Bibliography

Batini C, Scannapieco M (2006) Data Quality, Springer Verlag, Heidelberg.

Scanu M (2003) Metodi statistici per il record linkage, collana Metodi e Norme n. 16, Istat.

D’Orazio M., Di Zio M., Scanu M. (2006) Statistical matching: theory and practice, J. Wiley & Sons, Chichester.

Ballin M., De Francisci S., Scanu M., Tininini L., Vicard P. (2009) Integrated statistical systems: an approach to preserve coherence between a set of surveys based on the use of probabilistic expert systems, NTTS 2009, Bruxelles.

Page 68: Poznan 20 October 2010 Data integration: an overview on statistical methodologies and applications. Mauro Scanu Istat Central Unit on User Needs, Integration

Poznan 20 October 2010 World Statistics Day

68

Is this conditional independence?

Page 69: Poznan 20 October 2010 Data integration: an overview on statistical methodologies and applications. Mauro Scanu Istat Central Unit on User Needs, Integration

Poznan 20 October 2010 World Statistics Day

69

And this?

Page 70: Poznan 20 October 2010 Data integration: an overview on statistical methodologies and applications. Mauro Scanu Istat Central Unit on User Needs, Integration

Poznan 20 October 2010 World Statistics Day

70

Statistical methods of integration

Sometimes a “shorter track” is used.

Note! The “automatic methods” correspond to specific data generating model

Page 71: Poznan 20 October 2010 Data integration: an overview on statistical methodologies and applications. Mauro Scanu Istat Central Unit on User Needs, Integration

Poznan 20 October 2010 World Statistics Day

71

Statistical methods of integration

Page 72: Poznan 20 October 2010 Data integration: an overview on statistical methodologies and applications. Mauro Scanu Istat Central Unit on User Needs, Integration

Poznan 20 October 2010 World Statistics Day

72

Statistical methods of integration

The last approach is very appealing:

1) Estimate a data generating model from the two data samples at hand

2) Use this estimate for the estimation of aggregate data (e.g. contingency tables on non jointly observed variables)

3) If necessary, develop a complete data set by simulation from the estimated model: the integrated data generating mechanism is the “nearest” to the data generating model, according to the optimality properties of the model estimator

Attention! Issue 1 includes hypothesis that cannot be tested on the available data (this is true for record linkage and, more “dramatically”, for statistical matching)