Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
The Nature of the Bias When Studying Only
Linkable Person Records: Evidence from the American Community Survey
Adela Luque (U.S. Census Bureau)
Brittany Bond (U.S. Department of Commerce)
J. David Brown & Amy O’Hara (U.S. Census Bureau)
FedCASIC March 2014
1
Disclaimer
Any opinions and conclusions expressed herein are those of the authors and do not necessarily reflect the views of the U.S. Census Bureau
All results have been reviewed to ensure that no confidential information on individual persons is disclosed
2
Overview
Motivation
Objectives
Data & Methodology
Background on Anonymous Identifier Assignment Process
Expected Effects
Results
Conclusions
3
Motivation
Record linkage can enrich data, improve its quality & lead to research not otherwise possible - while reducing respondent burden & operational costs
Linking data requires common identifiers unique to each record that protect confidentiality
Census Bureau assigns Protected Identification Keys (PIKs) via a probabilistic matching algorithm: PVS (Personal Identification Validation System)
Not possible to reliably assign a PIK to every record, which may introduce bias in data analysis
4
Objectives What characteristics are associated with the
probability of receiving a PIK? That is, what is the nature of the bias introduced by incomplete PIK assignment?
Help researchers understand nature of bias, interpret results more accurately, adjust/reweight linked analytical dataset
Examine bias using regression analysis - before & after changes in PVS. Do alterations to PVS improve PIK assignment rates as well as reduce bias? NORC (2011) described some demographic and socio-
economic characteristics of those records not getting a PIK
5
Data & Methodology
2009 & 2010 American Community Survey (ACS) – processed through PVS Ongoing representative survey of the U.S. population Socioeconomic, demographic & housing characteristics 50 states & DC - Annual sample approximately 4.5 million person records
Probit model for 2009 and 2010 separately Dependent variable = 1 if person record received a PIK (0 otherwise) Covariates:
Demographic characteristics: age, sex, race and Hispanic origin Socio-economic characteristics: employment status, income, poverty status, marital status, level
of education, public program participation, health insurance status, citizenship status, English proficiency, military status, mobility status, and household type
Housing and address-related characteristics: urban vs. rural, type of living quarter, age of living quarter
ACS replicate weights Report marginal effects
2009 & 2010 results compared – before & after changes to PVS
6
Background on PVS
Probabilistic match of data from an incoming file (e.g., survey) to reference file containing data from the Social Security Administration enhanced with address data obtained from federal administrative records
If a match is found, person record receives a PIK or is “validated”
7
Background on PVS
Initial edit to clean & standardize linking fields (name, dob, sex & address)
Incoming data processed through cascading modules (or matching algorithms)
Only records failing a given module move on to the next
Impossible to compare all records in incoming file to all records in reference file → “blocking” Data split into blocks/groups based on exact matches
of certain fields or part of fields – probabilistic matching within block
8
Background on PVS
2009 PVS Modules Verification – Only for incoming files w/ SSNs
Geosearch looks for name/dob/gender matches after blocking on an address or address part (within 3-digit ZIP area)
Namesearch looks for name & dob matches within a block based on parts of name/dob
Each module has several ‘passes’ – different blocking & matching strategies
9
Background on PVS 2010 PVS Enhancements
ZIP3 Adjacency Module looks for name/dob/gender/address matches after blocking on address field parts in areas adjacent to 3-digit ZIP area
DOB Search Module looks for name/gender/dob matches after blocking on month & day of birth
Household Composition Search Module looks for name/dob matches for unmatched records that are seen in past at same address with PIKedrecord
Inclusion of ITINs in reference file
10
Expected EffectsLess likely to obtain a PIK: Insufficient or inaccurate person identifying info
in incoming record Issues w/ data collection or withholding due to
language barriers, trust in govt., privacy preferences Identifying info in incoming file & reference file more
likely to differ
Address info differs/not updated Movers, rent vs. own, certain types of housing
Record not in government reference files Newborns, recent immigrant, very
poor/unemployed/no govt. program recipient
11
Results – Overall Validation Rates
88.192.6
0.0
10.0
20.0
30.0
40.0
50.0
60.0
70.0
80.0
90.0
100.0
2009 2010
PVS Validation Rate (weighted)
12
Sources: 2009 & 2010 ACS
Probit Results
-0.040
-0.035
-0.030
-0.025
-0.020
-0.015
-0.010
-0.005
0.000
Hispanic Non-Hispanic (base)
Marginal Effect of Hispanic Origin on PVS Validation
2009
2010
13
Sources: 2009 & 2010 ACS
Probit Results
14
-0.035
-0.030
-0.025
-0.020
-0.015
-0.010
-0.005
0.000
0.005
0.010
White alone (base) Black alone AIAN alone Asian alone NHPI alone Some Other Race
Marginal Effect of Race on PVS Validation
2009 2010
Note: Dotted bars indicate that change in marginal effect from 2009 to 2010 is not statistically significant.
Probit Results
-0.140
-0.120
-0.100
-0.080
-0.060
-0.040
-0.020
0.000
Non-U.S. Citizen Foreign-Born U.S. Citizen U.S. Citizen (base)
Marginal Effect of Citizenship Status on PVS Validation
2009
2010
15
Probit Results
-0.050
-0.045
-0.040
-0.035
-0.030
-0.025
-0.020
-0.015
-0.010
-0.005
0.000
Poor Spoken English Not Poorly Spoken English (base)
Marginal Effect of Home Spoken English Quality on PVS Validation
2009
2010
16
Note: Dotted bars indicate that change in marginal effect from 2009 to 2010 is not statistically significant.
Probit Results
-0.050
-0.040
-0.030
-0.020
-0.010
0.000
0.010
0.020
0.030
0.040
0.050
0-2 3-5 6-9 10-14 15-18 19-24 25-34(base)
35-44 45-54 55-64 65-74 75 andOlder
Marginal Effect of Age on PVS Validation
2009 2010
17
Note: Dotted bars indicate that change in marginal effect from 2009 to 2010 is not statistically significant.
Probit Results
-0.180
-0.160
-0.140
-0.120
-0.100
-0.080
-0.060
-0.040
-0.020
0.000
0.020
Non-Mover in 12 MonthsBefore IM (base)
Mover From Abroad in 12Months before IM
Domestic Mover in 12Months before IM Moving status, missing
Marginal Effect of Mobility Status on PVS Validation
2009 2010
18
Probit Results
-0.040
-0.035
-0.030
-0.025
-0.020
-0.015
-0.010
-0.005
0.000
Rented Housing Unit Own Home (base)
Marginal Effect of Rent vs. Own on PVS Validation
2009
2010
19
Probit Results
-0.040
-0.030
-0.020
-0.010
0.000
0.010
0.020
0.030
0.040
0.050
Marginal Effect of Type of Living Quarter on PVS Validation
2009 2010
20
Note: Dotted bars indicate that change in marginal effect from 2009 to 2010 is not statistically significant.
Probit Results
0.0000
0.0050
0.0100
0.0150
0.0200
0.0250
0.0300
Non-Family Household (base) Living with Family
Marginal Effect of Family Household Status on PVS Validation
2009
2010
21
Probit Results
-0.004
-0.002
0.000
0.002
0.004
0.006
0.008
0.010
Rural (base) Urban Area
Marginal Effect of Rural vs Urban Area on PVS Validation
2009
2010
22
Probit Results
-0.005
0.000
0.005
0.010
0.015
0.020
0.025
0.030
0.035
Private Employment GovernmentEmployment
Self-Employed Family Employment Not Employed -includes missing (base)
Marginal Effect of Employment Status on PVS Validation
2009 2010
23
Note: Dotted bars indicate that change in marginal effect from 2009 to 2010 is not statistically significant.
Probit Results
0.000
0.010
0.020
0.030
0.040
0.050
0.060
Social SecurityRecipient
Private HealthInsurance
Public HealthInsurance
Both Private andPublic Health
Insurance
Uninsured (base)
Marginal Effect of Health Insurance Status on PVS Validation
2009
2010
24
Probit Results
0.0000
0.0020
0.0040
0.0060
0.0080
0.0100
0.0120
0.0140
Public Assistance Recipient Not Public Assistance Recipient (base)
Marginal Effect of Receiving Public Assistance on PVS Validation
2009
2010
25
Note: Dotted bars indicate that change in marginal effect from 2009 to 2010 is not statistically significant.
Probit Results
26
-0.020
-0.018
-0.016
-0.014
-0.012
-0.010
-0.008
-0.006
-0.004
-0.002
0.000
Not In Poverty (base) In Poverty
Marginal Effect of Poverty Status on PVS Validation
2009
2010
Conclusions
Mobile persons, those with lower income, unemployed, in process of integrating in economy/society, non-participants in government programs are less likely to be validated
Renters, movers, mobile homes
Low income, non-employed, most minorities, non-U.S. citizens, poor English
Non-participants of govt. program, uninsured, non-military
Researchers may wish to reweight observations based on validation propensity
27
Conclusions
Changes to PVS system
Increased overall validation rate by 4.5 percentage points
Reduced validation differences across most groups from 2009 to 2010
Record linkage research can lead to higher PIK assignment rates and less bias
28