36
Managing data for social survey research: key issues and concerns Paul Lambert, Dept. Applied Social Science, Univ. Stirling 27 th January 2009 Presented to the workshop ‘The significance of data management for social survey research’, University of Essex, a workshop organised by the Economic and Social Data Service (www.esds.ac.uk ) and the Data Management through e-Social Science’ research Node of the National Centre for e-Social Science ( www.dames.org.uk ).

Managing data for social survey research: key issues and concerns Paul Lambert, Dept. Applied Social Science, Univ. Stirling 27 th January 2009 Presented

Embed Size (px)

Citation preview

Page 1: Managing data for social survey research: key issues and concerns Paul Lambert, Dept. Applied Social Science, Univ. Stirling 27 th January 2009 Presented

Managing data for social survey research: key issues and concerns

Paul Lambert, Dept. Applied Social Science, Univ. Stirling

27th January 2009

Presented to the workshop ‘The significance of data management for social survey research’, University of Essex, a workshop organised by the Economic and Social Data Service (www.esds.ac.uk) and the Data Management through e-Social Science’ research

Node of the National Centre for e-Social Science (www.dames.org.uk).

Page 2: Managing data for social survey research: key issues and concerns Paul Lambert, Dept. Applied Social Science, Univ. Stirling 27 th January 2009 Presented

2

‘Data Management though e-Social Science’

DAMES – www.dames.org.uk

ESRC Node funded 2008-2011

Aim: Useful social science provisionsSpecialist data topics – occupations; education qualifications;

ethnicity; social care; health Mainstream packages and accessible resources

Aim: To exploit/engage with existing DM resources In social science – e.g. CESSDA In e-Science – e.g. OGSA-DAI; OMII

Page 3: Managing data for social survey research: key issues and concerns Paul Lambert, Dept. Applied Social Science, Univ. Stirling 27 th January 2009 Presented

3

‘Data management’ means… ‘the tasks associated with linking related data resources, with

coding and re-coding data in a consistent manner, and with accessing related data resources and combining them within the process of analysis’ […DAMES Node..]

Usually performed by social scientists themselvesMost overt in quantitative survey data analysis

• ‘variable constructions’, ‘data manipulations’• navigating abundance of data – thousands of variables

Usually a substantial component of the work process

Here we differentiate from archiving / controlling data itselfHere we differentiate from archiving / controlling data itself

Page 4: Managing data for social survey research: key issues and concerns Paul Lambert, Dept. Applied Social Science, Univ. Stirling 27 th January 2009 Presented

4

Some components…

Manipulating data Recoding categories / ‘operationalising’ variables

Linking data Linking related data (e.g. longitudinal studies) combining / enhancing data (e.g. linking micro- and macro-data)

Secure access to data Linking data with different levels of access permission Detailed access to micro-data cf. access restrictions

Harmonisation standards Approaches to linking ‘concepts’ and ‘measures’ (‘indicators’) Recommendations on particular ‘variable constructions’

Cleaning data ‘missing values’; implausible responses; extreme values

Page 5: Managing data for social survey research: key issues and concerns Paul Lambert, Dept. Applied Social Science, Univ. Stirling 27 th January 2009 Presented

5

Example – recoding data

Count

323 0 0 0 0 323

982 0 0 0 0 982

0 425 0 0 0 425

0 1597 0 0 0 1597

0 0 340 0 0 340

0 0 3434 0 0 3434

0 0 161 0 0 161

0 0 0 1811 0 1811

0 0 0 0 2518 2518

0 0 0 331 0 331

0 0 0 0 421 421

0 0 0 257 0 257

102 0 0 0 0 102

0 0 0 0 2787 2787

138 0 0 0 0 138

1545 2022 3935 2399 5726 15627

-9 Missing or wild

-7 Proxy respondent

1 Higher Degree

2 First Degree

3 Teaching QF

4 Other Higher QF

5 Nursing QF

6 GCE A Levels

7 GCE O Levels or Equiv

8 Commercial QF, No OLevels

9 CSE Grade 2-5,ScotGrade 4-5

10 Apprenticeship

11 Other QF

12 No QF

13 Still At School No QF

Highesteducationalqualification

Total

-9.001.00

Degree2.00

Diploma

3.00 Higherschool orvocational

4.00 Schoollevel orbelow

educ4

Total

Page 6: Managing data for social survey research: key issues and concerns Paul Lambert, Dept. Applied Social Science, Univ. Stirling 27 th January 2009 Presented

6

Example –Linking data Linking via ‘ojbsoc00’ : c1-5 =original data / c6 = derived from data / c7 = derived from www.camsis.stir.ac.uk

Page 7: Managing data for social survey research: key issues and concerns Paul Lambert, Dept. Applied Social Science, Univ. Stirling 27 th January 2009 Presented

7

‘The significance of data management for social survey research’

The data manipulations described above are a major component of the social survey research workload

Pre-release manipulations performed by distributors / archivists• Coding measures into standard categories• Dealing with missing records

Post-release manipulations performed by researchers • Re-coding measures into simple categories

We do have existing tools, facilities and expert experience to help us…but we don’t make a good job of using them efficiently or consistently

So the ‘significance’ of DM is about how much better research might be if we did things more effectively…

Page 8: Managing data for social survey research: key issues and concerns Paul Lambert, Dept. Applied Social Science, Univ. Stirling 27 th January 2009 Presented

8

Some provocative examples for the UK…

Social mobility is increasing, not decreasing− Popularity of controversial findings associated with Blanden et al (2004)− Contradicted by wider ranging datasets and/or better measures of stratification position− DM: researchers ought to be able to more easily access wider data and better variables

Degrees, MSc’s and PhD’s are getting easier− {or at least, more people are getting such qualifications}− Correlates with measures of education are changing over time − DM: facility in identifying qualification categories & standardising their relative value within

age/cohort/gender distributions isn’t, but should, and could, be widespread

‘Black-Caribbeans’ are not disappearing − As the 1948-70 immigrant cohort ages, the ‘Black-Caribbean’ group is decreasingly

prominent due to return migration and social integration of immigrant descendants − Data collectors under-pressure to measure large groups only− DM: It ought to remain easy to access and analyse survey data on Black-Caribbean’s, such

as by merging survey data sources and/or linking with suitable summary measures

Page 9: Managing data for social survey research: key issues and concerns Paul Lambert, Dept. Applied Social Science, Univ. Stirling 27 th January 2009 Presented

9

Our own motivation (in DAMES)

1. DM is a big part of the research process ..but receives limited methodological attention

2. Poor practice in soc. sci. DM is easily observed• Not keeping adequate records• Not linking relevant data • Not trying out relevant variable operationalisations

3. Even though.. There are plenty of existing resources and standards relevant

to data management activities There are suitable software and internet facilities (Scott Long 2009)

People are working on DM support (e.g. ESDS, DAMES)

Page 10: Managing data for social survey research: key issues and concerns Paul Lambert, Dept. Applied Social Science, Univ. Stirling 27 th January 2009 Presented

10

A bit of focus…

Most of the DAMES applications aim to facilitate one of two data management activities:

1) Variable constructions o Coding and re-coding values

2) Linking datasetso Internal and external linkages

Page 11: Managing data for social survey research: key issues and concerns Paul Lambert, Dept. Applied Social Science, Univ. Stirling 27 th January 2009 Presented

11

The relevance of e-Science

‘Data management through e-Social Science’

‘E-Science’ refers to adopting a number of particular approaches and standards from computing science, to applied research areas

These approaches include ‘the Grid’; distributed computing; data and computing standardisation; metadata; security; research infrastructures

DAMES (2008-11) – developing services / resources using e-Science approaches which will help social scientists in undertaking data management tasks

Page 12: Managing data for social survey research: key issues and concerns Paul Lambert, Dept. Applied Social Science, Univ. Stirling 27 th January 2009 Presented

12

National Centre for e-Social Science, www.ncess.ac.uk

Major UK investment into UK oriented e-social science projects, typically:

Handling and displaying large volumes of complex data

E.g. GeoVue; DReSS; Obesity e-Lab Resources for computationally demanding

analytical tasks CQeSS; MoSeS

Standards setting in preparing / supporting data and research

Page 13: Managing data for social survey research: key issues and concerns Paul Lambert, Dept. Applied Social Science, Univ. Stirling 27 th January 2009 Presented

13

E-Science and Data Management

E-Science isn’t essential to good DM, but it has capacity to improve and support conduct of DM…

1. Concern with standards setting in communication and enhancement of data

2. Linking distributed/heterogeneous/dynamic data Coordinating disparate resources; interrogating live resources

3) Contribution of metadata tools/standards for variable harmonisation and standardisation

4) Linking data subject to different security levels

5) The workflow nature of many DM tasks

Page 14: Managing data for social survey research: key issues and concerns Paul Lambert, Dept. Applied Social Science, Univ. Stirling 27 th January 2009 Presented

14

E.g. of GEODE: Organising and distributing specialist data resources (on occupations)

Page 15: Managing data for social survey research: key issues and concerns Paul Lambert, Dept. Applied Social Science, Univ. Stirling 27 th January 2009 Presented

15

The contribution of DAMES 8 project themes

1.1) Grid Enabled Specialist Data Environments (‘GE*DE’)

2.1) Description, discovery & service use through metadata and data abstraction

1.2) Data resources for micro-simulation on social care data

2.2) Techniques to handle data from multiple sources

1.3) Linking e-Health and social science databases

2.3) Workflow modelling for social science

1.4) Training and interfaces for management of complex survey data

2.4) Security driven data management

Page 16: Managing data for social survey research: key issues and concerns Paul Lambert, Dept. Applied Social Science, Univ. Stirling 27 th January 2009 Presented

16

DAMES research Node

social researchers often spend more time on data management than any other part of the research process

Appendix 1 – other extant resources relevant to DM

Data access / collection

Data Management

Data Analysis

UK Data ArchiveQualidata

Flagship social surveysOffice for National Statistics

Administrative dataSpecialist academic outputs

DAMESONS supportESDS support NCRM workshops

Essex summer school ESRC RDI initiatives

CQeSS

Page 17: Managing data for social survey research: key issues and concerns Paul Lambert, Dept. Applied Social Science, Univ. Stirling 27 th January 2009 Presented

17

Some Key issues and concerns for DAMES

4 good habits and principles

3 Challenges

Page 18: Managing data for social survey research: key issues and concerns Paul Lambert, Dept. Applied Social Science, Univ. Stirling 27 th January 2009 Presented

18

(a) Good habit: Keep clear records of your DM activities

Reproducible (for self)Replicable (for all)Paper trail for whole

lifecycleCf. Dale 2006; Freese 2007

In survey research, this means using clearly annotated syntax files (e.g. SPSS/Stata)

Syntax Examples: www.longitudinal.stir.ac.uk

Page 19: Managing data for social survey research: key issues and concerns Paul Lambert, Dept. Applied Social Science, Univ. Stirling 27 th January 2009 Presented

19

Stata syntax example (‘do file’)

Page 20: Managing data for social survey research: key issues and concerns Paul Lambert, Dept. Applied Social Science, Univ. Stirling 27 th January 2009 Presented

20

Software and handling variables – a personal view

Stata is the superior package for secondary survey data analysis:

o Advanced data management and data analysis functionalityo Supports easy evaluation of alternative measures (e.g. est store)o Culture of transparency of programming/data manipulationo Cf. Scott Long (2009)

Problems with Statao Not available to all users o {Slow estimation times}

Page 21: Managing data for social survey research: key issues and concerns Paul Lambert, Dept. Applied Social Science, Univ. Stirling 27 th January 2009 Presented

21

(b) Principle: Use existing standards and previous research

Variable operationalisationsUse recognised recodes / standard classifications

• ONS harmonisation standards

• [Shaw et al. 2007]

• Cross-national standards. [Hoffmeyer-Zlotnick & Wolf 2003]

• Common v’s best practices (e.g. dichotomisations)

Use reproducible recodes / classifications (paper trail)

Other data file manipulations• Missing data treatments• Matching data files (finding the right data)

Page 22: Managing data for social survey research: key issues and concerns Paul Lambert, Dept. Applied Social Science, Univ. Stirling 27 th January 2009 Presented

22

(c) Principle: Do something, not nothing

We currently put much more effort into data collection and data analysis, and neglect data manipulation

Survey research – the influence of ‘what was on the archive version’

…In my experience, a common reason why people didn’t do more DM was because they were frightened to…

Page 23: Managing data for social survey research: key issues and concerns Paul Lambert, Dept. Applied Social Science, Univ. Stirling 27 th January 2009 Presented

23

(d) Principle: Learn how to match files (‘deterministic’)

Complex data (complex research) is distributed across different files. In surveys, use key linking variables for... One-to-one matching

SPSS: match files /file=“file1.sav” /file=“file2.sav” /by=pid. Stata: merge pid using file2.dta

One-to-many matching (‘table distribution’)SPSS: match files /file=“file1.sav” /table=“file2.sav” /by=pid .Stata: merge pid using file2.dta

Many-to-one matching (‘aggregation’)SPSS: aggregate outfile=“file3.sav” /meaninc=mean(income) /break=pid. Stata: collapse (mean) meaninc=income, by(pid)

Many-to-Many matches

Related cases matching

Page 24: Managing data for social survey research: key issues and concerns Paul Lambert, Dept. Applied Social Science, Univ. Stirling 27 th January 2009 Presented

24

Some challenges for data management..

(e) Agreeing about variable constructions

Unresolved debates about optimal measures and variables

Esp. in comparative research such as across time, between countries

In DAMES, we have particular interests in comparability for: Longitudinal comparability (

http://www.longitudinal.stir.ac.uk/variables/) Scaling / scoring categories to achieve ‘meaning equivalence’

or ‘specific measures’

Page 25: Managing data for social survey research: key issues and concerns Paul Lambert, Dept. Applied Social Science, Univ. Stirling 27 th January 2009 Presented

25

Some challenges for data management..

(f) Worrying about data security

DM activities could challenge data security Inspecting individual cases Multiple copies of related data files Ability to link with other datasets ‘Hands-on’ model of data review

New and exciting data resources • have more individual information• are more likely to be released with stringent conditions• may jeopardize traditional DM approaches

Page 26: Managing data for social survey research: key issues and concerns Paul Lambert, Dept. Applied Social Science, Univ. Stirling 27 th January 2009 Presented

26

Some routes to secure data

Secure ‘portals’ for direct access to remote data

Secure settings (e.g. safe labs)Data annonymisation and attenuation Emphasis on users’ responsibility rather than

the data provider

Page 27: Managing data for social survey research: key issues and concerns Paul Lambert, Dept. Applied Social Science, Univ. Stirling 27 th January 2009 Presented

27

Some challenges for data management..

(g) Incentivising documentation / replicability

There is little to press researchers to better document DM, but much to press them not to

• Make DM and its documentation easier?• Reward documentation (e.g. citations)?

Page 28: Managing data for social survey research: key issues and concerns Paul Lambert, Dept. Applied Social Science, Univ. Stirling 27 th January 2009 Presented

28

Appendices

Page 29: Managing data for social survey research: key issues and concerns Paul Lambert, Dept. Applied Social Science, Univ. Stirling 27 th January 2009 Presented

29

Appendix 1: Existing resources (i): Data providers - a) Documentation and metadata files

Page 30: Managing data for social survey research: key issues and concerns Paul Lambert, Dept. Applied Social Science, Univ. Stirling 27 th January 2009 Presented

30

Existing resources (i): Data providers

b) Resources for variables CESSDA PPP on key variables http://www.nsd.uib.no/cessda/project/ UK Question Bank http://qb.soc.surrey.ac.uk/ ONS Harmonisation http://www.statistics.gov.uk/about/data/

c) Resources for datasets UK Census data portal, http://census.ac.uk/ IPUMS international census data facilities, www.ipums.org European Social Survey, www.europeansocialsurvey.org

d) Data manipulations prior to data release Missing data imputation / documentation Survey design / weighting information Influential – most analysts use ‘the archive version’

Page 31: Managing data for social survey research: key issues and concerns Paul Lambert, Dept. Applied Social Science, Univ. Stirling 27 th January 2009 Presented

31

Existing resources (ii) Resource projects / infrastructures

- UK ESDS www.esds.ac.uk ESDS International | ESDS Government ESDS Longitudinal | ESDS Qualidata

- Helpdesks; online instructions; user support..

- UK ESRC NCRM / NCeSS / RDI initiatives- Longitudinal data – www.longitudinal.stir.ac.uk - Linking micro/macro - www.mimas.ac.uk/limmd/

- Other resources / projects / initiatives- EDACwowe - http://recwowe.vitamib.com/datacentre- ….

Page 32: Managing data for social survey research: key issues and concerns Paul Lambert, Dept. Applied Social Science, Univ. Stirling 27 th January 2009 Presented

32

Existing resources (iii) Analytical and software support

Textbooks featuring data management [Levesque 2008] [Sarantakos 2007] [Scott Long 2009]

Software training covering DM Stata’s ‘data management’ manual SPSS user group course on syntax and data management,

www.spssusers.co.uk

But generally, sustained marginalisation of DM as a topic Advanced methods texts use simplistic data Advanced software for analysis isn’t usually combined with extended

DM requirements

Page 33: Managing data for social survey research: key issues and concerns Paul Lambert, Dept. Applied Social Science, Univ. Stirling 27 th January 2009 Presented

33

Existing resources (iv) Data analysts’ contributions

Academic researchers often generate and publish their own DM resources, e.g.

Harry Ganzeboom on education and occupations, http://home.fsw.vu.nl/~ganzeboom/pisa/

Provision of whole or partial syntax programming examples

Analysts often drive wider resource provisions related to DM

CAMSIS project on occupational scales, www.camsis.stir.ac.uk

CASMIN project on education and social class

Page 34: Managing data for social survey research: key issues and concerns Paul Lambert, Dept. Applied Social Science, Univ. Stirling 27 th January 2009 Presented

34

Existing resources (v) Literatures on harmonisation and standardisation

National Statistics Institutes’ principles and practices

E.g. ONS www.statistics.gov.uk/about/data/harmonisation/

Cross-national organisationsE.g. UNSTATS - http://unstats.un.org/unsd/class/

Academic studiesE.g. [Harkness et al 2003]; [Hoffmeyer-Zlotnick & Wolf

2003] [Jowell et al. 2007]

Page 35: Managing data for social survey research: key issues and concerns Paul Lambert, Dept. Applied Social Science, Univ. Stirling 27 th January 2009 Presented

35

Appendix 2: Some other selected NCeSS projects(concerned with accessing/handling complex data)

GENeSIS http://www.genesis.ucl.ac.uk/

Geographical data collection for visualisation and simulation analysis

DReSS http://web.mac.com/andy.crabtree/NCeSS_Digital_Records_Node/

Storing and processing high-volume Qualitative data (audio/visual)

LifeGuide http://www.ncess.ac.uk/research/lifeguide/ Collecting/coordinating health/lifestyle information resources for public health dissemination

Obesity e-Lab http://www.ncess.ac.uk/research/obesity/

Collecting, linking and accessing health/social data re diet/lifestyle/obesity

PolicyGrid http://www.ncess.ac.uk/research/semantic_web/policyGrid/ Organise/access evidence from mixed data types to assist social science policy making

CQeSS http://e-science.lancs.ac.uk/cqess/ Statistical analysis resources for specification of models for data on complex multi-process systems

Page 36: Managing data for social survey research: key issues and concerns Paul Lambert, Dept. Applied Social Science, Univ. Stirling 27 th January 2009 Presented

36

References

Blanden, J., Goodman, A., Gregg, P., & Machin, S. (2004). Changes in generational mobility in Britain. In M. Corak (Ed.), Generational Income Mobility in North America and Europe. Cambridge: Cambridge University Press.

Dale, A. (2006). Quality Issues with Survey Research. International Journal of Social Research Methodology, 9(2), 143-158.

Freese, J. (2007). Replication Standards for Quantitative Social Science: Why Not Sociology? Sociological Methods and Research, 36(2), 2007.

Harkness, J., van de Vijver, F. J. R., & Mohler, P. P. (Eds.). (2003). Cross-Cultural Survey Methods. New York: Wiley.

Hoffmeyer-Zlotnik, J. H. P., & Wolf, C. (Eds.). (2003). Advances in Cross-national Comparison: A European Working Book for Demographic and Socio-economic Variables. Berlin: Kluwer Academic / Plenum Publishers.

Jowell, R., Roberts, C., Fitzgerald, R., & Eva, G. (2007). Measuring Attitudes Cross-Nationally. London: Sage.

Levesque, R., & SPSS Inc. (2008). Programming and Data Management for SPSS 16.0: A Guide for SPSS and SAS users. Chicago, Il.: SPSS Inc.

Sarantakos, S. (2007). A Tool Kit for Quantitative Data Analysis Using SPSS. London: Palgrave MacMillan.

Scott Long, J. (2009). The Workflow of Data Analysis Using Stata. Boca Raton: CRC Press. Shaw, M., Galobardes, B., Lawlor, D. A., Lynch, J., Wheeler, B., & Davey Smith, G. (2007). The Handbook

of Inequality and Socioeconomic Position: Concepts and Measures. Bristol: Policy Press.