20
Using synthetic data to improve the accessibility of the SLS Susan Carsley, SLS Project Manager

Using synthetic data to improve the accessibility of the SLS Susan Carsley, SLS Project Manager

Embed Size (px)

Citation preview

Using synthetic data to improve the accessibility of the SLS

Susan Carsley, SLS Project Manager

Overview

What is the SLS?

How the SLS can currently be accessed

How the SLS hopes to use synthetic data

What is the SLS?

What is the SLS?

The SLS is a large-scale, anonymised linkage study designed to capture 5.5% of the Scottish population

The sample is based on 20 semi-random birthdays

It’s a joint project between University of Edinburgh, University of St Andrews and National Records of Scotland (NRS)

It is built using data available from… Census Vital Events NHSCR (migration into or out of Scotland) Education (School Census, Absences and SQA qualifications) NSS health data (linked on a project by project basis)

Aims and scope

Aims:Continue building and developing the SLS;

Support researchers who wish to undertake projects with the SLS data;

Provide web-based resources that help make use of the SLS easier;

Provide training on the SLS and longitudinal data handling, analysis and modelling.

Scope:Research into demographic, health and social questions in Scotland;

Support is primarily given to academic researchers, and secondly to non-academic researchers for non-commercial use.

Security & Confidentiality

Dataset is held in a secure environment at NRS (access to the building is controlled, passes are worn at all times and visitors are escorted)

Data are accesses in a keypad-secure environment

Computers are on a password-protected, stand-alone network

Abide by all relevant protocols on data sharing, access and security

Data access strictly controlled

Release of the results of data analysis are all disclosure checked

How the SLS can currently be accessed

Accessing the SLS

There are currently 2 ways to access the SLS

Remote access

Safe Setting access

Types of data access: Remote Access

Analysis

Researchers can specify the analyses by writing syntax code in

SPSS, SAS or Stata, and sending this to their SLS Support Officer.

Use the web-based Data Dictionary for looking up variable names

and category names (http://sls.lscs.ac.uk/variables).

Or Support Officer will email the researcher an ‘empty shell’ including

variable labels and value labels to aid writing the syntax.

The Support Officer will then run the analysis on the real dataset.

Types of data access: Remote Access

Outputs

The Support Officer will check the output of the analyses to check for confidentiality issues.

If the output is disclosive, your Support Officer does one of the following two things:

alters the output slightly so that it no longer contains disclosive elements.

informs you that the analyses you wish cannot be carried out because they breach the confidentiality rules.

Cleared output is sent to researchers (by email in an encrypted attachment).

Researchers never receive the real dataset. Remote access only provides you with cleared analysis outputs, such as frequency tables, cross tabulations, or regression model parameters.

Types of data access: Remote Access

Pros Cons

Can work from the comfort of own home/ office

Get no feel for the data

Can access textbooks and internet whilst writing syntax

Can be a long process if models need tweaking and rerun

Don’t need to travel to the Safe Setting in Edinburgh

Very reliant on Support Officer

Types of data access: Working in the safe setting room

If you wish to analyse the data yourself – as most users do especially at the initial stages of recoding variables and exploratory analysis – you will need to visit NRS in Edinburgh to work with in the safe setting (safe haven) room.

You will not have access to the entire SLS database (only the sub-set of data extracted for your project).

The computers for analysis are not connected to the outside world and are only equipped with a CD-ROM reader.

You cannot take your outputs home immediately, because they first have to be cleared by the SLS Team (the encrypted outputs will be sent to you afterwards).

Types of data access: Working in the safe setting room

Pros Cons

Work with the data hands on Must travel to the Safe Setting in Edinburgh

Can tweak and rerun models No internet access within the SLS

Support Officer on hand to provide advise

Strict rules within Safe Setting

How the SLS hopes to use synthetic data

Why use synthetic data?

The sensitive nature of the information the SLS contains means that access to the microdata is highly restricted.

Consequently, compared to other census data products the SLS is used by a small number of researchers – a situation which limits their potential impact.

Using synthetic data will facilitate access to the SLS while protecting confidentiality.

Synthetic data for the SLS- SYLLS

Synthetic SLS data spine (1991 & 2001)Age, sex, marital status, ethnicity, limiting long term illness and geography

Open access via CALLS Hub and SLS

Bespoke synthetic datasetsSynthetic versions of data extracts to match individual user data requests

Provided to approved researchers for preliminary analysis, final analysis will be run on the real data in safe settings

Synthetic SLS data spine

Aims

Provide web-based resources that help make use of the SLS easier;

Provide training on the SLS and longitudinal data handling, analysis and modelling.

Benefits

Will allow a small subset of longitudinal data to be made available online.

Uses

Will allow potential users to access a small subset of data online and allow them to consider and practice longitudinal analysis techniques

Used in SLS training courses

Freely available for others to use as a training dataset

Bespoke synthetic datasets

Aims

Support researchers who wish to undertake projects with the SLS data;

Benefits

A good compromise between the current access options. The synthetic dataset can be accessed at home and will look (structurally) and behave (statistically) like original confidential data but will contain artificial units only.

Uses

Allow researchers to access a synthetic version of their dataset at home

Allow researchers to write syntax and develop models using synthetic data which should behave like the original data

Coming soon……..

Access to SLS-like data on own computer:

Spine datasets available soon via CALLS Hub and SLS website

Following formal approval bespoke synthetic data should be available for SLS users in 2015

For more information

SLS

Website – sls.lscs.ac.uk

Email – [email protected]

Twitter - @SLS_DSU

SYLLS

Website – http://www.lscs.ac.uk/projects/synthetic-data-estimation-for-uk-longitudinal-studies/