41
1 Experiences from extracting large data sets from Swedish public offices Fredrik Liljeros

Experiences from extracting large data sets from Swedish public offices Fredrik Liljeros

Embed Size (px)

DESCRIPTION

Experiences from extracting large data sets from Swedish public offices Fredrik Liljeros. 1. Outline. Why use data sets from public offices? Three example of available Swedish datasets Workplace and household data In-patient data Data of suspected criminals - PowerPoint PPT Presentation

Citation preview

Page 1: Experiences from extracting large data sets from Swedish public offices Fredrik Liljeros

1

Experiences from extracting large data sets from Swedish public offices

Fredrik Liljeros

Page 2: Experiences from extracting large data sets from Swedish public offices Fredrik Liljeros

Outline•Why use data sets from public offices?

•Three example of available Swedish datasetsWorkplace and household dataIn-patient dataData of suspected criminals

•Problems with Swedish public office data

Page 3: Experiences from extracting large data sets from Swedish public offices Fredrik Liljeros

Sociological data

• Expensive to collect

• Time consuming (Especially time series)

• Low response rate

• Network data are associated with special problems

Page 4: Experiences from extracting large data sets from Swedish public offices Fredrik Liljeros

Sampling of Network Data

Page 5: Experiences from extracting large data sets from Swedish public offices Fredrik Liljeros

We can’t use a random sample

Page 6: Experiences from extracting large data sets from Swedish public offices Fredrik Liljeros

Extracting data from existing databases!

Page 7: Experiences from extracting large data sets from Swedish public offices Fredrik Liljeros

Sweden may be seen as an outlier when it comes to available public data

• 1686 All priests was ordered to keep track of all people living in their parishes (We had a state church until 2000 in Sweden)

• 1749 First census• 1756 Foundation of the governmental office

”Tabell kommisionen” (Sweden and Finland)• 1858 Foundation of Statistics Sweden SCB

(www.SCB.SE)

Page 8: Experiences from extracting large data sets from Swedish public offices Fredrik Liljeros

All individuals officially living in Sweden have an unique identifier

”personnummer”

700209-0960

Page 9: Experiences from extracting large data sets from Swedish public offices Fredrik Liljeros

Example 1

The Sweden database

Page 10: Experiences from extracting large data sets from Swedish public offices Fredrik Liljeros

The network

• Individuals 8,861,392

• Families 4,641,829

• Workplaces 437,936

Page 11: Experiences from extracting large data sets from Swedish public offices Fredrik Liljeros

Giant component 5 942 389Average path distance 8.5Diameter 22

Page 12: Experiences from extracting large data sets from Swedish public offices Fredrik Liljeros

Send home (or vaccinate) everyone except max size of

workplace

0 20 40 60 80 1000

1000000

2000000

3000000

4000000

5000000

6000000

size

of i

nter

cone

cted

clu

ster

max size of workplace

Page 13: Experiences from extracting large data sets from Swedish public offices Fredrik Liljeros

Send home people randomly

0 20 40 60 80 1000

1000000

2000000

3000000

4000000

5000000

6000000

size

of i

nter

cone

cted

clu

ster

working %

large workplaces first random

Page 14: Experiences from extracting large data sets from Swedish public offices Fredrik Liljeros

Average path distance

0 20 40 60 80 1000

10

20

30

40

50

60av

erag

e pa

th d

ista

nce

working %

large workplaces first random

Page 15: Experiences from extracting large data sets from Swedish public offices Fredrik Liljeros

Example 2

Data about suspected criminals

Page 16: Experiences from extracting large data sets from Swedish public offices Fredrik Liljeros

The data

• All individuals that have been registered as suspected for having committed a criminal act for every year between 1997 and 2005

• Total number of suspected individuals: 348 402• Types of crimes: 144• Total number of reported individual crimes:924 783• Average number of suspected crime types per

individual: 2.65• Standard deviation of number of suspected crime

types per individual: 3.3

Page 17: Experiences from extracting large data sets from Swedish public offices Fredrik Liljeros

Purpose

• Can social network visualization tools help us to give a better sense of how different crimes are related to each other?

Page 18: Experiences from extracting large data sets from Swedish public offices Fredrik Liljeros

Basic concepts

• Node: A specific type of crime. (For example,

• “Assualt, outdoors, against child 0-6year of age, unacquainted with the victim”

• “Trafficking for sexual purposes “

• Link: Exists between two types of crimes if at least one individual have been suspected for both crimes different years

Page 19: Experiences from extracting large data sets from Swedish public offices Fredrik Liljeros

Example

2002 Bank“Robbery, with firearm, (Bank)”

2005

Post

“Robbery, with firearm, (Post)”

Page 20: Experiences from extracting large data sets from Swedish public offices Fredrik Liljeros

The mess of all violent crimes

Page 21: Experiences from extracting large data sets from Swedish public offices Fredrik Liljeros

A minimum spanning tree

Page 22: Experiences from extracting large data sets from Swedish public offices Fredrik Liljeros

What is a minimum spanning tree?

1

2

3

4

5

26

Page 23: Experiences from extracting large data sets from Swedish public offices Fredrik Liljeros

A B

Number of mutual links

Page 24: Experiences from extracting large data sets from Swedish public offices Fredrik Liljeros

Number of mutual links may not be a good measure

Page 25: Experiences from extracting large data sets from Swedish public offices Fredrik Liljeros

A B

Highly correlated

Page 26: Experiences from extracting large data sets from Swedish public offices Fredrik Liljeros

A B

Weak correlation

Page 27: Experiences from extracting large data sets from Swedish public offices Fredrik Liljeros

A simple measure of correlation between crimes

Bfor suspected sindividual ofNumber

B andA both for suspected sindividual ofNumber

Afor suspected sindividual ofNumber

B andA both for suspected sindividual ofNumber crimecorr

Page 28: Experiences from extracting large data sets from Swedish public offices Fredrik Liljeros

A B

A simple Example

Bfor suspected sindividual ofNumber

B andA both for suspected sindividual ofNumber

Afor suspected sindividual ofNumber

B andA both for suspected sindividual ofNumber crimecorr

643.08

6

7

6

crimecorr

Page 29: Experiences from extracting large data sets from Swedish public offices Fredrik Liljeros

A minimum spanning tree based on crime correlation

Page 30: Experiences from extracting large data sets from Swedish public offices Fredrik Liljeros

A minimum spanning tree based on crime correlation with a lower threshold of 0.01

Page 31: Experiences from extracting large data sets from Swedish public offices Fredrik Liljeros

The “mess” of sexbuyers

Page 32: Experiences from extracting large data sets from Swedish public offices Fredrik Liljeros

A minimum spanning tree of suspected crimes of suspected sex buyers based on crime correlation

Page 33: Experiences from extracting large data sets from Swedish public offices Fredrik Liljeros

Conclusion

• To play with different graphs may give a good first picture of how different crimes are associated with each other

• We still need traditional statistical techniques to test hypotheses

• Existing software package are not very user friendly (Three different softwares was needed to produce these pictures Windows SQL server, Mathcad and Pajek)

Page 34: Experiences from extracting large data sets from Swedish public offices Fredrik Liljeros

Example 3

Data about inpatients in a hospital system

Page 35: Experiences from extracting large data sets from Swedish public offices Fredrik Liljeros

The hospital network

Page 36: Experiences from extracting large data sets from Swedish public offices Fredrik Liljeros

The network• All hospitalizations of individuals in Stockholm 2001-

2002• 295,108 individuals• 570,382 institutional, healthcare occasions• 702 wards located at different hospitals• The mean number of patients admitted to the wards, per

day, varied between one and 69 (mean 10.05 and standard deviation 9.44)

Page 37: Experiences from extracting large data sets from Swedish public offices Fredrik Liljeros

Degree distributions

Page 38: Experiences from extracting large data sets from Swedish public offices Fredrik Liljeros

Duration of hospital stays

Page 39: Experiences from extracting large data sets from Swedish public offices Fredrik Liljeros

Problem with Swedish public office data

• You usually have to pay for the data

• You are only allowed to use the data for the purpose you bought i for

• You can’t share the data for free

• Swedish data may not be of general interest

Page 40: Experiences from extracting large data sets from Swedish public offices Fredrik Liljeros

A last animation

Page 41: Experiences from extracting large data sets from Swedish public offices Fredrik Liljeros

Relevant publications

Liljeros, F., J. Giesecke, and P. Holme. (2007). "The contact network of inpatients in a regional healthcare system. A longitudinal case study." Mathematical Population Studies 14:269-284 Chen Y, Paul G, Cohen R, Havlin S, Borgatti SP, et al. 2007. Percolation theory applied to measures of fragmentation in social networks. Phys Rev E Stat Nonlin Soft Matter Phys 75:04610 Gallos L. K., Liljeros F, Argyrakis P, Bunde A, Havlin S. 2007. Improving immunization strategies. Phys Rev E Stat Nonlin Soft Matter Phys 75:045104 Camitz, M. and F. Liljeros (2006). "The effect of travel restrictions on the spread of a moderately contagious disease." BMC Medicine 2006 4(32): 1-10. Edling, C. R and F. Liljeros (2003) “Spatial Diffusion of Social Organizing: Modelling Trade Union Growth in Sweden, 1890-1940” Geography and Strategy vol. 20 267-192.