BIG data for Official Statistics - eustat.eus · “Big Data” estatistika ofizialean aurkeztu dugu, Vitoria-Gasteizko Europa Biltzar Jauregian, 2016ko azaroaren 21ean. 1983. urtetik

I

BIG data

for Official Statistics

Peter Struijs

Department for Methodology and Process Development

Statistics Netherlands

E-mail: [email protected]

mailto:[email protected]

II

AURKEZPENA

Urtez urte, Nazioarteko Estatistika Mintegia gogotsu dator; aurten XXIX edizioa izanda,

“Big Data” estatistika ofizialean aurkeztu dugu, Vitoria-Gasteizko Europa Biltzar Jauregian,

2016ko azaroaren 21ean.

1983. urtetik hona, estatistika alorrean mundu mailan ari diren ikertzaile aitzindari eta

ospetsuak gure Estatistika Mintegira irakasle etorri izana, ohore handia da. Oraingo honetan,

gure gonbidatu nagusia Peter Struijs izan da, Statistics Netherlands (SN) Big Data

programaren koordinatzailea (Herbehereetatik etorria).

Berarekin batera, arratsaldeko saioan Pedro Alberto González (Datuak Babesteko Euskal

Bulegoa), Jerónimo Hernández eta Iñaki Inza (Donostiako EHU-ko Informatika Fakultatea)

eta Javier San Vicente eta Jorge Aramendi (EUSTAT) izan ditugu.

Aurtengo helburu nagusia, arlo guztietako gizarte orokorrera zuzentzea izan da, bai erakunde

publikoei eta enpresa pribatuei, bai unibertsitate eta estatistika arloko lankideei eta abar.

Aintzat hartuta Big Data gaur egungo eta etorkizuneko gai garrantzitsua izango dela, aurretik

prestakuntza jaso beharreko ardura dugu.

Honen zabalkundea ahalik eta pertsona eta erakunde gehienetara iritsi ahal izateko, Eustat-eko

web orrira jo dezakezue, www.eustat.eus, Nazioarteko Estatistika Mintegiari buruzko

informazioa izan dezazuen.

Bertan liburu honen eta 1983.urtetik aurrerako hizlarien txostenak eta lanak on-line eskura

dituzue. Teknologiaren abantailarekin batera estatistika ezagutza mundu osora ahalik gehiena

zabaltzea nahi dugu.

Vitoria-Gasteiz, azaroak 2016

JOSU IRADI ARRIETA EUSTAT-eko Zuzendari Nagusia

http://www.eustat.eus/

III

PRESENTATION

Year after year, we look forward to the International Statistics Seminar enthusiastically, with

this being the XXIX edition since its inception. During this time, we presented the topic of

"Big Data" in the official Statistics Seminar held at the Europe Conference and Exhibition

Centre in Vitoria-Gasteiz on November 21, 2016.

Since 1983, it has been an honour to have been able to attract innovative and recognized

researchers in statistics on a global level to speak at our International Statistics Seminar.

This time the main guest was Peter Struijs, coordinator of the Statistics Netherlands (SN) Big

Data program). Also participating along with him in the afternoon session were Pedro Alberto

González (Basque Data Protection Agency), Jerónimo Hernández and Iñaki Inza (Faculty of

Information Technology of Donostia-San Sebastián-EHU-UPV-) and Javier San Vicente and

Jorge Aramendi (EUSTAT).

The main objective this year was to address all areas of society, both private companies and

public organisations, the university field, workers in the statistics sector... etc. We have to

keep in mind that "Big Data" is a current issue and of great importance in the future, so it is

our responsibility to prepare and train ourselves before then.

In order for this news to reach as many interested people and institutions as possible, you

have at your disposal information about the International Statistics Seminar on the Eustat

website, www.eustat.eus.

Available within this section of the website are both this book and all the papers and technical

notes made by previous speakers since 1983. We want to contribute to the expansion of

statistical knowledge on a global level through the advantages of technology.

Vitoria-Gasteiz, November 2016

JOSU IRADI ARRIETA Director General of EUSTAT


IV

PRESENTACIÓN Año tras año, recibimos el Seminario Internacional de Estadística con entusiasmo, siendo ya

la XXIXª edición desde su creación. En esta ocasión hemos presentado el tema “Big Data” en

la estadística oficial, celebrado en el Palacio de Congreso Europa de Vitoria-Gasteiz, el día

21 de noviembre de 2016.

Desde 1983, es un honor haber logrado traer investigadores pioneros y reconocidos en

materia estadística a nivel mundial, para ser ponentes de nuestro Seminario Internacional de

Estadística.

En este caso, el invitado principal ha sido Peter Struijs , coordinador del programa de Big

Data de Statistics Netherlands (SN) (Países Bajos). Junto a él, en la sesión de tarde, también

participaron Pedro Alberto González (Agencia Vasca de Protección de Datos), Jerónimo

Hernández e Iñaki Inza (Facultad de Informática de Donostia-San Sebastián- EHU-UPV-), y

Javier San Vicente y Jorge Aramendi (EUSTAT).

El principal objetivo de este año ha sido dirigirnos a todos las ámbitos de la sociedad en

general, tanto a la empresa privada como a los organismos públicos, al campo Universitario,

a trabajadores del área estadística…etc. Tenemos que tener en cuenta que el “Big Data” es un

tema de actualidad y de gran importancia en un futuro, por lo que es nuestra responsabilidad

prepararnos y formarnos previamente.

Para que esta difusión llegue al mayor número posible de personas e instituciones

interesadas, tenéis a vuestra disposición información sobre el Seminario Internacional de

Estadística en la página web de Eustat, www.eustat.eus.

Desde esta sección de la web están disponibles on-line tanto este libro como todos los trabajos

y cuadernos técnicos realizados por los anteriores ponentes desde 1983. A través de las

ventajas de la tecnología, queremos contribuir a la expansión del conocimiento de estadística

a todo el mundo.

Vitoria-Gasteiz, noviembre 2016

JOSU IRADI ARRIETA Director General de EUSTAT


V

BIOGRAFI OHARRAK

Peter Struijs Statistics Netherlands (SN) Big Data programaren koordinatzailea da; Europar

Batasuneko ESSnet (European Statistical System network) Big Data taldea koordinatzen du

eta Nazio Batasuneko Global Working Group on Big Data for Official Statistics-eko kide da.

Big Data-rekin ekin aurretik, Peter Statistics Netherlands-eko Open Dataren arduraduna izan

zen. Urte askotan, prozesuak garatzeko eta kalitatea kudeatzeko unitate arloko burua izan zen.

Lehenago EUROSTAT-en, Europar Batasuneko Estatistika Bulegoan, lan egin zuen.

Statistics Netherlandsen metodologian aditu gisa hasi zen. Horrez gain, ISI-ko (International

Statistical Institute) kide hautatua da.

BIOGRAPHICAL SKETCH Peter Struijs is coordinator of the Big Data programme of Statistics Netherlands (SN),

coordinates the ESSnet Big Data of the EU and is a member of the UN Global Working

Group on Big Data for Official Statistics.

Before being engaged in Big Data, Peter was responsible for open data at SN. For many

years, he held the position of Head of Unit for process development and quality management.

Earlier, he worked at Eurostat, the Statistical Office of the EU. He started work at SN as a

methodologist and he is an elected member of the International Statistical Institute.

NOTAS BIOGRÁFICAS Peter Struijs es coordinador del programa de Big Data de Statistics Netherlands (SN),

coordina el grupo de Big Data de ESSnet (European Statistical System network) de laUnión

Europea y es miembro de Global Working Group on Big Data for Official Statistics de las

Naciones Unidas.

Antes de dedicarse al Big Data, Peter fue responsable de Open Data en Statistics

Netherlands.Ocupó durante muchos años el cargo de Jefe de Unidad de desarrollo de procesos

y gestión de la calidad.

Previamente, trabajó en Eurostat, Oficina de Estadística de la Unión Europea. Comenzó a

trabajar en Statistics Netherlands como especialista en metodología. Además, es miembro

electo de ISI (International Statistical Institute).

1

Index

1. Introduction ....................................................................................................................... 3

1.1. The notion of big data .................................................................................................. 4

1.2. Types of big data sources ............................................................................................. 5

1.3. The use of big data ....................................................................................................... 8

2. Examples of big data for official statistics .................................................................... 11

2.1. Traffic loop data ......................................................................................................... 11

2.2. Mobile phone data ..................................................................................................... 13

2.3. Social media data ....................................................................................................... 14

3. Big data and the statistical process ................................................................................ 16

3.1. From well-known to new processes and methods ..................................................... 16

3.2. Methodological issues ................................................................................................ 17

3.3. Process issues ............................................................................................................. 19

4. Getting ready for big data .............................................................................................. 22

4.1. Organising for big data .............................................................................................. 22

4.2. Towards a data ecosystem ......................................................................................... 23

Annex ....................................................................................................................................... 26

References................................................................................................................................ 29

3

1. Introduction

Big data seems to be a hype. According to Google Trends, in August 2012 it overtook “open

data” as a search term (Struijs and Daas, 2013). Hype or not, big data is most relevant to

official statistics, since it has to do with the exponential increase of data registered through

networks of sensors, camera’s, public administrations, banks, enterprises, mobile networks,

satellites, drones, social networks, internet sites, etc. This not only creates many opportunities

for improving official statistics, such as reporting on phenomena whose measurement used to

be out of reach, but also profoundly influences the context in which statistics are produced,

for better or for worse. And if big data is a hype, this does not mean that the attention to big

data will diminish after a peak. Maybe the term “big data” will fade after some time, but as an

important phenomenon it will most probably last.

Big data has the potential to become a game changer for National Statistical Institutes (NSIs).

There are many issues with big data that may have an impact on NSIs, such as on the required

statistical methodology, the way data is obtained, privacy considerations, the need for an

appropriate IT infrastructure, the skills needed to deal with big data, the quality of statistics

based on big data, and the positioning of NSIs in the emerging data society. The possible

strategic impact of big data for official statistics was recognised by several NSIs some years

ago, and in 2013 the Directors-General of the NSIs of the European Statistical System (ESS),

adopted the so-called Scheveningen Memorandum on Big Data and Official Statistics

(DGINS, 2013), in which a course of action was set out, including the drafting of an ESS

action plan and roadmap.

The resulting momentum led to the development of new approaches to deal with big data.

However, this subject is far from being settled. In that sense the subject of big data is different

from other areas of statistics, which benefit from established, validated approaches. This

document provides an overview of the evolving field of big data for official statistics. It aims

at showing the main issues when dealing with big data and provides access to the literature

and guidelines that are being developed by various national and international organisations. It

is not meant to give definite answers to questions, such as are available for more traditional

areas of statistics. Although the document is intended to be balanced, it does reflect the

specific experience of the author in international big data initiatives and in the use of big data

by Statistics Netherlands. Parts of the text are based on earlier papers by the author.

The remainder of this chapter comprises an introduction to the notion of big data, a typology

of such data sources, and an overview of potential uses. Chapter 2 discusses three examples of

the use of big data. Building on these examples, the third chapter looks into methodological

and other issues related to the statistical process, including data access and privacy issues,

which are proving to be a significant bottleneck for realising the potential of the use of big

data for official statistics. Chapter 4 is concerned with the question what has to be done in

order to prepare for a future in which big data becomes an important source for official

4

statistics. The international statistical community has been very active in trying to help using

big data, and throughout this document references are given to what has been achieved so far.

1.1. The notion of big data

The concept of big data is not clear-cut. Many attempts have been made to define big data, but

no single definition is generally accepted. Most experts agree that big data is characterised by

volume, velocity and variety, the three V’s, and some add a V for veracity, but these

characteristics may not apply all at the same time (Mayer-Schönberger and Cukier, 2013).

Volume in itself is not enough to consider data “big”. Moore’s Law stems from 1965, and the

volume of data has been increasing for many decades. What threshold was passed a couple of

years ago to start talking about big data? Apparently, no specific one. The emergence of the

concept of big data appears to result from qualitative changes induced by changes in data

quantity and public availability. We seem to have reached a point where the traditional way of

using data does not provide the answers to the new questions that arise – or not fast enough. It

may be noted that what is seen as “high volume” at one moment may not be considered very

voluminous several years later, because of advancing technological possibilities to deal with

large data quantities. In that sense big data is also a relative notion.

In the context of official statistics, big data is generally considered as a data source. An

attempt was made by UNECE, the UN Economic Commission for Europe, to define big data

for statistical purposes. Building on a definition by Gartner (Laney, 2012) it defined big data

as follows (Glasson et al, 2013):

Big data are data sources that can be –generally– described as: “high volume, velocity

and variety of data that demand cost-effective, innovative forms of processing for

enhanced insight and decision making.”

However, this definition is not precise enough to decide in concrete cases whether the data

source belongs to big data or not. Among statisticians there is some discussion on whether

high-volume data from administrative sources is included in the notion of big data, and

scanner data is considered big data by some, but not by all. Since government may make use

of sensors, e.g. road sensors, which are considered part of the Internet of Things, the

governmental origin of the data does not preclude that it should be considered big data.

In any case, rather than trying – possibly in vain – to give a more precise definition, it may

help to mention aspects of big data sources that are regarded as characteristic for such sources

by many statisticians, and to supplement this by mentioning examples of data sources that

many statisticians consider big data sources. In this way, a picture of big data can be obtained

that is clear enough to allow making progress without being stuck in discussions on

definition. These can be found in abundance on the internet.

5

Also, in statistics, high volume is not a sufficient condition for data to be considered big data.

In fact, there exist pretty high-volume traditional data sources, such as comprehensive tax

registers, that are not necessarily considered to be big data. Other characteristics often

mentioned are the novelty of the data source, the dynamics of its population, the need to use

new methodological approaches, the essentially new character of the resulting information,

the possible need to process the data at the source, the unstructured nature of the data, the

reference of the data to events, the circumstance that the data is often a by-product of the

principal activity of an organization, and their physical distribution over several databases or

points of measurement. These characteristics do support the assumption that the emergence of

the concept of big data has to do with the qualitative changes that come with quantitative ones

(Struijs and Daas, 2013).

1.2. Types of big data sources

Especially in the situation where there is not a generally accepted, unambiguous definition of

big data, it helps to have a list of concrete big data sources. For UNECE, an international task

team developed a typology of big data sources in 2013, comprising three main categories. The

first is (human-sourced) social networks, which refers to digitized information, which is

loosely structured. The second category is process-mediated data from traditional business

systems, such as data on the registration of customers, product manufacturing, taking of

orders, etc. The data tend to be highly structured, including reference tables, relationships and

metadata, making the use of relational database systems possible. The third category is the

machine-generated data of the Internet of Things. Sensors and machines record events and

situations in the physical world, and the data can be simple or complex, but is often well-

structured. Its size and speed is beyond traditional approaches. This is the full typology1:

1. Social Networks (human-sourced information):

1100. Social Networks: Facebook, Twitter, Tumblr etc.

1200. Blogs and comments

1300. Personal documents

1400. Pictures: Instagram, Flickr, Picasa etc.

1500. Videos: YouTube etc.

1600. Internet searches

1700. Mobile data content: text messages

1800. User-generated maps

1900. E-Mail

2. Traditional Business systems (process-mediated data):

21. Data produced by Public Agencies

2110. Medical records

22. Data produced by businesses

2210. Commercial transactions

2220. Banking/stock records

1 http://www1.unece.org/stat/platform/display/bigdata/Classification+of+Types+of+Big+Data

http://www1.unece.org/stat/platform/display/bigdata/Classification+of+Types+of+Big+Data

6

2230. E-commerce

2240. Credit cards

3. Internet of Things (machine-generated data):

31. Data from sensors

311. Fixed sensors

3111. Home automation

3112. Weather/pollution sensors

3113. Traffic sensors/webcam

3114. Scientific sensors

3115. Security/surveillance videos/images

312. Mobile sensors (tracking)

3121. Mobile phone location

3122. Cars

3123. Satellite images

32. Data from computer systems

3210. Logs

3220. Web logs

In 2015, the Global Working Group on Big Data for Official Statistics (GWG), created by

UNSD, the statistical department of the UN, also looked at the question how to classify big

data, taking the UNECE results as a starting point. This followed a UNSD survey among

NSIs on the use of big data for official statistics, also in 2015, which included the question:

“On which topics do you see an urgent need for statistical guidance for your office or national

statistical system?”; one of the topics listed was “classification of big data”. The question was

answered by 89 respondents. Of these, 73% indicated that guidance on the classification of

big data had a “high” (37%) or “medium” (36%) urgency2.

The GWG approached the question of grouping big data sources as a classification problem.

Classifications usually have a subject, a scope (or universe), and one or more levels of (sub)

classes describing possible characteristics of the subjects, based on explicit or implicit

classification criteria. Classifications are designed on the basis of their intended uses.

Concerning the intended uses, the first use of a big data classification is providing a so-called

extensive definition of big data, i.e., an enumeration of types of big data. Any guidelines, for

instance on methods to be used when dealing with big data, could refer to the classification. It

can also be used for policy issues, such as having a well-defined scope for projects. For

instance, in February 2016 an ESSnet on Big Data (a research project) was launched, for

which pilot projects were selected by assessing the categories of the UNECE typology. It was

also used as a reference for the UNSD survey just mentioned.

2 http://unstats.un.org/unsd/trade/events/2015/abudhabi/presentations/day1/04/UNSD%20-

%20Global%20Survey%20on%20Big%20Data.pdf

http://unstats.un.org/unsd/trade/events/2015/abudhabi/presentations/day1/04/UNSD%20-%20Global%20Survey%20on%20Big%20Data.pdfhttp://unstats.un.org/unsd/trade/events/2015/abudhabi/presentations/day1/04/UNSD%20-%20Global%20Survey%20on%20Big%20Data.pdf

7

One possibly important future use of the classification is as a reference in the discussion on

the possible use of big data for compiling SDG indicators3. Only very few countries have

started looking at the usability of big data for deriving indicators to measure progress on the

SDGs, as was shown in the UNSD survey. Therefore, it may be too early to know how it will

be used, but it is clear that the usability of big data for such indicators will be a relevant

factor, possibly with a further decomposition such as the SDG goals, targets or indicators that

could be measured using each big data source. This is also being explored by the GWG.

The intended uses inform the classification criteria to be used. The GWG identified fifteen

potential classification criteria (GWG, 2015):

1. characteristics of the data itself

2. local versus global sources

3. regulatory framework applicable

4. main product versus by-product

5. purpose and subject of the data

6. original versus derived data

7. relationship data source with organisation (e.g. data platforms)

8. public versus private organisation providing the data

9. data sourced by humans versus machines

10. degree of stability of the source

11. degree of accessibility

12. real-time versus accumulated data

13. statistical methodology required for using the data

14. domains of usability

15. usability for SDG indicators

The first criterion, characteristics of the data itself, includes eight possible characteristics:

high volume, high velocity, high variety, high veracity, selectivity, (lack of) structure, high

population dynamics, and event-based data.

The GWG is currently engaged in developing the classification, which should be flexible and

should be able to evolve over time. Initially, this would probably mean relatively short

periods between revisions. Flexibility may also be obtained by constructing a system for

classifying big data sources on demand rather than a fixed classification. In that case, methods

and rules would be needed, and possibly a larger number of criteria could be accommodated.

The work is on-going.

Other lists of big data sources that are not clearly linked to the UNECE classification are used

elsewhere. Many big data overview papers contain lists of big data sources. A recent example

is a paper by Kitchin (2015), which contains a table linking big data sources to data types and

statistical domains, but there are many more cases of ad hoc classifications of big data.

3 SDG = Sustainable Development Goals. These have been agreed on at UN level.

8

Companies that offer services related to big data may use their own classifications, such as

IBM (2014).

1.3. The use of big data

There is a gap between actual use and potential use of big data. The potential use comprises

the following categories:

1. production of new products

2. providing more detail in statistics

3. making statistics more timely

4. add nowcasts or early indicators to statistics

5. quality improvement

6. response burden reduction

7. cost reduction and higher efficiency

New products may be statistics on phenomena about which no official statistics were

previously available. An example would be a general sentiment index on the basis of public

social media messages. Where there is new demand, such as for the SDG indicators, big data

can also be considered. New products may also be new visualisations of data (Tennekes,

2014). In some cases big data can be used as a single source, but combining big data and

traditional sources for new products are in many cases a more promising approach. For new

products one needs benchmarks, based on established, validated methods, in order to assess

the quality of the new products.

More detail in statistics may be provided along several dimensions, for instance higher

regional detail on the basis of big data sources, or more temporal detail such as monthly

estimates where previously there were only quarterly data. Usually higher detail requires

regular statistics that are produced using existing sources and methods, the detail being

derived from an additional big data source. For instance, if a survey has only limited regional

detail because of the sample size, one may explore whether Google Trends at a lower regional

level can be used to provide a picture of the lower level. However, this may be more difficult

than one might think (Reep and Buelens, 2015).

Making statistics more timely is a traditional goal of official statisticians, which has its limits

if surveys are used, or if data from administrative sources lag reality, as is for example

generally the case with fiscal data. However, big data sources may be much faster, for

instance if manually sampling prices is compared with using web scraping by means of

internet robots. And one may make use of correlations between big data sources and other

sources to generate more timely outcomes by making use of a model.

One step further is to produce early indicators or nowcasts for more traditional statistics. They

supplement these statistics rather than replace them. The early indicators or nowcasts often

9

heavily depend on correlations and model assumptions, but these quality issues may be

accepted because the final figures are produced later, so there is still a benchmark. If the

assumptions behind early indicators and nowcasts are clearly communicated to the users, and

the quality drawbacks are dealt with in a transparent way, big data may play an important role

in fulfilling the important demand of users for early information on phenomena, however

provisional the information may be.

Big data can also be used to improve the quality of statistics in the sense of improving

accuracy and reliability. (In fact, relevance, timeliness and clarity, to which the first four uses

mentioned above contribute, can also be seen as quality aspects, as is done in the European

Statistics Code of Practice4.) For improving accuracy and reliability, big data sources are

generally used as complementary sources to existing sources. This includes using big data for

checking the plausibility of statistical outcomes.

Response burden reduction is an important aim of many NSIs, some of which apply specific

reduction targets. Of course, the response burden can be especially reduced if data collected

by means of questionnaires can be replaced by data from other sources. Replacing surveys

with big data, however, is not easy. In some cases, such as using internet data for prices, is

already being used successfully (Ten Bosch and Windmeijer, 2014), but usually big data

sources have more potential if they are used as additional sources, thereby not eliminating

surveys but reducing their sample size, frequency or level of detail.

Cost reduction and higher efficiency naturally go together with response burden reduction, but

NSIs may have separate targets for them. In fact, the trend seems to be that NSIs try, where

possible, to have a so-called zero footprint, by which is meant that NSIs make use of all

information they can get without causing any cost or burden. In some countries the population

census has already been replaced by making estimated from administrative sources, and big

data has the potential to reduce cost and burden also in other areas.

The seven categories are not mutually exclusive. On the contrary, making use of big data may

serve several purposes at the same time. If continuation of the precise current statistical

programme is a requirement, it may appear that the possibilities of using big data are limited.

If there is some flexibility in the programme, the potential of big data to increase total user

satisfaction is much higher, given the increased availability of data sources. Then a new

optimum may be found. In fact, the availability of big data sources requires a new

optimisation effort, aimed at getting the best set of statistical services, given the potential data

sources, the demand for statistical information and budgetary constraints and possibilities

(Struijs and Daas, 2013).

The 2015 UNSD survey gives insight in the actual use of big data for official statistics. The

main reasons for considering the use of big data given by the 89 respondents were the

production of more timely statistics and the reduction of response burden. However, big data

4 http://ec.europa.eu/eurostat/web/quality/european-statistics-code-of-practice

http://ec.europa.eu/eurostat/web/quality/european-statistics-code-of-practice

10

is not very often used or considered to be used, the use of scanner data with 22% being the

highest. Satellite data is used by 19% of the respondents, and web scraping by 16.5%. Big

data is used very much more often by OECD respondents than non-OECD respondents, with

the exception of satellite data, which is used by 22% of OECD and 17% of non-OECD

countries.

Concerning the statistical domains in which big data is used, the top three consist of price

statistics (30%), population statistics (15%) and labour statistics (14%). It should be noted,

however, that the use of big data is in most case only for explorative purposes, in pilot

projects, and in many cases these pilot projects are in an initial phase.

11

2. Examples of big data for official statistics

In order to understand the issues that arise from using big data or the intention to use it, it

helps to look at examples first. In official statistics there are not yet many examples of actual

use of big data in regular statistics outside price statistics (use of scanner data and web

scraping). There are more examples of research into the potential use of big data for official

statistics, such as on the use of mobile phone data or satellite data, and outside official

statistics there are many more examples, such as the well-known Billion Prices Project of

MIT5. Outside official statistics, there are many examples of the use of data such as social

media messages, for example for research or for commercial purposes.

The examples presented here are from official statistics, mainly in the Netherlands. The first

concerns the use of road sensor data, which is already being used for regular statistics. This is

a big data source without major data access issues, since the data is available from an

administrative source for statistical purposes for free. The second example is about the use of

mobile phone data, where data access is a big issue indeed, but where the potential uses are

many. Several countries are currently trying to get access to and use this data, and the data is

also exploited commercially. The third example concerns the use of public social media

messages, which poses particular methodological challenges. Together these examples show

many of the issues and possibilities of big data, and they have been documented in the

literature. The discussion in this section is mainly based on a paper written for the UNECE

(Struijs and Daas, 2013), some of the text of which is reused and updated.

2.1. Traffic loop data

In the Netherlands, approximately 230 million traffic loop detection records are generated a

day. This data can be used as a source of information for traffic and transport statistics and

potentially also for statistics on other economic phenomena. The data is provided at a very

detailed level. More specifically, for more than 20,000 detection loops on Dutch roads, the

number of passing cars in various length classes is available on a minute-by-minute basis.

The downside of this source is that it seriously suffers from under coverage and selectivity.

The number of vehicles detected is not in all cases available for every minute and not all

Dutch roads have detection loops yet, although all main roads have. Fortunately, the first can

be corrected by imputing the absent data with data that is reported by the same location during

a 5-minutes interval before or after that minute (Daas et al., 2015). Coverage is improving

over time. Gradually more and more roads have detection loops, enabling a more complete

coverage of the most important Dutch roads. In one year more than 2000 loops were added.

A considerable part of the loops are able to discern vehicles in various length classes,

enabling the differentiation between cars and trucks. This is illustrated in Figure 1. In this

figure, for the whole of the Netherlands, normalized profiles are shown for 3 classes of

5 http://bpp.mit.edu/

http://bpp.mit.edu/

12

vehicles. The vehicles were differentiated in three length categories: small (5.6 and 12.2 meter). The results after correction

for missing data were used. Because the small vehicle category comprised around 75% of all

vehicles detected, compared to 12% for the medium-sized and 13% for the large vehicles, the

normalized results for each category are shown.

Figure 1. Normalized number of vehicles detected in three length categories on December

1st, 2011 after correcting for missing data. Small (5.6 and

12.2 meter) are shown in black, dark grey and grey,

respectively. Profiles are normalized to more clearly reveal the differences in driving

behaviour.

The profiles clearly reveal differences in the driving behaviour of the vehicle classes. The

small vehicles have clear morning and evening rush-hour peaks at 8 am and 5 pm,

respectively. The medium-sized vehicles have both an earlier morning and evening rush hour

peak, at 7 am and 4 pm, respectively. The large vehicle category has a clear morning rush

hour peak around 7 am and displays a more distributed driving behaviour during the

remainder of the day. After 3 pm the number of large vehicles gradually declines. Most

remarkable is the decrease in the relative number of medium-sized and large vehicles detected

at 8 am, during the morning rush hour peak of the small vehicles. This may be caused by a

deliberate action of the drivers of the medium-sized and large vehicles, wanting to avoid the

morning rush hour peak of the small vehicles.

At the most detailed level, that of individual loops, the number of vehicles detected

demonstrates (highly) volatile behaviour, indicating the need for a more statistical approach

(Daas et al., 2015). Harvesting the vast amount of information from the data is a major

challenge for statistics. For this, visualisation techniques can be very useful (Tennekes and

Puts, 2015). Making full use of this information would result in speedier and more robust

13

statistics on traffic in general and will provide more detailed information of the traffic of large

vehicles. This is very likely indicative of changes in economic development.

Since 2015 Statistics Netherlands publishes regular statistics on the traffic intensity of the

main roads, based on this source6. Interestingly, the potential of big data was demonstrated at

the beginning of 2016, when the first three working days of the year were extremely frosty,

with icy roads in the north of the country. With the process already in place, it was possible to

publish a press release on the eight of January reporting on the use of the main roads in the

north of the country, in which a comparison was made with the first three working days of

previous years. Road use was shown to have been halved7.

2.2. Mobile phone data

The use of mobile phones nowadays is ubiquitous. People often carry phones with them and

use their phones throughout the day. Instrumental for the infrastructure enabling the coverage

for mobile phones, are mobile phone masts/towers, called ‘sites’ in the industry. Those sites

are located at strategic points, covering as wide an area as possible.

Much of the activity that is associated with handling the phone traffic, that is, handling the

localisation of mobile phones and optimizing the capacity of a site, is stored by the mobile

phone company. So mobile phone companies record data that are very closely associated with

behaviour of people; behaviour that is of interest to NSIs. Obvious examples are behaviour

regarding tourism, mobility, commuting and transport. The destinations and residences of

people during daytime are also topics of various surveys. Data from mobile phone companies

could provide additional and more detailed insight on the whereabouts and the activity of its

users, which may be indicative for the behaviour of people in general.

Several NSIs have tried to get access to mobile phone location data and explore the

possibilities for statistics. In the Netherlands, research on this has been going on for some

time now. A dataset from a mobile telecommunication provider containing records of all call-

events (speech-calls and text messages) on their network in the Netherlands for a time period

of two weeks was studied. There are about 35 million records a day. Each record contains

information about the time and serving antenna of a call-event and a (scrambled version of

the) identification number of the phone. Getting the data proved to be very complex. This

study revealed several uses for official statistics, such as economic activity, tourism,

population density to mobility, and road use (De Jonge et al., 2012). In particular, the place

where people are at any time during the day can be compared to the place where people are

registered at municipalities. For these purposes, good visualisation is essential (Tennekes and

Offermans, 2014).

6 http://www.cbs.nl/NR/rdonlyres/25CE3592-A756-42B7-BABF-

C3E4C4E9375B/0/a13busiestnationalmotorwayinthenetherlands.pdf 7 http://www.cbs.nl/nl-NL/menu/themas/verkeer-vervoer/publicaties/artikelen/archief/2016/helft-minder-

verkeer-in-noord-nederland-door-ijzel-januari-2016.htm (in Dutch).

http://www.cbs.nl/NR/rdonlyres/25CE3592-A756-42B7-BABF-C3E4C4E9375B/0/a13busiestnationalmotorwayinthenetherlands.pdfhttp://www.cbs.nl/NR/rdonlyres/25CE3592-A756-42B7-BABF-C3E4C4E9375B/0/a13busiestnationalmotorwayinthenetherlands.pdfhttp://www.cbs.nl/nl-NL/menu/themas/verkeer-vervoer/publicaties/artikelen/archief/2016/helft-minder-verkeer-in-noord-nederland-door-ijzel-januari-2016.htmhttp://www.cbs.nl/nl-NL/menu/themas/verkeer-vervoer/publicaties/artikelen/archief/2016/helft-minder-verkeer-in-noord-nederland-door-ijzel-januari-2016.htm

14

Recent research by Statistics Belgium on the use of mobile phone data also showed the

potential of using mobile phone data for official statistics (De Meersman et al, 2016). The

Belgian NSI is one of the more successful NSIs in obtaining access to mobile phone data, and

promotes the formation of mutually beneficial partnerships (Debusschere, 2016).

At the level of the ESS, the importance of securing access to data from mobile network

operators has been recognised. The use of mobile phone data is one of the research areas of

the ESSnet on Big Data, mentioned earlier. Part of this research is aimed at solving access

issues. In September 2016, a workshop was organised by this ESSnet with mobile network

operators8, to see what can be done to exploit access and partnership opportunities. This will

be further discussed in chapter 4.

2.3. Social media data

So far social media messages have not been used for regular official statistics, but their

potential use is increasingly being researched. In the Netherlands, more than one million

public social media messages are produced on a daily basis. These messages are available to

anyone with internet access. Social media is a data source where people voluntarily share

information, discuss topics of interest, and contact family and friends. To find out whether

social media is an interesting data source for statistics, Dutch social media messages were

studied from two perspectives: content and sentiment.

Studies of the content of Dutch Twitter messages (the predominant public social media

message in the Netherlands at the time of the study) revealed that nearly 50% of those

messages were composed of 'pointless babble'. The remainder predominantly discussed spare

time activities (10%), work (7%), media (TV & radio; 5%) and politics (3%). Use of these,

more serious, messages was hampered by the less serious 'babble' messages. The latter also

negatively affected text mining studies.

Determination of the sentiment in social media messages revealed a very interesting potential

use of this data for statistics. The sentiment in Dutch social media messages was found to be

highly correlated with Dutch consumer confidence; in particular with the sentiment towards

the economic

situation. The latter relation was stable on a monthly and on a weekly basis. Daily figures,

however, displayed highly volatile behaviour (Daas et al., 2015). This highlights that it is

possible to produce weekly indicators for consumer confidence. It also revealed that such an

indicator could be produced on the first working day following the week studied,

demonstrating the ability to deliver quick results. Moreover, since consumer confidence

statistics are survey-based, cost and response burden reduction may be feasible, if quality

8

https://webgate.ec.europa.eu/fpfis/mwikis/essnetbigdata/index.php/WP5_2016_09_2223_Luxembourg_Workshop

https://webgate.ec.europa.eu/fpfis/mwikis/essnetbigdata/index.php/WP5_2016_09_2223_Luxembourg_Workshophttps://webgate.ec.europa.eu/fpfis/mwikis/essnetbigdata/index.php/WP5_2016_09_2223_Luxembourg_Workshop

15

issues can be solved in a satisfactory way. The survey may remain necessary for benchmark

purposes, but its sample size or frequency may be reduced. It is conceivable to use both

survey data and social media data in a model in order to get earlier results, lower cost and

response burden, and still maintain quality standards.

The analysis of social media messages is an area where text mining, learning algorithms and

other artificial intelligence approaches are applied to big data. Apart from official statistics,

much research is done in academia (an example is Schwartz et al, 2014). This is another area

where partnerships may be beneficial to both sides. And the use of such data for commercial

purposes has also been realised early on (Bollen et al, 2011), but this generally does not result

in publicly available methods and outcomes. However, it shows that official statistics are far

from unique in analysing social media data.

16

3. Big data and the statistical process

In order to see how the use of big data for official statistics may affect statistical production

processes and methods, and what issues arise, the current way of producing statistics is the

background reference. Therefore the next section starts with characterising the more familiar

way of making statistics, before looking at possible changes caused by or required for using

big data. Methodological issues are the subject of the subsequent section, which also covers

quality issues, since the quality of statistics depends on the methods applied, and methods are

usually chosen to fulfil quality objectives. There is interaction between methods on the one

hand, and their implementation in processes on the other. This is the subject of the third

section of this chapter. The chapter builds, among other sources, on a paper on quality

approaches to big data (Struijs and Daas, 2014).

3.1. From well-known to new processes and methods

With only very few exceptions, the statistical programmes of NSIs are based on inputs from

statistical surveys and administrative data sources. For such statistics there exists an elaborate

body of validated statistical methods. Many of these methods are survey oriented, but in fact,

most survey based statistics make use of population frames that are taken or derived from

administrative data sources. Methods for surveys without such frames do exist, for instance

area sampling methods, but nowadays even censuses (of persons and households as well as

business and institutions) tend to make use of administrative data. And administrative data

sources are, of course, themselves also used as the main data source for statistical outputs.

Statistical surveys are increasingly used for supplementing and enhancing administrative

source information rather than the other way round. This is the consequence of the widely

pursued objectives of response burden reduction and cost efficiency.

Surveys may be run in parallel, in so-called stovepipes. These may be well co-ordinated or

may run more or less independently from each other. A large part of the body of established

methods for surveys is connected to sampling theory, the core of which refers to a target

population of units and variables, to which sampling, data collection, data processing and

estimation are tuned and optimised, considering cost and quality aspects.

Next to stovepipe statistics, there also exist integrative statistics, based on a multitude of

sources. The prime example of such statistics is National Accounts (NA). Statistical methods

for NA focus on the way different sources for various domains and variables of interest can be

combined. Since these sources may be based on different concepts and populations, frames

and models have been developed for integration of sources. These frames and models include,

for instance, macroeconomic equations. Interestingly, NA outputs generally do not include

estimations of business populations. This may reflect the fact that the production of NA

involves quite a few expert assumptions as well as modelling, rather than population based

estimation.

17

This characterisation of statistics in relation to methods is, of course, incomplete. There are a

number of methods aimed at specific types of statistics, for instance occupancy models for

estimating the evolvement of wild animal populations, or time series modelling.

For big data the question is to what extent current methods and processes can be reused when

applying new types of data sources. Many big data sources do not have a deliberate design.

Traditional administrative registers have a well-defined target population, variables, structure

and (administrative) quality. They also have an explicit legal basis. But what design is behind

Twitter messages, commercial websites or mobile phone traffic? For big Data sources,

populations can often not be specified, let alone related to other sources. How can NSIs then

ensure quality?

The implication is that methods derived from sampling theory may have their limitations

when big data are going to be used. However, although current methods are predominantly

based on sampling theory, this is not exclusively so. Methods outside traditional sampling

theory, especially those involving modelling may be relevant when dealing with big data. And

modelling is already being applied in some statistical domains, such as NA and seasonal

adjustment.

3.2. Methodological issues

Starting with a very fundamental issue, what exactly is the meaning and relevance of the data

found in big data sources, from a user’s perspective? What does the number of searches on an

internet search engine reveal, or the sentiment observed in social media, or the number of

mobile phones connected to a site? The interpretation of big data can be a big methodological

problem (Daas and Puts, 2014). Moreover, meaning and relevance are user and use

dependant.

This issue is not unique to big data, to be sure, as for instance certain administrative data

sources may have a similar issue. In fact, this sometimes results in statistics about what can be

found in an administrative register rather than about the phenomenon of interest, such as when

reported rather than actual crime is measured, or the population with unemployment benefit

rather than unemployment itself. If the meaning of the data of a big data source cannot be

pinpointed, but obviously has some relevance, an option may be to produce stand-alone

statistics, such as a general sentiment indicator based on social media. The interpretation is

then up to the user, and changes in the index (rather than the level itself) may be interesting

anyway. In a way, the example of road use statistics based on sensor data can be seen as

stand-alone, since the number of vehicles passing a certain road segment can hardly be linked

to surveys on mobility or other statistics.

Another issue concerns the population about which a big data source reports. Most statistics

aim at giving information about populations of persons or businesses, or other relevant sets,

such as goods imported or sold. However, the population covered by big data may be unclear.

18

Mobile phones may be carried by others than the owner, some persons have multiple phones,

vehicles passing a detection loop may be private or company vehicles, and what do we know

about the population using social media? And how do these populations change over time?

How stable are they? In some cases it may be possible to obtain background variables, such as

for credit card data, while in other cases background variables may be estimated. For instance,

the choice of wording is correlated with age and sex of the user of social media (Daas and

Burger, 2015). This is another reason text mining may become more important in the age of

big data.

This issue has to do with the question of selectivity and representativity (Buelens et al, 2014),

and with the sometimes unstructured nature of the data, which makes it even more difficult to

extract meaningful statistical information. Selectivity is a characteristic of many data sources,

including big data sources. For some of such sources, the selectivity mechanism is known,

such as for road sensors if the target population consists of road segments, or for financial

transaction data. For other sources this is partly known, as is the case for mobile phone data,

where the grid of antennae may be known, but the population of mobile phone users perhaps

not. In the case of the population behind social media the mechanism is even less known.

Not knowing the composition of the populations included in big data leads to the question

what to do when sampling theory cannot be applied (Struijs et al, 2014). What to do if one

does not know for what part of the target population the dataset is representative? More

fundamentally, one may wonder whether sampling theory deserves being the default approach

to statistics in the age of big data. Maybe more model-based approaches need to be applied.

Examples are probabilistic modelling, Bayesian methods, multilevel approaches, statistical-

learning methods and occupancy models, such as those used in measuring wild animal

populations. Econometric models can also be considered. Then the measured phenomena are

leading, and research may be aimed at relating them to information already known.

However, it is not clear whether this would really be desirable. This approach to official

statistics is not generally accepted, at least not yet, as this would increase the use of

assumptions in statistics, and the compilation of statistics by making use of observed

correlations between variables. For instance, the correlation between the sentiment as

observed in social media messages and the surveyed consumer confidence may be high and

remain high for a considerable time, but if that correlation is not well understood, there are

certain risks, especially if the relationship between de population writing public messages on

social media and the population at large is not known (Daas and Puts, 2014a). Such risks have

become well-known since quality issues appeared with the Google Flu data (Lazer et al,

2014).

For NSIs, a key question is how the quality of official statistics can be guaranteed if they are

based on big data and new methods such as modelling are applied (Puts et al, 2015). The

question of modelling is not new (Breiman, 2001), but the concerns with big data are (Tam

19

and Clarke, 2015). When reading articles by proponents of new approaches (e.g., Varian,

2014),one may wonder whether a paradigm shift is taking place.

Methodological issues also have the attention of the international statistical community. In

2014 the UNECE established a task team to advise on how to ensure good quality of statistics

when using big data. The report of the task team (UNECE, 2014) did not come up with a list

of methods, because there was too little experience with big data at the time and methods

would be source dependent, but it proposed an approach similar to assessing the potential use

of administrative data sources. Another initiative is currently taking place in the ESSnet on

Big Data mentioned earlier, where a work package has been defined to systematically assess

the methods that can be used for big data statistics, coming from the ESSnet itself as well as

from the literature. The work package will be carried out in 2017. Other organisations have

also been looking into such issues (e.g., Baker et al, 2013, and AAPOR, 2015).

3.3. Process issues

Making use of big data for official statistics may have consequences for all aspects of the

statistical process, from the input to the output process. At the input side, there may be access

issues. Throughout the process, there may be issues of privacy and security, and of

infrastructure needed for processing the volume of the data. There may also be issues

concerning dissemination.

When using big data, the design of the statistical process needs special attention. It may be

difficult to receive and process really high volume datasets, especially if the second “V” of

big data, velocity, applies. In the example of the use of traffic sensor data, the first analysis

was done on a huge set of data covering all records for several years. This yielded techniques

for reduction of the volume of data without information loss, by looking at what was actually

needed, including metadata, and removing noise. For instance, records may contain a lot of

metadata that is the same for a large set of records. The process that resulted was very

efficient and used parallel processing, but in this case it is also possible to use the streaming

data. That requires a different type of process. In any case, the processing, storage and transfer

of large data sets may pose a challenge. However, given technological advances like increases

in computing power, parallel processing techniques, larger storage facilities and high

bandwidth data channels, for most situations this does not need to become a bottleneck.

There may be technical solutions for dealing with high volume and high velocity data, albeit

possibly at additional costs, but another solution is conceivable: perhaps the data can remain

at the source. Maybe it is possible to arrange that queries are done by the source holder in the

source, for instance that data is first aggregated or sampled prior to sending it to the NSI. But

such solutions themselves entail other issues. Are the results reproducible, can they still be

linked to data available at the NSI? In the case of mobile phone data, the operators often

prefer delivering aggregated data rather than individual records. This has to do with privacy

considerations, among other things (Struijs et al, 2014).

20

In general, as is the case with more traditional statistical processing, the data may be

processed physically at the NSI offices or elsewhere, in various arrangements. Although it

entails its own issues, cloud computing may be considered if some conditions are met (see

below). Another possibility is to collaborate with another party, for instance a research

institute with facilities for big data processing. This may be beneficial to all parties involved,

since knowledge and experience can be shared. Whatever arrangement is entered, it is

important to be aware of possible risks, for instance in respect of the continuity of the

partnership and public trust. There are many platform possibilities (Singh and Reddy, 2014)

and many tools for analysis, including visualisation tools for big data sets (Tennekes et al,

2013).

Privacy and security considerations are especially important when dealing with big data,

because in comparison with more traditional processes, the problems can be compounded and

the legal situation may not be entirely clear or in flux, as existing legislation and rules were

not designed for dealing with big data. There are several issues, real or perceived, that may

impede using big data. Data ownership and copyright may be an issue, and the purpose for

which data are registered. Even if data is publicly accessible, for instance on websites or as

social media messages that do not have access restrictions, questions of ownership and

purpose of publication can be raised. Internet robots cause a burden on the providers of the

sites, and in some cases site owners prefer sending data directly to the NSI. And even if data

may legally be used, this does not imply that it is wise or appropriate to do so. Of critical

importance is the implication of any use of big data for the public perception of an NSI as this

has a direct impact on trust in official statistics.

A complicating factor is the circumstance that public opinion on privacy and confidentiality

seems to be in flux. On the one hand, privacy seems to be ever more under pressure when

public safety or commercial interests are perceived to be at stake, and young people who have

grown up using social networks tend to consider privacy less important than the elderly. On

the other hand, there seems to be a growing general awareness of possible privacy

implications of the ubiquity of data, resulting in a more critical attitude towards the

unquestioned processing of data by anyone. Anyway, the understanding for the need for

statistical data collection by organisations is decreasing, especially if such data are already

registered elsewhere.

Fortunately, there are measures NSIs can take to overcome at least some of the obstacles. In

some cases the use of informed consent may be a solution. If the NSI can offer a reduction of

the response burden, this can be very helpful, also in getting the support of the general public.

Transparency about what and how big data sources are used is crucial. For the long run

changes in legislation may be considered, to ensure continuous data access. But it remains

important to stay in line with public opinion, because credibility and public trust are important

assets of NSIs.

21

As to cloud computing, a few sensible rules may be used. Cloud services include the

provision of computing resources, platforms and IT applications via the internet. At the

present legal and technical state of the art, it is advisable in general not to host sensitive and

critical data and processes in the cloud. The user of the cloud service remains responsible for

the security and privacy of the data, and public trust depends on this. It is also advisable to use

data encryption where possible.

Even at the output side of the statistical process, there may be issues. The prevention of the

disclosure of the identity of individuals is an imperative, but this is difficult to guarantee when

dealing with Big Data, although a number of techniques are available that have proven to be

reliable. Another issues may be the dissemination policy. Some statistical outputs based on

big data may be innovative or provisional and may entail some quality risks. Instead of not

disseminating results for which there is high demand, even if their quality does not meet

traditional standards, a possibility may be to release such results on a beta site, where all

outputs are qualified by default. In fact, this is common practice with many large internet

businesses, so that they get early feedback on their products. Among other things for this

reason Statistics Netherlands has launched an innovation site9 in October 2016.

International organisations have tried to provide help and guidelines to deal with process

issues. In Ireland a so-called Sandbox for practicing with big data was created in 2014 by the

Central Statistics Office (CSO) of Ireland and the Irish Centre for High-End Computing

(ICHEC)10

. NSIs may use these facilities for a small annual fee, and assistance is provided.

International guidelines have been developed on privacy and security by a task team of

UNECE in 2014, resulting in three documents on good practices. These documents do not

have any formal status, but facilitate working with big data; they are available on the big data

site of the 2014 project of UNECE11

.

At the level of the UN, the Global Working Group (GWG) mentioned earlier has also

produced recommendations. In particular, there are recommendations on access and

partnerships (GWG, 2015a), and a template for a Memorandum of Understanding with global

data providers (GWG, 2015b). Furthermore, the GWG has drafted a number of principles for

data access, trying to find a fair balance between the interest of getting free access to data for

the public good on the one hand, and legitimate interests of private organisations on the other.

The main elements of this balance are the creation of a level playing field, equal treatment,

safeguards for confidentiality and security, transparency, and proportionality. These draft

principles, which are annexed to this document, are based on the Fundamental Principles of

Official Statistics of the UN12

. They have been presented and discussed with stakeholders at

several fora and have received broad support.

9 https://www.cbs.nl/en-gb/our-services/innovation

10 http://www1.unece.org/stat/platform/display/bigdata/Sandbox

11 http://www1.unece.org/stat/platform/display/bigdata/2014+Project

12 http://unstats.un.org/unsd/dnss/gp/fundprinciples.aspx

https://www.cbs.nl/en-gb/our-services/innovationhttp://www1.unece.org/stat/platform/display/bigdata/Sandboxhttp://www1.unece.org/stat/platform/display/bigdata/2014+Projecthttp://unstats.un.org/unsd/dnss/gp/fundprinciples.aspx

22

4. Getting ready for big data

4.1. Organising for big data

An NSI that wants to make serious use of big data will have to organise itself in order to cope

with the challenge. Several factors are important. A factor not discussed so far is human

capital. In order to work with big data, specific technical skills are needed, such as advanced

computing skills, a fair command of math and statistics, modelling skills and data engineering

skills. But equally important are the mental orientation and behavioural skills of the staff.

Working with big data requires an open mind-set and the ability not to see all problems a

priori in terms of sampling theory. For this type of staff the term data scientist has been

coined. However, it is not evident that the culture of NSIs can smoothly absorb this type of

professional. A way to deal with this cultural issue is to create one or more kernels of data

scientists working with big data, and let these kernels grow, which will be a natural process if

they are successful (Struijs and Daas, 2013).

Another factor is the way processes are organised. The continuity and possible volatility of

big data sources deserve consideration. Social media, for instance, seem to have an ever

shorter lifecycle. As a consequence, the use of big data requires a more flexible set-up of

production processes, with a short time-to-market. Not only data collection has to be flexible,

but also data processing further in the production chain. More generally, NSIs that start using

big data may have to adapt or even reconsider their enterprise architecture.

For NSIs that want to make big data a serious part of their business, governance may become

an issue. Because of the important strategic aspects of big data, this subject should get

attention at the highest management level of the NSI. Setting priorities, creating favourable

conditions for using big data and taking related budget decisions would be tasks for the

strategic level, as would be the making of policy choices. An increased use of big data

requires a number of policy decisions that influence various parts of the organisation. The

organisation’s CIO (Chief Information Officer) would likely have an important say in the way

the NSI deals with big data.

There are more relevant factors, of course. The required IT infrastructure must be in place,

and the same goes for an appropriate research capability. Policy support must be organised,

for instance concerning privacy issues. All this requires a conscious effort and co-ordination.

However, there is no blueprint for getting an organisation ready for big data.

To give an example, this is the way Statistics Netherlands organised itself when the awareness

grew that big data was a strategic issue. Once the Board of Directors of Statistics Netherlands

identified the need to have a big data strategy, they had a staff member write a position paper

for discussion by the Board. The paper suggested that the NSI should work out a big data

roadmap, and this was done. The roadmap was validated by IBM (IBM, 2014), updated twice

a year, and monitored. The roadmap not only identified big data research projects and

23

statistics, together with a time plan and ownership, but also arranged for creating the right

conditions, such as IT, methodological and policy support. The Deputy Director-General was

made responsible for big data at the strategic level. Statistics Netherlands already had an R&D

programme, which was then also tuned to the demand for big data research. At a more tactical

level, a big data co-ordination group was created. This group prepared updates of the

roadmap, and also arranged for internal training to be given.

In practice, this approach did yield results, but it was not satisfactory. In particular, the

transition from research to regular statistics production took much longer than desired. In

2016, the decision was taken to centralise all big data activities in a programme with its own

physical facilities and permanent staff, to which other staff is added on a project basis. This

centre, which is called the Center for Big Data Statistics (CBDS), was launched in September

2016. It has already many external partners, and makes use of the innovation site mentioned

in section 3.3.

4.2. Towards a data ecosystem13

The environment in which NSIs operate is changing. This may have consequences for the

position an NSI occupies or wants to occupy in the data society. NSIs are faced with more and

more potential data sources, whereas the modalities for their use are changing. Most actions

by persons or businesses – transactions, movements, communication, social and business

activities – nowadays leave digital traces in one way or another; ever increasing amounts of

data are becoming available. And, contrary to survey data, these data are not available

exclusively for NSIs. They are becoming less unique as a user of data.

The position of NSIs in the information society is becoming less evident, even though their

institutional setting is stable for the time being. Other providers of information on relevant

phenomena of society pop up everywhere. They are often very quick and perceived as

knowledgeable. Because of this, society becomes less dependent on information from NSIs.

There are alternatives, for instance, to official price indices. Even if there are quality issues

attached to these alternatives, there is demand for them. Apart from the many practical

questions, big data is bound to have an impact on NSIs at the strategic level.

One of the questions with which an NSI may be confronted is the question what to do if there

is a market alternative to one or more of its statistics. But it may also ask whether it can

assume new roles, based on its institutional position and the knowledge it has accumulated.

Should one for instance consider to shift the role of the NSI from producing statistical

information towards validating information produced by others? Or to pool resources?

A possible approach is to assess the strengths and weaknesses of the NSI, and take them into

account when positioning itself in the information society. For instance, NSIs have a unique

ability to relate data from different sources and to assess the quality of information produced

13

This section makes extensive use of (Struijs and Daas, 2014) and (Struijs et al, 2014).

24

by others. They may try to exploit this by forming networks and forging partnerships with

other organisations. NSIs have come to recognise the necessity of not working in isolation but

collaborating with each other and others outside the community of official statistics. This

collaboration is often exploratory and may be aimed at sharing knowledge and experiences,

but there are already examples of collaboration that go further.

From the perspective of NSIs, several types of partners are of interest. First of all, the

potential providers of big data are essential partners: if they do not grant access to their data,

the story is over before it starts. Data owners have their own concerns and, like NSIs, they are

subject to privacy rules. This may complicate collaboration even if they have a positive

outlook and approach. But since big data sources are not designed for statistical use, such

collaboration is also essential in order to obtain good knowledge of the provenance of such

sources. Additionally, for statistical production, it may be more efficient to have data

processed at the site of collection and storage.

On the other hand, statisticians also have much to offer such as providing analytic insights

that may help data owners understand their data better. Doing complex statistical analyses is

core business for NSIs, but not for, say, a mobile phone company. In these and other ways,

the relationship with data providers could potentially become true partnerships. For example,

one specific role that NSIs could play is that of a trusted third party. In a competitive market,

competitors will be reluctant to share sensitive data among each other. But they might be

willing to share it with an NSI who compiles statistical information that is beneficial to all.

Collaboration between NSIs and academia may grow as well. Universities have historically

been natural partners for NSIs. It stands to reason that such collaboration will extend to the

field of big data, for instance, in solving methodological problems, developing technical

solutions and training future data scientists. Such collaboration is also being supported by

public funders who are facilitating research and innovation partnerships through targeted

grants. By working in partnership, researchers in universities and NSIs could better leverage

such opportunities.

Furthermore, there are many commercial partners with which NSIs could collaborate. Google

and Facebook are two examples for which big data forms the core of their business model.

Their knowledge and the data to which they have access may be very relevant to NSIs. IT

companies also possess relevant knowledge on big data processing and storage, security,

cloud processing, etc. Apart from the provision of paid services, collaboration may be of

interest to them with a view to obtaining statistical expertise and for benchmarking or

validating their information products.

The relationship between the various stakeholders will involve each partner building on and

contributing different strengths and will likely result in flexible networks. Such networks are

flexible in the sense that membership of the network and the contribution of partners depend

on actual needs instead of being fixed in advance for a long time. The emerging data

25

ecosystem will also allow for forming ad hoc consortia to compete for research funds and

other subsidies, such as funds from the Horizon 2020 programme (European Commission,

2013). The ESSnet on Big Data mentioned earlier can also be seen as part of the data

ecosystem.

Partnerships are the core of the data ecosystem, and international organisations have tried to

document good practices. The GWG work on access and partnerships, such as the

recommendations for access to data from private organisations was mentioned in section 3.3,

which actually carried forward earlier work of UNECE14

on partnerships. Guidelines have

also been drafted by PARIS21 in collaboration with OECD (Robin et al, 2016).

14

http://www1.unece.org/stat/platform/display/bigdata/Guidelines+for+the+establishment+and+use+of+partnerships+in+Big+Data+Projects+for+Official+Statistics

http://www1.unece.org/stat/platform/display/bigdata/Guidelines+for+the+establishment+and+use+of+partnerships+in+Big+Data+Projects+for+Official+Statisticshttp://www1.unece.org/stat/platform/display/bigdata/Guidelines+for+the+establishment+and+use+of+partnerships+in+Big+Data+Projects+for+Official+Statistics

26

Annex

Global Working Group on Big Data for Official Statistics

Recommendations for Access to Data from Private Organisations

for Official Statistics

Draft 14 July 2016

Preamble

The Global Working Group on Big Data for Official Statistics,

(1) Taking notice of the high and urgent need for access to data kept by private organizations

for the production of official statistics, such as indicators for the Sustainable Development

Goals and statistics on phenomena related to modern society, and the social responsibility

already shown by private organizations to provide access to new data sources, free of

charge, for purposes such as disaster relief and the fight against epidemics,

(2) Bearing in mind that in using such data the Fundamental Principles of Official Statistics,

as endorsed by the UN General Assembly15

, unconditionally apply, and that the statistical

community has pledged to adhere to the professional ethics, as stated in the Declaration on

Professional Ethics, as adopted by the International Statistical Institute16

, thereby creating

the foundation for sharing data for official statistics,

(3) Recognizing the legitimate interests of private organizations, including respect for their

business model and value proposition, and the need to guarantee a level playing field for

private organizations considering the burden created by providing data for official

statistics, as well as the legitimate interest of organizations in charge of compiling official

statistics to have equal access,

(4) Stressing that the burden to private organizations resulting from data requests for official

statistics must be fair in proportion to their envisaged public benefits and that the data

should be adequate and relevant in relation to the purposes for which they are requested,

(5) Considering that legislation aimed at accessing and using data kept by private

organizations unavoidably lags the emergence of new types of data sources, that existing

national and international legal frameworks fully apply but need interpretation in view of

new data sources, especially concerning privacy, data ownership, reuse of data by third

15 Resolution 68/261, adopted by the General Assembly on 29 January 2014. 16

This declaration was adopted by the Council of the International Statistical Institute in its session of 22 and 23 July 2010, in Reykjavik, Iceland.

27

parties, and liability in case of breaches of confidentiality, and that there is thus a need for

guidance,

(6) Highlighting the need to create public trust by applying full transparency in the use of

data from private organizations for official statistics, in particular in view of privacy

concerns, given a number of well-publicized cases of likely abuse outside the realm of

official statistics, and the need to provide clarity concerning the possible use for statistical

purposes of personal data in customer contracts with private organizations, for instance by

referring to the Recommendations set out below,

(7) Acknowledging that private data sources are diverse in many respects, such as data

ownership, provenance of the data, purpose of collecting the data, and characteristics of

the data itself, and that providing access to the data can take a variety of shapes, such as

sending micro data to statistical agencies, providing aggregates compiled according to

specifications from statistical agencies, or providing on-site data access for analysis,

(8) Admitting that source and branch specific operational rules and guidelines may be needed

for dealing with access to data kept by private organizations, that such rules and

guidelines should be consistent with the Recommendations set out below, that before

access is requested for the purpose of producing official statistics data exploration may be

necessary in collaboration with the private data source, and that this requires the

development of partnerships between private organizations providing and statistical

agencies using data,

Endorses the following recommendations for access to data from private organizations for

the production of official statistics:

Recommendations

Recommendation 1. The role of national and international systems of official statistics is to

provide relevant, high-quality information to society in an impartial way. This role is

indispensable to the well-functioning of societies. To this end, data is needed from private

organizations as inputs to these systems. In view of the emergence of new types of data

sources and the social responsibility of private organizations, these members of society are

called upon to make the data that is needed available to the statistical agency concerned, free

of charge, on a voluntary basis.

Recommendation 2. The data needed for official statistics may only be collected and

processed if the statistical agency concerned acts in full accordance with the Fundamental

Principles of Official Statistics17

. These principles guarantee, among other things, the

professional independence and accountability of the statistical agency, and the strictly

confidential use of the data, exclusively for statistical purposes.

17

http://unstats.un.org/unsd/dnss/gp/fundprinciples.aspx

http://unstats.un.org/unsd/dnss/gp/fundprinciples.aspx

28

Recommendation 3. When data is collected from private organizations for the purpose of

producing official statistics, the fairness of the distribution of the burden across those

organizations has to be considered, in order to guarantee a level playing field.

Recommendation 4. Data requests for official statistics must acknowledge and take into

account the role of data in the business model and value proposition of private organizations,

in particular if their data has market value. There must be a fair balance between public and

business interests when data is requested and possible harm to business interests has to be

kept as low as possible.

Recommendation 5. The data must be adequate and relevant in relation to the purposes for

which it is requested from the private organization. No more data should be requested than

needed for these purposes. Operational arrangements have to be agreed on between the

private organization and the statistical agency concerned, taking into account business

concerns and data adequacy for official statistics. The metadata must also be adequate.

Recommendation 6. The cost and effort of providing data access, including possible pre-

processing, must be reasonable compared to the expected public benefit of the official

statistics envisaged.

Recommendation 7. When private organizations operate internationally, they are expected to

treat requests for data from national statistical systems in a non-discriminatory way, unless

different treatment is justified by differences in the national legislative frameworks

concerned, and provided that adherence to the Fundamental Principles of Official Statistics is

guaranteed in theory as well as p

Documents

BIG data for Official Statistics - eustat.eus · “Big Data” estatistika ofizialean aurkeztu dugu, Vitoria-Gasteizko Europa Biltzar Jauregian, 2016ko azaroaren 21ean. 1983. urtetik