43

BIG data for Official Statistics - eustat.eus · “Big Data” estatistika ofizialean aurkeztu dugu, Vitoria-Gasteizko Europa Biltzar Jauregian, 2016ko azaroaren 21ean. 1983. urtetik

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

  • I

    BIG data

    for Official Statistics

    Peter Struijs

    Department for Methodology and Process Development

    Statistics Netherlands

    E-mail: [email protected]

    mailto:[email protected]

  • II

    AURKEZPENA

    Urtez urte, Nazioarteko Estatistika Mintegia gogotsu dator; aurten XXIX edizioa izanda,

    “Big Data” estatistika ofizialean aurkeztu dugu, Vitoria-Gasteizko Europa Biltzar Jauregian,

    2016ko azaroaren 21ean.

    1983. urtetik hona, estatistika alorrean mundu mailan ari diren ikertzaile aitzindari eta

    ospetsuak gure Estatistika Mintegira irakasle etorri izana, ohore handia da. Oraingo honetan,

    gure gonbidatu nagusia Peter Struijs izan da, Statistics Netherlands (SN) Big Data

    programaren koordinatzailea (Herbehereetatik etorria).

    Berarekin batera, arratsaldeko saioan Pedro Alberto González (Datuak Babesteko Euskal

    Bulegoa), Jerónimo Hernández eta Iñaki Inza (Donostiako EHU-ko Informatika Fakultatea)

    eta Javier San Vicente eta Jorge Aramendi (EUSTAT) izan ditugu.

    Aurtengo helburu nagusia, arlo guztietako gizarte orokorrera zuzentzea izan da, bai erakunde

    publikoei eta enpresa pribatuei, bai unibertsitate eta estatistika arloko lankideei eta abar.

    Aintzat hartuta Big Data gaur egungo eta etorkizuneko gai garrantzitsua izango dela, aurretik

    prestakuntza jaso beharreko ardura dugu.

    Honen zabalkundea ahalik eta pertsona eta erakunde gehienetara iritsi ahal izateko, Eustat-eko

    web orrira jo dezakezue, www.eustat.eus, Nazioarteko Estatistika Mintegiari buruzko

    informazioa izan dezazuen.

    Bertan liburu honen eta 1983.urtetik aurrerako hizlarien txostenak eta lanak on-line eskura

    dituzue. Teknologiaren abantailarekin batera estatistika ezagutza mundu osora ahalik gehiena

    zabaltzea nahi dugu.

    Vitoria-Gasteiz, azaroak 2016

    JOSU IRADI ARRIETA EUSTAT-eko Zuzendari Nagusia

    http://www.eustat.eus/

  • III

    PRESENTATION

    Year after year, we look forward to the International Statistics Seminar enthusiastically, with

    this being the XXIX edition since its inception. During this time, we presented the topic of

    "Big Data" in the official Statistics Seminar held at the Europe Conference and Exhibition

    Centre in Vitoria-Gasteiz on November 21, 2016.

    Since 1983, it has been an honour to have been able to attract innovative and recognized

    researchers in statistics on a global level to speak at our International Statistics Seminar.

    This time the main guest was Peter Struijs, coordinator of the Statistics Netherlands (SN) Big

    Data program). Also participating along with him in the afternoon session were Pedro Alberto

    González (Basque Data Protection Agency), Jerónimo Hernández and Iñaki Inza (Faculty of

    Information Technology of Donostia-San Sebastián-EHU-UPV-) and Javier San Vicente and

    Jorge Aramendi (EUSTAT).

    The main objective this year was to address all areas of society, both private companies and

    public organisations, the university field, workers in the statistics sector... etc. We have to

    keep in mind that "Big Data" is a current issue and of great importance in the future, so it is

    our responsibility to prepare and train ourselves before then.

    In order for this news to reach as many interested people and institutions as possible, you

    have at your disposal information about the International Statistics Seminar on the Eustat

    website, www.eustat.eus.

    Available within this section of the website are both this book and all the papers and technical

    notes made by previous speakers since 1983. We want to contribute to the expansion of

    statistical knowledge on a global level through the advantages of technology.

    Vitoria-Gasteiz, November 2016

    JOSU IRADI ARRIETA Director General of EUSTAT

    http://www.eustat.eus/

  • IV

    PRESENTACIÓN Año tras año, recibimos el Seminario Internacional de Estadística con entusiasmo, siendo ya

    la XXIXª edición desde su creación. En esta ocasión hemos presentado el tema “Big Data” en

    la estadística oficial, celebrado en el Palacio de Congreso Europa de Vitoria-Gasteiz, el día

    21 de noviembre de 2016.

    Desde 1983, es un honor haber logrado traer investigadores pioneros y reconocidos en

    materia estadística a nivel mundial, para ser ponentes de nuestro Seminario Internacional de

    Estadística.

    En este caso, el invitado principal ha sido Peter Struijs , coordinador del programa de Big

    Data de Statistics Netherlands (SN) (Países Bajos). Junto a él, en la sesión de tarde, también

    participaron Pedro Alberto González (Agencia Vasca de Protección de Datos), Jerónimo

    Hernández e Iñaki Inza (Facultad de Informática de Donostia-San Sebastián- EHU-UPV-), y

    Javier San Vicente y Jorge Aramendi (EUSTAT).

    El principal objetivo de este año ha sido dirigirnos a todos las ámbitos de la sociedad en

    general, tanto a la empresa privada como a los organismos públicos, al campo Universitario,

    a trabajadores del área estadística…etc. Tenemos que tener en cuenta que el “Big Data” es un

    tema de actualidad y de gran importancia en un futuro, por lo que es nuestra responsabilidad

    prepararnos y formarnos previamente.

    Para que esta difusión llegue al mayor número posible de personas e instituciones

    interesadas, tenéis a vuestra disposición información sobre el Seminario Internacional de

    Estadística en la página web de Eustat, www.eustat.eus.

    Desde esta sección de la web están disponibles on-line tanto este libro como todos los trabajos

    y cuadernos técnicos realizados por los anteriores ponentes desde 1983. A través de las

    ventajas de la tecnología, queremos contribuir a la expansión del conocimiento de estadística

    a todo el mundo.

    Vitoria-Gasteiz, noviembre 2016

    JOSU IRADI ARRIETA Director General de EUSTAT

    http://www.eustat.eus/

  • V

    BIOGRAFI OHARRAK

    Peter Struijs Statistics Netherlands (SN) Big Data programaren koordinatzailea da; Europar

    Batasuneko ESSnet (European Statistical System network) Big Data taldea koordinatzen du

    eta Nazio Batasuneko Global Working Group on Big Data for Official Statistics-eko kide da.

    Big Data-rekin ekin aurretik, Peter Statistics Netherlands-eko Open Dataren arduraduna izan

    zen. Urte askotan, prozesuak garatzeko eta kalitatea kudeatzeko unitate arloko burua izan zen.

    Lehenago EUROSTAT-en, Europar Batasuneko Estatistika Bulegoan, lan egin zuen.

    Statistics Netherlandsen metodologian aditu gisa hasi zen. Horrez gain, ISI-ko (International

    Statistical Institute) kide hautatua da.

    BIOGRAPHICAL SKETCH Peter Struijs is coordinator of the Big Data programme of Statistics Netherlands (SN),

    coordinates the ESSnet Big Data of the EU and is a member of the UN Global Working

    Group on Big Data for Official Statistics.

    Before being engaged in Big Data, Peter was responsible for open data at SN. For many

    years, he held the position of Head of Unit for process development and quality management.

    Earlier, he worked at Eurostat, the Statistical Office of the EU. He started work at SN as a

    methodologist and he is an elected member of the International Statistical Institute.

    NOTAS BIOGRÁFICAS Peter Struijs es coordinador del programa de Big Data de Statistics Netherlands (SN),

    coordina el grupo de Big Data de ESSnet (European Statistical System network) de laUnión

    Europea y es miembro de Global Working Group on Big Data for Official Statistics de las

    Naciones Unidas.

    Antes de dedicarse al Big Data, Peter fue responsable de Open Data en Statistics

    Netherlands.Ocupó durante muchos años el cargo de Jefe de Unidad de desarrollo de procesos

    y gestión de la calidad.

    Previamente, trabajó en Eurostat, Oficina de Estadística de la Unión Europea. Comenzó a

    trabajar en Statistics Netherlands como especialista en metodología. Además, es miembro

    electo de ISI (International Statistical Institute).

  • 1

    Index

    1. Introduction ....................................................................................................................... 3

    1.1. The notion of big data .................................................................................................. 4

    1.2. Types of big data sources ............................................................................................. 5

    1.3. The use of big data ....................................................................................................... 8

    2. Examples of big data for official statistics .................................................................... 11

    2.1. Traffic loop data ......................................................................................................... 11

    2.2. Mobile phone data ..................................................................................................... 13

    2.3. Social media data ....................................................................................................... 14

    3. Big data and the statistical process ................................................................................ 16

    3.1. From well-known to new processes and methods ..................................................... 16

    3.2. Methodological issues ................................................................................................ 17

    3.3. Process issues ............................................................................................................. 19

    4. Getting ready for big data .............................................................................................. 22

    4.1. Organising for big data .............................................................................................. 22

    4.2. Towards a data ecosystem ......................................................................................... 23

    Annex ....................................................................................................................................... 26

    References................................................................................................................................ 29

  • 3

    1. Introduction

    Big data seems to be a hype. According to Google Trends, in August 2012 it overtook “open

    data” as a search term (Struijs and Daas, 2013). Hype or not, big data is most relevant to

    official statistics, since it has to do with the exponential increase of data registered through

    networks of sensors, camera’s, public administrations, banks, enterprises, mobile networks,

    satellites, drones, social networks, internet sites, etc. This not only creates many opportunities

    for improving official statistics, such as reporting on phenomena whose measurement used to

    be out of reach, but also profoundly influences the context in which statistics are produced,

    for better or for worse. And if big data is a hype, this does not mean that the attention to big

    data will diminish after a peak. Maybe the term “big data” will fade after some time, but as an

    important phenomenon it will most probably last.

    Big data has the potential to become a game changer for National Statistical Institutes (NSIs).

    There are many issues with big data that may have an impact on NSIs, such as on the required

    statistical methodology, the way data is obtained, privacy considerations, the need for an

    appropriate IT infrastructure, the skills needed to deal with big data, the quality of statistics

    based on big data, and the positioning of NSIs in the emerging data society. The possible

    strategic impact of big data for official statistics was recognised by several NSIs some years

    ago, and in 2013 the Directors-General of the NSIs of the European Statistical System (ESS),

    adopted the so-called Scheveningen Memorandum on Big Data and Official Statistics

    (DGINS, 2013), in which a course of action was set out, including the drafting of an ESS

    action plan and roadmap.

    The resulting momentum led to the development of new approaches to deal with big data.

    However, this subject is far from being settled. In that sense the subject of big data is different

    from other areas of statistics, which benefit from established, validated approaches. This

    document provides an overview of the evolving field of big data for official statistics. It aims

    at showing the main issues when dealing with big data and provides access to the literature

    and guidelines that are being developed by various national and international organisations. It

    is not meant to give definite answers to questions, such as are available for more traditional

    areas of statistics. Although the document is intended to be balanced, it does reflect the

    specific experience of the author in international big data initiatives and in the use of big data

    by Statistics Netherlands. Parts of the text are based on earlier papers by the author.

    The remainder of this chapter comprises an introduction to the notion of big data, a typology

    of such data sources, and an overview of potential uses. Chapter 2 discusses three examples of

    the use of big data. Building on these examples, the third chapter looks into methodological

    and other issues related to the statistical process, including data access and privacy issues,

    which are proving to be a significant bottleneck for realising the potential of the use of big

    data for official statistics. Chapter 4 is concerned with the question what has to be done in

    order to prepare for a future in which big data becomes an important source for official

  • 4

    statistics. The international statistical community has been very active in trying to help using

    big data, and throughout this document references are given to what has been achieved so far.

    1.1. The notion of big data

    The concept of big data is not clear-cut. Many attempts have been made to define big data, but

    no single definition is generally accepted. Most experts agree that big data is characterised by

    volume, velocity and variety, the three V’s, and some add a V for veracity, but these

    characteristics may not apply all at the same time (Mayer-Schönberger and Cukier, 2013).

    Volume in itself is not enough to consider data “big”. Moore’s Law stems from 1965, and the

    volume of data has been increasing for many decades. What threshold was passed a couple of

    years ago to start talking about big data? Apparently, no specific one. The emergence of the

    concept of big data appears to result from qualitative changes induced by changes in data

    quantity and public availability. We seem to have reached a point where the traditional way of

    using data does not provide the answers to the new questions that arise – or not fast enough. It

    may be noted that what is seen as “high volume” at one moment may not be considered very

    voluminous several years later, because of advancing technological possibilities to deal with

    large data quantities. In that sense big data is also a relative notion.

    In the context of official statistics, big data is generally considered as a data source. An

    attempt was made by UNECE, the UN Economic Commission for Europe, to define big data

    for statistical purposes. Building on a definition by Gartner (Laney, 2012) it defined big data

    as follows (Glasson et al, 2013):

    Big data are data sources that can be –generally– described as: “high volume, velocity

    and variety of data that demand cost-effective, innovative forms of processing for

    enhanced insight and decision making.”

    However, this definition is not precise enough to decide in concrete cases whether the data

    source belongs to big data or not. Among statisticians there is some discussion on whether

    high-volume data from administrative sources is included in the notion of big data, and

    scanner data is considered big data by some, but not by all. Since government may make use

    of sensors, e.g. road sensors, which are considered part of the Internet of Things, the

    governmental origin of the data does not preclude that it should be considered big data.

    In any case, rather than trying – possibly in vain – to give a more precise definition, it may

    help to mention aspects of big data sources that are regarded as characteristic for such sources

    by many statisticians, and to supplement this by mentioning examples of data sources that

    many statisticians consider big data sources. In this way, a picture of big data can be obtained

    that is clear enough to allow making progress without being stuck in discussions on

    definition. These can be found in abundance on the internet.

  • 5

    Also, in statistics, high volume is not a sufficient condition for data to be considered big data.

    In fact, there exist pretty high-volume traditional data sources, such as comprehensive tax

    registers, that are not necessarily considered to be big data. Other characteristics often

    mentioned are the novelty of the data source, the dynamics of its population, the need to use

    new methodological approaches, the essentially new character of the resulting information,

    the possible need to process the data at the source, the unstructured nature of the data, the

    reference of the data to events, the circumstance that the data is often a by-product of the

    principal activity of an organization, and their physical distribution over several databases or

    points of measurement. These characteristics do support the assumption that the emergence of

    the concept of big data has to do with the qualitative changes that come with quantitative ones

    (Struijs and Daas, 2013).

    1.2. Types of big data sources

    Especially in the situation where there is not a generally accepted, unambiguous definition of

    big data, it helps to have a list of concrete big data sources. For UNECE, an international task

    team developed a typology of big data sources in 2013, comprising three main categories. The

    first is (human-sourced) social networks, which refers to digitized information, which is

    loosely structured. The second category is process-mediated data from traditional business

    systems, such as data on the registration of customers, product manufacturing, taking of

    orders, etc. The data tend to be highly structured, including reference tables, relationships and

    metadata, making the use of relational database systems possible. The third category is the

    machine-generated data of the Internet of Things. Sensors and machines record events and

    situations in the physical world, and the data can be simple or complex, but is often well-

    structured. Its size and speed is beyond traditional approaches. This is the full typology1:

    1. Social Networks (human-sourced information):

    1100. Social Networks: Facebook, Twitter, Tumblr etc.

    1200. Blogs and comments

    1300. Personal documents

    1400. Pictures: Instagram, Flickr, Picasa etc.

    1500. Videos: YouTube etc.

    1600. Internet searches

    1700. Mobile data content: text messages

    1800. User-generated maps

    1900. E-Mail

    2. Traditional Business systems (process-mediated data):

    21. Data produced by Public Agencies

    2110. Medical records

    22. Data produced by businesses

    2210. Commercial transactions

    2220. Banking/stock records

    1 http://www1.unece.org/stat/platform/display/bigdata/Classification+of+Types+of+Big+Data

    http://www1.unece.org/stat/platform/display/bigdata/Classification+of+Types+of+Big+Data

  • 6

    2230. E-commerce

    2240. Credit cards

    3. Internet of Things (machine-generated data):

    31. Data from sensors

    311. Fixed sensors

    3111. Home automation

    3112. Weather/pollution sensors

    3113. Traffic sensors/webcam

    3114. Scientific sensors

    3115. Security/surveillance videos/images

    312. Mobile sensors (tracking)

    3121. Mobile phone location

    3122. Cars

    3123. Satellite images

    32. Data from computer systems

    3210. Logs

    3220. Web logs

    In 2015, the Global Working Group on Big Data for Official Statistics (GWG), created by

    UNSD, the statistical department of the UN, also looked at the question how to classify big

    data, taking the UNECE results as a starting point. This followed a UNSD survey among

    NSIs on the use of big data for official statistics, also in 2015, which included the question:

    “On which topics do you see an urgent need for statistical guidance for your office or national

    statistical system?”; one of the topics listed was “classification of big data”. The question was

    answered by 89 respondents. Of these, 73% indicated that guidance on the classification of

    big data had a “high” (37%) or “medium” (36%) urgency2.

    The GWG approached the question of grouping big data sources as a classification problem.

    Classifications usually have a subject, a scope (or universe), and one or more levels of (sub)

    classes describing possible characteristics of the subjects, based on explicit or implicit

    classification criteria. Classifications are designed on the basis of their intended uses.

    Concerning the intended uses, the first use of a big data classification is providing a so-called

    extensive definition of big data, i.e., an enumeration of types of big data. Any guidelines, for

    instance on methods to be used when dealing with big data, could refer to the classification. It

    can also be used for policy issues, such as having a well-defined scope for projects. For

    instance, in February 2016 an ESSnet on Big Data (a research project) was launched, for

    which pilot projects were selected by assessing the categories of the UNECE typology. It was

    also used as a reference for the UNSD survey just mentioned.

    2 http://unstats.un.org/unsd/trade/events/2015/abudhabi/presentations/day1/04/UNSD%20-

    %20Global%20Survey%20on%20Big%20Data.pdf

    http://unstats.un.org/unsd/trade/events/2015/abudhabi/presentations/day1/04/UNSD%20-%20Global%20Survey%20on%20Big%20Data.pdfhttp://unstats.un.org/unsd/trade/events/2015/abudhabi/presentations/day1/04/UNSD%20-%20Global%20Survey%20on%20Big%20Data.pdf

  • 7

    One possibly important future use of the classification is as a reference in the discussion on

    the possible use of big data for compiling SDG indicators3. Only very few countries have

    started looking at the usability of big data for deriving indicators to measure progress on the

    SDGs, as was shown in the UNSD survey. Therefore, it may be too early to know how it will

    be used, but it is clear that the usability of big data for such indicators will be a relevant

    factor, possibly with a further decomposition such as the SDG goals, targets or indicators that

    could be measured using each big data source. This is also being explored by the GWG.

    The intended uses inform the classification criteria to be used. The GWG identified fifteen

    potential classification criteria (GWG, 2015):

    1. characteristics of the data itself

    2. local versus global sources

    3. regulatory framework applicable

    4. main product versus by-product

    5. purpose and subject of the data

    6. original versus derived data

    7. relationship data source with organisation (e.g. data platforms)

    8. public versus private organisation providing the data

    9. data sourced by humans versus machines

    10. degree of stability of the source

    11. degree of accessibility

    12. real-time versus accumulated data

    13. statistical methodology required for using the data

    14. domains of usability

    15. usability for SDG indicators

    The first criterion, characteristics of the data itself, includes eight possible characteristics:

    high volume, high velocity, high variety, high veracity, selectivity, (lack of) structure, high

    population dynamics, and event-based data.

    The GWG is currently engaged in developing the classification, which should be flexible and

    should be able to evolve over time. Initially, this would probably mean relatively short

    periods between revisions. Flexibility may also be obtained by constructing a system for

    classifying big data sources on demand rather than a fixed classification. In that case, methods

    and rules would be needed, and possibly a larger number of criteria could be accommodated.

    The work is on-going.

    Other lists of big data sources that are not clearly linked to the UNECE classification are used

    elsewhere. Many big data overview papers contain lists of big data sources. A recent example

    is a paper by Kitchin (2015), which contains a table linking big data sources to data types and

    statistical domains, but there are many more cases of ad hoc classifications of big data.

    3 SDG = Sustainable Development Goals. These have been agreed on at UN level.

  • 8

    Companies that offer services related to big data may use their own classifications, such as

    IBM (2014).

    1.3. The use of big data

    There is a gap between actual use and potential use of big data. The potential use comprises

    the following categories:

    1. production of new products

    2. providing more detail in statistics

    3. making statistics more timely

    4. add nowcasts or early indicators to statistics

    5. quality improvement

    6. response burden reduction

    7. cost reduction and higher efficiency

    New products may be statistics on phenomena about which no official statistics were

    previously available. An example would be a general sentiment index on the basis of public

    social media messages. Where there is new demand, such as for the SDG indicators, big data

    can also be considered. New products may also be new visualisations of data (Tennekes,

    2014). In some cases big data can be used as a single source, but combining big data and

    traditional sources for new products are in many cases a more promising approach. For new

    products one needs benchmarks, based on established, validated methods, in order to assess

    the quality of the new products.

    More detail in statistics may be provided along several dimensions, for instance higher

    regional detail on the basis of big data sources, or more temporal detail such as monthly

    estimates where previously there were only quarterly data. Usually higher detail requires

    regular statistics that are produced using existing sources and methods, the detail being

    derived from an additional big data source. For instance, if a survey has only limited regional

    detail because of the sample size, one may explore whether Google Trends at a lower regional

    level can be used to provide a picture of the lower level. However, this may be more difficult

    than one might think (Reep and Buelens, 2015).

    Making statistics more timely is a traditional goal of official statisticians, which has its limits

    if surveys are used, or if data from administrative sources lag reality, as is for example

    generally the case with fiscal data. However, big data sources may be much faster, for

    instance if manually sampling prices is compared with using web scraping by means of

    internet robots. And one may make use of correlations between big data sources and other

    sources to generate more timely outcomes by making use of a model.

    One step further is to produce early indicators or nowcasts for more traditional statistics. They

    supplement these statistics rather than replace them. The early indicators or nowcasts often

  • 9

    heavily depend on correlations and model assumptions, but these quality issues may be

    accepted because the final figures are produced later, so there is still a benchmark. If the

    assumptions behind early indicators and nowcasts are clearly communicated to the users, and

    the quality drawbacks are dealt with in a transparent way, big data may play an important role

    in fulfilling the important demand of users for early information on phenomena, however

    provisional the information may be.

    Big data can also be used to improve the quality of statistics in the sense of improving

    accuracy and reliability. (In fact, relevance, timeliness and clarity, to which the first four uses

    mentioned above contribute, can also be seen as quality aspects, as is done in the European

    Statistics Code of Practice4.) For improving accuracy and reliability, big data sources are

    generally used as complementary sources to existing sources. This includes using big data for

    checking the plausibility of statistical outcomes.

    Response burden reduction is an important aim of many NSIs, some of which apply specific

    reduction targets. Of course, the response burden can be especially reduced if data collected

    by means of questionnaires can be replaced by data from other sources. Replacing surveys

    with big data, however, is not easy. In some cases, such as using internet data for prices, is

    already being used successfully (Ten Bosch and Windmeijer, 2014), but usually big data

    sources have more potential if they are used as additional sources, thereby not eliminating

    surveys but reducing their sample size, frequency or level of detail.

    Cost reduction and higher efficiency naturally go together with response burden reduction, but

    NSIs may have separate targets for them. In fact, the trend seems to be that NSIs try, where

    possible, to have a so-called zero footprint, by which is meant that NSIs make use of all

    information they can get without causing any cost or burden. In some countries the population

    census has already been replaced by making estimated from administrative sources, and big

    data has the potential to reduce cost and burden also in other areas.

    The seven categories are not mutually exclusive. On the contrary, making use of big data may

    serve several purposes at the same time. If continuation of the precise current statistical

    programme is a requirement, it may appear that the possibilities of using big data are limited.

    If there is some flexibility in the programme, the potential of big data to increase total user

    satisfaction is much higher, given the increased availability of data sources. Then a new

    optimum may be found. In fact, the availability of big data sources requires a new

    optimisation effort, aimed at getting the best set of statistical services, given the potential data

    sources, the demand for statistical information and budgetary constraints and possibilities

    (Struijs and Daas, 2013).

    The 2015 UNSD survey gives insight in the actual use of big data for official statistics. The

    main reasons for considering the use of big data given by the 89 respondents were the

    production of more timely statistics and the reduction of response burden. However, big data

    4 http://ec.europa.eu/eurostat/web/quality/european-statistics-code-of-practice

    http://ec.europa.eu/eurostat/web/quality/european-statistics-code-of-practice

  • 10

    is not very often used or considered to be used, the use of scanner data with 22% being the

    highest. Satellite data is used by 19% of the respondents, and web scraping by 16.5%. Big

    data is used very much more often by OECD respondents than non-OECD respondents, with

    the exception of satellite data, which is used by 22% of OECD and 17% of non-OECD

    countries.

    Concerning the statistical domains in which big data is used, the top three consist of price

    statistics (30%), population statistics (15%) and labour statistics (14%). It should be noted,

    however, that the use of big data is in most case only for explorative purposes, in pilot

    projects, and in many cases these pilot projects are in an initial phase.

  • 11

    2. Examples of big data for official statistics

    In order to understand the issues that arise from using big data or the intention to use it, it

    helps to look at examples first. In official statistics there are not yet many examples of actual

    use of big data in regular statistics outside price statistics (use of scanner data and web

    scraping). There are more examples of research into the potential use of big data for official

    statistics, such as on the use of mobile phone data or satellite data, and outside official

    statistics there are many more examples, such as the well-known Billion Prices Project of

    MIT5. Outside official statistics, there are many examples of the use of data such as social

    media messages, for example for research or for commercial purposes.

    The examples presented here are from official statistics, mainly in the Netherlands. The first

    concerns the use of road sensor data, which is already being used for regular statistics. This is

    a big data source without major data access issues, since the data is available from an

    administrative source for statistical purposes for free. The second example is about the use of

    mobile phone data, where data access is a big issue indeed, but where the potential uses are

    many. Several countries are currently trying to get access to and use this data, and the data is

    also exploited commercially. The third example concerns the use of public social media

    messages, which poses particular methodological challenges. Together these examples show

    many of the issues and possibilities of big data, and they have been documented in the

    literature. The discussion in this section is mainly based on a paper written for the UNECE

    (Struijs and Daas, 2013), some of the text of which is reused and updated.

    2.1. Traffic loop data

    In the Netherlands, approximately 230 million traffic loop detection records are generated a

    day. This data can be used as a source of information for traffic and transport statistics and

    potentially also for statistics on other economic phenomena. The data is provided at a very

    detailed level. More specifically, for more than 20,000 detection loops on Dutch roads, the

    number of passing cars in various length classes is available on a minute-by-minute basis.

    The downside of this source is that it seriously suffers from under coverage and selectivity.

    The number of vehicles detected is not in all cases available for every minute and not all

    Dutch roads have detection loops yet, although all main roads have. Fortunately, the first can

    be corrected by imputing the absent data with data that is reported by the same location during

    a 5-minutes interval before or after that minute (Daas et al., 2015). Coverage is improving

    over time. Gradually more and more roads have detection loops, enabling a more complete

    coverage of the most important Dutch roads. In one year more than 2000 loops were added.

    A considerable part of the loops are able to discern vehicles in various length classes,

    enabling the differentiation between cars and trucks. This is illustrated in Figure 1. In this

    figure, for the whole of the Netherlands, normalized profiles are shown for 3 classes of

    5 http://bpp.mit.edu/

    http://bpp.mit.edu/

  • 12

    vehicles. The vehicles were differentiated in three length categories: small (5.6 and 12.2 meter). The results after correction

    for missing data were used. Because the small vehicle category comprised around 75% of all

    vehicles detected, compared to 12% for the medium-sized and 13% for the large vehicles, the

    normalized results for each category are shown.

    Figure 1. Normalized number of vehicles detected in three length categories on December

    1st, 2011 after correcting for missing data. Small (5.6 and

    12.2 meter) are shown in black, dark grey and grey,

    respectively. Profiles are normalized to more clearly reveal the differences in driving

    behaviour.

    The profiles clearly reveal differences in the driving behaviour of the vehicle classes. The

    small vehicles have clear morning and evening rush-hour peaks at 8 am and 5 pm,

    respectively. The medium-sized vehicles have both an earlier morning and evening rush hour

    peak, at 7 am and 4 pm, respectively. The large vehicle category has a clear morning rush

    hour peak around 7 am and displays a more distributed driving behaviour during the

    remainder of the day. After 3 pm the number of large vehicles gradually declines. Most

    remarkable is the decrease in the relative number of medium-sized and large vehicles detected

    at 8 am, during the morning rush hour peak of the small vehicles. This may be caused by a

    deliberate action of the drivers of the medium-sized and large vehicles, wanting to avoid the

    morning rush hour peak of the small vehicles.

    At the most detailed level, that of individual loops, the number of vehicles detected

    demonstrates (highly) volatile behaviour, indicating the need for a more statistical approach

    (Daas et al., 2015). Harvesting the vast amount of information from the data is a major

    challenge for statistics. For this, visualisation techniques can be very useful (Tennekes and

    Puts, 2015). Making full use of this information would result in speedier and more robust

  • 13

    statistics on traffic in general and will provide more detailed information of the traffic of large

    vehicles. This is very likely indicative of changes in economic development.

    Since 2015 Statistics Netherlands publishes regular statistics on the traffic intensity of the

    main roads, based on this source6. Interestingly, the potential of big data was demonstrated at

    the beginning of 2016, when the first three working days of the year were extremely frosty,

    with icy roads in the north of the country. With the process already in place, it was possible to

    publish a press release on the eight of January reporting on the use of the main roads in the

    north of the country, in which a comparison was made with the first three working days of

    previous years. Road use was shown to have been halved7.

    2.2. Mobile phone data

    The use of mobile phones nowadays is ubiquitous. People often carry phones with them and

    use their phones throughout the day. Instrumental for the infrastructure enabling the coverage

    for mobile phones, are mobile phone masts/towers, called ‘sites’ in the industry. Those sites

    are located at strategic points, covering as wide an area as possible.

    Much of the activity that is associated with handling the phone traffic, that is, handling the

    localisation of mobile phones and optimizing the capacity of a site, is stored by the mobile

    phone company. So mobile phone companies record data that are very closely associated with

    behaviour of people; behaviour that is of interest to NSIs. Obvious examples are behaviour

    regarding tourism, mobility, commuting and transport. The destinations and residences of

    people during daytime are also topics of various surveys. Data from mobile phone companies

    could provide additional and more detailed insight on the whereabouts and the activity of its

    users, which may be indicative for the behaviour of people in general.

    Several NSIs have tried to get access to mobile phone location data and explore the

    possibilities for statistics. In the Netherlands, research on this has been going on for some

    time now. A dataset from a mobile telecommunication provider containing records of all call-

    events (speech-calls and text messages) on their network in the Netherlands for a time period

    of two weeks was studied. There are about 35 million records a day. Each record contains

    information about the time and serving antenna of a call-event and a (scrambled version of

    the) identification number of the phone. Getting the data proved to be very complex. This

    study revealed several uses for official statistics, such as economic activity, tourism,

    population density to mobility, and road use (De Jonge et al., 2012). In particular, the place

    where people are at any time during the day can be compared to the place where people are

    registered at municipalities. For these purposes, good visualisation is essential (Tennekes and

    Offermans, 2014).

    6 http://www.cbs.nl/NR/rdonlyres/25CE3592-A756-42B7-BABF-

    C3E4C4E9375B/0/a13busiestnationalmotorwayinthenetherlands.pdf 7 http://www.cbs.nl/nl-NL/menu/themas/verkeer-vervoer/publicaties/artikelen/archief/2016/helft-minder-

    verkeer-in-noord-nederland-door-ijzel-januari-2016.htm (in Dutch).

    http://www.cbs.nl/NR/rdonlyres/25CE3592-A756-42B7-BABF-C3E4C4E9375B/0/a13busiestnationalmotorwayinthenetherlands.pdfhttp://www.cbs.nl/NR/rdonlyres/25CE3592-A756-42B7-BABF-C3E4C4E9375B/0/a13busiestnationalmotorwayinthenetherlands.pdfhttp://www.cbs.nl/nl-NL/menu/themas/verkeer-vervoer/publicaties/artikelen/archief/2016/helft-minder-verkeer-in-noord-nederland-door-ijzel-januari-2016.htmhttp://www.cbs.nl/nl-NL/menu/themas/verkeer-vervoer/publicaties/artikelen/archief/2016/helft-minder-verkeer-in-noord-nederland-door-ijzel-januari-2016.htm

  • 14

    Recent research by Statistics Belgium on the use of mobile phone data also showed the

    potential of using mobile phone data for official statistics (De Meersman et al, 2016). The

    Belgian NSI is one of the more successful NSIs in obtaining access to mobile phone data, and

    promotes the formation of mutually beneficial partnerships (Debusschere, 2016).

    At the level of the ESS, the importance of securing access to data from mobile network

    operators has been recognised. The use of mobile phone data is one of the research areas of

    the ESSnet on Big Data, mentioned earlier. Part of this research is aimed at solving access

    issues. In September 2016, a workshop was organised by this ESSnet with mobile network

    operators8, to see what can be done to exploit access and partnership opportunities. This will

    be further discussed in chapter 4.

    2.3. Social media data

    So far social media messages have not been used for regular official statistics, but their

    potential use is increasingly being researched. In the Netherlands, more than one million

    public social media messages are produced on a daily basis. These messages are available to

    anyone with internet access. Social media is a data source where people voluntarily share

    information, discuss topics of interest, and contact family and friends. To find out whether

    social media is an interesting data source for statistics, Dutch social media messages were

    studied from two perspectives: content and sentiment.

    Studies of the content of Dutch Twitter messages (the predominant public social media

    message in the Netherlands at the time of the study) revealed that nearly 50% of those

    messages were composed of 'pointless babble'. The remainder predominantly discussed spare

    time activities (10%), work (7%), media (TV & radio; 5%) and politics (3%). Use of these,

    more serious, messages was hampered by the less serious 'babble' messages. The latter also

    negatively affected text mining studies.

    Determination of the sentiment in social media messages revealed a very interesting potential

    use of this data for statistics. The sentiment in Dutch social media messages was found to be

    highly correlated with Dutch consumer confidence; in particular with the sentiment towards

    the economic

    situation. The latter relation was stable on a monthly and on a weekly basis. Daily figures,

    however, displayed highly volatile behaviour (Daas et al., 2015). This highlights that it is

    possible to produce weekly indicators for consumer confidence. It also revealed that such an

    indicator could be produced on the first working day following the week studied,

    demonstrating the ability to deliver quick results. Moreover, since consumer confidence

    statistics are survey-based, cost and response burden reduction may be feasible, if quality

    8

    https://webgate.ec.europa.eu/fpfis/mwikis/essnetbigdata/index.php/WP5_2016_09_2223_Luxembourg_Workshop

    https://webgate.ec.europa.eu/fpfis/mwikis/essnetbigdata/index.php/WP5_2016_09_2223_Luxembourg_Workshophttps://webgate.ec.europa.eu/fpfis/mwikis/essnetbigdata/index.php/WP5_2016_09_2223_Luxembourg_Workshop

  • 15

    issues can be solved in a satisfactory way. The survey may remain necessary for benchmark

    purposes, but its sample size or frequency may be reduced. It is conceivable to use both

    survey data and social media data in a model in order to get earlier results, lower cost and

    response burden, and still maintain quality standards.

    The analysis of social media messages is an area where text mining, learning algorithms and

    other artificial intelligence approaches are applied to big data. Apart from official statistics,

    much research is done in academia (an example is Schwartz et al, 2014). This is another area

    where partnerships may be beneficial to both sides. And the use of such data for commercial

    purposes has also been realised early on (Bollen et al, 2011), but this generally does not result

    in publicly available methods and outcomes. However, it shows that official statistics are far

    from unique in analysing social media data.

  • 16

    3. Big data and the statistical process

    In order to see how the use of big data for official statistics may affect statistical production

    processes and methods, and what issues arise, the current way of producing statistics is the

    background reference. Therefore the next section starts with characterising the more familiar

    way of making statistics, before looking at possible changes caused by or required for using

    big data. Methodological issues are the subject of the subsequent section, which also covers

    quality issues, since the quality of statistics depends on the methods applied, and methods are

    usually chosen to fulfil quality objectives. There is interaction between methods on the one

    hand, and their implementation in processes on the other. This is the subject of the third

    section of this chapter. The chapter builds, among other sources, on a paper on quality

    approaches to big data (Struijs and Daas, 2014).

    3.1. From well-known to new processes and methods

    With only very few exceptions, the statistical programmes of NSIs are based on inputs from

    statistical surveys and administrative data sources. For such statistics there exists an elaborate

    body of validated statistical methods. Many of these methods are survey oriented, but in fact,

    most survey based statistics make use of population frames that are taken or derived from

    administrative data sources. Methods for surveys without such frames do exist, for instance

    area sampling methods, but nowadays even censuses (of persons and households as well as

    business and institutions) tend to make use of administrative data. And administrative data

    sources are, of course, themselves also used as the main data source for statistical outputs.

    Statistical surveys are increasingly used for supplementing and enhancing administrative

    source information rather than the other way round. This is the consequence of the widely

    pursued objectives of response burden reduction and cost efficiency.

    Surveys may be run in parallel, in so-called stovepipes. These may be well co-ordinated or

    may run more or less independently from each other. A large part of the body of established

    methods for surveys is connected to sampling theory, the core of which refers to a target

    population of units and variables, to which sampling, data collection, data processing and

    estimation are tuned and optimised, considering cost and quality aspects.

    Next to stovepipe statistics, there also exist integrative statistics, based on a multitude of

    sources. The prime example of such statistics is National Accounts (NA). Statistical methods

    for NA focus on the way different sources for various domains and variables of interest can be

    combined. Since these sources may be based on different concepts and populations, frames

    and models have been developed for integration of sources. These frames and models include,

    for instance, macroeconomic equations. Interestingly, NA outputs generally do not include

    estimations of business populations. This may reflect the fact that the production of NA

    involves quite a few expert assumptions as well as modelling, rather than population based

    estimation.

  • 17

    This characterisation of statistics in relation to methods is, of course, incomplete. There are a

    number of methods aimed at specific types of statistics, for instance occupancy models for

    estimating the evolvement of wild animal populations, or time series modelling.

    For big data the question is to what extent current methods and processes can be reused when

    applying new types of data sources. Many big data sources do not have a deliberate design.

    Traditional administrative registers have a well-defined target population, variables, structure

    and (administrative) quality. They also have an explicit legal basis. But what design is behind

    Twitter messages, commercial websites or mobile phone traffic? For big Data sources,

    populations can often not be specified, let alone related to other sources. How can NSIs then

    ensure quality?

    The implication is that methods derived from sampling theory may have their limitations

    when big data are going to be used. However, although current methods are predominantly

    based on sampling theory, this is not exclusively so. Methods outside traditional sampling

    theory, especially those involving modelling may be relevant when dealing with big data. And

    modelling is already being applied in some statistical domains, such as NA and seasonal

    adjustment.

    3.2. Methodological issues

    Starting with a very fundamental issue, what exactly is the meaning and relevance of the data

    found in big data sources, from a user’s perspective? What does the number of searches on an

    internet search engine reveal, or the sentiment observed in social media, or the number of

    mobile phones connected to a site? The interpretation of big data can be a big methodological

    problem (Daas and Puts, 2014). Moreover, meaning and relevance are user and use

    dependant.

    This issue is not unique to big data, to be sure, as for instance certain administrative data

    sources may have a similar issue. In fact, this sometimes results in statistics about what can be

    found in an administrative register rather than about the phenomenon of interest, such as when

    reported rather than actual crime is measured, or the population with unemployment benefit

    rather than unemployment itself. If the meaning of the data of a big data source cannot be

    pinpointed, but obviously has some relevance, an option may be to produce stand-alone

    statistics, such as a general sentiment indicator based on social media. The interpretation is

    then up to the user, and changes in the index (rather than the level itself) may be interesting

    anyway. In a way, the example of road use statistics based on sensor data can be seen as

    stand-alone, since the number of vehicles passing a certain road segment can hardly be linked

    to surveys on mobility or other statistics.

    Another issue concerns the population about which a big data source reports. Most statistics

    aim at giving information about populations of persons or businesses, or other relevant sets,

    such as goods imported or sold. However, the population covered by big data may be unclear.

  • 18

    Mobile phones may be carried by others than the owner, some persons have multiple phones,

    vehicles passing a detection loop may be private or company vehicles, and what do we know

    about the population using social media? And how do these populations change over time?

    How stable are they? In some cases it may be possible to obtain background variables, such as

    for credit card data, while in other cases background variables may be estimated. For instance,

    the choice of wording is correlated with age and sex of the user of social media (Daas and

    Burger, 2015). This is another reason text mining may become more important in the age of

    big data.

    This issue has to do with the question of selectivity and representativity (Buelens et al, 2014),

    and with the sometimes unstructured nature of the data, which makes it even more difficult to

    extract meaningful statistical information. Selectivity is a characteristic of many data sources,

    including big data sources. For some of such sources, the selectivity mechanism is known,

    such as for road sensors if the target population consists of road segments, or for financial

    transaction data. For other sources this is partly known, as is the case for mobile phone data,

    where the grid of antennae may be known, but the population of mobile phone users perhaps

    not. In the case of the population behind social media the mechanism is even less known.

    Not knowing the composition of the populations included in big data leads to the question

    what to do when sampling theory cannot be applied (Struijs et al, 2014). What to do if one

    does not know for what part of the target population the dataset is representative? More

    fundamentally, one may wonder whether sampling theory deserves being the default approach

    to statistics in the age of big data. Maybe more model-based approaches need to be applied.

    Examples are probabilistic modelling, Bayesian methods, multilevel approaches, statistical-

    learning methods and occupancy models, such as those used in measuring wild animal

    populations. Econometric models can also be considered. Then the measured phenomena are

    leading, and research may be aimed at relating them to information already known.

    However, it is not clear whether this would really be desirable. This approach to official

    statistics is not generally accepted, at least not yet, as this would increase the use of

    assumptions in statistics, and the compilation of statistics by making use of observed

    correlations between variables. For instance, the correlation between the sentiment as

    observed in social media messages and the surveyed consumer confidence may be high and

    remain high for a considerable time, but if that correlation is not well understood, there are

    certain risks, especially if the relationship between de population writing public messages on

    social media and the population at large is not known (Daas and Puts, 2014a). Such risks have

    become well-known since quality issues appeared with the Google Flu data (Lazer et al,

    2014).

    For NSIs, a key question is how the quality of official statistics can be guaranteed if they are

    based on big data and new methods such as modelling are applied (Puts et al, 2015). The

    question of modelling is not new (Breiman, 2001), but the concerns with big data are (Tam

  • 19

    and Clarke, 2015). When reading articles by proponents of new approaches (e.g., Varian,

    2014),one may wonder whether a paradigm shift is taking place.

    Methodological issues also have the attention of the international statistical community. In

    2014 the UNECE established a task team to advise on how to ensure good quality of statistics

    when using big data. The report of the task team (UNECE, 2014) did not come up with a list

    of methods, because there was too little experience with big data at the time and methods

    would be source dependent, but it proposed an approach similar to assessing the potential use

    of administrative data sources. Another initiative is currently taking place in the ESSnet on

    Big Data mentioned earlier, where a work package has been defined to systematically assess

    the methods that can be used for big data statistics, coming from the ESSnet itself as well as

    from the literature. The work package will be carried out in 2017. Other organisations have

    also been looking into such issues (e.g., Baker et al, 2013, and AAPOR, 2015).

    3.3. Process issues

    Making use of big data for official statistics may have consequences for all aspects of the

    statistical process, from the input to the output process. At the input side, there may be access

    issues. Throughout the process, there may be issues of privacy and security, and of

    infrastructure needed for processing the volume of the data. There may also be issues

    concerning dissemination.

    When using big data, the design of the statistical process needs special attention. It may be

    difficult to receive and process really high volume datasets, especially if the second “V” of

    big data, velocity, applies. In the example of the use of traffic sensor data, the first analysis

    was done on a huge set of data covering all records for several years. This yielded techniques

    for reduction of the volume of data without information loss, by looking at what was actually

    needed, including metadata, and removing noise. For instance, records may contain a lot of

    metadata that is the same for a large set of records. The process that resulted was very

    efficient and used parallel processing, but in this case it is also possible to use the streaming

    data. That requires a different type of process. In any case, the processing, storage and transfer

    of large data sets may pose a challenge. However, given technological advances like increases

    in computing power, parallel processing techniques, larger storage facilities and high

    bandwidth data channels, for most situations this does not need to become a bottleneck.

    There may be technical solutions for dealing with high volume and high velocity data, albeit

    possibly at additional costs, but another solution is conceivable: perhaps the data can remain

    at the source. Maybe it is possible to arrange that queries are done by the source holder in the

    source, for instance that data is first aggregated or sampled prior to sending it to the NSI. But

    such solutions themselves entail other issues. Are the results reproducible, can they still be

    linked to data available at the NSI? In the case of mobile phone data, the operators often

    prefer delivering aggregated data rather than individual records. This has to do with privacy

    considerations, among other things (Struijs et al, 2014).

  • 20

    In general, as is the case with more traditional statistical processing, the data may be

    processed physically at the NSI offices or elsewhere, in various arrangements. Although it

    entails its own issues, cloud computing may be considered if some conditions are met (see

    below). Another possibility is to collaborate with another party, for instance a research

    institute with facilities for big data processing. This may be beneficial to all parties involved,

    since knowledge and experience can be shared. Whatever arrangement is entered, it is

    important to be aware of possible risks, for instance in respect of the continuity of the

    partnership and public trust. There are many platform possibilities (Singh and Reddy, 2014)

    and many tools for analysis, including visualisation tools for big data sets (Tennekes et al,

    2013).

    Privacy and security considerations are especially important when dealing with big data,

    because in comparison with more traditional processes, the problems can be compounded and

    the legal situation may not be entirely clear or in flux, as existing legislation and rules were

    not designed for dealing with big data. There are several issues, real or perceived, that may

    impede using big data. Data ownership and copyright may be an issue, and the purpose for

    which data are registered. Even if data is publicly accessible, for instance on websites or as

    social media messages that do not have access restrictions, questions of ownership and

    purpose of publication can be raised. Internet robots cause a burden on the providers of the

    sites, and in some cases site owners prefer sending data directly to the NSI. And even if data

    may legally be used, this does not imply that it is wise or appropriate to do so. Of critical

    importance is the implication of any use of big data for the public perception of an NSI as this

    has a direct impact on trust in official statistics.

    A complicating factor is the circumstance that public opinion on privacy and confidentiality

    seems to be in flux. On the one hand, privacy seems to be ever more under pressure when

    public safety or commercial interests are perceived to be at stake, and young people who have

    grown up using social networks tend to consider privacy less important than the elderly. On

    the other hand, there seems to be a growing general awareness of possible privacy

    implications of the ubiquity of data, resulting in a more critical attitude towards the

    unquestioned processing of data by anyone. Anyway, the understanding for the need for

    statistical data collection by organisations is decreasing, especially if such data are already

    registered elsewhere.

    Fortunately, there are measures NSIs can take to overcome at least some of the obstacles. In

    some cases the use of informed consent may be a solution. If the NSI can offer a reduction of

    the response burden, this can be very helpful, also in getting the support of the general public.

    Transparency about what and how big data sources are used is crucial. For the long run

    changes in legislation may be considered, to ensure continuous data access. But it remains

    important to stay in line with public opinion, because credibility and public trust are important

    assets of NSIs.

  • 21

    As to cloud computing, a few sensible rules may be used. Cloud services include the

    provision of computing resources, platforms and IT applications via the internet. At the

    present legal and technical state of the art, it is advisable in general not to host sensitive and

    critical data and processes in the cloud. The user of the cloud service remains responsible for

    the security and privacy of the data, and public trust depends on this. It is also advisable to use

    data encryption where possible.

    Even at the output side of the statistical process, there may be issues. The prevention of the

    disclosure of the identity of individuals is an imperative, but this is difficult to guarantee when

    dealing with Big Data, although a number of techniques are available that have proven to be

    reliable. Another issues may be the dissemination policy. Some statistical outputs based on

    big data may be innovative or provisional and may entail some quality risks. Instead of not

    disseminating results for which there is high demand, even if their quality does not meet

    traditional standards, a possibility may be to release such results on a beta site, where all

    outputs are qualified by default. In fact, this is common practice with many large internet

    businesses, so that they get early feedback on their products. Among other things for this

    reason Statistics Netherlands has launched an innovation site9 in October 2016.

    International organisations have tried to provide help and guidelines to deal with process

    issues. In Ireland a so-called Sandbox for practicing with big data was created in 2014 by the

    Central Statistics Office (CSO) of Ireland and the Irish Centre for High-End Computing

    (ICHEC)10

    . NSIs may use these facilities for a small annual fee, and assistance is provided.

    International guidelines have been developed on privacy and security by a task team of

    UNECE in 2014, resulting in three documents on good practices. These documents do not

    have any formal status, but facilitate working with big data; they are available on the big data

    site of the 2014 project of UNECE11

    .

    At the level of the UN, the Global Working Group (GWG) mentioned earlier has also

    produced recommendations. In particular, there are recommendations on access and

    partnerships (GWG, 2015a), and a template for a Memorandum of Understanding with global

    data providers (GWG, 2015b). Furthermore, the GWG has drafted a number of principles for

    data access, trying to find a fair balance between the interest of getting free access to data for

    the public good on the one hand, and legitimate interests of private organisations on the other.

    The main elements of this balance are the creation of a level playing field, equal treatment,

    safeguards for confidentiality and security, transparency, and proportionality. These draft

    principles, which are annexed to this document, are based on the Fundamental Principles of

    Official Statistics of the UN12

    . They have been presented and discussed with stakeholders at

    several fora and have received broad support.

    9 https://www.cbs.nl/en-gb/our-services/innovation

    10 http://www1.unece.org/stat/platform/display/bigdata/Sandbox

    11 http://www1.unece.org/stat/platform/display/bigdata/2014+Project

    12 http://unstats.un.org/unsd/dnss/gp/fundprinciples.aspx

    https://www.cbs.nl/en-gb/our-services/innovationhttp://www1.unece.org/stat/platform/display/bigdata/Sandboxhttp://www1.unece.org/stat/platform/display/bigdata/2014+Projecthttp://unstats.un.org/unsd/dnss/gp/fundprinciples.aspx

  • 22

    4. Getting ready for big data

    4.1. Organising for big data

    An NSI that wants to make serious use of big data will have to organise itself in order to cope

    with the challenge. Several factors are important. A factor not discussed so far is human

    capital. In order to work with big data, specific technical skills are needed, such as advanced

    computing skills, a fair command of math and statistics, modelling skills and data engineering

    skills. But equally important are the mental orientation and behavioural skills of the staff.

    Working with big data requires an open mind-set and the ability not to see all problems a

    priori in terms of sampling theory. For this type of staff the term data scientist has been

    coined. However, it is not evident that the culture of NSIs can smoothly absorb this type of

    professional. A way to deal with this cultural issue is to create one or more kernels of data

    scientists working with big data, and let these kernels grow, which will be a natural process if

    they are successful (Struijs and Daas, 2013).

    Another factor is the way processes are organised. The continuity and possible volatility of

    big data sources deserve consideration. Social media, for instance, seem to have an ever

    shorter lifecycle. As a consequence, the use of big data requires a more flexible set-up of

    production processes, with a short time-to-market. Not only data collection has to be flexible,

    but also data processing further in the production chain. More generally, NSIs that start using

    big data may have to adapt or even reconsider their enterprise architecture.

    For NSIs that want to make big data a serious part of their business, governance may become

    an issue. Because of the important strategic aspects of big data, this subject should get

    attention at the highest management level of the NSI. Setting priorities, creating favourable

    conditions for using big data and taking related budget decisions would be tasks for the

    strategic level, as would be the making of policy choices. An increased use of big data

    requires a number of policy decisions that influence various parts of the organisation. The

    organisation’s CIO (Chief Information Officer) would likely have an important say in the way

    the NSI deals with big data.

    There are more relevant factors, of course. The required IT infrastructure must be in place,

    and the same goes for an appropriate research capability. Policy support must be organised,

    for instance concerning privacy issues. All this requires a conscious effort and co-ordination.

    However, there is no blueprint for getting an organisation ready for big data.

    To give an example, this is the way Statistics Netherlands organised itself when the awareness

    grew that big data was a strategic issue. Once the Board of Directors of Statistics Netherlands

    identified the need to have a big data strategy, they had a staff member write a position paper

    for discussion by the Board. The paper suggested that the NSI should work out a big data

    roadmap, and this was done. The roadmap was validated by IBM (IBM, 2014), updated twice

    a year, and monitored. The roadmap not only identified big data research projects and

  • 23

    statistics, together with a time plan and ownership, but also arranged for creating the right

    conditions, such as IT, methodological and policy support. The Deputy Director-General was

    made responsible for big data at the strategic level. Statistics Netherlands already had an R&D

    programme, which was then also tuned to the demand for big data research. At a more tactical

    level, a big data co-ordination group was created. This group prepared updates of the

    roadmap, and also arranged for internal training to be given.

    In practice, this approach did yield results, but it was not satisfactory. In particular, the

    transition from research to regular statistics production took much longer than desired. In

    2016, the decision was taken to centralise all big data activities in a programme with its own

    physical facilities and permanent staff, to which other staff is added on a project basis. This

    centre, which is called the Center for Big Data Statistics (CBDS), was launched in September

    2016. It has already many external partners, and makes use of the innovation site mentioned

    in section 3.3.

    4.2. Towards a data ecosystem13

    The environment in which NSIs operate is changing. This may have consequences for the

    position an NSI occupies or wants to occupy in the data society. NSIs are faced with more and

    more potential data sources, whereas the modalities for their use are changing. Most actions

    by persons or businesses – transactions, movements, communication, social and business

    activities – nowadays leave digital traces in one way or another; ever increasing amounts of

    data are becoming available. And, contrary to survey data, these data are not available

    exclusively for NSIs. They are becoming less unique as a user of data.

    The position of NSIs in the information society is becoming less evident, even though their

    institutional setting is stable for the time being. Other providers of information on relevant

    phenomena of society pop up everywhere. They are often very quick and perceived as

    knowledgeable. Because of this, society becomes less dependent on information from NSIs.

    There are alternatives, for instance, to official price indices. Even if there are quality issues

    attached to these alternatives, there is demand for them. Apart from the many practical

    questions, big data is bound to have an impact on NSIs at the strategic level.

    One of the questions with which an NSI may be confronted is the question what to do if there

    is a market alternative to one or more of its statistics. But it may also ask whether it can

    assume new roles, based on its institutional position and the knowledge it has accumulated.

    Should one for instance consider to shift the role of the NSI from producing statistical

    information towards validating information produced by others? Or to pool resources?

    A possible approach is to assess the strengths and weaknesses of the NSI, and take them into

    account when positioning itself in the information society. For instance, NSIs have a unique

    ability to relate data from different sources and to assess the quality of information produced

    13

    This section makes extensive use of (Struijs and Daas, 2014) and (Struijs et al, 2014).

  • 24

    by others. They may try to exploit this by forming networks and forging partnerships with

    other organisations. NSIs have come to recognise the necessity of not working in isolation but

    collaborating with each other and others outside the community of official statistics. This

    collaboration is often exploratory and may be aimed at sharing knowledge and experiences,

    but there are already examples of collaboration that go further.

    From the perspective of NSIs, several types of partners are of interest. First of all, the

    potential providers of big data are essential partners: if they do not grant access to their data,

    the story is over before it starts. Data owners have their own concerns and, like NSIs, they are

    subject to privacy rules. This may complicate collaboration even if they have a positive

    outlook and approach. But since big data sources are not designed for statistical use, such

    collaboration is also essential in order to obtain good knowledge of the provenance of such

    sources. Additionally, for statistical production, it may be more efficient to have data

    processed at the site of collection and storage.

    On the other hand, statisticians also have much to offer such as providing analytic insights

    that may help data owners understand their data better. Doing complex statistical analyses is

    core business for NSIs, but not for, say, a mobile phone company. In these and other ways,

    the relationship with data providers could potentially become true partnerships. For example,

    one specific role that NSIs could play is that of a trusted third party. In a competitive market,

    competitors will be reluctant to share sensitive data among each other. But they might be

    willing to share it with an NSI who compiles statistical information that is beneficial to all.

    Collaboration between NSIs and academia may grow as well. Universities have historically

    been natural partners for NSIs. It stands to reason that such collaboration will extend to the

    field of big data, for instance, in solving methodological problems, developing technical

    solutions and training future data scientists. Such collaboration is also being supported by

    public funders who are facilitating research and innovation partnerships through targeted

    grants. By working in partnership, researchers in universities and NSIs could better leverage

    such opportunities.

    Furthermore, there are many commercial partners with which NSIs could collaborate. Google

    and Facebook are two examples for which big data forms the core of their business model.

    Their knowledge and the data to which they have access may be very relevant to NSIs. IT

    companies also possess relevant knowledge on big data processing and storage, security,

    cloud processing, etc. Apart from the provision of paid services, collaboration may be of

    interest to them with a view to obtaining statistical expertise and for benchmarking or

    validating their information products.

    The relationship between the various stakeholders will involve each partner building on and

    contributing different strengths and will likely result in flexible networks. Such networks are

    flexible in the sense that membership of the network and the contribution of partners depend

    on actual needs instead of being fixed in advance for a long time. The emerging data

  • 25

    ecosystem will also allow for forming ad hoc consortia to compete for research funds and

    other subsidies, such as funds from the Horizon 2020 programme (European Commission,

    2013). The ESSnet on Big Data mentioned earlier can also be seen as part of the data

    ecosystem.

    Partnerships are the core of the data ecosystem, and international organisations have tried to

    document good practices. The GWG work on access and partnerships, such as the

    recommendations for access to data from private organisations was mentioned in section 3.3,

    which actually carried forward earlier work of UNECE14

    on partnerships. Guidelines have

    also been drafted by PARIS21 in collaboration with OECD (Robin et al, 2016).

    14

    http://www1.unece.org/stat/platform/display/bigdata/Guidelines+for+the+establishment+and+use+of+partnerships+in+Big+Data+Projects+for+Official+Statistics

    http://www1.unece.org/stat/platform/display/bigdata/Guidelines+for+the+establishment+and+use+of+partnerships+in+Big+Data+Projects+for+Official+Statisticshttp://www1.unece.org/stat/platform/display/bigdata/Guidelines+for+the+establishment+and+use+of+partnerships+in+Big+Data+Projects+for+Official+Statistics

  • 26

    Annex

    Global Working Group on Big Data for Official Statistics

    Recommendations for Access to Data from Private Organisations

    for Official Statistics

    Draft 14 July 2016

    Preamble

    The Global Working Group on Big Data for Official Statistics,

    (1) Taking notice of the high and urgent need for access to data kept by private organizations

    for the production of official statistics, such as indicators for the Sustainable Development

    Goals and statistics on phenomena related to modern society, and the social responsibility

    already shown by private organizations to provide access to new data sources, free of

    charge, for purposes such as disaster relief and the fight against epidemics,

    (2) Bearing in mind that in using such data the Fundamental Principles of Official Statistics,

    as endorsed by the UN General Assembly15

    , unconditionally apply, and that the statistical

    community has pledged to adhere to the professional ethics, as stated in the Declaration on

    Professional Ethics, as adopted by the International Statistical Institute16

    , thereby creating

    the foundation for sharing data for official statistics,

    (3) Recognizing the legitimate interests of private organizations, including respect for their

    business model and value proposition, and the need to guarantee a level playing field for

    private organizations considering the burden created by providing data for official

    statistics, as well as the legitimate interest of organizations in charge of compiling official

    statistics to have equal access,

    (4) Stressing that the burden to private organizations resulting from data requests for official

    statistics must be fair in proportion to their envisaged public benefits and that the data

    should be adequate and relevant in relation to the purposes for which they are requested,

    (5) Considering that legislation aimed at accessing and using data kept by private

    organizations unavoidably lags the emergence of new types of data sources, that existing

    national and international legal frameworks fully apply but need interpretation in view of

    new data sources, especially concerning privacy, data ownership, reuse of data by third

    15 Resolution 68/261, adopted by the General Assembly on 29 January 2014. 16

    This declaration was adopted by the Council of the International Statistical Institute in its session of 22 and 23 July 2010, in Reykjavik, Iceland.

  • 27

    parties, and liability in case of breaches of confidentiality, and that there is thus a need for

    guidance,

    (6) Highlighting the need to create public trust by applying full transparency in the use of

    data from private organizations for official statistics, in particular in view of privacy

    concerns, given a number of well-publicized cases of likely abuse outside the realm of

    official statistics, and the need to provide clarity concerning the possible use for statistical

    purposes of personal data in customer contracts with private organizations, for instance by

    referring to the Recommendations set out below,

    (7) Acknowledging that private data sources are diverse in many respects, such as data

    ownership, provenance of the data, purpose of collecting the data, and characteristics of

    the data itself, and that providing access to the data can take a variety of shapes, such as

    sending micro data to statistical agencies, providing aggregates compiled according to

    specifications from statistical agencies, or providing on-site data access for analysis,

    (8) Admitting that source and branch specific operational rules and guidelines may be needed

    for dealing with access to data kept by private organizations, that such rules and

    guidelines should be consistent with the Recommendations set out below, that before

    access is requested for the purpose of producing official statistics data exploration may be

    necessary in collaboration with the private data source, and that this requires the

    development of partnerships between private organizations providing and statistical

    agencies using data,

    Endorses the following recommendations for access to data from private organizations for

    the production of official statistics:

    Recommendations

    Recommendation 1. The role of national and international systems of official statistics is to

    provide relevant, high-quality information to society in an impartial way. This role is

    indispensable to the well-functioning of societies. To this end, data is needed from private

    organizations as inputs to these systems. In view of the emergence of new types of data

    sources and the social responsibility of private organizations, these members of society are

    called upon to make the data that is needed available to the statistical agency concerned, free

    of charge, on a voluntary basis.

    Recommendation 2. The data needed for official statistics may only be collected and

    processed if the statistical agency concerned acts in full accordance with the Fundamental

    Principles of Official Statistics17

    . These principles guarantee, among other things, the

    professional independence and accountability of the statistical agency, and the strictly

    confidential use of the data, exclusively for statistical purposes.

    17

    http://unstats.un.org/unsd/dnss/gp/fundprinciples.aspx

    http://unstats.un.org/unsd/dnss/gp/fundprinciples.aspx

  • 28

    Recommendation 3. When data is collected from private organizations for the purpose of

    producing official statistics, the fairness of the distribution of the burden across those

    organizations has to be considered, in order to guarantee a level playing field.

    Recommendation 4. Data requests for official statistics must acknowledge and take into

    account the role of data in the business model and value proposition of private organizations,

    in particular if their data has market value. There must be a fair balance between public and

    business interests when data is requested and possible harm to business interests has to be

    kept as low as possible.

    Recommendation 5. The data must be adequate and relevant in relation to the purposes for

    which it is requested from the private organization. No more data should be requested than

    needed for these purposes. Operational arrangements have to be agreed on between the

    private organization and the statistical agency concerned, taking into account business

    concerns and data adequacy for official statistics. The metadata must also be adequate.

    Recommendation 6. The cost and effort of providing data access, including possible pre-

    processing, must be reasonable compared to the expected public benefit of the official

    statistics envisaged.

    Recommendation 7. When private organizations operate internationally, they are expected to

    treat requests for data from national statistical systems in a non-discriminatory way, unless

    different treatment is justified by differences in the national legislative frameworks

    concerned, and provided that adherence to the Fundamental Principles of Official Statistics is

    guaranteed in theory as well as p