Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
I
BIG data
for Official Statistics
Peter Struijs
Department for Methodology and Process Development
Statistics Netherlands
E-mail: [email protected]
mailto:[email protected]
II
AURKEZPENA
Urtez urte, Nazioarteko Estatistika Mintegia gogotsu dator; aurten XXIX edizioa izanda,
“Big Data” estatistika ofizialean aurkeztu dugu, Vitoria-Gasteizko Europa Biltzar Jauregian,
2016ko azaroaren 21ean.
1983. urtetik hona, estatistika alorrean mundu mailan ari diren ikertzaile aitzindari eta
ospetsuak gure Estatistika Mintegira irakasle etorri izana, ohore handia da. Oraingo honetan,
gure gonbidatu nagusia Peter Struijs izan da, Statistics Netherlands (SN) Big Data
programaren koordinatzailea (Herbehereetatik etorria).
Berarekin batera, arratsaldeko saioan Pedro Alberto González (Datuak Babesteko Euskal
Bulegoa), Jerónimo Hernández eta Iñaki Inza (Donostiako EHU-ko Informatika Fakultatea)
eta Javier San Vicente eta Jorge Aramendi (EUSTAT) izan ditugu.
Aurtengo helburu nagusia, arlo guztietako gizarte orokorrera zuzentzea izan da, bai erakunde
publikoei eta enpresa pribatuei, bai unibertsitate eta estatistika arloko lankideei eta abar.
Aintzat hartuta Big Data gaur egungo eta etorkizuneko gai garrantzitsua izango dela, aurretik
prestakuntza jaso beharreko ardura dugu.
Honen zabalkundea ahalik eta pertsona eta erakunde gehienetara iritsi ahal izateko, Eustat-eko
web orrira jo dezakezue, www.eustat.eus, Nazioarteko Estatistika Mintegiari buruzko
informazioa izan dezazuen.
Bertan liburu honen eta 1983.urtetik aurrerako hizlarien txostenak eta lanak on-line eskura
dituzue. Teknologiaren abantailarekin batera estatistika ezagutza mundu osora ahalik gehiena
zabaltzea nahi dugu.
Vitoria-Gasteiz, azaroak 2016
JOSU IRADI ARRIETA EUSTAT-eko Zuzendari Nagusia
http://www.eustat.eus/
III
PRESENTATION
Year after year, we look forward to the International Statistics Seminar enthusiastically, with
this being the XXIX edition since its inception. During this time, we presented the topic of
"Big Data" in the official Statistics Seminar held at the Europe Conference and Exhibition
Centre in Vitoria-Gasteiz on November 21, 2016.
Since 1983, it has been an honour to have been able to attract innovative and recognized
researchers in statistics on a global level to speak at our International Statistics Seminar.
This time the main guest was Peter Struijs, coordinator of the Statistics Netherlands (SN) Big
Data program). Also participating along with him in the afternoon session were Pedro Alberto
González (Basque Data Protection Agency), Jerónimo Hernández and Iñaki Inza (Faculty of
Information Technology of Donostia-San Sebastián-EHU-UPV-) and Javier San Vicente and
Jorge Aramendi (EUSTAT).
The main objective this year was to address all areas of society, both private companies and
public organisations, the university field, workers in the statistics sector... etc. We have to
keep in mind that "Big Data" is a current issue and of great importance in the future, so it is
our responsibility to prepare and train ourselves before then.
In order for this news to reach as many interested people and institutions as possible, you
have at your disposal information about the International Statistics Seminar on the Eustat
website, www.eustat.eus.
Available within this section of the website are both this book and all the papers and technical
notes made by previous speakers since 1983. We want to contribute to the expansion of
statistical knowledge on a global level through the advantages of technology.
Vitoria-Gasteiz, November 2016
JOSU IRADI ARRIETA Director General of EUSTAT
http://www.eustat.eus/
IV
PRESENTACIÓN Año tras año, recibimos el Seminario Internacional de Estadística con entusiasmo, siendo ya
la XXIXª edición desde su creación. En esta ocasión hemos presentado el tema “Big Data” en
la estadística oficial, celebrado en el Palacio de Congreso Europa de Vitoria-Gasteiz, el día
21 de noviembre de 2016.
Desde 1983, es un honor haber logrado traer investigadores pioneros y reconocidos en
materia estadística a nivel mundial, para ser ponentes de nuestro Seminario Internacional de
Estadística.
En este caso, el invitado principal ha sido Peter Struijs , coordinador del programa de Big
Data de Statistics Netherlands (SN) (Países Bajos). Junto a él, en la sesión de tarde, también
participaron Pedro Alberto González (Agencia Vasca de Protección de Datos), Jerónimo
Hernández e Iñaki Inza (Facultad de Informática de Donostia-San Sebastián- EHU-UPV-), y
Javier San Vicente y Jorge Aramendi (EUSTAT).
El principal objetivo de este año ha sido dirigirnos a todos las ámbitos de la sociedad en
general, tanto a la empresa privada como a los organismos públicos, al campo Universitario,
a trabajadores del área estadística…etc. Tenemos que tener en cuenta que el “Big Data” es un
tema de actualidad y de gran importancia en un futuro, por lo que es nuestra responsabilidad
prepararnos y formarnos previamente.
Para que esta difusión llegue al mayor número posible de personas e instituciones
interesadas, tenéis a vuestra disposición información sobre el Seminario Internacional de
Estadística en la página web de Eustat, www.eustat.eus.
Desde esta sección de la web están disponibles on-line tanto este libro como todos los trabajos
y cuadernos técnicos realizados por los anteriores ponentes desde 1983. A través de las
ventajas de la tecnología, queremos contribuir a la expansión del conocimiento de estadística
a todo el mundo.
Vitoria-Gasteiz, noviembre 2016
JOSU IRADI ARRIETA Director General de EUSTAT
http://www.eustat.eus/
V
BIOGRAFI OHARRAK
Peter Struijs Statistics Netherlands (SN) Big Data programaren koordinatzailea da; Europar
Batasuneko ESSnet (European Statistical System network) Big Data taldea koordinatzen du
eta Nazio Batasuneko Global Working Group on Big Data for Official Statistics-eko kide da.
Big Data-rekin ekin aurretik, Peter Statistics Netherlands-eko Open Dataren arduraduna izan
zen. Urte askotan, prozesuak garatzeko eta kalitatea kudeatzeko unitate arloko burua izan zen.
Lehenago EUROSTAT-en, Europar Batasuneko Estatistika Bulegoan, lan egin zuen.
Statistics Netherlandsen metodologian aditu gisa hasi zen. Horrez gain, ISI-ko (International
Statistical Institute) kide hautatua da.
BIOGRAPHICAL SKETCH Peter Struijs is coordinator of the Big Data programme of Statistics Netherlands (SN),
coordinates the ESSnet Big Data of the EU and is a member of the UN Global Working
Group on Big Data for Official Statistics.
Before being engaged in Big Data, Peter was responsible for open data at SN. For many
years, he held the position of Head of Unit for process development and quality management.
Earlier, he worked at Eurostat, the Statistical Office of the EU. He started work at SN as a
methodologist and he is an elected member of the International Statistical Institute.
NOTAS BIOGRÁFICAS Peter Struijs es coordinador del programa de Big Data de Statistics Netherlands (SN),
coordina el grupo de Big Data de ESSnet (European Statistical System network) de laUnión
Europea y es miembro de Global Working Group on Big Data for Official Statistics de las
Naciones Unidas.
Antes de dedicarse al Big Data, Peter fue responsable de Open Data en Statistics
Netherlands.Ocupó durante muchos años el cargo de Jefe de Unidad de desarrollo de procesos
y gestión de la calidad.
Previamente, trabajó en Eurostat, Oficina de Estadística de la Unión Europea. Comenzó a
trabajar en Statistics Netherlands como especialista en metodología. Además, es miembro
electo de ISI (International Statistical Institute).
1
Index
1. Introduction ....................................................................................................................... 3
1.1. The notion of big data .................................................................................................. 4
1.2. Types of big data sources ............................................................................................. 5
1.3. The use of big data ....................................................................................................... 8
2. Examples of big data for official statistics .................................................................... 11
2.1. Traffic loop data ......................................................................................................... 11
2.2. Mobile phone data ..................................................................................................... 13
2.3. Social media data ....................................................................................................... 14
3. Big data and the statistical process ................................................................................ 16
3.1. From well-known to new processes and methods ..................................................... 16
3.2. Methodological issues ................................................................................................ 17
3.3. Process issues ............................................................................................................. 19
4. Getting ready for big data .............................................................................................. 22
4.1. Organising for big data .............................................................................................. 22
4.2. Towards a data ecosystem ......................................................................................... 23
Annex ....................................................................................................................................... 26
References................................................................................................................................ 29
3
1. Introduction
Big data seems to be a hype. According to Google Trends, in August 2012 it overtook “open
data” as a search term (Struijs and Daas, 2013). Hype or not, big data is most relevant to
official statistics, since it has to do with the exponential increase of data registered through
networks of sensors, camera’s, public administrations, banks, enterprises, mobile networks,
satellites, drones, social networks, internet sites, etc. This not only creates many opportunities
for improving official statistics, such as reporting on phenomena whose measurement used to
be out of reach, but also profoundly influences the context in which statistics are produced,
for better or for worse. And if big data is a hype, this does not mean that the attention to big
data will diminish after a peak. Maybe the term “big data” will fade after some time, but as an
important phenomenon it will most probably last.
Big data has the potential to become a game changer for National Statistical Institutes (NSIs).
There are many issues with big data that may have an impact on NSIs, such as on the required
statistical methodology, the way data is obtained, privacy considerations, the need for an
appropriate IT infrastructure, the skills needed to deal with big data, the quality of statistics
based on big data, and the positioning of NSIs in the emerging data society. The possible
strategic impact of big data for official statistics was recognised by several NSIs some years
ago, and in 2013 the Directors-General of the NSIs of the European Statistical System (ESS),
adopted the so-called Scheveningen Memorandum on Big Data and Official Statistics
(DGINS, 2013), in which a course of action was set out, including the drafting of an ESS
action plan and roadmap.
The resulting momentum led to the development of new approaches to deal with big data.
However, this subject is far from being settled. In that sense the subject of big data is different
from other areas of statistics, which benefit from established, validated approaches. This
document provides an overview of the evolving field of big data for official statistics. It aims
at showing the main issues when dealing with big data and provides access to the literature
and guidelines that are being developed by various national and international organisations. It
is not meant to give definite answers to questions, such as are available for more traditional
areas of statistics. Although the document is intended to be balanced, it does reflect the
specific experience of the author in international big data initiatives and in the use of big data
by Statistics Netherlands. Parts of the text are based on earlier papers by the author.
The remainder of this chapter comprises an introduction to the notion of big data, a typology
of such data sources, and an overview of potential uses. Chapter 2 discusses three examples of
the use of big data. Building on these examples, the third chapter looks into methodological
and other issues related to the statistical process, including data access and privacy issues,
which are proving to be a significant bottleneck for realising the potential of the use of big
data for official statistics. Chapter 4 is concerned with the question what has to be done in
order to prepare for a future in which big data becomes an important source for official
4
statistics. The international statistical community has been very active in trying to help using
big data, and throughout this document references are given to what has been achieved so far.
1.1. The notion of big data
The concept of big data is not clear-cut. Many attempts have been made to define big data, but
no single definition is generally accepted. Most experts agree that big data is characterised by
volume, velocity and variety, the three V’s, and some add a V for veracity, but these
characteristics may not apply all at the same time (Mayer-Schönberger and Cukier, 2013).
Volume in itself is not enough to consider data “big”. Moore’s Law stems from 1965, and the
volume of data has been increasing for many decades. What threshold was passed a couple of
years ago to start talking about big data? Apparently, no specific one. The emergence of the
concept of big data appears to result from qualitative changes induced by changes in data
quantity and public availability. We seem to have reached a point where the traditional way of
using data does not provide the answers to the new questions that arise – or not fast enough. It
may be noted that what is seen as “high volume” at one moment may not be considered very
voluminous several years later, because of advancing technological possibilities to deal with
large data quantities. In that sense big data is also a relative notion.
In the context of official statistics, big data is generally considered as a data source. An
attempt was made by UNECE, the UN Economic Commission for Europe, to define big data
for statistical purposes. Building on a definition by Gartner (Laney, 2012) it defined big data
as follows (Glasson et al, 2013):
Big data are data sources that can be –generally– described as: “high volume, velocity
and variety of data that demand cost-effective, innovative forms of processing for
enhanced insight and decision making.”
However, this definition is not precise enough to decide in concrete cases whether the data
source belongs to big data or not. Among statisticians there is some discussion on whether
high-volume data from administrative sources is included in the notion of big data, and
scanner data is considered big data by some, but not by all. Since government may make use
of sensors, e.g. road sensors, which are considered part of the Internet of Things, the
governmental origin of the data does not preclude that it should be considered big data.
In any case, rather than trying – possibly in vain – to give a more precise definition, it may
help to mention aspects of big data sources that are regarded as characteristic for such sources
by many statisticians, and to supplement this by mentioning examples of data sources that
many statisticians consider big data sources. In this way, a picture of big data can be obtained
that is clear enough to allow making progress without being stuck in discussions on
definition. These can be found in abundance on the internet.
5
Also, in statistics, high volume is not a sufficient condition for data to be considered big data.
In fact, there exist pretty high-volume traditional data sources, such as comprehensive tax
registers, that are not necessarily considered to be big data. Other characteristics often
mentioned are the novelty of the data source, the dynamics of its population, the need to use
new methodological approaches, the essentially new character of the resulting information,
the possible need to process the data at the source, the unstructured nature of the data, the
reference of the data to events, the circumstance that the data is often a by-product of the
principal activity of an organization, and their physical distribution over several databases or
points of measurement. These characteristics do support the assumption that the emergence of
the concept of big data has to do with the qualitative changes that come with quantitative ones
(Struijs and Daas, 2013).
1.2. Types of big data sources
Especially in the situation where there is not a generally accepted, unambiguous definition of
big data, it helps to have a list of concrete big data sources. For UNECE, an international task
team developed a typology of big data sources in 2013, comprising three main categories. The
first is (human-sourced) social networks, which refers to digitized information, which is
loosely structured. The second category is process-mediated data from traditional business
systems, such as data on the registration of customers, product manufacturing, taking of
orders, etc. The data tend to be highly structured, including reference tables, relationships and
metadata, making the use of relational database systems possible. The third category is the
machine-generated data of the Internet of Things. Sensors and machines record events and
situations in the physical world, and the data can be simple or complex, but is often well-
structured. Its size and speed is beyond traditional approaches. This is the full typology1:
1. Social Networks (human-sourced information):
1100. Social Networks: Facebook, Twitter, Tumblr etc.
1200. Blogs and comments
1300. Personal documents
1400. Pictures: Instagram, Flickr, Picasa etc.
1500. Videos: YouTube etc.
1600. Internet searches
1700. Mobile data content: text messages
1800. User-generated maps
1900. E-Mail
2. Traditional Business systems (process-mediated data):
21. Data produced by Public Agencies
2110. Medical records
22. Data produced by businesses
2210. Commercial transactions
2220. Banking/stock records
1 http://www1.unece.org/stat/platform/display/bigdata/Classification+of+Types+of+Big+Data
http://www1.unece.org/stat/platform/display/bigdata/Classification+of+Types+of+Big+Data
6
2230. E-commerce
2240. Credit cards
3. Internet of Things (machine-generated data):
31. Data from sensors
311. Fixed sensors
3111. Home automation
3112. Weather/pollution sensors
3113. Traffic sensors/webcam
3114. Scientific sensors
3115. Security/surveillance videos/images
312. Mobile sensors (tracking)
3121. Mobile phone location
3122. Cars
3123. Satellite images
32. Data from computer systems
3210. Logs
3220. Web logs
In 2015, the Global Working Group on Big Data for Official Statistics (GWG), created by
UNSD, the statistical department of the UN, also looked at the question how to classify big
data, taking the UNECE results as a starting point. This followed a UNSD survey among
NSIs on the use of big data for official statistics, also in 2015, which included the question:
“On which topics do you see an urgent need for statistical guidance for your office or national
statistical system?”; one of the topics listed was “classification of big data”. The question was
answered by 89 respondents. Of these, 73% indicated that guidance on the classification of
big data had a “high” (37%) or “medium” (36%) urgency2.
The GWG approached the question of grouping big data sources as a classification problem.
Classifications usually have a subject, a scope (or universe), and one or more levels of (sub)
classes describing possible characteristics of the subjects, based on explicit or implicit
classification criteria. Classifications are designed on the basis of their intended uses.
Concerning the intended uses, the first use of a big data classification is providing a so-called
extensive definition of big data, i.e., an enumeration of types of big data. Any guidelines, for
instance on methods to be used when dealing with big data, could refer to the classification. It
can also be used for policy issues, such as having a well-defined scope for projects. For
instance, in February 2016 an ESSnet on Big Data (a research project) was launched, for
which pilot projects were selected by assessing the categories of the UNECE typology. It was
also used as a reference for the UNSD survey just mentioned.
2 http://unstats.un.org/unsd/trade/events/2015/abudhabi/presentations/day1/04/UNSD%20-
%20Global%20Survey%20on%20Big%20Data.pdf
http://unstats.un.org/unsd/trade/events/2015/abudhabi/presentations/day1/04/UNSD%20-%20Global%20Survey%20on%20Big%20Data.pdfhttp://unstats.un.org/unsd/trade/events/2015/abudhabi/presentations/day1/04/UNSD%20-%20Global%20Survey%20on%20Big%20Data.pdf
7
One possibly important future use of the classification is as a reference in the discussion on
the possible use of big data for compiling SDG indicators3. Only very few countries have
started looking at the usability of big data for deriving indicators to measure progress on the
SDGs, as was shown in the UNSD survey. Therefore, it may be too early to know how it will
be used, but it is clear that the usability of big data for such indicators will be a relevant
factor, possibly with a further decomposition such as the SDG goals, targets or indicators that
could be measured using each big data source. This is also being explored by the GWG.
The intended uses inform the classification criteria to be used. The GWG identified fifteen
potential classification criteria (GWG, 2015):
1. characteristics of the data itself
2. local versus global sources
3. regulatory framework applicable
4. main product versus by-product
5. purpose and subject of the data
6. original versus derived data
7. relationship data source with organisation (e.g. data platforms)
8. public versus private organisation providing the data
9. data sourced by humans versus machines
10. degree of stability of the source
11. degree of accessibility
12. real-time versus accumulated data
13. statistical methodology required for using the data
14. domains of usability
15. usability for SDG indicators
The first criterion, characteristics of the data itself, includes eight possible characteristics:
high volume, high velocity, high variety, high veracity, selectivity, (lack of) structure, high
population dynamics, and event-based data.
The GWG is currently engaged in developing the classification, which should be flexible and
should be able to evolve over time. Initially, this would probably mean relatively short
periods between revisions. Flexibility may also be obtained by constructing a system for
classifying big data sources on demand rather than a fixed classification. In that case, methods
and rules would be needed, and possibly a larger number of criteria could be accommodated.
The work is on-going.
Other lists of big data sources that are not clearly linked to the UNECE classification are used
elsewhere. Many big data overview papers contain lists of big data sources. A recent example
is a paper by Kitchin (2015), which contains a table linking big data sources to data types and
statistical domains, but there are many more cases of ad hoc classifications of big data.
3 SDG = Sustainable Development Goals. These have been agreed on at UN level.
8
Companies that offer services related to big data may use their own classifications, such as
IBM (2014).
1.3. The use of big data
There is a gap between actual use and potential use of big data. The potential use comprises
the following categories:
1. production of new products
2. providing more detail in statistics
3. making statistics more timely
4. add nowcasts or early indicators to statistics
5. quality improvement
6. response burden reduction
7. cost reduction and higher efficiency
New products may be statistics on phenomena about which no official statistics were
previously available. An example would be a general sentiment index on the basis of public
social media messages. Where there is new demand, such as for the SDG indicators, big data
can also be considered. New products may also be new visualisations of data (Tennekes,
2014). In some cases big data can be used as a single source, but combining big data and
traditional sources for new products are in many cases a more promising approach. For new
products one needs benchmarks, based on established, validated methods, in order to assess
the quality of the new products.
More detail in statistics may be provided along several dimensions, for instance higher
regional detail on the basis of big data sources, or more temporal detail such as monthly
estimates where previously there were only quarterly data. Usually higher detail requires
regular statistics that are produced using existing sources and methods, the detail being
derived from an additional big data source. For instance, if a survey has only limited regional
detail because of the sample size, one may explore whether Google Trends at a lower regional
level can be used to provide a picture of the lower level. However, this may be more difficult
than one might think (Reep and Buelens, 2015).
Making statistics more timely is a traditional goal of official statisticians, which has its limits
if surveys are used, or if data from administrative sources lag reality, as is for example
generally the case with fiscal data. However, big data sources may be much faster, for
instance if manually sampling prices is compared with using web scraping by means of
internet robots. And one may make use of correlations between big data sources and other
sources to generate more timely outcomes by making use of a model.
One step further is to produce early indicators or nowcasts for more traditional statistics. They
supplement these statistics rather than replace them. The early indicators or nowcasts often
9
heavily depend on correlations and model assumptions, but these quality issues may be
accepted because the final figures are produced later, so there is still a benchmark. If the
assumptions behind early indicators and nowcasts are clearly communicated to the users, and
the quality drawbacks are dealt with in a transparent way, big data may play an important role
in fulfilling the important demand of users for early information on phenomena, however
provisional the information may be.
Big data can also be used to improve the quality of statistics in the sense of improving
accuracy and reliability. (In fact, relevance, timeliness and clarity, to which the first four uses
mentioned above contribute, can also be seen as quality aspects, as is done in the European
Statistics Code of Practice4.) For improving accuracy and reliability, big data sources are
generally used as complementary sources to existing sources. This includes using big data for
checking the plausibility of statistical outcomes.
Response burden reduction is an important aim of many NSIs, some of which apply specific
reduction targets. Of course, the response burden can be especially reduced if data collected
by means of questionnaires can be replaced by data from other sources. Replacing surveys
with big data, however, is not easy. In some cases, such as using internet data for prices, is
already being used successfully (Ten Bosch and Windmeijer, 2014), but usually big data
sources have more potential if they are used as additional sources, thereby not eliminating
surveys but reducing their sample size, frequency or level of detail.
Cost reduction and higher efficiency naturally go together with response burden reduction, but
NSIs may have separate targets for them. In fact, the trend seems to be that NSIs try, where
possible, to have a so-called zero footprint, by which is meant that NSIs make use of all
information they can get without causing any cost or burden. In some countries the population
census has already been replaced by making estimated from administrative sources, and big
data has the potential to reduce cost and burden also in other areas.
The seven categories are not mutually exclusive. On the contrary, making use of big data may
serve several purposes at the same time. If continuation of the precise current statistical
programme is a requirement, it may appear that the possibilities of using big data are limited.
If there is some flexibility in the programme, the potential of big data to increase total user
satisfaction is much higher, given the increased availability of data sources. Then a new
optimum may be found. In fact, the availability of big data sources requires a new
optimisation effort, aimed at getting the best set of statistical services, given the potential data
sources, the demand for statistical information and budgetary constraints and possibilities
(Struijs and Daas, 2013).
The 2015 UNSD survey gives insight in the actual use of big data for official statistics. The
main reasons for considering the use of big data given by the 89 respondents were the
production of more timely statistics and the reduction of response burden. However, big data
4 http://ec.europa.eu/eurostat/web/quality/european-statistics-code-of-practice
http://ec.europa.eu/eurostat/web/quality/european-statistics-code-of-practice
10
is not very often used or considered to be used, the use of scanner data with 22% being the
highest. Satellite data is used by 19% of the respondents, and web scraping by 16.5%. Big
data is used very much more often by OECD respondents than non-OECD respondents, with
the exception of satellite data, which is used by 22% of OECD and 17% of non-OECD
countries.
Concerning the statistical domains in which big data is used, the top three consist of price
statistics (30%), population statistics (15%) and labour statistics (14%). It should be noted,
however, that the use of big data is in most case only for explorative purposes, in pilot
projects, and in many cases these pilot projects are in an initial phase.
11
2. Examples of big data for official statistics
In order to understand the issues that arise from using big data or the intention to use it, it
helps to look at examples first. In official statistics there are not yet many examples of actual
use of big data in regular statistics outside price statistics (use of scanner data and web
scraping). There are more examples of research into the potential use of big data for official
statistics, such as on the use of mobile phone data or satellite data, and outside official
statistics there are many more examples, such as the well-known Billion Prices Project of
MIT5. Outside official statistics, there are many examples of the use of data such as social
media messages, for example for research or for commercial purposes.
The examples presented here are from official statistics, mainly in the Netherlands. The first
concerns the use of road sensor data, which is already being used for regular statistics. This is
a big data source without major data access issues, since the data is available from an
administrative source for statistical purposes for free. The second example is about the use of
mobile phone data, where data access is a big issue indeed, but where the potential uses are
many. Several countries are currently trying to get access to and use this data, and the data is
also exploited commercially. The third example concerns the use of public social media
messages, which poses particular methodological challenges. Together these examples show
many of the issues and possibilities of big data, and they have been documented in the
literature. The discussion in this section is mainly based on a paper written for the UNECE
(Struijs and Daas, 2013), some of the text of which is reused and updated.
2.1. Traffic loop data
In the Netherlands, approximately 230 million traffic loop detection records are generated a
day. This data can be used as a source of information for traffic and transport statistics and
potentially also for statistics on other economic phenomena. The data is provided at a very
detailed level. More specifically, for more than 20,000 detection loops on Dutch roads, the
number of passing cars in various length classes is available on a minute-by-minute basis.
The downside of this source is that it seriously suffers from under coverage and selectivity.
The number of vehicles detected is not in all cases available for every minute and not all
Dutch roads have detection loops yet, although all main roads have. Fortunately, the first can
be corrected by imputing the absent data with data that is reported by the same location during
a 5-minutes interval before or after that minute (Daas et al., 2015). Coverage is improving
over time. Gradually more and more roads have detection loops, enabling a more complete
coverage of the most important Dutch roads. In one year more than 2000 loops were added.
A considerable part of the loops are able to discern vehicles in various length classes,
enabling the differentiation between cars and trucks. This is illustrated in Figure 1. In this
figure, for the whole of the Netherlands, normalized profiles are shown for 3 classes of
5 http://bpp.mit.edu/
http://bpp.mit.edu/
12
vehicles. The vehicles were differentiated in three length categories: small (5.6 and 12.2 meter). The results after correction
for missing data were used. Because the small vehicle category comprised around 75% of all
vehicles detected, compared to 12% for the medium-sized and 13% for the large vehicles, the
normalized results for each category are shown.
Figure 1. Normalized number of vehicles detected in three length categories on December
1st, 2011 after correcting for missing data. Small (5.6 and
12.2 meter) are shown in black, dark grey and grey,
respectively. Profiles are normalized to more clearly reveal the differences in driving
behaviour.
The profiles clearly reveal differences in the driving behaviour of the vehicle classes. The
small vehicles have clear morning and evening rush-hour peaks at 8 am and 5 pm,
respectively. The medium-sized vehicles have both an earlier morning and evening rush hour
peak, at 7 am and 4 pm, respectively. The large vehicle category has a clear morning rush
hour peak around 7 am and displays a more distributed driving behaviour during the
remainder of the day. After 3 pm the number of large vehicles gradually declines. Most
remarkable is the decrease in the relative number of medium-sized and large vehicles detected
at 8 am, during the morning rush hour peak of the small vehicles. This may be caused by a
deliberate action of the drivers of the medium-sized and large vehicles, wanting to avoid the
morning rush hour peak of the small vehicles.
At the most detailed level, that of individual loops, the number of vehicles detected
demonstrates (highly) volatile behaviour, indicating the need for a more statistical approach
(Daas et al., 2015). Harvesting the vast amount of information from the data is a major
challenge for statistics. For this, visualisation techniques can be very useful (Tennekes and
Puts, 2015). Making full use of this information would result in speedier and more robust
13
statistics on traffic in general and will provide more detailed information of the traffic of large
vehicles. This is very likely indicative of changes in economic development.
Since 2015 Statistics Netherlands publishes regular statistics on the traffic intensity of the
main roads, based on this source6. Interestingly, the potential of big data was demonstrated at
the beginning of 2016, when the first three working days of the year were extremely frosty,
with icy roads in the north of the country. With the process already in place, it was possible to
publish a press release on the eight of January reporting on the use of the main roads in the
north of the country, in which a comparison was made with the first three working days of
previous years. Road use was shown to have been halved7.
2.2. Mobile phone data
The use of mobile phones nowadays is ubiquitous. People often carry phones with them and
use their phones throughout the day. Instrumental for the infrastructure enabling the coverage
for mobile phones, are mobile phone masts/towers, called ‘sites’ in the industry. Those sites
are located at strategic points, covering as wide an area as possible.
Much of the activity that is associated with handling the phone traffic, that is, handling the
localisation of mobile phones and optimizing the capacity of a site, is stored by the mobile
phone company. So mobile phone companies record data that are very closely associated with
behaviour of people; behaviour that is of interest to NSIs. Obvious examples are behaviour
regarding tourism, mobility, commuting and transport. The destinations and residences of
people during daytime are also topics of various surveys. Data from mobile phone companies
could provide additional and more detailed insight on the whereabouts and the activity of its
users, which may be indicative for the behaviour of people in general.
Several NSIs have tried to get access to mobile phone location data and explore the
possibilities for statistics. In the Netherlands, research on this has been going on for some
time now. A dataset from a mobile telecommunication provider containing records of all call-
events (speech-calls and text messages) on their network in the Netherlands for a time period
of two weeks was studied. There are about 35 million records a day. Each record contains
information about the time and serving antenna of a call-event and a (scrambled version of
the) identification number of the phone. Getting the data proved to be very complex. This
study revealed several uses for official statistics, such as economic activity, tourism,
population density to mobility, and road use (De Jonge et al., 2012). In particular, the place
where people are at any time during the day can be compared to the place where people are
registered at municipalities. For these purposes, good visualisation is essential (Tennekes and
Offermans, 2014).
6 http://www.cbs.nl/NR/rdonlyres/25CE3592-A756-42B7-BABF-
C3E4C4E9375B/0/a13busiestnationalmotorwayinthenetherlands.pdf 7 http://www.cbs.nl/nl-NL/menu/themas/verkeer-vervoer/publicaties/artikelen/archief/2016/helft-minder-
verkeer-in-noord-nederland-door-ijzel-januari-2016.htm (in Dutch).
http://www.cbs.nl/NR/rdonlyres/25CE3592-A756-42B7-BABF-C3E4C4E9375B/0/a13busiestnationalmotorwayinthenetherlands.pdfhttp://www.cbs.nl/NR/rdonlyres/25CE3592-A756-42B7-BABF-C3E4C4E9375B/0/a13busiestnationalmotorwayinthenetherlands.pdfhttp://www.cbs.nl/nl-NL/menu/themas/verkeer-vervoer/publicaties/artikelen/archief/2016/helft-minder-verkeer-in-noord-nederland-door-ijzel-januari-2016.htmhttp://www.cbs.nl/nl-NL/menu/themas/verkeer-vervoer/publicaties/artikelen/archief/2016/helft-minder-verkeer-in-noord-nederland-door-ijzel-januari-2016.htm
14
Recent research by Statistics Belgium on the use of mobile phone data also showed the
potential of using mobile phone data for official statistics (De Meersman et al, 2016). The
Belgian NSI is one of the more successful NSIs in obtaining access to mobile phone data, and
promotes the formation of mutually beneficial partnerships (Debusschere, 2016).
At the level of the ESS, the importance of securing access to data from mobile network
operators has been recognised. The use of mobile phone data is one of the research areas of
the ESSnet on Big Data, mentioned earlier. Part of this research is aimed at solving access
issues. In September 2016, a workshop was organised by this ESSnet with mobile network
operators8, to see what can be done to exploit access and partnership opportunities. This will
be further discussed in chapter 4.
2.3. Social media data
So far social media messages have not been used for regular official statistics, but their
potential use is increasingly being researched. In the Netherlands, more than one million
public social media messages are produced on a daily basis. These messages are available to
anyone with internet access. Social media is a data source where people voluntarily share
information, discuss topics of interest, and contact family and friends. To find out whether
social media is an interesting data source for statistics, Dutch social media messages were
studied from two perspectives: content and sentiment.
Studies of the content of Dutch Twitter messages (the predominant public social media
message in the Netherlands at the time of the study) revealed that nearly 50% of those
messages were composed of 'pointless babble'. The remainder predominantly discussed spare
time activities (10%), work (7%), media (TV & radio; 5%) and politics (3%). Use of these,
more serious, messages was hampered by the less serious 'babble' messages. The latter also
negatively affected text mining studies.
Determination of the sentiment in social media messages revealed a very interesting potential
use of this data for statistics. The sentiment in Dutch social media messages was found to be
highly correlated with Dutch consumer confidence; in particular with the sentiment towards
the economic
situation. The latter relation was stable on a monthly and on a weekly basis. Daily figures,
however, displayed highly volatile behaviour (Daas et al., 2015). This highlights that it is
possible to produce weekly indicators for consumer confidence. It also revealed that such an
indicator could be produced on the first working day following the week studied,
demonstrating the ability to deliver quick results. Moreover, since consumer confidence
statistics are survey-based, cost and response burden reduction may be feasible, if quality
8
https://webgate.ec.europa.eu/fpfis/mwikis/essnetbigdata/index.php/WP5_2016_09_2223_Luxembourg_Workshop
https://webgate.ec.europa.eu/fpfis/mwikis/essnetbigdata/index.php/WP5_2016_09_2223_Luxembourg_Workshophttps://webgate.ec.europa.eu/fpfis/mwikis/essnetbigdata/index.php/WP5_2016_09_2223_Luxembourg_Workshop
15
issues can be solved in a satisfactory way. The survey may remain necessary for benchmark
purposes, but its sample size or frequency may be reduced. It is conceivable to use both
survey data and social media data in a model in order to get earlier results, lower cost and
response burden, and still maintain quality standards.
The analysis of social media messages is an area where text mining, learning algorithms and
other artificial intelligence approaches are applied to big data. Apart from official statistics,
much research is done in academia (an example is Schwartz et al, 2014). This is another area
where partnerships may be beneficial to both sides. And the use of such data for commercial
purposes has also been realised early on (Bollen et al, 2011), but this generally does not result
in publicly available methods and outcomes. However, it shows that official statistics are far
from unique in analysing social media data.
16
3. Big data and the statistical process
In order to see how the use of big data for official statistics may affect statistical production
processes and methods, and what issues arise, the current way of producing statistics is the
background reference. Therefore the next section starts with characterising the more familiar
way of making statistics, before looking at possible changes caused by or required for using
big data. Methodological issues are the subject of the subsequent section, which also covers
quality issues, since the quality of statistics depends on the methods applied, and methods are
usually chosen to fulfil quality objectives. There is interaction between methods on the one
hand, and their implementation in processes on the other. This is the subject of the third
section of this chapter. The chapter builds, among other sources, on a paper on quality
approaches to big data (Struijs and Daas, 2014).
3.1. From well-known to new processes and methods
With only very few exceptions, the statistical programmes of NSIs are based on inputs from
statistical surveys and administrative data sources. For such statistics there exists an elaborate
body of validated statistical methods. Many of these methods are survey oriented, but in fact,
most survey based statistics make use of population frames that are taken or derived from
administrative data sources. Methods for surveys without such frames do exist, for instance
area sampling methods, but nowadays even censuses (of persons and households as well as
business and institutions) tend to make use of administrative data. And administrative data
sources are, of course, themselves also used as the main data source for statistical outputs.
Statistical surveys are increasingly used for supplementing and enhancing administrative
source information rather than the other way round. This is the consequence of the widely
pursued objectives of response burden reduction and cost efficiency.
Surveys may be run in parallel, in so-called stovepipes. These may be well co-ordinated or
may run more or less independently from each other. A large part of the body of established
methods for surveys is connected to sampling theory, the core of which refers to a target
population of units and variables, to which sampling, data collection, data processing and
estimation are tuned and optimised, considering cost and quality aspects.
Next to stovepipe statistics, there also exist integrative statistics, based on a multitude of
sources. The prime example of such statistics is National Accounts (NA). Statistical methods
for NA focus on the way different sources for various domains and variables of interest can be
combined. Since these sources may be based on different concepts and populations, frames
and models have been developed for integration of sources. These frames and models include,
for instance, macroeconomic equations. Interestingly, NA outputs generally do not include
estimations of business populations. This may reflect the fact that the production of NA
involves quite a few expert assumptions as well as modelling, rather than population based
estimation.
17
This characterisation of statistics in relation to methods is, of course, incomplete. There are a
number of methods aimed at specific types of statistics, for instance occupancy models for
estimating the evolvement of wild animal populations, or time series modelling.
For big data the question is to what extent current methods and processes can be reused when
applying new types of data sources. Many big data sources do not have a deliberate design.
Traditional administrative registers have a well-defined target population, variables, structure
and (administrative) quality. They also have an explicit legal basis. But what design is behind
Twitter messages, commercial websites or mobile phone traffic? For big Data sources,
populations can often not be specified, let alone related to other sources. How can NSIs then
ensure quality?
The implication is that methods derived from sampling theory may have their limitations
when big data are going to be used. However, although current methods are predominantly
based on sampling theory, this is not exclusively so. Methods outside traditional sampling
theory, especially those involving modelling may be relevant when dealing with big data. And
modelling is already being applied in some statistical domains, such as NA and seasonal
adjustment.
3.2. Methodological issues
Starting with a very fundamental issue, what exactly is the meaning and relevance of the data
found in big data sources, from a user’s perspective? What does the number of searches on an
internet search engine reveal, or the sentiment observed in social media, or the number of
mobile phones connected to a site? The interpretation of big data can be a big methodological
problem (Daas and Puts, 2014). Moreover, meaning and relevance are user and use
dependant.
This issue is not unique to big data, to be sure, as for instance certain administrative data
sources may have a similar issue. In fact, this sometimes results in statistics about what can be
found in an administrative register rather than about the phenomenon of interest, such as when
reported rather than actual crime is measured, or the population with unemployment benefit
rather than unemployment itself. If the meaning of the data of a big data source cannot be
pinpointed, but obviously has some relevance, an option may be to produce stand-alone
statistics, such as a general sentiment indicator based on social media. The interpretation is
then up to the user, and changes in the index (rather than the level itself) may be interesting
anyway. In a way, the example of road use statistics based on sensor data can be seen as
stand-alone, since the number of vehicles passing a certain road segment can hardly be linked
to surveys on mobility or other statistics.
Another issue concerns the population about which a big data source reports. Most statistics
aim at giving information about populations of persons or businesses, or other relevant sets,
such as goods imported or sold. However, the population covered by big data may be unclear.
18
Mobile phones may be carried by others than the owner, some persons have multiple phones,
vehicles passing a detection loop may be private or company vehicles, and what do we know
about the population using social media? And how do these populations change over time?
How stable are they? In some cases it may be possible to obtain background variables, such as
for credit card data, while in other cases background variables may be estimated. For instance,
the choice of wording is correlated with age and sex of the user of social media (Daas and
Burger, 2015). This is another reason text mining may become more important in the age of
big data.
This issue has to do with the question of selectivity and representativity (Buelens et al, 2014),
and with the sometimes unstructured nature of the data, which makes it even more difficult to
extract meaningful statistical information. Selectivity is a characteristic of many data sources,
including big data sources. For some of such sources, the selectivity mechanism is known,
such as for road sensors if the target population consists of road segments, or for financial
transaction data. For other sources this is partly known, as is the case for mobile phone data,
where the grid of antennae may be known, but the population of mobile phone users perhaps
not. In the case of the population behind social media the mechanism is even less known.
Not knowing the composition of the populations included in big data leads to the question
what to do when sampling theory cannot be applied (Struijs et al, 2014). What to do if one
does not know for what part of the target population the dataset is representative? More
fundamentally, one may wonder whether sampling theory deserves being the default approach
to statistics in the age of big data. Maybe more model-based approaches need to be applied.
Examples are probabilistic modelling, Bayesian methods, multilevel approaches, statistical-
learning methods and occupancy models, such as those used in measuring wild animal
populations. Econometric models can also be considered. Then the measured phenomena are
leading, and research may be aimed at relating them to information already known.
However, it is not clear whether this would really be desirable. This approach to official
statistics is not generally accepted, at least not yet, as this would increase the use of
assumptions in statistics, and the compilation of statistics by making use of observed
correlations between variables. For instance, the correlation between the sentiment as
observed in social media messages and the surveyed consumer confidence may be high and
remain high for a considerable time, but if that correlation is not well understood, there are
certain risks, especially if the relationship between de population writing public messages on
social media and the population at large is not known (Daas and Puts, 2014a). Such risks have
become well-known since quality issues appeared with the Google Flu data (Lazer et al,
2014).
For NSIs, a key question is how the quality of official statistics can be guaranteed if they are
based on big data and new methods such as modelling are applied (Puts et al, 2015). The
question of modelling is not new (Breiman, 2001), but the concerns with big data are (Tam
19
and Clarke, 2015). When reading articles by proponents of new approaches (e.g., Varian,
2014),one may wonder whether a paradigm shift is taking place.
Methodological issues also have the attention of the international statistical community. In
2014 the UNECE established a task team to advise on how to ensure good quality of statistics
when using big data. The report of the task team (UNECE, 2014) did not come up with a list
of methods, because there was too little experience with big data at the time and methods
would be source dependent, but it proposed an approach similar to assessing the potential use
of administrative data sources. Another initiative is currently taking place in the ESSnet on
Big Data mentioned earlier, where a work package has been defined to systematically assess
the methods that can be used for big data statistics, coming from the ESSnet itself as well as
from the literature. The work package will be carried out in 2017. Other organisations have
also been looking into such issues (e.g., Baker et al, 2013, and AAPOR, 2015).
3.3. Process issues
Making use of big data for official statistics may have consequences for all aspects of the
statistical process, from the input to the output process. At the input side, there may be access
issues. Throughout the process, there may be issues of privacy and security, and of
infrastructure needed for processing the volume of the data. There may also be issues
concerning dissemination.
When using big data, the design of the statistical process needs special attention. It may be
difficult to receive and process really high volume datasets, especially if the second “V” of
big data, velocity, applies. In the example of the use of traffic sensor data, the first analysis
was done on a huge set of data covering all records for several years. This yielded techniques
for reduction of the volume of data without information loss, by looking at what was actually
needed, including metadata, and removing noise. For instance, records may contain a lot of
metadata that is the same for a large set of records. The process that resulted was very
efficient and used parallel processing, but in this case it is also possible to use the streaming
data. That requires a different type of process. In any case, the processing, storage and transfer
of large data sets may pose a challenge. However, given technological advances like increases
in computing power, parallel processing techniques, larger storage facilities and high
bandwidth data channels, for most situations this does not need to become a bottleneck.
There may be technical solutions for dealing with high volume and high velocity data, albeit
possibly at additional costs, but another solution is conceivable: perhaps the data can remain
at the source. Maybe it is possible to arrange that queries are done by the source holder in the
source, for instance that data is first aggregated or sampled prior to sending it to the NSI. But
such solutions themselves entail other issues. Are the results reproducible, can they still be
linked to data available at the NSI? In the case of mobile phone data, the operators often
prefer delivering aggregated data rather than individual records. This has to do with privacy
considerations, among other things (Struijs et al, 2014).
20
In general, as is the case with more traditional statistical processing, the data may be
processed physically at the NSI offices or elsewhere, in various arrangements. Although it
entails its own issues, cloud computing may be considered if some conditions are met (see
below). Another possibility is to collaborate with another party, for instance a research
institute with facilities for big data processing. This may be beneficial to all parties involved,
since knowledge and experience can be shared. Whatever arrangement is entered, it is
important to be aware of possible risks, for instance in respect of the continuity of the
partnership and public trust. There are many platform possibilities (Singh and Reddy, 2014)
and many tools for analysis, including visualisation tools for big data sets (Tennekes et al,
2013).
Privacy and security considerations are especially important when dealing with big data,
because in comparison with more traditional processes, the problems can be compounded and
the legal situation may not be entirely clear or in flux, as existing legislation and rules were
not designed for dealing with big data. There are several issues, real or perceived, that may
impede using big data. Data ownership and copyright may be an issue, and the purpose for
which data are registered. Even if data is publicly accessible, for instance on websites or as
social media messages that do not have access restrictions, questions of ownership and
purpose of publication can be raised. Internet robots cause a burden on the providers of the
sites, and in some cases site owners prefer sending data directly to the NSI. And even if data
may legally be used, this does not imply that it is wise or appropriate to do so. Of critical
importance is the implication of any use of big data for the public perception of an NSI as this
has a direct impact on trust in official statistics.
A complicating factor is the circumstance that public opinion on privacy and confidentiality
seems to be in flux. On the one hand, privacy seems to be ever more under pressure when
public safety or commercial interests are perceived to be at stake, and young people who have
grown up using social networks tend to consider privacy less important than the elderly. On
the other hand, there seems to be a growing general awareness of possible privacy
implications of the ubiquity of data, resulting in a more critical attitude towards the
unquestioned processing of data by anyone. Anyway, the understanding for the need for
statistical data collection by organisations is decreasing, especially if such data are already
registered elsewhere.
Fortunately, there are measures NSIs can take to overcome at least some of the obstacles. In
some cases the use of informed consent may be a solution. If the NSI can offer a reduction of
the response burden, this can be very helpful, also in getting the support of the general public.
Transparency about what and how big data sources are used is crucial. For the long run
changes in legislation may be considered, to ensure continuous data access. But it remains
important to stay in line with public opinion, because credibility and public trust are important
assets of NSIs.
21
As to cloud computing, a few sensible rules may be used. Cloud services include the
provision of computing resources, platforms and IT applications via the internet. At the
present legal and technical state of the art, it is advisable in general not to host sensitive and
critical data and processes in the cloud. The user of the cloud service remains responsible for
the security and privacy of the data, and public trust depends on this. It is also advisable to use
data encryption where possible.
Even at the output side of the statistical process, there may be issues. The prevention of the
disclosure of the identity of individuals is an imperative, but this is difficult to guarantee when
dealing with Big Data, although a number of techniques are available that have proven to be
reliable. Another issues may be the dissemination policy. Some statistical outputs based on
big data may be innovative or provisional and may entail some quality risks. Instead of not
disseminating results for which there is high demand, even if their quality does not meet
traditional standards, a possibility may be to release such results on a beta site, where all
outputs are qualified by default. In fact, this is common practice with many large internet
businesses, so that they get early feedback on their products. Among other things for this
reason Statistics Netherlands has launched an innovation site9 in October 2016.
International organisations have tried to provide help and guidelines to deal with process
issues. In Ireland a so-called Sandbox for practicing with big data was created in 2014 by the
Central Statistics Office (CSO) of Ireland and the Irish Centre for High-End Computing
(ICHEC)10
. NSIs may use these facilities for a small annual fee, and assistance is provided.
International guidelines have been developed on privacy and security by a task team of
UNECE in 2014, resulting in three documents on good practices. These documents do not
have any formal status, but facilitate working with big data; they are available on the big data
site of the 2014 project of UNECE11
.
At the level of the UN, the Global Working Group (GWG) mentioned earlier has also
produced recommendations. In particular, there are recommendations on access and
partnerships (GWG, 2015a), and a template for a Memorandum of Understanding with global
data providers (GWG, 2015b). Furthermore, the GWG has drafted a number of principles for
data access, trying to find a fair balance between the interest of getting free access to data for
the public good on the one hand, and legitimate interests of private organisations on the other.
The main elements of this balance are the creation of a level playing field, equal treatment,
safeguards for confidentiality and security, transparency, and proportionality. These draft
principles, which are annexed to this document, are based on the Fundamental Principles of
Official Statistics of the UN12
. They have been presented and discussed with stakeholders at
several fora and have received broad support.
9 https://www.cbs.nl/en-gb/our-services/innovation
10 http://www1.unece.org/stat/platform/display/bigdata/Sandbox
11 http://www1.unece.org/stat/platform/display/bigdata/2014+Project
12 http://unstats.un.org/unsd/dnss/gp/fundprinciples.aspx
https://www.cbs.nl/en-gb/our-services/innovationhttp://www1.unece.org/stat/platform/display/bigdata/Sandboxhttp://www1.unece.org/stat/platform/display/bigdata/2014+Projecthttp://unstats.un.org/unsd/dnss/gp/fundprinciples.aspx
22
4. Getting ready for big data
4.1. Organising for big data
An NSI that wants to make serious use of big data will have to organise itself in order to cope
with the challenge. Several factors are important. A factor not discussed so far is human
capital. In order to work with big data, specific technical skills are needed, such as advanced
computing skills, a fair command of math and statistics, modelling skills and data engineering
skills. But equally important are the mental orientation and behavioural skills of the staff.
Working with big data requires an open mind-set and the ability not to see all problems a
priori in terms of sampling theory. For this type of staff the term data scientist has been
coined. However, it is not evident that the culture of NSIs can smoothly absorb this type of
professional. A way to deal with this cultural issue is to create one or more kernels of data
scientists working with big data, and let these kernels grow, which will be a natural process if
they are successful (Struijs and Daas, 2013).
Another factor is the way processes are organised. The continuity and possible volatility of
big data sources deserve consideration. Social media, for instance, seem to have an ever
shorter lifecycle. As a consequence, the use of big data requires a more flexible set-up of
production processes, with a short time-to-market. Not only data collection has to be flexible,
but also data processing further in the production chain. More generally, NSIs that start using
big data may have to adapt or even reconsider their enterprise architecture.
For NSIs that want to make big data a serious part of their business, governance may become
an issue. Because of the important strategic aspects of big data, this subject should get
attention at the highest management level of the NSI. Setting priorities, creating favourable
conditions for using big data and taking related budget decisions would be tasks for the
strategic level, as would be the making of policy choices. An increased use of big data
requires a number of policy decisions that influence various parts of the organisation. The
organisation’s CIO (Chief Information Officer) would likely have an important say in the way
the NSI deals with big data.
There are more relevant factors, of course. The required IT infrastructure must be in place,
and the same goes for an appropriate research capability. Policy support must be organised,
for instance concerning privacy issues. All this requires a conscious effort and co-ordination.
However, there is no blueprint for getting an organisation ready for big data.
To give an example, this is the way Statistics Netherlands organised itself when the awareness
grew that big data was a strategic issue. Once the Board of Directors of Statistics Netherlands
identified the need to have a big data strategy, they had a staff member write a position paper
for discussion by the Board. The paper suggested that the NSI should work out a big data
roadmap, and this was done. The roadmap was validated by IBM (IBM, 2014), updated twice
a year, and monitored. The roadmap not only identified big data research projects and
23
statistics, together with a time plan and ownership, but also arranged for creating the right
conditions, such as IT, methodological and policy support. The Deputy Director-General was
made responsible for big data at the strategic level. Statistics Netherlands already had an R&D
programme, which was then also tuned to the demand for big data research. At a more tactical
level, a big data co-ordination group was created. This group prepared updates of the
roadmap, and also arranged for internal training to be given.
In practice, this approach did yield results, but it was not satisfactory. In particular, the
transition from research to regular statistics production took much longer than desired. In
2016, the decision was taken to centralise all big data activities in a programme with its own
physical facilities and permanent staff, to which other staff is added on a project basis. This
centre, which is called the Center for Big Data Statistics (CBDS), was launched in September
2016. It has already many external partners, and makes use of the innovation site mentioned
in section 3.3.
4.2. Towards a data ecosystem13
The environment in which NSIs operate is changing. This may have consequences for the
position an NSI occupies or wants to occupy in the data society. NSIs are faced with more and
more potential data sources, whereas the modalities for their use are changing. Most actions
by persons or businesses – transactions, movements, communication, social and business
activities – nowadays leave digital traces in one way or another; ever increasing amounts of
data are becoming available. And, contrary to survey data, these data are not available
exclusively for NSIs. They are becoming less unique as a user of data.
The position of NSIs in the information society is becoming less evident, even though their
institutional setting is stable for the time being. Other providers of information on relevant
phenomena of society pop up everywhere. They are often very quick and perceived as
knowledgeable. Because of this, society becomes less dependent on information from NSIs.
There are alternatives, for instance, to official price indices. Even if there are quality issues
attached to these alternatives, there is demand for them. Apart from the many practical
questions, big data is bound to have an impact on NSIs at the strategic level.
One of the questions with which an NSI may be confronted is the question what to do if there
is a market alternative to one or more of its statistics. But it may also ask whether it can
assume new roles, based on its institutional position and the knowledge it has accumulated.
Should one for instance consider to shift the role of the NSI from producing statistical
information towards validating information produced by others? Or to pool resources?
A possible approach is to assess the strengths and weaknesses of the NSI, and take them into
account when positioning itself in the information society. For instance, NSIs have a unique
ability to relate data from different sources and to assess the quality of information produced
13
This section makes extensive use of (Struijs and Daas, 2014) and (Struijs et al, 2014).
24
by others. They may try to exploit this by forming networks and forging partnerships with
other organisations. NSIs have come to recognise the necessity of not working in isolation but
collaborating with each other and others outside the community of official statistics. This
collaboration is often exploratory and may be aimed at sharing knowledge and experiences,
but there are already examples of collaboration that go further.
From the perspective of NSIs, several types of partners are of interest. First of all, the
potential providers of big data are essential partners: if they do not grant access to their data,
the story is over before it starts. Data owners have their own concerns and, like NSIs, they are
subject to privacy rules. This may complicate collaboration even if they have a positive
outlook and approach. But since big data sources are not designed for statistical use, such
collaboration is also essential in order to obtain good knowledge of the provenance of such
sources. Additionally, for statistical production, it may be more efficient to have data
processed at the site of collection and storage.
On the other hand, statisticians also have much to offer such as providing analytic insights
that may help data owners understand their data better. Doing complex statistical analyses is
core business for NSIs, but not for, say, a mobile phone company. In these and other ways,
the relationship with data providers could potentially become true partnerships. For example,
one specific role that NSIs could play is that of a trusted third party. In a competitive market,
competitors will be reluctant to share sensitive data among each other. But they might be
willing to share it with an NSI who compiles statistical information that is beneficial to all.
Collaboration between NSIs and academia may grow as well. Universities have historically
been natural partners for NSIs. It stands to reason that such collaboration will extend to the
field of big data, for instance, in solving methodological problems, developing technical
solutions and training future data scientists. Such collaboration is also being supported by
public funders who are facilitating research and innovation partnerships through targeted
grants. By working in partnership, researchers in universities and NSIs could better leverage
such opportunities.
Furthermore, there are many commercial partners with which NSIs could collaborate. Google
and Facebook are two examples for which big data forms the core of their business model.
Their knowledge and the data to which they have access may be very relevant to NSIs. IT
companies also possess relevant knowledge on big data processing and storage, security,
cloud processing, etc. Apart from the provision of paid services, collaboration may be of
interest to them with a view to obtaining statistical expertise and for benchmarking or
validating their information products.
The relationship between the various stakeholders will involve each partner building on and
contributing different strengths and will likely result in flexible networks. Such networks are
flexible in the sense that membership of the network and the contribution of partners depend
on actual needs instead of being fixed in advance for a long time. The emerging data
25
ecosystem will also allow for forming ad hoc consortia to compete for research funds and
other subsidies, such as funds from the Horizon 2020 programme (European Commission,
2013). The ESSnet on Big Data mentioned earlier can also be seen as part of the data
ecosystem.
Partnerships are the core of the data ecosystem, and international organisations have tried to
document good practices. The GWG work on access and partnerships, such as the
recommendations for access to data from private organisations was mentioned in section 3.3,
which actually carried forward earlier work of UNECE14
on partnerships. Guidelines have
also been drafted by PARIS21 in collaboration with OECD (Robin et al, 2016).
14
http://www1.unece.org/stat/platform/display/bigdata/Guidelines+for+the+establishment+and+use+of+partnerships+in+Big+Data+Projects+for+Official+Statistics
http://www1.unece.org/stat/platform/display/bigdata/Guidelines+for+the+establishment+and+use+of+partnerships+in+Big+Data+Projects+for+Official+Statisticshttp://www1.unece.org/stat/platform/display/bigdata/Guidelines+for+the+establishment+and+use+of+partnerships+in+Big+Data+Projects+for+Official+Statistics
26
Annex
Global Working Group on Big Data for Official Statistics
Recommendations for Access to Data from Private Organisations
for Official Statistics
Draft 14 July 2016
Preamble
The Global Working Group on Big Data for Official Statistics,
(1) Taking notice of the high and urgent need for access to data kept by private organizations
for the production of official statistics, such as indicators for the Sustainable Development
Goals and statistics on phenomena related to modern society, and the social responsibility
already shown by private organizations to provide access to new data sources, free of
charge, for purposes such as disaster relief and the fight against epidemics,
(2) Bearing in mind that in using such data the Fundamental Principles of Official Statistics,
as endorsed by the UN General Assembly15
, unconditionally apply, and that the statistical
community has pledged to adhere to the professional ethics, as stated in the Declaration on
Professional Ethics, as adopted by the International Statistical Institute16
, thereby creating
the foundation for sharing data for official statistics,
(3) Recognizing the legitimate interests of private organizations, including respect for their
business model and value proposition, and the need to guarantee a level playing field for
private organizations considering the burden created by providing data for official
statistics, as well as the legitimate interest of organizations in charge of compiling official
statistics to have equal access,
(4) Stressing that the burden to private organizations resulting from data requests for official
statistics must be fair in proportion to their envisaged public benefits and that the data
should be adequate and relevant in relation to the purposes for which they are requested,
(5) Considering that legislation aimed at accessing and using data kept by private
organizations unavoidably lags the emergence of new types of data sources, that existing
national and international legal frameworks fully apply but need interpretation in view of
new data sources, especially concerning privacy, data ownership, reuse of data by third
15 Resolution 68/261, adopted by the General Assembly on 29 January 2014. 16
This declaration was adopted by the Council of the International Statistical Institute in its session of 22 and 23 July 2010, in Reykjavik, Iceland.
27
parties, and liability in case of breaches of confidentiality, and that there is thus a need for
guidance,
(6) Highlighting the need to create public trust by applying full transparency in the use of
data from private organizations for official statistics, in particular in view of privacy
concerns, given a number of well-publicized cases of likely abuse outside the realm of
official statistics, and the need to provide clarity concerning the possible use for statistical
purposes of personal data in customer contracts with private organizations, for instance by
referring to the Recommendations set out below,
(7) Acknowledging that private data sources are diverse in many respects, such as data
ownership, provenance of the data, purpose of collecting the data, and characteristics of
the data itself, and that providing access to the data can take a variety of shapes, such as
sending micro data to statistical agencies, providing aggregates compiled according to
specifications from statistical agencies, or providing on-site data access for analysis,
(8) Admitting that source and branch specific operational rules and guidelines may be needed
for dealing with access to data kept by private organizations, that such rules and
guidelines should be consistent with the Recommendations set out below, that before
access is requested for the purpose of producing official statistics data exploration may be
necessary in collaboration with the private data source, and that this requires the
development of partnerships between private organizations providing and statistical
agencies using data,
Endorses the following recommendations for access to data from private organizations for
the production of official statistics:
Recommendations
Recommendation 1. The role of national and international systems of official statistics is to
provide relevant, high-quality information to society in an impartial way. This role is
indispensable to the well-functioning of societies. To this end, data is needed from private
organizations as inputs to these systems. In view of the emergence of new types of data
sources and the social responsibility of private organizations, these members of society are
called upon to make the data that is needed available to the statistical agency concerned, free
of charge, on a voluntary basis.
Recommendation 2. The data needed for official statistics may only be collected and
processed if the statistical agency concerned acts in full accordance with the Fundamental
Principles of Official Statistics17
. These principles guarantee, among other things, the
professional independence and accountability of the statistical agency, and the strictly
confidential use of the data, exclusively for statistical purposes.
17
http://unstats.un.org/unsd/dnss/gp/fundprinciples.aspx
http://unstats.un.org/unsd/dnss/gp/fundprinciples.aspx
28
Recommendation 3. When data is collected from private organizations for the purpose of
producing official statistics, the fairness of the distribution of the burden across those
organizations has to be considered, in order to guarantee a level playing field.
Recommendation 4. Data requests for official statistics must acknowledge and take into
account the role of data in the business model and value proposition of private organizations,
in particular if their data has market value. There must be a fair balance between public and
business interests when data is requested and possible harm to business interests has to be
kept as low as possible.
Recommendation 5. The data must be adequate and relevant in relation to the purposes for
which it is requested from the private organization. No more data should be requested than
needed for these purposes. Operational arrangements have to be agreed on between the
private organization and the statistical agency concerned, taking into account business
concerns and data adequacy for official statistics. The metadata must also be adequate.
Recommendation 6. The cost and effort of providing data access, including possible pre-
processing, must be reasonable compared to the expected public benefit of the official
statistics envisaged.
Recommendation 7. When private organizations operate internationally, they are expected to
treat requests for data from national statistical systems in a non-discriminatory way, unless
different treatment is justified by differences in the national legislative frameworks
concerned, and provided that adherence to the Fundamental Principles of Official Statistics is
guaranteed in theory as well as p