Services concerning ethical, communicational, skills ... · Big data processing. In general big data are not designed for statistical purposes and thus do not meet statistical standards

0

Services concerning ethical, communicational, skills issues and methodological cooperation related to

the use of Big Data in European statistics

(Contract number 11104.2015.005-2015.799)

TASK 1: Ethical review

Deliverable 1.3 Report on ethical guidelines

Version 2

Date: 3 June 2017

Drafted by: SOGETI Luxembourg: Alma RUTKAUSKIENE

Disseminated: EUROSTAT: Albrecht WIRTHMANN

1

Table of contents

INTRODUCTION ............................................................................................................................................. 2

GENERAL ETHICAL CONSIDERATIONS ........................................................................................................... 3

POSSIBLE ETHICAL APPROACH BY BIG DATA SOURCES ................................................................................. 5

1. Mobile positioning data (mobile phone data)....................................................................................... 5

2. Data from smart electricity consumption meters (smart meters) ........................................................ 7

3. Road traffic loops data ........................................................................................................................ 10

4. Remote sensing data including satellite image data, data from unmanned aerial vehicles (UAV) .... 12

5. Social media data ................................................................................................................................ 14

6. Web-scraped data from company websites, job vacancy websites or real estate agencies' websites ..

............................................................................................................................................................. 16

7. Query and ClickOut data from internet searches ............................................................................... 19

8. Cash register data, e.g. from supermarkets ........................................................................................ 21

CONCLUSIONS ............................................................................................................................................. 23

REFERENCES .................................................................................................... Error! Bookmark not defined.

2

INTRODUCTION

The strategic importance of big data for the European Statistical System has been recognised by the

European Statistical System Committee (ESSC) by adopting the Scheveningen Memorandum [A] in

September 2013. The Directors General of the National Statistical Institutes conference considered,

“Official statistics should incorporate as much as possible all potential data sources, including Big Data,

into their conceptual design”. It was acknowledged that “Big Data represent new opportunities and

challenges for Official Statistics, and therefore encourage the European Statistical System and its partners

to effectively examine the potential of Big Data sources in that regard”.

The National Statistical Institutes (NSIs) are exploring possibilities of integrating big data sources in

production of official statistics. Pilot projects were initiated by the UN SD, UN ECE, and Eurostat. The

projects identified the opportunities and pointed out issues linked to the access to the data, assurance of

privacy of the data subjects, quality of the data in terms of suitability for official statistics and etc. In

order to provide the public with independent high quality information statistical offices adhere to

statistical principles that are defined by the European Statistics Code of Practice and the UN

Fundamental Principles of official statistics that constitute the ethical framework of official statistics.

The ethical framework of official statistics do not preclude the use of any types of data sources if they

ensure the quality of statistical output, are cost-efficient and minimise the reporting burden for the data

providers (UN Fundamental principles of official statistics, Principle 5.) [B]. Even so, specific

characteristics of big data sources may require additional efforts from the NSIs to comply with other

statistical principles such as professional independence, mandate for data collection, adequacy of

resources, impartiality and objectivity, clarity of the methods used to obtain statistical output, that are

embedded in the European Statistics Code of Practice [C].

The aim of these Guidelines is to draw the attention of statistical authorities to possible issues related to

professional ethics once big data is used for the production of official statistics and to recommend an

approach that would be compliant with the statistical code of conduct. The guidelines are based on the

results of the projects carried out by statistical institutes and research organisations investigating

different aspects of the possible use of different types of big data in official statistics. The guidelines

provide general recommendations that are common to the majority of types of big data and

recommendations that may concern particular types of big data.

3

As the exploration of the potential of big data continues, it is obvious that these guidelines will need to

be updated once more experience is collected. The recommendations have a rather general character

and will need to be adapted to the national conditions and specific situation.

GENERAL ETHICAL CONSIDERATIONS

The professional ethics of official statistics is based on shared professional values of statisticians: respect,

professionalism, truthfulness and Integrity [D]. These values are reflected in the UN Fundamental

Principles of Official Statistics [B] and European Statistics Code of Practice [C].

Statistical authorities of the EU committed themselves to adhere to the European Statistics Code of

Practice (CoP) that consists of 15 principles covering the institutional environment, the statistical

production process and the output of statistics. The quality assurance framework of the European

Statistical System facilitates the implementation of the CoP by describing activities, methods and tools to

operationalize the indicators of the CoP. The adhesion to these principles makes official statistics a

trusted source of information for all users.

Due to specific characteristics of big data (high volume, high velocity, high variety) statistical authorities

need to be ready to meet new challenges in order to harness its potential for official statistics. Some of

these challenges would require an answer to ethical questions that may be raised at different stages of

the statistical production process. Here below we provide examples of possible ethical consideration that

may be needed at the main stages of the statistical production process: data acquisition, data processing

and dissemination of statistical output.

Big data acquisition. Big data in most of the cases are collected by private companies that, in

principle, are not compelled by the law to provide the data to statistical authorities. Therefore the

provision of the data will depend on benefits realised or perceived by the data holders. The

experience of the NSIs shows that special arrangements with the private companies are needed in

order to get access to the data they possess.

In order to make these arrangements according to professional ethics, NSIs would need, first of all,

to respect principle of professional independence meaning that statistics must be developed,

produced and disseminated in an independent manner. The agreements with private companies

need to avoid any pressure from businesses to put their interests above the public interest. The

4

selection of data providers should be done in a transparent way without favouring one company

against another.

Statistical institutes need to assure big data providers that the big data obtained from them will be

used exclusively for the purposes of official statistics and there is no risk of harm to their business.

In case big data contain personal information the companies that collect these data are obliged by

the law to protect the privacy of the data subjects. In order to work in ethical terms and show

respect to the data subjects, statistical institutes should be informed by the data providers whether

their customers are aware that the data about them can be delivered to statistical authorities.

Big data processing. In general big data are not designed for statistical purposes and thus do not

meet statistical standards on concepts and definitions as such. A number of data quality issues

might challenge the use of big data for official statistics, e.g., the selectivity of the data, no

guarantee in continuity and stability of the data structure, or the risk of data manipulation. The

compliance with professional ethics may be questioned if statistical models, imputation techniques

etc. used for the data processing and output were not scientifically proofed, or if the quality of the

output that is going to be published as official statistics could not be assured. The professionalism

of statistical offices is one of their major assets that makes official statistics a trusted source of

information.

The processing of big data (big data analytics) can involve methods of data linking that might reveal

personal information. Improper use of the personal data within statistical system can cause harm

to individuals or businesses (intentionally or unintentionally) and damage the reputation of official

statistics. Existing internal rules of access and measures to ensure the confidentiality of information

have to be applied to big data. Additional research might be necessary to ensure privacy and

confidentiality of the disseminated data.

Dissemination of statistical output. Big data requires complex techniques to produce statistical

output. In some cases, only inputs and outputs might be observable while the transformation may

not be transparent. The professional ethics of statisticians requires that “that information on

methods and procedures used to produce official statistics is publicly available“[3] and scientifically

sound. Therefore, statistical agencies using big data analytics should not only describe the data

5

sources but should also document the applied methods and models to enable independent

assessment of data processing and results.

Different types of big data may raise different ethical questions that depend on the characteristics of

these data, e.g. availability of the access to the data, content of personal information, quality issues in

terms of suitability for the purpose of official statistics, the clarity of the methods to be applied in order

to get statistical output etc. These guidelines provide some examples of possible issues linked to

different types of big data and recommendations on how to handle them in compliance with the

statistical principles.

POSSIBLE ETHICAL APPROACH BY BIG DATA SOURCES

1. Mobile positioning data (mobile phone data)

The data source

Mobile positioning is tracking the location of mobile telephones. Generally, it can be divided into active

and passive mobile positioning. Active mobile positioning is used for tracking the location of mobile

phones in real time using mobile positioning system (MPS). There are many technical solutions for active

real time tracking of telephones. The cell identity method determines the network cell where the

telephone is located. Location data from passive mobile positioning is automatically stored in memory or

log files of Mobile Network Operators (MNO). Operators’ systems generate a very large amount of data

on the use of mobile communication including location information. These data are mostly used

internally by the network carriers for business and marketing purposes, e.g. charging clients for services,

providing usage statistics, analysing network performance, or developing new marketing products. The

location data can be used for generating statistics about space-time movement of phones (phone users)

cost-effectively [1].

Potential use for official statistics

The mobile positioning data can be used to complement European tourism statistics (to collect the data

on short trips or the same day visits as well as during a longer period of stay). Such method could be an

alternative to the 'bookkeeping system' or 'diary' currently used or to the traditional ex-post

questionnaires in which respondents report on trips made during a specified reference period. It can

provide information previously not available (new indicators), calibration opportunities for existing data.

6

A follow-up sample survey may still be needed to collect additional, qualitative information on the trips.

However, the sample surveys could be based on much smaller samples [1].

The data from MNOs across different countries could allow to produce a pan-European view of the

population density. Furthermore, the proper fusion of multi-MNO data from the same country bears the

potential of improving the accuracy of the estimation within the same country along different directions,

namely: (i) increase the population coverage; (ii) mitigate the potential bias caused by MNO specific

network configurations and (iii) improve the spatial accuracy [2]. The integration of existing population

and flow statistics with the continuously up-to-date estimates obtained from GSM data could provide

more accurate results. [3].

Possible concerns

Acquisition of the data. Access to mobile positioning data might be a problem for the NSIs due to

regulatory limitations that wary across EU countries. The main concerns are related to the privacy of the

data subjects. This data source has sensitive personal information and can create a perception of people

being tracked. Apart from the regulatory limitation, MNOs have business and financial concerns.

Providing data to third parties may have a negative impact on their business.

In order to get access to mobile positioning data statistical authorities would need special arrangements

with MNOs. This may need ethical considerations to preserve the professional independence of

statistical agencies and guarantee an equal treatment of data providers.

Data processing. The major issues that may arise during the data processing are linked to the quality of

mobile positioning data in terms of its suitability to the purpose of official statistics. The data has some

limitations. For example, in the case of tourism statistics there is a lack of information on the purpose of

the trip, expenditure, type of accommodation and means of transport used. It is difficult to estimate the

over-coverage of the same-day trips due to the misclassification of overnight trips as well as over and

under-coverage related to the usage of mobile phones (tourists who do not appear in mobile positioning

data, tourists who use several mobile devices or the roaming service of several MNOs). [4].

Dissemination of statistical output. Due to the differences in the concepts and definitions, the

methodology of data processing for the purposes of official statistics might be rather complex. The

statistical principle of transparency requires that official statistics and corresponding metadata are

presented in clear and understandable form.

7

Recommended ethical approach

1. Any arrangements with the data providers in order to get access to the mobile positioning data

should not compromise professional independence of statistical authorities. The cooperation

should be based on mutual benefit balancing the interest of statistical authorities and private

companies.

2. The selection of the data and of data providers for the purposes of official statistics should follow

statistical considerations and should not favour or disadvantage particular companies.

3. The privacy of the data subjects has to be respected. The data subjects should to be informed

that their data (mobile phone data records) are used by the national statistical authorities for the

purposes of official statistics.

4. Upon access to the data, the protection of personal data must be ensured by statistical offices.

Measures to protect personal data used by statistical authorities and private data providers

should be communicated to the general public.

5. The data obtained from MNOs must be used exclusively for the purposes of official statistics. The

possibility to harm MNOs or their subscribers directly or indirectly should be excluded. Measures

to ensure exclusive use for statistical purposes should be communicated to MNOs, their clients

and to the general public.

6. The rules of data retention need to be defined and communicated to the general public.

7. Statistical offices should minimize use of individual data and should adopt a privacy by design

approach.

8. It would be recommended to collaborate with academia and research organizations in order to

develop the methods of data analysis and processing that would ensure the quality of statistical

output in accordance with the requirements of official statics.

9. The methods applied during the processing of the mobile communication data to obtain the

statistical indicators that are disseminated as official statistics need to be explained to the users

in a clear and understandable form.

10. Statistical offices should perform prior impact assessment to evaluate possible risks and define

mitigation activities.

2. Data from smart electricity consumption meters (smart meters)

The data source

8

The European Smart Metering Alliance defines smart metering as having the following features:

Automatic processing, transfer, management and utilisation of metering data

Automatic management of meters

2-way data communication with meters

Provides meaningful and timely consumption information to the relevant actors and their

systems, including the energy consumer

Supports services that improve the energy efficiency of the energy consumption and the energy

system (generation, transmission, distribution and especially end-use)

Wikipedia describes a smart meter as an electronic device that records the consumption of electric

energy, water or gas in intervals of an hour or less and communicates that information at least daily back

to the utility for monitoring and billing. Smart meters infrastructures should provide system operators

with real-time data on consumption and allow customers to make informed choices about energy usage

based on the price at the time of use1. Smart meter data is of interest to statistical organisations as it

provides detailed information on energy consumption at high levels of frequency and on real time [5].

Potential use in official statistics

The data from electricity smart meters can be beneficial for several domains of official statistics, e.g.

statistics on energy consumption, household consumption expenditure, consumer price index, or

environmental statistics. Use of smart meter data would substantially reduce reporting burden on

households and companies and would possibly allow to produce statistical indicators at more detailed

level of aggregation and to improve timeliness.

In addition to energy consumption statistics, there have been experiments to use smart meters’ data to

estimate household occupancy by time of the day, to identify the household structure and their size, to

estimate probability of occupancy of dwellings, to make statistics on long-term vacant properties [6].

Possible concerns

Data acquisition

Smart meters provide distribution network operators and energy suppliers with data about energy

consumption. The granularity of the data creates the possibility to identify households, household

1 https://en.wikipedia.org/wiki/Smart_meter

9

characteristics and real-time occupancy. Personal information about energy consumers is protected by

the EU and national rules that define who can access personal data and under what circumstances. A

survey among NSIs of the EU on the possibility to have access to smart meter data carried out by the

ESSnet Big Data showed that currently only few countries have access to smart meter data[7]. In case

access to the data is not enforced by law, statistical authorities will need specific agreements with the

data holders that should take into account the interest of all the parties involved.

Data processing

One of the major issues that may affect the quality of statistical output and has to be handled by

statistical authorities is the representativeness of the smart meters’ data. For the time being the

deployment of smart meters does not cover all the households and businesses. Although it is expected

by 2020 that almost 72% of European consumers will have a smart meters for electricity while 40% will

have one for gas2; the metering point does not correspond to statistical observation unit that creates

some challenges for the NSIs to evaluate the coverage of the data source.

The smart meters’ data are not designed for statistical purposes and might be changed (in terms of

structure and the content) by the organisations that collect the data. In such cases, NSIs would need to

cope with the problem of discontinuity of the data source.

When data analysis is conducted on the systems of the companies collecting the data, NSIs have to be

informed about the methods used by these companies in order to be sure in what extent they

correspond with the requirements of official statistics.

Dissemination of statistical output

Professional ethics requires that official statistics is published together with corresponding metadata. In

case the data analysis is done by other organisations than NSIs it can be assumed that there might be a

risk that not all necessary details of the data processing are disclosed to statistical authorities. The

accuracy of statistical output based on smart meters’ data as well as comparability across reasonable

period of time and across countries need to be ensured.

Recommended ethical approach:

2 https://ec.europa.eu/energy/en/topics/markets-and-consumers/smart-grids-and-meters

10

1. Statistical offices should inform clients of utility providers that the data from their smart meters

are used for the purposes of official statistics.

2. The methods to ensure smart meters’ data suitability for official statistics (e.g. representativity),

should be scientifically based, tested and ensure that the statistical output correspond to the

requirements of official statistics.

3. Statistical offices should collaborate within the statistical system as well as with academic to

ensure scientific standards.

4. Suppliers of smart meter data should be transparent and provide statistical offices with

information necessary to achieve desired quality of statistical outputs.

5. The methods and techniques used to analyse the big data need to be described and presented in

understandable form together with statistical output published as official statistics.

3. Road traffic loops

The data source

Vehicle detection loops, called inductive-loop traffic detectors, can detect vehicles passing or arriving at

a certain point, for instance approaching a traffic light or in motorway traffic. An insulated, electrically

conducting loop is installed in the pavement3. Vehicle detection loops (road sensors) used to collect the

data on speed, direction, length and class of a passing vehicle. Normally, the data is stored in a central

data warehouse of the responsible authority, e.g. the national transport agency


Road sensors data can be used, for example, to count the number of vehicles and average speed at the

sensor location for a given time interval to produce statistics on traffic intensities. Traffic Indices can be

calculated to provide a picture of the road traffic at national and regional level, as well as by vehicle type

[8]. Traffic sensor data can also be used to support official statistics in the context of the Harmonised

European Time Use Surveys (HETUS), for example, to estimate commuting time of employees [9].

3 https://en.wikipedia.org/wiki/Induction_loop

https://en.wikipedia.org/wiki/Induction_loop

11

Possible concerns

Data acquisition

The experience collected so far shows that the data acquisition should not raise any ethical concerns for

statistical authorities because the data does not contain personal information.

Data processing

The quality of the traffic loops data in terms of suitability to the purpose of official statistics is considered

as one of the major challenges for NSIs. First of all the data has a lot of noise that need to be removed. It

might happen that the data have not been collected for many minutes and, because of the stochastic

nature of the arrival times of vehicles at a road sensor, it might be hard to directly derive the number of

vehicles that passed during that minute. It might be difficult to find a good imputation methods, and

thus to clean the traffic loop data in such a way that the estimation of the number of vehicles could be

precise and accurate [8].

Dissemination of the output

Low quality of the raw data usually requires complex methods for the data cleaning, imputation and

estimation of statistical indicators. Professional ethics requires that official statistics are disseminated in

a transparent manner providing clear description of not only the data sources but also the methods used

to produce statistical indicators.


1. Communication with the institutions collecting traffic loops data could help to better understand

the prominence of the data and therefore to develop the methods of the data treatment that

could ensure the quality of the output required for official statistics.

2. Collaboration with academia and research community is indispensable to developing methods of

handling road sensor data.

3. Continuous studies of the research work done by other institutions and private companies in the

area of utilization of traffic sensors data for different kind of statistics.

12

4. When releasing the official statistics based on big data, the methods used to obtain the statistical

output need to be presented in a clear and understandable manner that would be convenient for

the users.

4. Remote sensing data including satellite image data, data from

unmanned aerial vehicles (UAV)

The data source

Remote sensing is defined “as the measurement of object properties on the earth's surface using data

acquired from sensors mounted on aircrafts or satellites based on propagated signals (e.g.

electromagnetic radiation). The measurements are done at distance in contrast to measurements in

situ”4. It may be split into "active" remote sensing (i.e., when a signal is emitted by a satellite or aircraft

and its reflection by the object is detected by the sensor) and "passive" remote sensing (i.e., when the

reflection of sunlight is detected by the sensor)5.

The output of a remote sensing system is usually an image representing the scene being observed. A

further step of image analysis and interpretation is required in order to extract useful information from

the image. Remote sensing images are normally in the form of digital images6. Remote sensing has a

wide range of applications in many different fields, e.g. agriculture, the environment, business activity

and transport.


Statistical agencies around the world are investigating the possibility of using satellite imagery data in

official statistics. It is expected that satellite imagery would have the potential to improve timeliness and

reduce reporting burden on respondents, reduce costs of surveys and to provide data at a more

disaggregated level.

Experience collected so far shows that satellite imagery data can complement official statistics on

agriculture (e.g. calculation crop acreage estimates, to aid in modelling crop yield) [10].

4 Schowengerdt, Robert, A.: Remote Sensing, 3

rd edition, 2007, p. 2

5 https://en.wikipedia.org/wiki/Remote_sensing

https://en.wikipedia.org/wiki/Satellite_imagery 6 http://www.crisp.nus.edu.sg/~research/tutorial/intro.htm

https://en.wikipedia.org/wiki/Remote_sensing

https://en.wikipedia.org/wiki/Satellite_imagery

http://www.crisp.nus.edu.sg/~research/tutorial/intro.htm

13

Possible concerns

Data acquisition

Most of the data obtained from satellite or aerial imagery providers is publicly available and therefore

the data acquisition should not be a problem for national statistical authorities. For the time being

privacy issues are not of major concern. Although it can be raised by people who wish not to have their

property shown from above. With growing spatial resolution of satellite imagery and especially aerial

photography, enabling the identification of individuals, issues of privacy may be raised.

Data processing

The major issue is the quality of the imagery that may affect quality (in particular accuracy) of the

estimates. Research work done by the Australian Bureau of Statistics pointed out some limitations of

satellite imagery data that present challenges for its utilisation in official statistics. These are, for

example, missing and contaminated data due to cloud cover, missing or poor quality data due to on

board satellite equipment failures, insufficient image resolution) [10].

Data dissemination

In order to obtain statistical estimates on the basis of satellite or aerial imagery techniques of classifying

the imagery data need to be applied. The methods, together with information on quality will need to be

clearly explained to the users once statistical output is disseminated as official statistics.


1. A deep analysis of viability of using satellite/aerial imagery data to produce statistical estimates

needs to be performed.

2. Research work on the methods used to guarantee quality of statistical estimates based on

satellite/aerial imagery data need to be carried out before the statistical can be published as

official statistics. Joint efforts of NSIs and research community in exploration of these new data

sources could benefit from synergies and save public resources.

3. Collaboration with organisations that possess knowledge in interpreting satellite/ aerial imagery

data could help to develop methods suitable for producing the indicators of official statistics

14

4. Assessment of recognition or dissemination of confidential information or breach of privacy

should be performed.

5. Social media data

The data source

Social Media Data has surfaced in recent years as one of the big data sources with a lot of promise.

Whereas Satellite Imagery or Mobile Phone Data are relatively well-defined as data sources, Social

Media is more of a mixed basket, which will need to be further clarified. Maybe a general denominator

of such data is that they are disseminated through the Internet; further, most data are text messages,

images, video or searches voluntarily submitted by persons7.

In this chapter the data generated on the basis of text messages posted on Twitter and Facebook are

analysed from the point of view of official statistics.


The experience of Statistics Netherlands shows that social media messages can be used, for example, to

produce consumer confidence indicator (European Harmonised Consumer Opinion Survey). It can be

done by building a model based on fitting characteristics derived from Facebook and Twitter messages.

The production time can be reduced from several weeks to a few days [11].

Besides of official statistics traditional indicators some research work has been done to examine the

viability of using social media messages to measure a level of well-being in different countries. Most of

the happiness research in social media has focused on the emotion component (i.e., sentiment analysis).

For example, classifying the emotional affinity of sentences and characterizing happiness as a specific

emotion [12].

Other studies showed that automated monitoring of public sentiment on social media, combined with

contextual knowledge has the potential to be a valuable real-time proxy for food-related economic

indicators. The correlation between the volume of food-related Twitter conversations and official food

7 https://unstats.un.org/bigdata/

15

inflation statistics, and between food and fuel-related tweet volumes can be quantified using simple

time series analysis. [13]

Possible concerns

Data acquisition.

The social media messages in general are public, so that should not cause problems to obtain access to

the data. Nevertheless even though social media messages are marked as public by the users of the

social networks, it might be considered as unethical to use them for the purposes that were not known

to them in advance. Professional ethics of statisticians requires showing respect to the data subjects and

a question whether the users of social networks need to be informed that their public information will

be used by statistical authorities might need some considerations.

For some social media sources statistical offices might have privileged access to the data. In this case

data subjects should be informed that the data is used, how the data is processed and what kind of

statistical product is generated.

In some cases, the access to the collection of gathered messages has to be purchased from private

companies. That might raise controversy whether NSIs should pay for the data sources that are going to

be used for producing official statistics.

Data processing

The experience collected so far shows that the way of interpretation the millions of messages and

building the models to produce statistical indicators based on this interpretation is essential for the use

of these data source that guarantee the quality of the statistical output. This may require additional

research work to better understand the relations between behaviour of individuals and the phenomenon

being predicted.

Apart from that, it has to be taken into account that social media messages can be a subject of

manipulation. Some additional investigation might be needed to estimate the probability of emersion of

artificially created messages. The risk of data manipulation might increase if user of statistics generated

from social media messages knew what data were used and how these data were processed. In this case,

statistical offices should define risk avoiding and mitigating strategies under ethical considerations.

16

In case the social media data (text messages, images, video etc.) obtained by statistical offices can

identify individuals, confidentiality of personal information must be ensured and possibility of misuse

should be eliminated.

Dissemination of the output

The methods and the data sources used to produce statistical indicators based on social media data and

disseminated as official statistics should be a part of their metadata and need to be described in an

understandable way.


1. In order to avoid controversy over the ownership of the public messages, it would be

recommended to make it known to the social networks users that their public data is used for

the purposes of official statistics.

2. Thorough studies of the relation between peoples’ web activity and the phenomena to be

predicted are necessary in order to develop methods that generate statistical indicators that can

be published as official statistics. The prediction models needs to be sufficiently tested and

robust.

3. The transparency of the methods used to produce statistical indicators needs to be ensured. It

concerns both statistical output disseminated as official statistics and additional studies that

were used to support the prediction models to obtain this statistical output.

4. Production of data should rely on more than one source in order to mitigate risks of data

manipulation or false correlations.

6. Web-scraped data from company websites, job vacancy websites

or real estate agencies' websites

The data source

Web scraping (web harvesting or web data extraction) is a technique used for extracting data from

websites. The term usually refers to automated processes implemented using a bot or web crawler. It is

17

a form of copying, in which specific data is gathered and copied from the web, typically into a central

local database or spreadsheet, for later retrieval or analysis. 8

With web scrapers it is possible to collect the data from different websites that might be useful for

official statistics.

Potential use in for official statistics

NSIs used web scraping techniques to collect prices from the Internet (air tickets, clothes, and electronic

devices as well as housing prices etc.) and analysed the possibilities of using this data for the compilation

of the Consumer Price Index (CPI). Online prices could replace prices collected by price collectors. This

can enlarge the sample of the products and reduce costs of the statistical production process. A research

work done by Statistics Netherlands in this area explains, for example, how “unmatched (new and

disappearing) items in the market could be treated and how the time-product dummy index compares

two matched-model price indexes” [14].

On-going projects carried out by the ESSnet Big Data are examining a job web sites, job adverts on the

web sites of enterprises, and job vacancy data from third party sources to prove that valuable

information can be obtained to complement the production of Job Vacancy Statistics; to use information

from enterprises’ web sites to update statistical Business Register [23].

Possible concerns

Data acquisition

The legality of web scraping varies across the countries. Web scraping may be against the terms of use of

some websites and, although legal enforceability of these requirements is not clear, ethical norms of

statisticians would recommend to respect the wishes of the websites owners.

Database rights9 might be considered as property rights and it might be the case that the act of creating

a database from web scraped data could breach database rights, at least if essential parts of the

database were scraped.

8 https://en.wikipedia.org/wiki/Data_scraping

9 Directive 96/9/EC of the European Parliament and of the Council of 11 March 1996 on the legal protection of databases.

https://en.wikipedia.org/wiki/Data_scraping

18

Once the legal issues are resolved the NSIs should only take from the web sites the information that is

necessary for the purposes of official statistics.

Data processing

The characteristics targeted by web scrapping do not always directly correspond to the characteristics

used in official statistics, the same characteristics may be duplicated in several websites. That might

create challenges for the data processing. For example, job vacancy is different from job advertisement,

the same job vacancy may be posted on differed websites [15]. It shows that some resources need to be

invested in research work in order to be able to produce statistical indicators on the basis of information

collected from the websites as official statistics.

Data dissemination

The replacement of the traditional data collection by the collection of information from websites may

have an impact on the time series of statistical indicators. The changes in the production process of

statistical indicators will need to be explained to the users of official statistics once these indicators are

published.


1. Web scraping for the purposes of obtaining information for official statistics need to be

performed in a transparent way, e.g. information about web scrapping activities could be

provided on the websites of the NSIs.

2. Web scrapping must respect the law (e.g. intellectual property, data base rights). The NSIs

should seek for a legal advice to ensure that planned web scraping activities are not prohibited

by European or national legislation.

3. Web scraping needs to be performed with the respect the rights of web site owners excluding

the possibility of harming to their business.

4. It is recommended to identify yourself when extracting the data from the websites and to

provide contact details to the website administrator.

19

5. Protected areas of websites should be respected. The robots.txt files should be consulted in

order to be informed what is allowed to be scraped.

6. Web scraping at the time of intensive internet traffic should be avoided. It should not

compromise the functioning of the web services.

7. When the number of target websites is limited it would be recommended to contact website

owners directly explaining the purpose of the research, the nature of the web scraping and

asking for permission for the web scrapping before applying it on a large scale.

8. Contacts with the websites’ owners could help to better understand how the data is compiled

and if the continuity of information can be guaranteed.

9. If possible, it would be recommended to inform the data subjects (enterprises) that the data

about them is being collected from their websites and to give them the possibility to opt-out.

10. The collection of personal information should be avoided to the possible extent. Appropriate

organisational and technical measures to guarantee confidentiality of individual data while the

data are collected, processed and stored have to be put in place.

7. Query and ClickOut data from internet searches

The data source

A search query is a request for information that is made using internet search engines. One of the most

popular search engines, the Google Search Engine was used by Google Inc. as a basis for a public web

facility Google Trends10. Google Trends provides the query index that is calculated by dividing the total

query volume for the search term in question within a particular geographic region by the total number

of queries in that region during the time period being examined. The maximum query share in the time

period specified is normalised to be 100, and the query share at the initial date being examined is

normalised to be zero. [21]

Apart from Google Trends, another source in this group of big data could be the number of Wikipedia

page views. “Wikipedia is ideally suited as a platform that could potentially be of use for legitimate

10

https://trends.google.com/trends/

https://trends.google.com/trends/

20

scientific investigation in many different areas. Not only is the information held within Wikipedia articles

very useful on its own, but statistics and trends surrounding the amount of usage of particular articles,

frequency of article edits, region specific statistics, and countless other factors make the Wikipedia

environment an area of interest for researchers” [22].

Potential use for official statistics

Google Trends indices can be of interest to official statistics. Some studies proved that Google Trends

query indices maybe correlated with economic indicators and can be used for short-term prediction

(nowcast) of economic and social statistics indicators. For example, to predict unemployment figures

[15][16], trends in household consumption, the development of retail sales [17]. Google Trends indices

can also be used to improve standard forecast models in different domains of statistics using the

advantages of quick access to information that this tool provides.

Possible concerns

Data acquisition

Google Tends is a publicly available web service and therefore there should not be an issue to access the

data.

Data processing

The pilots conducted so far identified possible issues linked to the data processing that could influence

the quality of the statistical output. It might happen that information about the scope and coverage of

the searches as well as the stability of the algorithm used to obtain the index are not sufficiently

disclosed. That may lead to mistakes in prediction models and therefore in statistical estimates.

The accuracy of statistical estimates based on Google search (as any other data based on web searches)

may depend on the percentage of internet usage in the country and habits of people, for example,

searching information on the internet before making purchases. That might require additional studies on

this subject [18].

Data dissemination

It might be difficult to guarantee full transparency of the methods applied to produce statistical

estimates if information about the collection of data on web searches by the web services would not be

21

available. The use of the data source that is a “black box” might be considered as non-compliance with of

the professional ethics of statisticians. The publication of data based on web searches might trigger an

increase of these searches.


1. Collaboration with academia and research communities would be recommended in order to

develop prediction models that could handle possible deficiencies of the data based on internet

searches and make this data suitable for official statistics.

2. NSIs need to be aware about the rules and algorithms used by web services to collect the data

on web-searches. Contacts with the web services operators would be necessary to get

information on the provenance of the data.

3. The relation between the model for the prediction of estimates based on web searches and the

phenomenon being predicted should be proven by scientific evidence.

4. In order to guarantee transparency of the methods used to predict statistical indicators

disseminated as official statistics the data sources and methods used should be disclosed to the

users.

5. NSIs should include more data sources into the model of prediction or use additional sources for

verifying the results.

8. Cash register data, e.g. from supermarkets

The data source

A cash register is a mechanical or electronic device for registering and calculating transactions at a point

of sale. Cash registers are usually connected to a handheld or stationary barcode reader so that a

customer's purchases can be more rapidly scanned.11 At the moment of buying a product a scanner

creates a record that has European Article Number (EAN) unique for each item, quantity sold and price.

These transaction records collected from a large number of retailers can be considered as big data. This

data is a valuable source of information that can be used by NSIs for official statistics.

11

https://en.wikipedia.org/wiki/Cash_register

https://en.wikipedia.org/wiki/Point_of_sale

https://en.wikipedia.org/wiki/Point_of_sale

https://en.wikipedia.org/wiki/Cash_register

22

Possible use in official statistics

Scanner data is used by several NSIs for producing the Consumer Price Index (CPI). The use of scanner

data has advantages vs the traditional data collection, i.e. lower costs of data collection, no sampling

errors as the data contains all the transaction of the retailers, better reflection of price changes as the

data includes transactions for entire observation period, the opportunity to observe arrivals of new

products in the market. [19] [20]

The possibility of using scanner data to complement other domains of official statistics (price statistics,

household expenditure, and business statistics) is under exploration.

Possible concerns

Data acquisition

Difficulties to obtain the data from the retailers may raise some ethical concerns linked to the

establishment of cooperation with private companies. The experience of the NSIs shows that the data

holders are not always willing to cooperate. It can take long time to develop the relationship and to gain

access to the data [20]. The NSIs need to find ways of mutual benefits not jeopardising the professional

independence of statisticians. The agreements with the retailers (data holders) need to guarantee the

provision of scanner data to NSIs on a regular basis and at agreed dates. The NSIs should be informed in

advance about any changes in the data structure. All that may create extra burden on retailers.

The data can also be purchased from the market research companies but that may raise questions

related to costs, i.e. the price for the data provision might exceed the cost for preparing them.

Data processing

Scanner data provide opportunities for official statistics but also create serious methodological

challenges (to classify the products correctly, to choose the appropriate index and a weighting system, to

make sure that the products for which prices are measured were not replaced by the products that do

not correspond to the technical specification of the observed ones etc.). Besides that, due to the high

importance of the CPI it is necessary to ensure its compliance with the EU Regulation and comparability

over time and across countries.

Dissemination of output

23

The change of the data sources (from price collectors’ data to scanners’ data) may have an impact on the

time series of the statistical output. The effects should be minimized. In addition, users of official

statistics need to be informed on possible discrepancies between the old and new methodology.


1. Cooperation with the data providers (retailers) should be based on partnerships with mutual

benefits. Excessive burden on the data providers (efforts for preparation of the data to be

transmitted to NSIs) should be avoided. Professional independence of statistical authorities needs to

be ensured.

2. Scanner data might be of commercial value for the retailers. The data received by statistical

authorities should only be used for the purposes of official statistics. Rules and measures to protect

data from misuse should be communicated by the statistical offices

3. Regular information exchange between retailers and statistical offices could help to understand

possible changes in the market and replacements of the products and consequently to ensure

required quality of the price indices.

4. Collaboration with the research community is necessary to overcome methodological challenges

presented posed by the scanner data. Learning from experience of other statistical offices could help

to accelerate the progress.

5. The dissemination of the statistical output needs to be done in a transparent way. The changes in the

data sources and methods used should be communicated to the users of official statistics.

CONCLUSIONS

Different types of big data have specific characteristics linked to their suitability for official statistics.

They may raise different ethical questions during the statistical business production process. For

example, the issues linked to the access to the data are more relevant to the data sources that have

sensitive personal data like mobile phone data. Social media data more than other sources are

susceptible to manipulation; road sensors data probably is more “noisy” than others; the relation

between data and the phenomenon to measure might not always be straightforward etc. The integration

of big data sources into production of official statistics may require following different approaches by

24

statistical offices to resolve the issues linked to professional ethics, which may depend on institutional,

legal and national context.

In addition, ethical principles of official statistics may be in conflict with each other. For example, the

requirement to ensure the privacy of the data subjects with the increasing demand for more detailed

information for evidence based policy making; the requirement for reducing response burden may

coincide with increasing burden on big data holders, ect.

These developments may trigger the need for creating a dedicated structure addressing and providing

advice on ethical questions to find the right balance between ethical norms of conduct and being

empowered to take a decision on the issues linked to professional ethics.

The consequent application of quality principles in statistical production and dissemination is one of the

main principles that guarantees trust in official statistics. As research and exploratory phase of new data

sources may take some time before they can be incorporated into regular production process, it would

be recommendable to publish preliminary results as experimental data and not as official statistics.

Publication as experimental statistics could be used to improve the quality of the final product and could

lead to a discussion on the pertinence of data sources and methods for generating statistical

information.

25

REFERENCES

[A] Scheveningen Memorandum Big Data and Official Statistics;

http://ec.europa.eu/eurostat/documents/42577/43315/Scheveningen-memorandum-27-09-13

[B] FUNDAMENTAL PRINCIPLES OF OFFICIAL STATISTICS (A/RES/68/261 from 29 January 2014)

http://unstats.un.org/unsd/dnss/gp/fundprinciples.aspx

[C] EUROPEAN STATISTICS CODE OF PRACTICE, for the national and community statistical authorities;

adopted by the European Statistical System Committee 28th September 2011;

http://ec.europa.eu/eurostat/documents/3859598/5921861/KS-32-11-955-EN.PDF/5fa1ebc6-90bb-

43fa-888f-dde032471e15

[D] DECLARATION ON PROFESSIONAL ETHICS, adopted by the ISI council 22 & 23 July 2010, Reykjavik,

Iceland;

https://www.isi-web.org/index.php/news-from-isi/34-professional-ethics/296-

eclarationprofessionalethics-2010uk?showall

[1] Mobile telephones and mobile positioning data as source for statistics: Estonian experience;

University of Tartu and Eurostat;

https://ec.europa.eu/eurostat/cros/system/files/S19P4.pdf_en

[2] Estimating population density distribution from network-based mobile phone data; Fabio Ricciato,

Peter Widhalm, Massimo Craglia and Francesco Pantisano; European Commission, Joint Research Centre

–JRC, 2015;

https://ec.europa.eu/jrc/sites/jrcsh/files/Final-%20jrc-AIT-MNO-study-compressed.pdf

[3] Mobile phone data for mobility statistics, Emanuele Baldacci, ISTAT;

https://unstats.un.org/unsd/trade/events/2014/beijing/presentations/day1/afternoon/3.%20Mobile%2

0phone%20data%20for%20mobility%20statistics--Emanuele%20Balda.pdf

[4] Feasibility study on the use of mobile positioning data for tourism statistics, Eurostat, 2014

http://ec.europa.eu/eurostat/documents/42577/43315/Scheveningen-memorandum-27-09-13

http://unstats.un.org/unsd/dnss/gp/fundprinciples.aspx

http://ec.europa.eu/eurostat/documents/3859598/5921861/KS-32-11-955-EN.PDF/5fa1ebc6-90bb-43fa-888f-dde032471e15

http://ec.europa.eu/eurostat/documents/3859598/5921861/KS-32-11-955-EN.PDF/5fa1ebc6-90bb-43fa-888f-dde032471e15

https://www.isi-web.org/index.php/news-from-isi/34-professional-ethics/296-eclarationprofessionalethics-2010uk?showall

https://www.isi-web.org/index.php/news-from-isi/34-professional-ethics/296-eclarationprofessionalethics-2010uk?showall

https://ec.europa.eu/eurostat/cros/system/files/S19P4.pdf_en

https://ec.europa.eu/jrc/sites/jrcsh/files/Final-%20jrc-AIT-MNO-study-compressed.pdf

https://unstats.un.org/unsd/trade/events/2014/beijing/presentations/day1/afternoon/3.%20Mobile%20phone%20data%20for%20mobility%20statistics--Emanuele%20Balda.pdf

https://unstats.un.org/unsd/trade/events/2014/beijing/presentations/day1/afternoon/3.%20Mobile%20phone%20data%20for%20mobility%20statistics--Emanuele%20Balda.pdf

26

http://ec.europa.eu/eurostat/documents/747990/6225717/MP-Consolidated-report.pdf/530307ec-

0684-4052-87dd-0c02b0b63b73

[5] A Big Data Pilot Project with Smart Meter Data, Lily Ma, Statistics Canada;

http://www.statcan.gc.ca/sites/default/files/media/14274-eng.pdf#

[6] Modelling sample data from smart-type meter electricity usage, Susan Williams, ONS UK;

https://ec.europa.eu/eurostat/cros/system/files/Williams_Smart%20meters%20abstract%20final.pdf

[7] ESSnet Big Data. WP3. Smart meters. Report on data access and data handling. 2016-07-29;

https://webgate.ec.europa.eu/fpfis/mwikis/essnetbigdata/index.php?title=WP3_Report_1&oldid=2942

[8] High frequency road sensor data for official statistics; Marco Puts, Piet Daas, Martijn Tennekes;

Statistics Netherlands 2014;

http://www.von-tijn.nl/tijn/research/publications/road_sensors1_NTTS2015.pdf

[9] Traffic sensor data for commuting statistics, Pasi Piela, Statistics Finland;

http://www1.unece.org/stat/platform/display/BDI/Statistics+Finland+-

+Traffic+sensor+data+for+commuting+statistics

[10] Methodological Approaches for Utilising Satellite Imagery to Estimate Official Crop Area Statistics,

Jennifer Marley, Daniel Elazar and Kate Traeger; Analytical Services Branch; Australian bureau of

statistics; Research paper, 2014;

http://www.ausstats.abs.gov.au/ausstats/subscriber.nsf/0/EEF6FB75844F0AD3CA257D57001D7662/$Fil

e/1352055144_sep%202014.pdf

[11] Social media sentiment and consumer confidence, Piet J.H. Daas and Marco J.H. Puts, Statistics

Netherlands, 2014;

https://www.ecb.europa.eu/pub/pdf/scpsps/ecbsp5.en.pdf

http://ec.europa.eu/eurostat/documents/747990/6225717/MP-Consolidated-report.pdf/530307ec-0684-4052-87dd-0c02b0b63b73

http://ec.europa.eu/eurostat/documents/747990/6225717/MP-Consolidated-report.pdf/530307ec-0684-4052-87dd-0c02b0b63b73

http://www.statcan.gc.ca/sites/default/files/media/14274-eng.pdf

https://ec.europa.eu/eurostat/cros/system/files/Williams_Smart%20meters%20abstract%20final.pdf

https://webgate.ec.europa.eu/fpfis/mwikis/essnetbigdata/index.php?title=WP3_Report_1&oldid=2942

http://www.von-tijn.nl/tijn/research/publications/road_sensors1_NTTS2015.pdf

http://www1.unece.org/stat/platform/display/BDI/Statistics+Finland+-+Traffic+sensor+data+for+commuting+statistics

http://www1.unece.org/stat/platform/display/BDI/Statistics+Finland+-+Traffic+sensor+data+for+commuting+statistics

http://www.ausstats.abs.gov.au/ausstats/subscriber.nsf/0/EEF6FB75844F0AD3CA257D57001D7662/$File/1352055144_sep%202014.pdf

http://www.ausstats.abs.gov.au/ausstats/subscriber.nsf/0/EEF6FB75844F0AD3CA257D57001D7662/$File/1352055144_sep%202014.pdf



27

[12] Characterizing Geographic Variation in Well-Being using Tweets, University of Pennsylvania,

Michigan State University, 2013;

http://wwbp.org/papers/icwsm2013_cnty-wb.pdf

[13] Mining Indonesian Tweets to Understand Food Price Crises, UN Global Pulse, February 2014;

http://www.unglobalpulse.org/sites/default/files/Global-Pulse-Mining-Indonesian-Tweets-Food-Price-

Crises%20copy.pdf

[14] Online Data, Fixed Effects and the Construction of High-Frequency Price Indexes; Jan de Haana and

Rens Hendriksb; Statistics Netherlands, 2013;

https://www.business.unsw.edu.au/research-site/centreforappliedeconomicresearch-

site/Documents/Jan-de-Haan-Online-Price-Indexes.pdf

[15] Improving prediction of unemployment statistics with Google trends: preliminary experiments,

Vittorio Perduca, Eurostat;

https://ec.europa.eu/eurostat/cros/system/files/UnemploymentFrance_bigdata.pd

[16] European Commission, Eurostat (Pedro Ferreira): Improving prediction of unemployment statistics

with Google trends: part 2;

https://ec.europa.eu/eurostat/cros/content/improving-prediction-unemployment-statistics-google-

trends-part-2_en

[17] Google as a tool for nowcasting household consumption: estimations on Hungarian data, Istvan

Janos Toth, Miklós Hajdu, Institute for Economic and Enterprise Research – HCCI;

http://old.gvi.hu/data/papers/ciret_2012_tij_hm_paper_120415.pdf

[18] The use of web activity evidence to increase the timeliness of official statistics indicators Fernando

Reis, Pedro Ferreira, Vittorio Perduca, Eurostat, 2014;

https://ec.europa.eu/eurostat/cros/content/use-web-activity-evidence-increase-timeliness-iaos2014_en

http://wwbp.org/papers/icwsm2013_cnty-wb.pdf

http://www.unglobalpulse.org/sites/default/files/Global-Pulse-Mining-Indonesian-Tweets-Food-Price-Crises%20copy.pdf

http://www.unglobalpulse.org/sites/default/files/Global-Pulse-Mining-Indonesian-Tweets-Food-Price-Crises%20copy.pdf

https://www.business.unsw.edu.au/research-site/centreforappliedeconomicresearch-site/Documents/Jan-de-Haan-Online-Price-Indexes.pdf

https://www.business.unsw.edu.au/research-site/centreforappliedeconomicresearch-site/Documents/Jan-de-Haan-Online-Price-Indexes.pdf

https://ec.europa.eu/eurostat/cros/system/files/UnemploymentFrance_bigdata.pd

https://ec.europa.eu/eurostat/cros/content/improving-prediction-unemployment-statistics-google-trends-part-2_en




http://old.gvi.hu/data/papers/ciret_2012_tij_hm_paper_120415.pdf

https://ec.europa.eu/eurostat/cros/system/files/%5BReis%2CFerreira%2CPerduca%5D%282014%29The%20use%20of%20web%20activity%20evidence%20to%20increase%20the%20timeliness%20of%20official%20statistics%20indicators_IAOS_conference_paper.pdf_en

https://ec.europa.eu/eurostat/cros/content/use-web-activity-evidence-increase-timeliness-iaos2014_en

28

[19] Initial report on experiences with scanner data in ONS, Derek Bird, Robert Breton, Chris Payne and

Ainslie Restieaux, ONS UK, 2014;

https://www.ons.gov.uk/ons/guide-method/user-guidance/prices/cpi-and-rpi/initial-report-on-

experiences-with-scanner-data-in-ons.pdf

[20] Issues on the use of scanner data in the CPI, Muhanad Sammar, Anders Norberg and Can Tongur,

Statistics Swede2013;

http://www.ottawagroup.org/Ottawa/ottawagroup.nsf/4a256353001af3ed4b2562bb00121564/8bdac0

e73d96c891ca257bb00002fdb4/$FILE/Muhanad%20Sammar%202%20ISSUES%20ON%20THE%20USE%2

0OF%20SCANNER%20DATA%20IN%20THE%20CPI.pdf

[21] Predicting the Present with Google Trends, Hyunyoung Choi, Hal Varian December 18, 2011,page 3,

chapter 2 Google Trends;

http://people.ischool.berkeley.edu/~hal/Papers/2011/ptp.pdf

[22] Wikipedia Usage Estimates Prevalence of Influenza-Like Illness in the United States in Near Real-

Time, David J. McIver, John S. Brownstein;

http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1003581

[23] ESSnet Big Data. WP1. Web scraping/ Job vacancies, Deliverable 1.2, 2016-11-11;

https://webgate.ec.europa.eu/fpfis/mwikis/essnetbigdata/images/6/64/WP1_Deliverable_1_2_final.pdf

https://www.ons.gov.uk/ons/guide-method/user-guidance/prices/cpi-and-rpi/initial-report-on-experiences-with-scanner-data-in-ons.pdf

https://www.ons.gov.uk/ons/guide-method/user-guidance/prices/cpi-and-rpi/initial-report-on-experiences-with-scanner-data-in-ons.pdf

http://www.ottawagroup.org/Ottawa/ottawagroup.nsf/4a256353001af3ed4b2562bb00121564/8bdac0e73d96c891ca257bb00002fdb4/$FILE/Muhanad%20Sammar%202%20ISSUES%20ON%20THE%20USE%20OF%20SCANNER%20DATA%20IN%20THE%20CPI.pdf




http://people.ischool.berkeley.edu/~hal/Papers/2011/ptp.pdf

http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1003581

https://webgate.ec.europa.eu/fpfis/mwikis/essnetbigdata/images/6/64/WP1_Deliverable_1_2_final.pdf