Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
0
Services concerning ethical, communicational, skills issues and methodological cooperation related to
the use of Big Data in European statistics
(Contract number 11104.2015.005-2015.799)
TASK 1: Ethical review
Deliverable 1.3 Report on ethical guidelines
Version 2
Date: 3 June 2017
Drafted by: SOGETI Luxembourg: Alma RUTKAUSKIENE
Disseminated: EUROSTAT: Albrecht WIRTHMANN
1
Table of contents
INTRODUCTION ............................................................................................................................................. 2
GENERAL ETHICAL CONSIDERATIONS ........................................................................................................... 3
POSSIBLE ETHICAL APPROACH BY BIG DATA SOURCES ................................................................................. 5
1. Mobile positioning data (mobile phone data)....................................................................................... 5
2. Data from smart electricity consumption meters (smart meters) ........................................................ 7
3. Road traffic loops data ........................................................................................................................ 10
4. Remote sensing data including satellite image data, data from unmanned aerial vehicles (UAV) .... 12
5. Social media data ................................................................................................................................ 14
6. Web-scraped data from company websites, job vacancy websites or real estate agencies' websites ..
............................................................................................................................................................. 16
7. Query and ClickOut data from internet searches ............................................................................... 19
8. Cash register data, e.g. from supermarkets ........................................................................................ 21
CONCLUSIONS ............................................................................................................................................. 23
REFERENCES .................................................................................................... Error! Bookmark not defined.
2
INTRODUCTION
The strategic importance of big data for the European Statistical System has been recognised by the
European Statistical System Committee (ESSC) by adopting the Scheveningen Memorandum [A] in
September 2013. The Directors General of the National Statistical Institutes conference considered,
“Official statistics should incorporate as much as possible all potential data sources, including Big Data,
into their conceptual design”. It was acknowledged that “Big Data represent new opportunities and
challenges for Official Statistics, and therefore encourage the European Statistical System and its partners
to effectively examine the potential of Big Data sources in that regard”.
The National Statistical Institutes (NSIs) are exploring possibilities of integrating big data sources in
production of official statistics. Pilot projects were initiated by the UN SD, UN ECE, and Eurostat. The
projects identified the opportunities and pointed out issues linked to the access to the data, assurance of
privacy of the data subjects, quality of the data in terms of suitability for official statistics and etc. In
order to provide the public with independent high quality information statistical offices adhere to
statistical principles that are defined by the European Statistics Code of Practice and the UN
Fundamental Principles of official statistics that constitute the ethical framework of official statistics.
The ethical framework of official statistics do not preclude the use of any types of data sources if they
ensure the quality of statistical output, are cost-efficient and minimise the reporting burden for the data
providers (UN Fundamental principles of official statistics, Principle 5.) [B]. Even so, specific
characteristics of big data sources may require additional efforts from the NSIs to comply with other
statistical principles such as professional independence, mandate for data collection, adequacy of
resources, impartiality and objectivity, clarity of the methods used to obtain statistical output, that are
embedded in the European Statistics Code of Practice [C].
The aim of these Guidelines is to draw the attention of statistical authorities to possible issues related to
professional ethics once big data is used for the production of official statistics and to recommend an
approach that would be compliant with the statistical code of conduct. The guidelines are based on the
results of the projects carried out by statistical institutes and research organisations investigating
different aspects of the possible use of different types of big data in official statistics. The guidelines
provide general recommendations that are common to the majority of types of big data and
recommendations that may concern particular types of big data.
3
As the exploration of the potential of big data continues, it is obvious that these guidelines will need to
be updated once more experience is collected. The recommendations have a rather general character
and will need to be adapted to the national conditions and specific situation.
GENERAL ETHICAL CONSIDERATIONS
The professional ethics of official statistics is based on shared professional values of statisticians: respect,
professionalism, truthfulness and Integrity [D]. These values are reflected in the UN Fundamental
Principles of Official Statistics [B] and European Statistics Code of Practice [C].
Statistical authorities of the EU committed themselves to adhere to the European Statistics Code of
Practice (CoP) that consists of 15 principles covering the institutional environment, the statistical
production process and the output of statistics. The quality assurance framework of the European
Statistical System facilitates the implementation of the CoP by describing activities, methods and tools to
operationalize the indicators of the CoP. The adhesion to these principles makes official statistics a
trusted source of information for all users.
Due to specific characteristics of big data (high volume, high velocity, high variety) statistical authorities
need to be ready to meet new challenges in order to harness its potential for official statistics. Some of
these challenges would require an answer to ethical questions that may be raised at different stages of
the statistical production process. Here below we provide examples of possible ethical consideration that
may be needed at the main stages of the statistical production process: data acquisition, data processing
and dissemination of statistical output.
Big data acquisition. Big data in most of the cases are collected by private companies that, in
principle, are not compelled by the law to provide the data to statistical authorities. Therefore the
provision of the data will depend on benefits realised or perceived by the data holders. The
experience of the NSIs shows that special arrangements with the private companies are needed in
order to get access to the data they possess.
In order to make these arrangements according to professional ethics, NSIs would need, first of all,
to respect principle of professional independence meaning that statistics must be developed,
produced and disseminated in an independent manner. The agreements with private companies
need to avoid any pressure from businesses to put their interests above the public interest. The
4
selection of data providers should be done in a transparent way without favouring one company
against another.
Statistical institutes need to assure big data providers that the big data obtained from them will be
used exclusively for the purposes of official statistics and there is no risk of harm to their business.
In case big data contain personal information the companies that collect these data are obliged by
the law to protect the privacy of the data subjects. In order to work in ethical terms and show
respect to the data subjects, statistical institutes should be informed by the data providers whether
their customers are aware that the data about them can be delivered to statistical authorities.
Big data processing. In general big data are not designed for statistical purposes and thus do not
meet statistical standards on concepts and definitions as such. A number of data quality issues
might challenge the use of big data for official statistics, e.g., the selectivity of the data, no
guarantee in continuity and stability of the data structure, or the risk of data manipulation. The
compliance with professional ethics may be questioned if statistical models, imputation techniques
etc. used for the data processing and output were not scientifically proofed, or if the quality of the
output that is going to be published as official statistics could not be assured. The professionalism
of statistical offices is one of their major assets that makes official statistics a trusted source of
information.
The processing of big data (big data analytics) can involve methods of data linking that might reveal
personal information. Improper use of the personal data within statistical system can cause harm
to individuals or businesses (intentionally or unintentionally) and damage the reputation of official
statistics. Existing internal rules of access and measures to ensure the confidentiality of information
have to be applied to big data. Additional research might be necessary to ensure privacy and
confidentiality of the disseminated data.
Dissemination of statistical output. Big data requires complex techniques to produce statistical
output. In some cases, only inputs and outputs might be observable while the transformation may
not be transparent. The professional ethics of statisticians requires that “that information on
methods and procedures used to produce official statistics is publicly available“[3] and scientifically
sound. Therefore, statistical agencies using big data analytics should not only describe the data
5
sources but should also document the applied methods and models to enable independent
assessment of data processing and results.
Different types of big data may raise different ethical questions that depend on the characteristics of
these data, e.g. availability of the access to the data, content of personal information, quality issues in
terms of suitability for the purpose of official statistics, the clarity of the methods to be applied in order
to get statistical output etc. These guidelines provide some examples of possible issues linked to
different types of big data and recommendations on how to handle them in compliance with the
statistical principles.
POSSIBLE ETHICAL APPROACH BY BIG DATA SOURCES
1. Mobile positioning data (mobile phone data)
The data source
Mobile positioning is tracking the location of mobile telephones. Generally, it can be divided into active
and passive mobile positioning. Active mobile positioning is used for tracking the location of mobile
phones in real time using mobile positioning system (MPS). There are many technical solutions for active
real time tracking of telephones. The cell identity method determines the network cell where the
telephone is located. Location data from passive mobile positioning is automatically stored in memory or
log files of Mobile Network Operators (MNO). Operators’ systems generate a very large amount of data
on the use of mobile communication including location information. These data are mostly used
internally by the network carriers for business and marketing purposes, e.g. charging clients for services,
providing usage statistics, analysing network performance, or developing new marketing products. The
location data can be used for generating statistics about space-time movement of phones (phone users)
cost-effectively [1].
Potential use for official statistics
The mobile positioning data can be used to complement European tourism statistics (to collect the data
on short trips or the same day visits as well as during a longer period of stay). Such method could be an
alternative to the 'bookkeeping system' or 'diary' currently used or to the traditional ex-post
questionnaires in which respondents report on trips made during a specified reference period. It can
provide information previously not available (new indicators), calibration opportunities for existing data.
6
A follow-up sample survey may still be needed to collect additional, qualitative information on the trips.
However, the sample surveys could be based on much smaller samples [1].
The data from MNOs across different countries could allow to produce a pan-European view of the
population density. Furthermore, the proper fusion of multi-MNO data from the same country bears the
potential of improving the accuracy of the estimation within the same country along different directions,
namely: (i) increase the population coverage; (ii) mitigate the potential bias caused by MNO specific
network configurations and (iii) improve the spatial accuracy [2]. The integration of existing population
and flow statistics with the continuously up-to-date estimates obtained from GSM data could provide
more accurate results. [3].
Possible concerns
Acquisition of the data. Access to mobile positioning data might be a problem for the NSIs due to
regulatory limitations that wary across EU countries. The main concerns are related to the privacy of the
data subjects. This data source has sensitive personal information and can create a perception of people
being tracked. Apart from the regulatory limitation, MNOs have business and financial concerns.
Providing data to third parties may have a negative impact on their business.
In order to get access to mobile positioning data statistical authorities would need special arrangements
with MNOs. This may need ethical considerations to preserve the professional independence of
statistical agencies and guarantee an equal treatment of data providers.
Data processing. The major issues that may arise during the data processing are linked to the quality of
mobile positioning data in terms of its suitability to the purpose of official statistics. The data has some
limitations. For example, in the case of tourism statistics there is a lack of information on the purpose of
the trip, expenditure, type of accommodation and means of transport used. It is difficult to estimate the
over-coverage of the same-day trips due to the misclassification of overnight trips as well as over and
under-coverage related to the usage of mobile phones (tourists who do not appear in mobile positioning
data, tourists who use several mobile devices or the roaming service of several MNOs). [4].
Dissemination of statistical output. Due to the differences in the concepts and definitions, the
methodology of data processing for the purposes of official statistics might be rather complex. The
statistical principle of transparency requires that official statistics and corresponding metadata are
presented in clear and understandable form.
7
Recommended ethical approach
1. Any arrangements with the data providers in order to get access to the mobile positioning data
should not compromise professional independence of statistical authorities. The cooperation
should be based on mutual benefit balancing the interest of statistical authorities and private
companies.
2. The selection of the data and of data providers for the purposes of official statistics should follow
statistical considerations and should not favour or disadvantage particular companies.
3. The privacy of the data subjects has to be respected. The data subjects should to be informed
that their data (mobile phone data records) are used by the national statistical authorities for the
purposes of official statistics.
4. Upon access to the data, the protection of personal data must be ensured by statistical offices.
Measures to protect personal data used by statistical authorities and private data providers
should be communicated to the general public.
5. The data obtained from MNOs must be used exclusively for the purposes of official statistics. The
possibility to harm MNOs or their subscribers directly or indirectly should be excluded. Measures
to ensure exclusive use for statistical purposes should be communicated to MNOs, their clients
and to the general public.
6. The rules of data retention need to be defined and communicated to the general public.
7. Statistical offices should minimize use of individual data and should adopt a privacy by design
approach.
8. It would be recommended to collaborate with academia and research organizations in order to
develop the methods of data analysis and processing that would ensure the quality of statistical
output in accordance with the requirements of official statics.
9. The methods applied during the processing of the mobile communication data to obtain the
statistical indicators that are disseminated as official statistics need to be explained to the users
in a clear and understandable form.
10. Statistical offices should perform prior impact assessment to evaluate possible risks and define
mitigation activities.
2. Data from smart electricity consumption meters (smart meters)
The data source
8
The European Smart Metering Alliance defines smart metering as having the following features:
Automatic processing, transfer, management and utilisation of metering data
Automatic management of meters
2-way data communication with meters
Provides meaningful and timely consumption information to the relevant actors and their
systems, including the energy consumer
Supports services that improve the energy efficiency of the energy consumption and the energy
system (generation, transmission, distribution and especially end-use)
Wikipedia describes a smart meter as an electronic device that records the consumption of electric
energy, water or gas in intervals of an hour or less and communicates that information at least daily back
to the utility for monitoring and billing. Smart meters infrastructures should provide system operators
with real-time data on consumption and allow customers to make informed choices about energy usage
based on the price at the time of use1. Smart meter data is of interest to statistical organisations as it
provides detailed information on energy consumption at high levels of frequency and on real time [5].
Potential use in official statistics
The data from electricity smart meters can be beneficial for several domains of official statistics, e.g.
statistics on energy consumption, household consumption expenditure, consumer price index, or
environmental statistics. Use of smart meter data would substantially reduce reporting burden on
households and companies and would possibly allow to produce statistical indicators at more detailed
level of aggregation and to improve timeliness.
In addition to energy consumption statistics, there have been experiments to use smart meters’ data to
estimate household occupancy by time of the day, to identify the household structure and their size, to
estimate probability of occupancy of dwellings, to make statistics on long-term vacant properties [6].
Possible concerns
Data acquisition
Smart meters provide distribution network operators and energy suppliers with data about energy
consumption. The granularity of the data creates the possibility to identify households, household
1 https://en.wikipedia.org/wiki/Smart_meter
9
characteristics and real-time occupancy. Personal information about energy consumers is protected by
the EU and national rules that define who can access personal data and under what circumstances. A
survey among NSIs of the EU on the possibility to have access to smart meter data carried out by the
ESSnet Big Data showed that currently only few countries have access to smart meter data[7]. In case
access to the data is not enforced by law, statistical authorities will need specific agreements with the
data holders that should take into account the interest of all the parties involved.
Data processing
One of the major issues that may affect the quality of statistical output and has to be handled by
statistical authorities is the representativeness of the smart meters’ data. For the time being the
deployment of smart meters does not cover all the households and businesses. Although it is expected
by 2020 that almost 72% of European consumers will have a smart meters for electricity while 40% will
have one for gas2; the metering point does not correspond to statistical observation unit that creates
some challenges for the NSIs to evaluate the coverage of the data source.
The smart meters’ data are not designed for statistical purposes and might be changed (in terms of
structure and the content) by the organisations that collect the data. In such cases, NSIs would need to
cope with the problem of discontinuity of the data source.
When data analysis is conducted on the systems of the companies collecting the data, NSIs have to be
informed about the methods used by these companies in order to be sure in what extent they
correspond with the requirements of official statistics.
Dissemination of statistical output
Professional ethics requires that official statistics is published together with corresponding metadata. In
case the data analysis is done by other organisations than NSIs it can be assumed that there might be a
risk that not all necessary details of the data processing are disclosed to statistical authorities. The
accuracy of statistical output based on smart meters’ data as well as comparability across reasonable
period of time and across countries need to be ensured.
Recommended ethical approach:
2 https://ec.europa.eu/energy/en/topics/markets-and-consumers/smart-grids-and-meters
10
1. Statistical offices should inform clients of utility providers that the data from their smart meters
are used for the purposes of official statistics.
2. The methods to ensure smart meters’ data suitability for official statistics (e.g. representativity),
should be scientifically based, tested and ensure that the statistical output correspond to the
requirements of official statistics.
3. Statistical offices should collaborate within the statistical system as well as with academic to
ensure scientific standards.
4. Suppliers of smart meter data should be transparent and provide statistical offices with
information necessary to achieve desired quality of statistical outputs.
5. The methods and techniques used to analyse the big data need to be described and presented in
understandable form together with statistical output published as official statistics.
3. Road traffic loops
The data source
Vehicle detection loops, called inductive-loop traffic detectors, can detect vehicles passing or arriving at
a certain point, for instance approaching a traffic light or in motorway traffic. An insulated, electrically
conducting loop is installed in the pavement3. Vehicle detection loops (road sensors) used to collect the
data on speed, direction, length and class of a passing vehicle. Normally, the data is stored in a central
data warehouse of the responsible authority, e.g. the national transport agency
Potential use in official statistics
Road sensors data can be used, for example, to count the number of vehicles and average speed at the
sensor location for a given time interval to produce statistics on traffic intensities. Traffic Indices can be
calculated to provide a picture of the road traffic at national and regional level, as well as by vehicle type
[8]. Traffic sensor data can also be used to support official statistics in the context of the Harmonised
European Time Use Surveys (HETUS), for example, to estimate commuting time of employees [9].
3 https://en.wikipedia.org/wiki/Induction_loop
11
Possible concerns
Data acquisition
The experience collected so far shows that the data acquisition should not raise any ethical concerns for
statistical authorities because the data does not contain personal information.
Data processing
The quality of the traffic loops data in terms of suitability to the purpose of official statistics is considered
as one of the major challenges for NSIs. First of all the data has a lot of noise that need to be removed. It
might happen that the data have not been collected for many minutes and, because of the stochastic
nature of the arrival times of vehicles at a road sensor, it might be hard to directly derive the number of
vehicles that passed during that minute. It might be difficult to find a good imputation methods, and
thus to clean the traffic loop data in such a way that the estimation of the number of vehicles could be
precise and accurate [8].
Dissemination of the output
Low quality of the raw data usually requires complex methods for the data cleaning, imputation and
estimation of statistical indicators. Professional ethics requires that official statistics are disseminated in
a transparent manner providing clear description of not only the data sources but also the methods used
to produce statistical indicators.
Recommended ethical approach
1. Communication with the institutions collecting traffic loops data could help to better understand
the prominence of the data and therefore to develop the methods of the data treatment that
could ensure the quality of the output required for official statistics.
2. Collaboration with academia and research community is indispensable to developing methods of
handling road sensor data.
3. Continuous studies of the research work done by other institutions and private companies in the
area of utilization of traffic sensors data for different kind of statistics.
12
4. When releasing the official statistics based on big data, the methods used to obtain the statistical
output need to be presented in a clear and understandable manner that would be convenient for
the users.
4. Remote sensing data including satellite image data, data from
unmanned aerial vehicles (UAV)
The data source
Remote sensing is defined “as the measurement of object properties on the earth's surface using data
acquired from sensors mounted on aircrafts or satellites based on propagated signals (e.g.
electromagnetic radiation). The measurements are done at distance in contrast to measurements in
situ”4. It may be split into "active" remote sensing (i.e., when a signal is emitted by a satellite or aircraft
and its reflection by the object is detected by the sensor) and "passive" remote sensing (i.e., when the
reflection of sunlight is detected by the sensor)5.
The output of a remote sensing system is usually an image representing the scene being observed. A
further step of image analysis and interpretation is required in order to extract useful information from
the image. Remote sensing images are normally in the form of digital images6. Remote sensing has a
wide range of applications in many different fields, e.g. agriculture, the environment, business activity
and transport.
Potential use in official statistics
Statistical agencies around the world are investigating the possibility of using satellite imagery data in
official statistics. It is expected that satellite imagery would have the potential to improve timeliness and
reduce reporting burden on respondents, reduce costs of surveys and to provide data at a more
disaggregated level.
Experience collected so far shows that satellite imagery data can complement official statistics on
agriculture (e.g. calculation crop acreage estimates, to aid in modelling crop yield) [10].
4 Schowengerdt, Robert, A.: Remote Sensing, 3
rd edition, 2007, p. 2
5 https://en.wikipedia.org/wiki/Remote_sensing
https://en.wikipedia.org/wiki/Satellite_imagery 6 http://www.crisp.nus.edu.sg/~research/tutorial/intro.htm
13
Possible concerns
Data acquisition
Most of the data obtained from satellite or aerial imagery providers is publicly available and therefore
the data acquisition should not be a problem for national statistical authorities. For the time being
privacy issues are not of major concern. Although it can be raised by people who wish not to have their
property shown from above. With growing spatial resolution of satellite imagery and especially aerial
photography, enabling the identification of individuals, issues of privacy may be raised.
Data processing
The major issue is the quality of the imagery that may affect quality (in particular accuracy) of the
estimates. Research work done by the Australian Bureau of Statistics pointed out some limitations of
satellite imagery data that present challenges for its utilisation in official statistics. These are, for
example, missing and contaminated data due to cloud cover, missing or poor quality data due to on
board satellite equipment failures, insufficient image resolution) [10].
Data dissemination
In order to obtain statistical estimates on the basis of satellite or aerial imagery techniques of classifying
the imagery data need to be applied. The methods, together with information on quality will need to be
clearly explained to the users once statistical output is disseminated as official statistics.
Recommended ethical approach
1. A deep analysis of viability of using satellite/aerial imagery data to produce statistical estimates
needs to be performed.
2. Research work on the methods used to guarantee quality of statistical estimates based on
satellite/aerial imagery data need to be carried out before the statistical can be published as
official statistics. Joint efforts of NSIs and research community in exploration of these new data
sources could benefit from synergies and save public resources.
3. Collaboration with organisations that possess knowledge in interpreting satellite/ aerial imagery
data could help to develop methods suitable for producing the indicators of official statistics
14
4. Assessment of recognition or dissemination of confidential information or breach of privacy
should be performed.
5. Social media data
The data source
Social Media Data has surfaced in recent years as one of the big data sources with a lot of promise.
Whereas Satellite Imagery or Mobile Phone Data are relatively well-defined as data sources, Social
Media is more of a mixed basket, which will need to be further clarified. Maybe a general denominator
of such data is that they are disseminated through the Internet; further, most data are text messages,
images, video or searches voluntarily submitted by persons7.
In this chapter the data generated on the basis of text messages posted on Twitter and Facebook are
analysed from the point of view of official statistics.
Potential use in official statistics
The experience of Statistics Netherlands shows that social media messages can be used, for example, to
produce consumer confidence indicator (European Harmonised Consumer Opinion Survey). It can be
done by building a model based on fitting characteristics derived from Facebook and Twitter messages.
The production time can be reduced from several weeks to a few days [11].
Besides of official statistics traditional indicators some research work has been done to examine the
viability of using social media messages to measure a level of well-being in different countries. Most of
the happiness research in social media has focused on the emotion component (i.e., sentiment analysis).
For example, classifying the emotional affinity of sentences and characterizing happiness as a specific
emotion [12].
Other studies showed that automated monitoring of public sentiment on social media, combined with
contextual knowledge has the potential to be a valuable real-time proxy for food-related economic
indicators. The correlation between the volume of food-related Twitter conversations and official food
7 https://unstats.un.org/bigdata/
15
inflation statistics, and between food and fuel-related tweet volumes can be quantified using simple
time series analysis. [13]
Possible concerns
Data acquisition.
The social media messages in general are public, so that should not cause problems to obtain access to
the data. Nevertheless even though social media messages are marked as public by the users of the
social networks, it might be considered as unethical to use them for the purposes that were not known
to them in advance. Professional ethics of statisticians requires showing respect to the data subjects and
a question whether the users of social networks need to be informed that their public information will
be used by statistical authorities might need some considerations.
For some social media sources statistical offices might have privileged access to the data. In this case
data subjects should be informed that the data is used, how the data is processed and what kind of
statistical product is generated.
In some cases, the access to the collection of gathered messages has to be purchased from private
companies. That might raise controversy whether NSIs should pay for the data sources that are going to
be used for producing official statistics.
Data processing
The experience collected so far shows that the way of interpretation the millions of messages and
building the models to produce statistical indicators based on this interpretation is essential for the use
of these data source that guarantee the quality of the statistical output. This may require additional
research work to better understand the relations between behaviour of individuals and the phenomenon
being predicted.
Apart from that, it has to be taken into account that social media messages can be a subject of
manipulation. Some additional investigation might be needed to estimate the probability of emersion of
artificially created messages. The risk of data manipulation might increase if user of statistics generated
from social media messages knew what data were used and how these data were processed. In this case,
statistical offices should define risk avoiding and mitigating strategies under ethical considerations.
16
In case the social media data (text messages, images, video etc.) obtained by statistical offices can
identify individuals, confidentiality of personal information must be ensured and possibility of misuse
should be eliminated.
Dissemination of the output
The methods and the data sources used to produce statistical indicators based on social media data and
disseminated as official statistics should be a part of their metadata and need to be described in an
understandable way.
Recommended ethical approach
1. In order to avoid controversy over the ownership of the public messages, it would be
recommended to make it known to the social networks users that their public data is used for
the purposes of official statistics.
2. Thorough studies of the relation between peoples’ web activity and the phenomena to be
predicted are necessary in order to develop methods that generate statistical indicators that can
be published as official statistics. The prediction models needs to be sufficiently tested and
robust.
3. The transparency of the methods used to produce statistical indicators needs to be ensured. It
concerns both statistical output disseminated as official statistics and additional studies that
were used to support the prediction models to obtain this statistical output.
4. Production of data should rely on more than one source in order to mitigate risks of data
manipulation or false correlations.
6. Web-scraped data from company websites, job vacancy websites
or real estate agencies' websites
The data source
Web scraping (web harvesting or web data extraction) is a technique used for extracting data from
websites. The term usually refers to automated processes implemented using a bot or web crawler. It is
17
a form of copying, in which specific data is gathered and copied from the web, typically into a central
local database or spreadsheet, for later retrieval or analysis. 8
With web scrapers it is possible to collect the data from different websites that might be useful for
official statistics.
Potential use in for official statistics
NSIs used web scraping techniques to collect prices from the Internet (air tickets, clothes, and electronic
devices as well as housing prices etc.) and analysed the possibilities of using this data for the compilation
of the Consumer Price Index (CPI). Online prices could replace prices collected by price collectors. This
can enlarge the sample of the products and reduce costs of the statistical production process. A research
work done by Statistics Netherlands in this area explains, for example, how “unmatched (new and
disappearing) items in the market could be treated and how the time-product dummy index compares
two matched-model price indexes” [14].
On-going projects carried out by the ESSnet Big Data are examining a job web sites, job adverts on the
web sites of enterprises, and job vacancy data from third party sources to prove that valuable
information can be obtained to complement the production of Job Vacancy Statistics; to use information
from enterprises’ web sites to update statistical Business Register [23].
Possible concerns
Data acquisition
The legality of web scraping varies across the countries. Web scraping may be against the terms of use of
some websites and, although legal enforceability of these requirements is not clear, ethical norms of
statisticians would recommend to respect the wishes of the websites owners.
Database rights9 might be considered as property rights and it might be the case that the act of creating
a database from web scraped data could breach database rights, at least if essential parts of the
database were scraped.
8 https://en.wikipedia.org/wiki/Data_scraping
9 Directive 96/9/EC of the European Parliament and of the Council of 11 March 1996 on the legal protection of databases.
18
Once the legal issues are resolved the NSIs should only take from the web sites the information that is
necessary for the purposes of official statistics.
Data processing
The characteristics targeted by web scrapping do not always directly correspond to the characteristics
used in official statistics, the same characteristics may be duplicated in several websites. That might
create challenges for the data processing. For example, job vacancy is different from job advertisement,
the same job vacancy may be posted on differed websites [15]. It shows that some resources need to be
invested in research work in order to be able to produce statistical indicators on the basis of information
collected from the websites as official statistics.
Data dissemination
The replacement of the traditional data collection by the collection of information from websites may
have an impact on the time series of statistical indicators. The changes in the production process of
statistical indicators will need to be explained to the users of official statistics once these indicators are
published.
Recommended ethical approach
1. Web scraping for the purposes of obtaining information for official statistics need to be
performed in a transparent way, e.g. information about web scrapping activities could be
provided on the websites of the NSIs.
2. Web scrapping must respect the law (e.g. intellectual property, data base rights). The NSIs
should seek for a legal advice to ensure that planned web scraping activities are not prohibited
by European or national legislation.
3. Web scraping needs to be performed with the respect the rights of web site owners excluding
the possibility of harming to their business.
4. It is recommended to identify yourself when extracting the data from the websites and to
provide contact details to the website administrator.
19
5. Protected areas of websites should be respected. The robots.txt files should be consulted in
order to be informed what is allowed to be scraped.
6. Web scraping at the time of intensive internet traffic should be avoided. It should not
compromise the functioning of the web services.
7. When the number of target websites is limited it would be recommended to contact website
owners directly explaining the purpose of the research, the nature of the web scraping and
asking for permission for the web scrapping before applying it on a large scale.
8. Contacts with the websites’ owners could help to better understand how the data is compiled
and if the continuity of information can be guaranteed.
9. If possible, it would be recommended to inform the data subjects (enterprises) that the data
about them is being collected from their websites and to give them the possibility to opt-out.
10. The collection of personal information should be avoided to the possible extent. Appropriate
organisational and technical measures to guarantee confidentiality of individual data while the
data are collected, processed and stored have to be put in place.
7. Query and ClickOut data from internet searches
The data source
A search query is a request for information that is made using internet search engines. One of the most
popular search engines, the Google Search Engine was used by Google Inc. as a basis for a public web
facility Google Trends10. Google Trends provides the query index that is calculated by dividing the total
query volume for the search term in question within a particular geographic region by the total number
of queries in that region during the time period being examined. The maximum query share in the time
period specified is normalised to be 100, and the query share at the initial date being examined is
normalised to be zero. [21]
Apart from Google Trends, another source in this group of big data could be the number of Wikipedia
page views. “Wikipedia is ideally suited as a platform that could potentially be of use for legitimate
10
https://trends.google.com/trends/
20
scientific investigation in many different areas. Not only is the information held within Wikipedia articles
very useful on its own, but statistics and trends surrounding the amount of usage of particular articles,
frequency of article edits, region specific statistics, and countless other factors make the Wikipedia
environment an area of interest for researchers” [22].
Potential use for official statistics
Google Trends indices can be of interest to official statistics. Some studies proved that Google Trends
query indices maybe correlated with economic indicators and can be used for short-term prediction
(nowcast) of economic and social statistics indicators. For example, to predict unemployment figures
[15][16], trends in household consumption, the development of retail sales [17]. Google Trends indices
can also be used to improve standard forecast models in different domains of statistics using the
advantages of quick access to information that this tool provides.
Possible concerns
Data acquisition
Google Tends is a publicly available web service and therefore there should not be an issue to access the
data.
Data processing
The pilots conducted so far identified possible issues linked to the data processing that could influence
the quality of the statistical output. It might happen that information about the scope and coverage of
the searches as well as the stability of the algorithm used to obtain the index are not sufficiently
disclosed. That may lead to mistakes in prediction models and therefore in statistical estimates.
The accuracy of statistical estimates based on Google search (as any other data based on web searches)
may depend on the percentage of internet usage in the country and habits of people, for example,
searching information on the internet before making purchases. That might require additional studies on
this subject [18].
Data dissemination
It might be difficult to guarantee full transparency of the methods applied to produce statistical
estimates if information about the collection of data on web searches by the web services would not be
21
available. The use of the data source that is a “black box” might be considered as non-compliance with of
the professional ethics of statisticians. The publication of data based on web searches might trigger an
increase of these searches.
Recommended ethical approach
1. Collaboration with academia and research communities would be recommended in order to
develop prediction models that could handle possible deficiencies of the data based on internet
searches and make this data suitable for official statistics.
2. NSIs need to be aware about the rules and algorithms used by web services to collect the data
on web-searches. Contacts with the web services operators would be necessary to get
information on the provenance of the data.
3. The relation between the model for the prediction of estimates based on web searches and the
phenomenon being predicted should be proven by scientific evidence.
4. In order to guarantee transparency of the methods used to predict statistical indicators
disseminated as official statistics the data sources and methods used should be disclosed to the
users.
5. NSIs should include more data sources into the model of prediction or use additional sources for
verifying the results.
8. Cash register data, e.g. from supermarkets
The data source
A cash register is a mechanical or electronic device for registering and calculating transactions at a point
of sale. Cash registers are usually connected to a handheld or stationary barcode reader so that a
customer's purchases can be more rapidly scanned.11 At the moment of buying a product a scanner
creates a record that has European Article Number (EAN) unique for each item, quantity sold and price.
These transaction records collected from a large number of retailers can be considered as big data. This
data is a valuable source of information that can be used by NSIs for official statistics.
11
https://en.wikipedia.org/wiki/Cash_register
22
Possible use in official statistics
Scanner data is used by several NSIs for producing the Consumer Price Index (CPI). The use of scanner
data has advantages vs the traditional data collection, i.e. lower costs of data collection, no sampling
errors as the data contains all the transaction of the retailers, better reflection of price changes as the
data includes transactions for entire observation period, the opportunity to observe arrivals of new
products in the market. [19] [20]
The possibility of using scanner data to complement other domains of official statistics (price statistics,
household expenditure, and business statistics) is under exploration.
Possible concerns
Data acquisition
Difficulties to obtain the data from the retailers may raise some ethical concerns linked to the
establishment of cooperation with private companies. The experience of the NSIs shows that the data
holders are not always willing to cooperate. It can take long time to develop the relationship and to gain
access to the data [20]. The NSIs need to find ways of mutual benefits not jeopardising the professional
independence of statisticians. The agreements with the retailers (data holders) need to guarantee the
provision of scanner data to NSIs on a regular basis and at agreed dates. The NSIs should be informed in
advance about any changes in the data structure. All that may create extra burden on retailers.
The data can also be purchased from the market research companies but that may raise questions
related to costs, i.e. the price for the data provision might exceed the cost for preparing them.
Data processing
Scanner data provide opportunities for official statistics but also create serious methodological
challenges (to classify the products correctly, to choose the appropriate index and a weighting system, to
make sure that the products for which prices are measured were not replaced by the products that do
not correspond to the technical specification of the observed ones etc.). Besides that, due to the high
importance of the CPI it is necessary to ensure its compliance with the EU Regulation and comparability
over time and across countries.
Dissemination of output
23
The change of the data sources (from price collectors’ data to scanners’ data) may have an impact on the
time series of the statistical output. The effects should be minimized. In addition, users of official
statistics need to be informed on possible discrepancies between the old and new methodology.
Recommended ethical approach
1. Cooperation with the data providers (retailers) should be based on partnerships with mutual
benefits. Excessive burden on the data providers (efforts for preparation of the data to be
transmitted to NSIs) should be avoided. Professional independence of statistical authorities needs to
be ensured.
2. Scanner data might be of commercial value for the retailers. The data received by statistical
authorities should only be used for the purposes of official statistics. Rules and measures to protect
data from misuse should be communicated by the statistical offices
3. Regular information exchange between retailers and statistical offices could help to understand
possible changes in the market and replacements of the products and consequently to ensure
required quality of the price indices.
4. Collaboration with the research community is necessary to overcome methodological challenges
presented posed by the scanner data. Learning from experience of other statistical offices could help
to accelerate the progress.
5. The dissemination of the statistical output needs to be done in a transparent way. The changes in the
data sources and methods used should be communicated to the users of official statistics.
CONCLUSIONS
Different types of big data have specific characteristics linked to their suitability for official statistics.
They may raise different ethical questions during the statistical business production process. For
example, the issues linked to the access to the data are more relevant to the data sources that have
sensitive personal data like mobile phone data. Social media data more than other sources are
susceptible to manipulation; road sensors data probably is more “noisy” than others; the relation
between data and the phenomenon to measure might not always be straightforward etc. The integration
of big data sources into production of official statistics may require following different approaches by
24
statistical offices to resolve the issues linked to professional ethics, which may depend on institutional,
legal and national context.
In addition, ethical principles of official statistics may be in conflict with each other. For example, the
requirement to ensure the privacy of the data subjects with the increasing demand for more detailed
information for evidence based policy making; the requirement for reducing response burden may
coincide with increasing burden on big data holders, ect.
These developments may trigger the need for creating a dedicated structure addressing and providing
advice on ethical questions to find the right balance between ethical norms of conduct and being
empowered to take a decision on the issues linked to professional ethics.
The consequent application of quality principles in statistical production and dissemination is one of the
main principles that guarantees trust in official statistics. As research and exploratory phase of new data
sources may take some time before they can be incorporated into regular production process, it would
be recommendable to publish preliminary results as experimental data and not as official statistics.
Publication as experimental statistics could be used to improve the quality of the final product and could
lead to a discussion on the pertinence of data sources and methods for generating statistical
information.
25
REFERENCES
[A] Scheveningen Memorandum Big Data and Official Statistics;
http://ec.europa.eu/eurostat/documents/42577/43315/Scheveningen-memorandum-27-09-13
[B] FUNDAMENTAL PRINCIPLES OF OFFICIAL STATISTICS (A/RES/68/261 from 29 January 2014)
http://unstats.un.org/unsd/dnss/gp/fundprinciples.aspx
[C] EUROPEAN STATISTICS CODE OF PRACTICE, for the national and community statistical authorities;
adopted by the European Statistical System Committee 28th September 2011;
http://ec.europa.eu/eurostat/documents/3859598/5921861/KS-32-11-955-EN.PDF/5fa1ebc6-90bb-
43fa-888f-dde032471e15
[D] DECLARATION ON PROFESSIONAL ETHICS, adopted by the ISI council 22 & 23 July 2010, Reykjavik,
Iceland;
https://www.isi-web.org/index.php/news-from-isi/34-professional-ethics/296-
eclarationprofessionalethics-2010uk?showall
[1] Mobile telephones and mobile positioning data as source for statistics: Estonian experience;
University of Tartu and Eurostat;
https://ec.europa.eu/eurostat/cros/system/files/S19P4.pdf_en
[2] Estimating population density distribution from network-based mobile phone data; Fabio Ricciato,
Peter Widhalm, Massimo Craglia and Francesco Pantisano; European Commission, Joint Research Centre
–JRC, 2015;
https://ec.europa.eu/jrc/sites/jrcsh/files/Final-%20jrc-AIT-MNO-study-compressed.pdf
[3] Mobile phone data for mobility statistics, Emanuele Baldacci, ISTAT;
https://unstats.un.org/unsd/trade/events/2014/beijing/presentations/day1/afternoon/3.%20Mobile%2
0phone%20data%20for%20mobility%20statistics--Emanuele%20Balda.pdf
[4] Feasibility study on the use of mobile positioning data for tourism statistics, Eurostat, 2014
26
http://ec.europa.eu/eurostat/documents/747990/6225717/MP-Consolidated-report.pdf/530307ec-
0684-4052-87dd-0c02b0b63b73
[5] A Big Data Pilot Project with Smart Meter Data, Lily Ma, Statistics Canada;
http://www.statcan.gc.ca/sites/default/files/media/14274-eng.pdf#
[6] Modelling sample data from smart-type meter electricity usage, Susan Williams, ONS UK;
https://ec.europa.eu/eurostat/cros/system/files/Williams_Smart%20meters%20abstract%20final.pdf
[7] ESSnet Big Data. WP3. Smart meters. Report on data access and data handling. 2016-07-29;
https://webgate.ec.europa.eu/fpfis/mwikis/essnetbigdata/index.php?title=WP3_Report_1&oldid=2942
[8] High frequency road sensor data for official statistics; Marco Puts, Piet Daas, Martijn Tennekes;
Statistics Netherlands 2014;
http://www.von-tijn.nl/tijn/research/publications/road_sensors1_NTTS2015.pdf
[9] Traffic sensor data for commuting statistics, Pasi Piela, Statistics Finland;
http://www1.unece.org/stat/platform/display/BDI/Statistics+Finland+-
+Traffic+sensor+data+for+commuting+statistics
[10] Methodological Approaches for Utilising Satellite Imagery to Estimate Official Crop Area Statistics,
Jennifer Marley, Daniel Elazar and Kate Traeger; Analytical Services Branch; Australian bureau of
statistics; Research paper, 2014;
http://www.ausstats.abs.gov.au/ausstats/subscriber.nsf/0/EEF6FB75844F0AD3CA257D57001D7662/$Fil
e/1352055144_sep%202014.pdf
[11] Social media sentiment and consumer confidence, Piet J.H. Daas and Marco J.H. Puts, Statistics
Netherlands, 2014;
https://www.ecb.europa.eu/pub/pdf/scpsps/ecbsp5.en.pdf
27
[12] Characterizing Geographic Variation in Well-Being using Tweets, University of Pennsylvania,
Michigan State University, 2013;
http://wwbp.org/papers/icwsm2013_cnty-wb.pdf
[13] Mining Indonesian Tweets to Understand Food Price Crises, UN Global Pulse, February 2014;
http://www.unglobalpulse.org/sites/default/files/Global-Pulse-Mining-Indonesian-Tweets-Food-Price-
Crises%20copy.pdf
[14] Online Data, Fixed Effects and the Construction of High-Frequency Price Indexes; Jan de Haana and
Rens Hendriksb; Statistics Netherlands, 2013;
https://www.business.unsw.edu.au/research-site/centreforappliedeconomicresearch-
site/Documents/Jan-de-Haan-Online-Price-Indexes.pdf
[15] Improving prediction of unemployment statistics with Google trends: preliminary experiments,
Vittorio Perduca, Eurostat;
https://ec.europa.eu/eurostat/cros/system/files/UnemploymentFrance_bigdata.pd
[16] European Commission, Eurostat (Pedro Ferreira): Improving prediction of unemployment statistics
with Google trends: part 2;
https://ec.europa.eu/eurostat/cros/content/improving-prediction-unemployment-statistics-google-
trends-part-2_en
[17] Google as a tool for nowcasting household consumption: estimations on Hungarian data, Istvan
Janos Toth, Miklós Hajdu, Institute for Economic and Enterprise Research – HCCI;
http://old.gvi.hu/data/papers/ciret_2012_tij_hm_paper_120415.pdf
[18] The use of web activity evidence to increase the timeliness of official statistics indicators Fernando
Reis, Pedro Ferreira, Vittorio Perduca, Eurostat, 2014;
https://ec.europa.eu/eurostat/cros/content/use-web-activity-evidence-increase-timeliness-iaos2014_en
28
[19] Initial report on experiences with scanner data in ONS, Derek Bird, Robert Breton, Chris Payne and
Ainslie Restieaux, ONS UK, 2014;
https://www.ons.gov.uk/ons/guide-method/user-guidance/prices/cpi-and-rpi/initial-report-on-
experiences-with-scanner-data-in-ons.pdf
[20] Issues on the use of scanner data in the CPI, Muhanad Sammar, Anders Norberg and Can Tongur,
Statistics Swede2013;
http://www.ottawagroup.org/Ottawa/ottawagroup.nsf/4a256353001af3ed4b2562bb00121564/8bdac0
e73d96c891ca257bb00002fdb4/$FILE/Muhanad%20Sammar%202%20ISSUES%20ON%20THE%20USE%2
0OF%20SCANNER%20DATA%20IN%20THE%20CPI.pdf
[21] Predicting the Present with Google Trends, Hyunyoung Choi, Hal Varian December 18, 2011,page 3,
chapter 2 Google Trends;
http://people.ischool.berkeley.edu/~hal/Papers/2011/ptp.pdf
[22] Wikipedia Usage Estimates Prevalence of Influenza-Like Illness in the United States in Near Real-
Time, David J. McIver, John S. Brownstein;
http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1003581
[23] ESSnet Big Data. WP1. Web scraping/ Job vacancies, Deliverable 1.2, 2016-11-11;
https://webgate.ec.europa.eu/fpfis/mwikis/essnetbigdata/images/6/64/WP1_Deliverable_1_2_final.pdf