21
Olav ten Bosch 23 March 2016, ESSnet big data WP2, Rome Webscraping at Statistics Netherlands

Webscraping at Statistics Netherlands Focused Crawler (Roboto) Data store Search & Match ElasticSearch Url-base Incomplete statistical data More complete statistical data Search terms

  • Upload
    vandiep

  • View
    225

  • Download
    0

Embed Size (px)

Citation preview

Olav ten Bosch 23 March 2016, ESSnet big data WP2, Rome

Webscraping at Statistics Netherlands

Content

– Internet as a datasource (IAD): motivation

– Some IAD projects over past years

– Technologies used

– Summary / trends

– Observations / thoughts

– Legal

– The Dutch Business Register

2

The why

3

Administrative sources

– Tax, social security services

– Municipalities/ Provinces

– Supermarkets

– …

– …

– Surveys

Internet sources

Fuel prices (2009)

‐ Daily fuel prices from website of unmanned petrol

stations (tinq.nl)

‐ Regional prices (per station) every day

Now: 2016:

‐ A direct data feed from travelcard company, weekly

‐ Fuel prices per day and all transactions of that week

‐ Publication in website: prices per month

4

Airline tickets (2010)

5

– Pilot: 3 robots on 6 airline companies

– 2 robots by external companies, 1 by SN

– Prices comply with manual collection

– Quite expensive; negative business case

– 2016: still manual price collection of airline tickets

0

50

100

150

200

250

11 Feb 03 Mar 23 Mar 12 Apr 02 May 22 May 11 Jun 01 Jul 21 Jul 10 Aug

Ticket price Amsterdam - Milano

Robot

Manual

Housing market

– Housing market (from 2011):

‐ Discussions with external company for > 1 year (iWoz)

‐ We scraped 5 sites, about 250.000 observations /

week, 2 years

2013 ->:

‐ Direct feed from one of the sites (Jaap.nl)

‐ Statline tables: Bestaande woningen in verkoop

‐ “based on 80-90 percent of the market”

7

Bulk price collection for CPI (1)

– Bulk price collection for CPI (from 2012):

‐ Mainly clothing

‐ Software scrapes all prices and product data (id, name,

description, category, colour, size,…)

2016:

‐ About 500.000 price observations daily from 10 sites

‐ Data from 3 sites used in production of Dutch CPI

‐ Price collection process embedded in organisation

‐ Plans to extend to > 20 sites; other domains

8

Bulk price collection for CPI (2)

Processing

bulk data from

the Internet 9

Structured data

Data collection & Feature extraction

Index based on internet data

Big Data Index methods

Features: Fine-knit Jumper Dark blue Striped Cotton edges

Robot-assisted price collection

– Robot tool for detecting price changes on (parts of) websites

– Traffic light indicates status:

‐ Green: nothing changed, prices is saved in database

‐ Red: some change, need attention of statistician

‐ Two click to hold old price or store a new one

‐ In production from 2014

Collect data on enterprises for EGR (2013)

– Pilot: find data about EGR enterprises on the web ‐ We scraped semi structured data from Wikipedia ‐ Multiple wikipedia languages (NL, EN, DE, FR)

‐ 2016: something alike in ESSnet BD WP2?

11

Search product descriptions for classifying business activities

– Search product descriptions on web (from 2014)

‐ First time we used automated search with Google

search API for statistics

‐ Pilot, no production

‐ Some doubts on google results

12

Twitter-LinkedIn (1)

– LinkedIn-Twitter for profiling (2015)

‐ Automated search on LinkedIn based on a sample of

twitter users

‐ Very specific and experimental

‐ “Profiling of Twitter data, a big data selectivity study”,

Piet Daas, Joep Burger, Quan Lé, Olav ten Bosch

13

14

Scraping websites of enterprises

– Identify family businesses (search and / or crawling)

(2016)

– Identify businesses with a Corporate Social Responsibility

(CSR) (search and / or crawling) (2016)

– Research program: ‐ “Extracting information from websites to improve economic figures”

– This ESSnet BD WP2 !!!

15

Crawling for Statistics

16

Internet Focused Crawler (Roboto)

Datastore

Search & Match ElasticSearch

Url-base

Incomplete statistical data

More complete statistical data

Search terms

Navigation terms

Item identifyer terms “year report, family business”

Technologies used

– Perl (2009), Djuggler (2010)

– Python, Scrapy (2010)

– R (2011-2015)

– NodeJS (Javacript on server) (2014-)

– Google Search API (2014-)

– ElasticSearch (2016)

– Roboto (nodejs package, 2015-2016)

– Nutch: tested, not used

– Generic Framework (robot framework) for bulk scraping

of prices

17

Summary / trends

18

Production Scrape Search Crawl External company

Tinq x (x) Travelcard

Airlines x 2 robots

Housing x (x) Jaap.nl

BulkCPI x x

Robottool x x (x)

EGR x x

RGS x

Twitter/ Linkedin

x x

Enterprises x x Dataprovider?

Observations / thoughts …

‐ If it is there, we can get it

‐ Technology is (usually) not the problem!

‐ The internet is a living thing!

‐ It’s too simple to think we can just buy the

internet somewhere and then make statistics!

‐ It’s powerful to combine something we know

with something we observe!

‐ External companies can help, but be careful …

19

20

Legal

– Dutch Statistics Law: ‐ Enterprises have to provide data to Statistics Netherlands on

request ‐ Scraping information from websites reduces response burden ‐ Statistics Netherlands does use data for official statistics only

– Dutch database legislation: ‐ Commercial re-use of intellectual property is forbidden ‐ This may also apply to internet sources

– Privacy: ‐ Dutch (statistical) legislation on protection of personal

information ‐ Statistics Netherlands does only scrape public sources and

processes data within Statistics Netherlands’ safe environment, just as with other (privacy-sensitive) data internally

– Netiquette: ‐ respect robots.txt ‐ identify yourself (user-agent) ‐ do not overload servers, use some idle time between requests

21

Dutch Business Register (simplified)

22

Legal units relationships Cluster of

control Enterprise

groups Enterprises Local units

Sources: - Trade Register - Tax Register - Social security register

(employees) - Profilers

- From administrative units to statistical units:

- About 1.5 Million administrative entities - About 0.5 Million have a url - Quality of url field not known, but seems usable