41
Big Data ESSNet WP1: Web Scraping for Job Vacancy Statistics Nigel Swier

Big Data ESSNet WP1: Web Scraping for Job Vacancy Statistics · 2018-05-22 · • Direct web scraping • Point and click • Progammatic (e.g. Python Scrapy) • Web-scraping enterprise

  • Upload
    others

  • View
    8

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Big Data ESSNet WP1: Web Scraping for Job Vacancy Statistics · 2018-05-22 · • Direct web scraping • Point and click • Progammatic (e.g. Python Scrapy) • Web-scraping enterprise

Big Data ESSNet WP1:

Web Scraping for Job Vacancy Statistics

Nigel Swier

Page 2: Big Data ESSNet WP1: Web Scraping for Job Vacancy Statistics · 2018-05-22 · • Direct web scraping • Point and click • Progammatic (e.g. Python Scrapy) • Web-scraping enterprise

Today’s talk is just the tip of the iceberg ….

Page 3: Big Data ESSNet WP1: Web Scraping for Job Vacancy Statistics · 2018-05-22 · • Direct web scraping • Point and click • Progammatic (e.g. Python Scrapy) • Web-scraping enterprise

Potential of On-line Job Vacancy (OJV) Data

Current Official

Estimates (Survey)

Online data

Frequency Quarterly Real-time?

Industry Sector

Enterprise Size

Job type / skills

Geography

National Totals

More frequent More timely More granular Less burden Cheaper???

Page 4: Big Data ESSNet WP1: Web Scraping for Job Vacancy Statistics · 2018-05-22 · • Direct web scraping • Point and click • Progammatic (e.g. Python Scrapy) • Web-scraping enterprise

The Partners

SGA-1 partners (from Feb 2016):

• UK (lead)

• Germany

• Slovenia

• Greece

• Italy

• Sweden

SGA-2 partners (from Aug 2017):

• Belgium

• France

• Portugal

Page 5: Big Data ESSNet WP1: Web Scraping for Job Vacancy Statistics · 2018-05-22 · • Direct web scraping • Point and click • Progammatic (e.g. Python Scrapy) • Web-scraping enterprise

The People

Wiesbaden, April 2016 Rome, November 2016

Thessaloniki, Sept 2017 Milan, March 2018

Page 6: Big Data ESSNet WP1: Web Scraping for Job Vacancy Statistics · 2018-05-22 · • Direct web scraping • Point and click • Progammatic (e.g. Python Scrapy) • Web-scraping enterprise

Six challenges with using

On-line Job Vacancy (OJV) data

for statistical purposes

Page 7: Big Data ESSNet WP1: Web Scraping for Job Vacancy Statistics · 2018-05-22 · • Direct web scraping • Point and click • Progammatic (e.g. Python Scrapy) • Web-scraping enterprise

Not all jobs are advertised on-line. Coverage is

incomplete and not representative.

Recruitment by Channels, Germany 2016 (Source JVS)

Challenge 1:

Page 8: Big Data ESSNet WP1: Web Scraping for Job Vacancy Statistics · 2018-05-22 · • Direct web scraping • Point and click • Progammatic (e.g. Python Scrapy) • Web-scraping enterprise

Challenge 2:

There is no definitive source of OJV data

• National Employment Agencies

• Job portals:

• Job Boards

• Job Search Engines

• Hybrid Portals

• Enterprise websites

• Data aggregators:

• Commercial providers

• CEDEFOP

Duplication

Image: Creative Commons

Page 9: Big Data ESSNet WP1: Web Scraping for Job Vacancy Statistics · 2018-05-22 · • Direct web scraping • Point and click • Progammatic (e.g. Python Scrapy) • Web-scraping enterprise

Challenge 3:

Much OJV data is unstructured. Text processing

and analysis is required to extract useful

information.

Page 10: Big Data ESSNet WP1: Web Scraping for Job Vacancy Statistics · 2018-05-22 · • Direct web scraping • Point and click • Progammatic (e.g. Python Scrapy) • Web-scraping enterprise

Challenge 4:

Some job ads are not within the scope of official

statistics definitions of a job vacancy

• International Jobs

• Ghost Vacancies

• Unpaid Student Internships

All images: Creative Commons

Page 11: Big Data ESSNet WP1: Web Scraping for Job Vacancy Statistics · 2018-05-22 · • Direct web scraping • Point and click • Progammatic (e.g. Python Scrapy) • Web-scraping enterprise

Challenge 5:

The official definition of a job vacancy does not

correspond directly to the concept of a live job ad

Page 12: Big Data ESSNet WP1: Web Scraping for Job Vacancy Statistics · 2018-05-22 · • Direct web scraping • Point and click • Progammatic (e.g. Python Scrapy) • Web-scraping enterprise

Challenge 5:

The official definition of a job vacancy does not

correspond directly to the concept of a live job ad

One ad, multiple

vacancies

Page 13: Big Data ESSNet WP1: Web Scraping for Job Vacancy Statistics · 2018-05-22 · • Direct web scraping • Point and click • Progammatic (e.g. Python Scrapy) • Web-scraping enterprise

Challenge 6:

The specific job vacancy data landscape varies

between countries:

• Size of country and number of job portals

• Digital penetration

• Characteristics of the economy and the labour market

• The role of National Employment Agencies

• Differences in the Job Vacancy Survey

• Language(s)

• Legal Issues

Image: Creative Commons

Page 14: Big Data ESSNet WP1: Web Scraping for Job Vacancy Statistics · 2018-05-22 · • Direct web scraping • Point and click • Progammatic (e.g. Python Scrapy) • Web-scraping enterprise

Summary of Challenges

OJV data is not representative of the labour market and

there are definitional issues that make it difficult to

compare directly with official statistics

Image: Creative Commons

Page 15: Big Data ESSNet WP1: Web Scraping for Job Vacancy Statistics · 2018-05-22 · • Direct web scraping • Point and click • Progammatic (e.g. Python Scrapy) • Web-scraping enterprise

Data Access

Page 16: Big Data ESSNet WP1: Web Scraping for Job Vacancy Statistics · 2018-05-22 · • Direct web scraping • Point and click • Progammatic (e.g. Python Scrapy) • Web-scraping enterprise

OJV Data Landscape

Job Boards

Private Employment

Agencies

Employers

Job Search

Engines

National Employment

Agency

Enterprise

Websites

Data Aggregators

Public Policy

Cedefop

Official Job Vacancy

Statistics

Page 17: Big Data ESSNet WP1: Web Scraping for Job Vacancy Statistics · 2018-05-22 · • Direct web scraping • Point and click • Progammatic (e.g. Python Scrapy) • Web-scraping enterprise

Approaches to Data Access

• Direct web scraping

• Point and click

• Progammatic (e.g. Python Scrapy)

• Web-scraping enterprise websites

• Agreed Access

• National employment agency

• Private job portals

• Commercial providers

• CEDEFOP

Images: Creative Commons

Page 18: Big Data ESSNet WP1: Web Scraping for Job Vacancy Statistics · 2018-05-22 · • Direct web scraping • Point and click • Progammatic (e.g. Python Scrapy) • Web-scraping enterprise

Data Access by Country

Page 19: Big Data ESSNet WP1: Web Scraping for Job Vacancy Statistics · 2018-05-22 · • Direct web scraping • Point and click • Progammatic (e.g. Python Scrapy) • Web-scraping enterprise

Data Handling

• Data cleaning and deduplication

• Text analysis and classification

• Flow to stock transformation

Page 20: Big Data ESSNet WP1: Web Scraping for Job Vacancy Statistics · 2018-05-22 · • Direct web scraping • Point and click • Progammatic (e.g. Python Scrapy) • Web-scraping enterprise

Classifying textual data with machine learning

Can industry

and occupation

be classified

from a job ad?

Occupation is fairly straightforward in this case

Industry is more difficult. This company is an employment

agency not the employer. But there are clues….

Page 21: Big Data ESSNet WP1: Web Scraping for Job Vacancy Statistics · 2018-05-22 · • Direct web scraping • Point and click • Progammatic (e.g. Python Scrapy) • Web-scraping enterprise

Text pre-processing and feature extraction

• Text Standardisation

• Stop word removal

• White/blacklists

• Stemming (e.g. “making” => “mak”)

• Lemmatization:

• Standard (e.g. “making” => “make”)

• Sophisticated:

• Feature Extraction:

• Bag of words / n-grams

• Term frequency

Image: Creative Commons

Page 22: Big Data ESSNet WP1: Web Scraping for Job Vacancy Statistics · 2018-05-22 · • Direct web scraping • Point and click • Progammatic (e.g. Python Scrapy) • Web-scraping enterprise

Machine Learning

• Training data

• Libraries:

• Scikit Learn

• Rtexttools

• Best performing algorithms/approaches

• SVM with Linear Kernel (Portugal)

• Logistic Regression (France)

• Multinomial Naïve Bayes (Germany)

• Ensemble (Belgium)

Images: Creative Commons

Page 23: Big Data ESSNet WP1: Web Scraping for Job Vacancy Statistics · 2018-05-22 · • Direct web scraping • Point and click • Progammatic (e.g. Python Scrapy) • Web-scraping enterprise

Results: Classifying Occupation

Occupation Coding Confusion Matrix, Portugal Study

Page 24: Big Data ESSNet WP1: Web Scraping for Job Vacancy Statistics · 2018-05-22 · • Direct web scraping • Point and click • Progammatic (e.g. Python Scrapy) • Web-scraping enterprise

Results: Classifying Industry

NACE Coding Confusion Matrix, Belgium Study

Page 25: Big Data ESSNet WP1: Web Scraping for Job Vacancy Statistics · 2018-05-22 · • Direct web scraping • Point and click • Progammatic (e.g. Python Scrapy) • Web-scraping enterprise

Other approaches to classifying data

• String matching

• Levenshtein distance

• Jaccard Similarity

• Phrase-based classification (PBC)

• Controlled vocabularies

• More precision

• Greater transparency

• Less Scalable

Page 26: Big Data ESSNet WP1: Web Scraping for Job Vacancy Statistics · 2018-05-22 · • Direct web scraping • Point and click • Progammatic (e.g. Python Scrapy) • Web-scraping enterprise

Methodology

• Quality Assessment Frameworks

• Assessing Coverage

• Matching and Linking

• Time series analysis / Nowcasting

Page 27: Big Data ESSNet WP1: Web Scraping for Job Vacancy Statistics · 2018-05-22 · • Direct web scraping • Point and click • Progammatic (e.g. Python Scrapy) • Web-scraping enterprise

Assessment against aggregates

Page 28: Big Data ESSNet WP1: Web Scraping for Job Vacancy Statistics · 2018-05-22 · • Direct web scraping • Point and click • Progammatic (e.g. Python Scrapy) • Web-scraping enterprise

Assessment against statistical units

Also, illustrates an LSTM neural network nowcasting model using multiple OJV sources

JV count comparison for a selected company, UK Study

Page 29: Big Data ESSNet WP1: Web Scraping for Job Vacancy Statistics · 2018-05-22 · • Direct web scraping • Point and click • Progammatic (e.g. Python Scrapy) • Web-scraping enterprise

Time Series Analysis

Page 30: Big Data ESSNet WP1: Web Scraping for Job Vacancy Statistics · 2018-05-22 · • Direct web scraping • Point and click • Progammatic (e.g. Python Scrapy) • Web-scraping enterprise

Time Series Analysis

Page 31: Big Data ESSNet WP1: Web Scraping for Job Vacancy Statistics · 2018-05-22 · • Direct web scraping • Point and click • Progammatic (e.g. Python Scrapy) • Web-scraping enterprise

Statistical Outputs

Page 32: Big Data ESSNet WP1: Web Scraping for Job Vacancy Statistics · 2018-05-22 · • Direct web scraping • Point and click • Progammatic (e.g. Python Scrapy) • Web-scraping enterprise

Experimental Outputs For Slovenia

Page 33: Big Data ESSNet WP1: Web Scraping for Job Vacancy Statistics · 2018-05-22 · • Direct web scraping • Point and click • Progammatic (e.g. Python Scrapy) • Web-scraping enterprise

Job Vacancy Flash Estimates

Page 34: Big Data ESSNet WP1: Web Scraping for Job Vacancy Statistics · 2018-05-22 · • Direct web scraping • Point and click • Progammatic (e.g. Python Scrapy) • Web-scraping enterprise

Job Vacancies by Local Areas

Page 35: Big Data ESSNet WP1: Web Scraping for Job Vacancy Statistics · 2018-05-22 · • Direct web scraping • Point and click • Progammatic (e.g. Python Scrapy) • Web-scraping enterprise

Key Conclusions (and Questions)

• Agreed access arrangements are generally better than direct

web scraping

• OJV data cannot replace the Job Vacancy Survey

• OJV data does not correspond to target concepts and only

measures part of the labour market. How useful are these

measures?

• If useful, how should these measures be presented alongside

the official estimates?

• A successful collaboration with CEDEFOP is essential. How do

we get the best possible quality data for official statistics

purposes?

Page 36: Big Data ESSNet WP1: Web Scraping for Job Vacancy Statistics · 2018-05-22 · • Direct web scraping • Point and click • Progammatic (e.g. Python Scrapy) • Web-scraping enterprise

Future Perspectives

Page 37: Big Data ESSNet WP1: Web Scraping for Job Vacancy Statistics · 2018-05-22 · • Direct web scraping • Point and click • Progammatic (e.g. Python Scrapy) • Web-scraping enterprise

Disruptive technologies

Page 38: Big Data ESSNet WP1: Web Scraping for Job Vacancy Statistics · 2018-05-22 · • Direct web scraping • Point and click • Progammatic (e.g. Python Scrapy) • Web-scraping enterprise
Page 39: Big Data ESSNet WP1: Web Scraping for Job Vacancy Statistics · 2018-05-22 · • Direct web scraping • Point and click • Progammatic (e.g. Python Scrapy) • Web-scraping enterprise

Drivers of Cedefop RLMI work

Complement skills intelligence toolkit

Better labour market information for better policies

Lack of comparable data and systematic analysis

Page 40: Big Data ESSNet WP1: Web Scraping for Job Vacancy Statistics · 2018-05-22 · • Direct web scraping • Point and click • Progammatic (e.g. Python Scrapy) • Web-scraping enterprise

Key characteristics of the project • Based on previous feasibility study

– Interesting and unique set of results – Data used for Eurostat hackathon – Data used for various activates of WP 1

• Key features – Preselected well analysed sources – All 28 EU MS / all EU official languages – Skills in ESCO v.1 + other attributes

• Time horizon – Early release (Dec. 2018) – CZ, DE, ES, FR, IT, IE, UK – Final version (Dec 2020)

Page 41: Big Data ESSNet WP1: Web Scraping for Job Vacancy Statistics · 2018-05-22 · • Direct web scraping • Point and click • Progammatic (e.g. Python Scrapy) • Web-scraping enterprise

Connect to ESS net and Eurostat

• Valuable two ways cooperation

– Big Data Task Force

– EU hackathon

– Data4policy Sherpa Meeting

– ESS net WP1

• What next?

– Validation

– Production