51
WP1 - Web Scraping for Job Vacancy Statistics Big Data ESSNet Workshop. Sofia. 24-25 February 2017 Nigel Swier

WP1 - Web Scraping for Job Vacancy Statistics · WP1 - Web Scraping for Job Vacancy Statistics Big Data ESSNet Workshop. Sofia. 24-25 February 2017 Nigel Swier

  • Upload
    others

  • View
    29

  • Download
    0

Embed Size (px)

Citation preview

Page 1: WP1 - Web Scraping for Job Vacancy Statistics · WP1 - Web Scraping for Job Vacancy Statistics Big Data ESSNet Workshop. Sofia. 24-25 February 2017 Nigel Swier

WP1 - Web Scraping for Job

Vacancy Statistics

Big Data ESSNet Workshop. Sofia. 24-25 February 2017

Nigel Swier

Page 2: WP1 - Web Scraping for Job Vacancy Statistics · WP1 - Web Scraping for Job Vacancy Statistics Big Data ESSNet Workshop. Sofia. 24-25 February 2017 Nigel Swier

Rationale

Current Official

Estimates (Survey)

Online data

Frequency Quarterly Real-time?

Industry Sector

Enterprise Size

Job type / skills

Sub-national

National Totals

More frequent More timely More granular Less burden Cheaper???

Page 3: WP1 - Web Scraping for Job Vacancy Statistics · WP1 - Web Scraping for Job Vacancy Statistics Big Data ESSNet Workshop. Sofia. 24-25 February 2017 Nigel Swier

Participants (SGA-1)

• United Kingdom (lead)

• Germany

• Sweden

• Slovenia

• Italy

• Greece

Page 4: WP1 - Web Scraping for Job Vacancy Statistics · WP1 - Web Scraping for Job Vacancy Statistics Big Data ESSNet Workshop. Sofia. 24-25 February 2017 Nigel Swier

Data Access

1. Web scraping Job Portals

5. Commercial Suppliers

3. Web scraping Enterprise Websites

2. Job Portal APIs

4. Public Sector Agencies

Page 5: WP1 - Web Scraping for Job Vacancy Statistics · WP1 - Web Scraping for Job Vacancy Statistics Big Data ESSNet Workshop. Sofia. 24-25 February 2017 Nigel Swier

Broad Approach

• Understand the landscape of web-based job vacancy

data in each country

• Focus first on job portals. later explore enterprise

websites

• Try to replicate existing outputs. then investigate

opportunities to produce new types of output.

• Develop specific approaches that are appropriate to

the circumstances in each country

• Develop common approaches where possible

Page 6: WP1 - Web Scraping for Job Vacancy Statistics · WP1 - Web Scraping for Job Vacancy Statistics Big Data ESSNet Workshop. Sofia. 24-25 February 2017 Nigel Swier

Key Concepts

Target Measure:

Job Ad

Target Concept:

Job Vacancy

Page 7: WP1 - Web Scraping for Job Vacancy Statistics · WP1 - Web Scraping for Job Vacancy Statistics Big Data ESSNet Workshop. Sofia. 24-25 February 2017 Nigel Swier

Key Concepts

Job Ad Job Vacancy

Page 8: WP1 - Web Scraping for Job Vacancy Statistics · WP1 - Web Scraping for Job Vacancy Statistics Big Data ESSNet Workshop. Sofia. 24-25 February 2017 Nigel Swier

Key Concepts

Job Ad Job Vacancy

Page 9: WP1 - Web Scraping for Job Vacancy Statistics · WP1 - Web Scraping for Job Vacancy Statistics Big Data ESSNet Workshop. Sofia. 24-25 February 2017 Nigel Swier

Key Concepts

Job Ad Job Vacancy

“Ghost “ Vacancy

Page 10: WP1 - Web Scraping for Job Vacancy Statistics · WP1 - Web Scraping for Job Vacancy Statistics Big Data ESSNet Workshop. Sofia. 24-25 February 2017 Nigel Swier

Target Population: All job vacancies

Coverage Issues

Advertised on enterprise website

Advertised on a job portal

‘Ghost’

Vacancies

Employing business

is identifiable

Advertised through

an agency

Page 11: WP1 - Web Scraping for Job Vacancy Statistics · WP1 - Web Scraping for Job Vacancy Statistics Big Data ESSNet Workshop. Sofia. 24-25 February 2017 Nigel Swier

Outline Approach to Data Integration

Counts from online

sources Enterprise A

Enterprise B

Enterprise C

Enterprise D

Enterprise E

Survey Estimates Enterprise A

Enterprise B

Enterprise C

Enterprise F

Enterprise G

Scaling Factors

(by NACE?)

Matching

Integrated data set Enterprise A

Enterprise B

Enterprise C

Enterprise D

Enterprise E

Enterprise F

Enterprise G

Enterprise H

Enterprise I

Enterprise J

Business Register Enterprise A

Enterprise B

Enterprise C

Enterprise D

Enterprise E

Enterprise F

Enterprise G

Enterprise H

Enterprise I

Enterprise J

1. Scale online

data to survey

estimates

2. Apply scaling

factors to on-

line data

3. Use survey

estimates

4. Modelled

estimates

1. Survey and

Online

2. Online only

3. Survey only

4. Neither

survey or

online

Total = Survey Estimate

Page 12: WP1 - Web Scraping for Job Vacancy Statistics · WP1 - Web Scraping for Job Vacancy Statistics Big Data ESSNet Workshop. Sofia. 24-25 February 2017 Nigel Swier

Conclusion

• Data from on-line job ads are very rich. but

complex and unstructured

• Difficult to align to established statistical

concepts

• Need to understand coverage issues and

how to tackle them

• Surveys will still be needed and so the

challenges are around integrating different

sources.

Page 13: WP1 - Web Scraping for Job Vacancy Statistics · WP1 - Web Scraping for Job Vacancy Statistics Big Data ESSNet Workshop. Sofia. 24-25 February 2017 Nigel Swier
Page 14: WP1 - Web Scraping for Job Vacancy Statistics · WP1 - Web Scraping for Job Vacancy Statistics Big Data ESSNet Workshop. Sofia. 24-25 February 2017 Nigel Swier
Page 15: WP1 - Web Scraping for Job Vacancy Statistics · WP1 - Web Scraping for Job Vacancy Statistics Big Data ESSNet Workshop. Sofia. 24-25 February 2017 Nigel Swier
Page 16: WP1 - Web Scraping for Job Vacancy Statistics · WP1 - Web Scraping for Job Vacancy Statistics Big Data ESSNet Workshop. Sofia. 24-25 February 2017 Nigel Swier
Page 17: WP1 - Web Scraping for Job Vacancy Statistics · WP1 - Web Scraping for Job Vacancy Statistics Big Data ESSNet Workshop. Sofia. 24-25 February 2017 Nigel Swier
Page 18: WP1 - Web Scraping for Job Vacancy Statistics · WP1 - Web Scraping for Job Vacancy Statistics Big Data ESSNet Workshop. Sofia. 24-25 February 2017 Nigel Swier
Page 19: WP1 - Web Scraping for Job Vacancy Statistics · WP1 - Web Scraping for Job Vacancy Statistics Big Data ESSNet Workshop. Sofia. 24-25 February 2017 Nigel Swier
Page 20: WP1 - Web Scraping for Job Vacancy Statistics · WP1 - Web Scraping for Job Vacancy Statistics Big Data ESSNet Workshop. Sofia. 24-25 February 2017 Nigel Swier
Page 21: WP1 - Web Scraping for Job Vacancy Statistics · WP1 - Web Scraping for Job Vacancy Statistics Big Data ESSNet Workshop. Sofia. 24-25 February 2017 Nigel Swier
Page 22: WP1 - Web Scraping for Job Vacancy Statistics · WP1 - Web Scraping for Job Vacancy Statistics Big Data ESSNet Workshop. Sofia. 24-25 February 2017 Nigel Swier

Dissemination Workshop Sofia. 22-23 February 2017

Hellenic Statistical Authority

Christina Pierrakou – Eleni Bisioti

Page 23: WP1 - Web Scraping for Job Vacancy Statistics · WP1 - Web Scraping for Job Vacancy Statistics Big Data ESSNet Workshop. Sofia. 24-25 February 2017 Nigel Swier

Dissemination Workshop Sofia. 22-23 2017

ELSTAT

• Web Scraped Data Structure

• Tools and Environment

• Web scraping experiment

• Matching Results

23

Page 24: WP1 - Web Scraping for Job Vacancy Statistics · WP1 - Web Scraping for Job Vacancy Statistics Big Data ESSNet Workshop. Sofia. 24-25 February 2017 Nigel Swier

Dissemination Workshop Sofia. 22-23 2017

ELSTAT

Page 26: WP1 - Web Scraping for Job Vacancy Statistics · WP1 - Web Scraping for Job Vacancy Statistics Big Data ESSNet Workshop. Sofia. 24-25 February 2017 Nigel Swier

Dissemination Workshop Sofia. 22-23 2017

ELSTAT

Page 27: WP1 - Web Scraping for Job Vacancy Statistics · WP1 - Web Scraping for Job Vacancy Statistics Big Data ESSNet Workshop. Sofia. 24-25 February 2017 Nigel Swier

Dissemination Workshop Sofia. 22-23 2017

ELSTAT

Activities of head offices;

management consultancy

activities

20%

Employment activities

14%

Manufacture of food

products

10% Telecommunications

10%

Education

9%

Wholesale trade,

except of motor

vehicles and

motorcycles

6%

Accommodation

5%

Human health

activities

4%

Advertising and

market research

3%

Office administrative, office

support and other business

support activities

3%

Others

16%

Page 28: WP1 - Web Scraping for Job Vacancy Statistics · WP1 - Web Scraping for Job Vacancy Statistics Big Data ESSNet Workshop. Sofia. 24-25 February 2017 Nigel Swier

Dissemination Workshop Sofia. 22-23 2017

ELSTAT

1 Managers 5%

2 Professionals 16%

3 Technicians and Ass. Professionals

7%

4 Clerical Support Workers

12%

5 Services and Sales Workers

49%

6 Skilled Agricultural. Forestry and Fishery

Workers 0.1%

7 Craft and Related Trades Workers

5%

8 Plant and Machine Operators and

Assemblers 1%

9 Elementary Occupations

5%

Page 29: WP1 - Web Scraping for Job Vacancy Statistics · WP1 - Web Scraping for Job Vacancy Statistics Big Data ESSNet Workshop. Sofia. 24-25 February 2017 Nigel Swier

Dissemination Workshop Sofia. 22-23 February 2017

[email protected]

[email protected]

Page 30: WP1 - Web Scraping for Job Vacancy Statistics · WP1 - Web Scraping for Job Vacancy Statistics Big Data ESSNet Workshop. Sofia. 24-25 February 2017 Nigel Swier

WP1-Webscraping job vacancies.

SURS experiment

Boro Nikic

ESSnet Big Data Dissemination

Workshop. Sofia

23-24. 2. 2017

Page 31: WP1 - Web Scraping for Job Vacancy Statistics · WP1 - Web Scraping for Job Vacancy Statistics Big Data ESSNet Workshop. Sofia. 24-25 February 2017 Nigel Swier

Current Survey on JV (1)

• EU regulation: Number of JV ads broken down by activity (B-S) and

size (10+ employees)

• Population: Legal units with at least 1 employee

– 61.544 legal units (without public sector )

• Sample

– 8.942 legal units (probability sample)

– + cca. 3.300 legal units from public sector

– 12.200 enot 20 % of poulation

STRATUM Size class Number of

units Rate

0 1-2 zaposleni osebi 2.095

23.4

1 3-9 zaposlenih oseb 3.570 39.9

2 10 - 49 zaposlenih oseb 2.065 23.1

3 50 - 249 zaposlenih oseb 1.033 11.6

4 250 in več zaposlenih oseb 179 2.0

Skupaj 8.942 100.0

31

Page 32: WP1 - Web Scraping for Job Vacancy Statistics · WP1 - Web Scraping for Job Vacancy Statistics Big Data ESSNet Workshop. Sofia. 24-25 February 2017 Nigel Swier

Slovenian Job Portals

There are around 30 Job portals in Slovenia. Two of the most

important ones cover more then 95% JV ads.

Since May 2016 weakly collection of data from those two

portals.

32

Page 33: WP1 - Web Scraping for Job Vacancy Statistics · WP1 - Web Scraping for Job Vacancy Statistics Big Data ESSNet Workshop. Sofia. 24-25 February 2017 Nigel Swier

Structure of the scraped data

33

Position Enterprise Location Date

Pizzopek m/ž Trummer osebni servis d.o.o. Maribor Objavljeno: 15.04.2016

Vodja kuhinje m/ž Trummer osebni servis d.o.o. Maribor Objavljeno: 15.04.2016

Knjigovodja m/ž SPORTINA Bled d.o.o. Lesce Objavljeno: 15.04.2016

Asistent vodji produktov m/ž v

Mariboru

Trenkwalder kadrovske storitve d.o.o. Maribor Objavljeno: 15.04.2016

Page 34: WP1 - Web Scraping for Job Vacancy Statistics · WP1 - Web Scraping for Job Vacancy Statistics Big Data ESSNet Workshop. Sofia. 24-25 February 2017 Nigel Swier

Record linkage with BR

34

1 Merging by unique (short. complete. abbreviated ) name of enterprise

2 Merging by unique (short. complete) name of enterprise. The location of

enterprise is

removed from the name

4-6 merging by non-unique (short. complete. abbreviated ) name of enterprise

and location of the work/enterprise

7-8 Record linkage by using distance function (short. complete. name of

enterprise)

10 Manual (agencies. bigger enterprises)

11 Record linkage by using distance function (complete name of enterprise)

0 Unmerged

31.5. 31.8.

N % N %

1.1 1401 72.03 1271 75.61

1.2 365 18.77 250 14.87

1.3 3 0.15 2 0.12

2.1 22 1.13 24 1.43

2.2 9 0.46 13 0.77

4 11 0.57 15 0.89

7 16 0.82 27 1.61

8 11 0.57 9 0.54

10 82 4.22 49 2.91

11 17 0.87 2 0.12

0 8 0.41 19 1.13

TOTAL 1945 100 1681 100

Page 35: WP1 - Web Scraping for Job Vacancy Statistics · WP1 - Web Scraping for Job Vacancy Statistics Big Data ESSNet Workshop. Sofia. 24-25 February 2017 Nigel Swier

Duplicates

Key for merging: name of enterprise. job title. location

35

Števec 31.5. 31.8.

0 817 818

0.1 518 63.40 538 65.77

0.2 241 29.50 221 27.02

0.3 63 7.71 59 7.21

1 563 356

2 285 254

3 42 47

4 5 2

0 – number of distinct enterprises

0.1 - number of enterprises which advertise only on MojeDelo

0.2 - number of enterprises which advertise only on MojaZaposlitev

0.3 - number of enterprises which advertise on both Job potrtals

1 - number of enterprises with unique ads on MojeDelo

2 - number of enterprises with unique ads on MojaZaposlitev

3 - number of enterprises with more than one ads on MojaZaposlitev

and MojeDelo

4 - number of enterprises with more than ads on MojaZaposlitev and

MojeDelo. Number of ads on both portals doesn't match.

Page 36: WP1 - Web Scraping for Job Vacancy Statistics · WP1 - Web Scraping for Job Vacancy Statistics Big Data ESSNet Workshop. Sofia. 24-25 February 2017 Nigel Swier

Weakly movement of Number of JV ads

0

500

1000

1500

2000

2500

3000

Skupaj

Skupaj - cisti

Moje delo

Moje delo - cisti

Moja zaposlitev

Moja zaposlitev - cisti

36

Page 37: WP1 - Web Scraping for Job Vacancy Statistics · WP1 - Web Scraping for Job Vacancy Statistics Big Data ESSNet Workshop. Sofia. 24-25 February 2017 Nigel Swier

IT tools involved in scraping Job

portals

37

SCRAPING

OUTPUT

FILE

STORAGE

STATISTICAL

PRODUCTION

SAS Contextual Analytics Data Scraping Studio

Page 38: WP1 - Web Scraping for Job Vacancy Statistics · WP1 - Web Scraping for Job Vacancy Statistics Big Data ESSNet Workshop. Sofia. 24-25 February 2017 Nigel Swier

Methodology of scraping of enterprise

websites (1)

38

Identify URL links of

enterprises

Identify sub links

which potentially

contain JV ads

Employing machine

learning techniques

detect the JV ads

from list of contents of

sublinks

Detect variables

(locaation. job title.

skills...)

Not implemented yet

Page 39: WP1 - Web Scraping for Job Vacancy Statistics · WP1 - Web Scraping for Job Vacancy Statistics Big Data ESSNet Workshop. Sofia. 24-25 February 2017 Nigel Swier

Coverage: Sample vs. scraped &

admin data

39

Reported

data

Scraped &

admin data

Job

portals

Enterprise

websites

Number of JV ads 4312 2321 1073 262

Percentage 100% 54% 25% 6%

Strata Questionnaire BD Sources Percentge

1 employee 67 16 24%

1-9 employees 470 173 37%

10-49 employees 923 362 39%

50-249 employees 1681 744 44%

250 employees 1119 782 70%

Page 40: WP1 - Web Scraping for Job Vacancy Statistics · WP1 - Web Scraping for Job Vacancy Statistics Big Data ESSNet Workshop. Sofia. 24-25 February 2017 Nigel Swier

Planned activities (1)

Additional question in questionnaire for regular JV survey (2017)

Main goal: collection the info about mode of advertising of JV ads

Side goal: collection of URLs of enterprises

40

Job portals

Enterprise websites

Employment agencies

Newspapers

Social networks

(Linkedln.Facebook...)

Page 41: WP1 - Web Scraping for Job Vacancy Statistics · WP1 - Web Scraping for Job Vacancy Statistics Big Data ESSNet Workshop. Sofia. 24-25 February 2017 Nigel Swier

Planned activities (1)

December 2016: Meeting with the Employment Service of Slovenia (ESS)

Aim: deeper knowledge about cooperation between enterprises and ESS

March 2017: Meeting with the Employment agencies

Aim: cooperation SURS and agencies

41

Agency

Number

of

eployees

1 AC d.o.o. 79

2 ADECCO H.R. d.o.o. 3255

3 KARIERA D.O.O. 1296

4 KI INTERIM D.O.O. 167

5 KOROTAJ D.O.O. 330

6 MANPOWER D.O.O. 467

7 PAPIR SERVIS D.O.O. 518

8 TRENKWALDER D.O.O. 1242

Page 42: WP1 - Web Scraping for Job Vacancy Statistics · WP1 - Web Scraping for Job Vacancy Statistics Big Data ESSNet Workshop. Sofia. 24-25 February 2017 Nigel Swier

Planned activities (3)

Processing of Job portals data in 2017:

1. Weekly movement of number of JV ads broken down by main economic

activities (and by size groups)

2. Testing the models for grossing up Job portals (and other) JV data on

target population level (auxiliary informations from Statistical register of

employees)

3. Record linkage with Standard Occupational Classification System

4. Integration of data from Job portals. enterprise websites and data from

administrative sources (internal pilot at SURS) .

42

Page 43: WP1 - Web Scraping for Job Vacancy Statistics · WP1 - Web Scraping for Job Vacancy Statistics Big Data ESSNet Workshop. Sofia. 24-25 February 2017 Nigel Swier

Thank you for your attention!

[email protected]

43

Page 44: WP1 - Web Scraping for Job Vacancy Statistics · WP1 - Web Scraping for Job Vacancy Statistics Big Data ESSNet Workshop. Sofia. 24-25 February 2017 Nigel Swier

facebook.com/statistiskacentralbyranscb

@SCB_nyheter statistiska_centralbyran_scb www.linkedin.com/company/scb

Internet job portals as a

source for job vacancy

statistics

Page 45: WP1 - Web Scraping for Job Vacancy Statistics · WP1 - Web Scraping for Job Vacancy Statistics Big Data ESSNet Workshop. Sofia. 24-25 February 2017 Nigel Swier

Data sources

Swedish Employment Agency

2 236 663 advertisements (January 2012 - June 2016)

Statistics Sweden Job Vacancy Survey

410 393 business records (January 2012 - June 2016)

legal units (public sector)

local units (private sector)

Statistics Sweden Business Register

In progress: contacts with three private job portals

Page 46: WP1 - Web Scraping for Job Vacancy Statistics · WP1 - Web Scraping for Job Vacancy Statistics Big Data ESSNet Workshop. Sofia. 24-25 February 2017 Nigel Swier

Swedish Employment Agency

Job portal Platsbanken (PB) covers about 40% of the market

Information is entered manually at the Agency. on the web by employers. or submitted by files

Several required variables for advertising on the web (i.e. company name and id. address. occupation title. description and requirements of the job. posting date. etc.)

Rules to avoid invalid values. duplicate advertisements. old advertisements. etc.

Number of days an advertisement is on the web: mean 25 days. median 21 days

Page 47: WP1 - Web Scraping for Job Vacancy Statistics · WP1 - Web Scraping for Job Vacancy Statistics Big Data ESSNet Workshop. Sofia. 24-25 February 2017 Nigel Swier

Quality of the PB data

Checking invalid values: few invalid records on important identifying variables. dates. and important variables like occupation and type of employment

Coverage:

Recruiting/outsourcing companies: top three companies are behind 3% of the advertisements

Big cities appear frequently (Stockholm. Gothenburg. Malmö. Uppsala)

High skilled jobs frequent (> 40%)

Idea: use the text of the advertisements in PB and the high quality of the structured variables to find a good method for text analysis. Use the method on other portals with lower quality

Page 48: WP1 - Web Scraping for Job Vacancy Statistics · WP1 - Web Scraping for Job Vacancy Statistics Big Data ESSNet Workshop. Sofia. 24-25 February 2017 Nigel Swier

Matching PB with Business Register

94-99% match on organization id. municipality.

occupation code. NACE

So far: Very difficult to match on company name

For the matched counts:

Number of work places

%

1-10 61

11-250 30

251-1000 6

1000< 2

0/Null 2

Number of employees

%

0-9 23

10-49 17

50-99 6

100-200 6

200< 46

0/Null 1

Page 49: WP1 - Web Scraping for Job Vacancy Statistics · WP1 - Web Scraping for Job Vacancy Statistics Big Data ESSNet Workshop. Sofia. 24-25 February 2017 Nigel Swier

Matching PB with Job Vacancy Survey

PB data are first aggregated and grouped

according to the variables organization id.

municipality code. year. and month.

PB: 951 195 rows

Job Vacancy: 410 393 rows

20% of data can be matched

Public sector 70% match

Private sector 16% match

Work in progress

Page 50: WP1 - Web Scraping for Job Vacancy Statistics · WP1 - Web Scraping for Job Vacancy Statistics Big Data ESSNet Workshop. Sofia. 24-25 February 2017 Nigel Swier

Employers on sector. %

Sector PB Survey Business Register

Businesses Employees

Private 70 92 90 68

Non-profit organizations

1.5 1 10 0.02

Public 17 7 0.05 30

Missing 12 - - -

Page 51: WP1 - Web Scraping for Job Vacancy Statistics · WP1 - Web Scraping for Job Vacancy Statistics Big Data ESSNet Workshop. Sofia. 24-25 February 2017 Nigel Swier

Three other job portals

Metrojobb

Data sources: Employment agency. manually. files. web scraping

First data through API

CareerBuilder

Data sources: manually. files. through customer systems

Textkernel: “semantic search” (web scraping) Jobfeed (not in Sweden)

Jobbsafari

Planned meeting in Copenhagen in March

Web scraping

Issues:

Validation

Linking

Duplicates

Coverage

Etc…