25
Using web scraping for Applied Economics Morgan Raux Empirical and Econometric Methods Sessions December 4, 2018 Using web scraping for Applied Economics Morgan Raux 1 / 22

Using web scraping for Applied Economicsmorganraux.com/Additional/Slides_Web_Scraping.pdf · Web scraping: a programming method to collect data online. Automatize copy/paste of websites’

  • Upload
    others

  • View
    11

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Using web scraping for Applied Economicsmorganraux.com/Additional/Slides_Web_Scraping.pdf · Web scraping: a programming method to collect data online. Automatize copy/paste of websites’

Using web scraping for Applied Economics

Morgan Raux

Empirical and Econometric Methods SessionsDecember 4, 2018

Using web scraping for Applied Economics Morgan Raux 1 / 22

Page 2: Using web scraping for Applied Economicsmorganraux.com/Additional/Slides_Web_Scraping.pdf · Web scraping: a programming method to collect data online. Automatize copy/paste of websites’

Outline Introduction Issues Description

1 Why using web-scraped data?

2 Web-scraping: contributions and issues

3 The technology: how it works? What are the main challenges?

Using web scraping for Applied Economics Morgan Raux 2 / 22

Page 3: Using web scraping for Applied Economicsmorganraux.com/Additional/Slides_Web_Scraping.pdf · Web scraping: a programming method to collect data online. Automatize copy/paste of websites’

Outline Introduction Issues Description

Context

Web scraping: a programming method to collect data online.Automatize copy/paste of websites’ source code.

Advantages:

It gives access to new sources of data,

At a very large scale,

At low (monetary) cost (compared to other data collectiontechniques)

Drawbacks:

Information is limited, taking advantage from these data isdifficult.

Collecting data is time costly.

Using web scraping for Applied Economics Morgan Raux 3 / 22

Page 4: Using web scraping for Applied Economicsmorganraux.com/Additional/Slides_Web_Scraping.pdf · Web scraping: a programming method to collect data online. Automatize copy/paste of websites’

Outline Introduction Issues Description

Data sources on the web

Sources of data Application examples References

Social Media Studying migrations through Twitterand Facebook’s data.

Zagheni et al. (2015, 2016)

Job boards Studying job search using data fromIndeed and CareerBuilder.

Marinescu et al. (2017, 2018)

Sharing platforms Assessing discriminations with dataon Airbnb.

Laouenan & Rathelot (2017)

Reviewing platforms Measuring business cycle viarestaurant openings on Yelp.

Glaeser et al. (2017)

Using web scraping for Applied Economics Morgan Raux 4 / 22

Page 5: Using web scraping for Applied Economicsmorganraux.com/Additional/Slides_Web_Scraping.pdf · Web scraping: a programming method to collect data online. Automatize copy/paste of websites’

Outline Introduction Issues Description

Assessing the data contribution

Most important question:

Does scraped data bring a contribution to the research comparedto any other source of data I could use?

Compared to:

Usual data sources (surveys, administrative data, etc.).

Other internet data (especially research projects that have adirect access to the website’s database)

Using web scraping for Applied Economics Morgan Raux 5 / 22

Page 6: Using web scraping for Applied Economicsmorganraux.com/Additional/Slides_Web_Scraping.pdf · Web scraping: a programming method to collect data online. Automatize copy/paste of websites’

Outline Introduction Issues Description

Assessing the data contribution

I focus on the website I amscraping (Case 1)

I am scraping this website toget data on anotherphenomenon (Case 2)

Are the data scraped self-sufficient for the analysis?

If not:

Can I obtain direct access tothese data (and more) bycontacting the website?

Can I match them to otherdata sources?

↪→ Would the website beinterested by my research?

↪→ What is the right unit ofanalysis?

Using web scraping for Applied Economics Morgan Raux 6 / 22

Page 7: Using web scraping for Applied Economicsmorganraux.com/Additional/Slides_Web_Scraping.pdf · Web scraping: a programming method to collect data online. Automatize copy/paste of websites’

Outline Introduction Issues Description

What is web scraping?

Web-scraping: automatize copy/paste of websites’ source code.

1) Collect the data,

1. Connect to a webpage gathering the information of interest,

2. Copy / paste the source code in a .txt file,

3. Loop over all webpages.

2) Parse the data,

1. Open the first .txt file,

2. Identify the information of interest,

3. Transfer this information into a dataframe,

4. Loop over all .txt files.

Using web scraping for Applied Economics Morgan Raux 7 / 22

Page 8: Using web scraping for Applied Economicsmorganraux.com/Additional/Slides_Web_Scraping.pdf · Web scraping: a programming method to collect data online. Automatize copy/paste of websites’

Outline Introduction Issues Description

The database I want to obtain

Indeed Job listing database

Using web scraping for Applied Economics Morgan Raux 8 / 22

Page 9: Using web scraping for Applied Economicsmorganraux.com/Additional/Slides_Web_Scraping.pdf · Web scraping: a programming method to collect data online. Automatize copy/paste of websites’

Outline Introduction Issues Description

Collect the data: (1) Connect to the webpage

Using web scraping for Applied Economics Morgan Raux 9 / 22

Page 10: Using web scraping for Applied Economicsmorganraux.com/Additional/Slides_Web_Scraping.pdf · Web scraping: a programming method to collect data online. Automatize copy/paste of websites’

Outline Introduction Issues Description

Collect the data: (2) Copy/Paste the source code

Vizualizing the source code

Using web scraping for Applied Economics Morgan Raux 10 / 22

Page 11: Using web scraping for Applied Economicsmorganraux.com/Additional/Slides_Web_Scraping.pdf · Web scraping: a programming method to collect data online. Automatize copy/paste of websites’

Outline Introduction Issues Description

Collect the data: (2) Copy/Paste the source code

Vizualizing the source code

Using web scraping for Applied Economics Morgan Raux 11 / 22

Page 12: Using web scraping for Applied Economicsmorganraux.com/Additional/Slides_Web_Scraping.pdf · Web scraping: a programming method to collect data online. Automatize copy/paste of websites’

Outline Introduction Issues Description

Collect the data: (3) Loop over all webpages

Using web scraping for Applied Economics Morgan Raux 12 / 22

Page 13: Using web scraping for Applied Economicsmorganraux.com/Additional/Slides_Web_Scraping.pdf · Web scraping: a programming method to collect data online. Automatize copy/paste of websites’

Outline Introduction Issues Description

Collect the data: (3) Loop over all webpages

Url are constructed in a uniform way:

https://www.indeed.com/︸ ︷︷ ︸Domain name

jobs? q=Economist︸ ︷︷ ︸Job

& l=Boston%+MA︸ ︷︷ ︸Location

& start=0︸ ︷︷ ︸Page number

To collect the whole Indeed’s job listing for the US:

1. Loop over jobs

2. Loop over locations

3. Loop over pages

Remarks:

Websites prevent you from accessing the whole information

The way you organize your loops enables to address thischallenge

Using web scraping for Applied Economics Morgan Raux 13 / 22

Page 14: Using web scraping for Applied Economicsmorganraux.com/Additional/Slides_Web_Scraping.pdf · Web scraping: a programming method to collect data online. Automatize copy/paste of websites’

Outline Introduction Issues Description

Collect the data: (3) Loop over all webpages

Url are constructed in a uniform way:

https://www.indeed.com/︸ ︷︷ ︸Domain name

jobs? q=Economist︸ ︷︷ ︸Job

& l=Boston%+MA︸ ︷︷ ︸Location

& start=0︸ ︷︷ ︸Page number

To collect the whole Indeed’s job listing for the US:

1. Loop over jobs

2. Loop over locations

3. Loop over pages

Remarks:

Websites prevent you from accessing the whole information

The way you organize your loops enables to address thischallenge

Using web scraping for Applied Economics Morgan Raux 13 / 22

Page 15: Using web scraping for Applied Economicsmorganraux.com/Additional/Slides_Web_Scraping.pdf · Web scraping: a programming method to collect data online. Automatize copy/paste of websites’

Outline Introduction Issues Description

Collect the data: (3) Loop over all webpages

Url are constructed in a uniform way:

https://www.indeed.com/︸ ︷︷ ︸Domain name

jobs? q=Economist︸ ︷︷ ︸Job

& l=Boston%+MA︸ ︷︷ ︸Location

& start=0︸ ︷︷ ︸Page number

To collect the whole Indeed’s job listing for the US:

1. Loop over jobs

2. Loop over locations

3. Loop over pages

Remarks:

Websites prevent you from accessing the whole information

The way you organize your loops enables to address thischallenge

Using web scraping for Applied Economics Morgan Raux 13 / 22

Page 16: Using web scraping for Applied Economicsmorganraux.com/Additional/Slides_Web_Scraping.pdf · Web scraping: a programming method to collect data online. Automatize copy/paste of websites’

Outline Introduction Issues Description

Parse the data: Identify the information of interest

Correspondance between webpage and source code

Using web scraping for Applied Economics Morgan Raux 14 / 22

Page 17: Using web scraping for Applied Economicsmorganraux.com/Additional/Slides_Web_Scraping.pdf · Web scraping: a programming method to collect data online. Automatize copy/paste of websites’

Outline Introduction Issues Description

Two necessary conditions

1) Information must be accessible in the source code:

Collect and parse html code is easy

Collect and parse javascript code is feasible but much morecomplicated...

2) Information must be organized in a uniform way:

Depending on the coding language of the website, tags canidentify the information of interest.

If no precise tags, it has to follow some patterns.

Using web scraping for Applied Economics Morgan Raux 15 / 22

Page 18: Using web scraping for Applied Economicsmorganraux.com/Additional/Slides_Web_Scraping.pdf · Web scraping: a programming method to collect data online. Automatize copy/paste of websites’

Outline Introduction Issues Description

Information must be organized in a uniform way:

Clean source code (Indeed) Dirty source code (EJM)

Using web scraping for Applied Economics Morgan Raux 16 / 22

Page 19: Using web scraping for Applied Economicsmorganraux.com/Additional/Slides_Web_Scraping.pdf · Web scraping: a programming method to collect data online. Automatize copy/paste of websites’

Outline Introduction Issues Description

Technical issues with web-scraping

1. Matching data with other sources (no common identifier)

2. Escape from being blocked/banned by websites.

3. Stock the (Big) data.

4. Legal issues.

Using web scraping for Applied Economics Morgan Raux 17 / 22

Page 20: Using web scraping for Applied Economicsmorganraux.com/Additional/Slides_Web_Scraping.pdf · Web scraping: a programming method to collect data online. Automatize copy/paste of websites’

Outline Introduction Issues Description

Technical issues with web-scraping (1)

1) Matching data with other sources

Information obtained from one website is most of the timevery partial

There is never a common identifier across websites

This implies to think about the optimal unit of analysis tomerge together different sources of data.

Using web scraping for Applied Economics Morgan Raux 18 / 22

Page 21: Using web scraping for Applied Economicsmorganraux.com/Additional/Slides_Web_Scraping.pdf · Web scraping: a programming method to collect data online. Automatize copy/paste of websites’

Outline Introduction Issues Description

Technical issues with web-scraping (2)

2) Escaping the black-list

Web scraping implies a large number of requests on thewebpage

Most websites defend theirselves against such behaviors (byblocking temporarily the IP address, or banning definitively ...)

To prevent from these risks, your code must mimic humanbehaviors

Stop for a few minutes the scraping processLoop over different IP addresses (proxies, TOR, remoteservers...)

Using web scraping for Applied Economics Morgan Raux 19 / 22

Page 22: Using web scraping for Applied Economicsmorganraux.com/Additional/Slides_Web_Scraping.pdf · Web scraping: a programming method to collect data online. Automatize copy/paste of websites’

Outline Introduction Issues Description

Technical issues with web-scraping (3)

3) Stock the (Big) data

Because web scraping allows you to collect data at a very largescale. Therefore, you often ends up with large amount of data

It necessitates large memory space to stock these data. Don’tforget to save a backup copy !!!!

Technical solution can be to rent servers online.

Using web scraping for Applied Economics Morgan Raux 20 / 22

Page 23: Using web scraping for Applied Economicsmorganraux.com/Additional/Slides_Web_Scraping.pdf · Web scraping: a programming method to collect data online. Automatize copy/paste of websites’

Outline Introduction Issues Description

Technical issues with web-scraping (4)

4) Legal issues

Most of the time, websites’ Terms of Use explicitely forbid youto web-scrape their data.

In France, the Loi Lemaire (2016) allows researchers to exploitproprietary data. Elsewhere, the legal framework is not clear

There are two basic requirements to limit legal issues:

Do not harm the website functionningDo not use these data for a commercial activity

Do not scrape LinkedIn !!

Using web scraping for Applied Economics Morgan Raux 21 / 22

Page 24: Using web scraping for Applied Economicsmorganraux.com/Additional/Slides_Web_Scraping.pdf · Web scraping: a programming method to collect data online. Automatize copy/paste of websites’

Outline Introduction Issues Description

Technical issues with web-scraping (4)

4) Legal issues

Most of the time, websites’ Terms of Use explicitely forbid youto web-scrape their data.

In France, the Loi Lemaire (2016) allows researchers to exploitproprietary data. Elsewhere, the legal framework is not clear

There are two basic requirements to limit legal issues:

Do not harm the website functionningDo not use these data for a commercial activityDo not scrape LinkedIn !!

Using web scraping for Applied Economics Morgan Raux 21 / 22

Page 25: Using web scraping for Applied Economicsmorganraux.com/Additional/Slides_Web_Scraping.pdf · Web scraping: a programming method to collect data online. Automatize copy/paste of websites’

Outline Introduction Issues Description

Learning ressources

Programming language:

Python

R

Resources:

MOOC: Using Python to Access Web Data (Coursera)

Book: Web Scraping with Python, by Ryan Mitchel

Using web scraping for Applied Economics Morgan Raux 22 / 22