18
Intro to Data Scraping PRESENTED BY DAVID SELASSIE OPOKU @sdopoku 13 July 2015

Skillshare - Introduction to Data Scraping

Embed Size (px)

Citation preview

Page 1: Skillshare - Introduction to Data Scraping

Intro to Data ScrapingPRESENTED BY

DAVID SELASSIE OPOKU

@sdopoku

13 July 2015

Page 2: Skillshare - Introduction to Data Scraping

Outline1. Target audience

2. What is and Why Data Scraping?

3. Use cases

4. Basic steps & Best practices

5. Tools

6. Reference Resources

Page 3: Skillshare - Introduction to Data Scraping

Target Audience

Page 4: Skillshare - Introduction to Data Scraping

This should be useful to ...

● Non-tech-savvy data journalists

● Advanced data journalists

● Web developers & data publishers

● School of Data fellows

● Open Data enthusiasts

Page 5: Skillshare - Introduction to Data Scraping

What is & Why Data Scraping ?

Page 6: Skillshare - Introduction to Data Scraping

Data Scraping: what is it ?scrape [ verb \ˈskrāp\ ]

: to remove from a surface by usually repeated strokes of an edged instrument

: to collect by or as if by scraping —often used with up or together <scrape up the

price of a ticket>

- Merriam Webster

“The transformation of unstructured data on the web, typically in HTML format, into

structured data that can be stored and analyzed in a central local database or

spreadsheet.”

- Wikipedia (web scraping)

Page 7: Skillshare - Introduction to Data Scraping

When should you scrape data ?

● PDF Data

● HTML data

Machine-readable data

Page 8: Skillshare - Introduction to Data Scraping

Example Use Cases

Page 9: Skillshare - Introduction to Data Scraping

Cases when you can scrape● Create a dataset for a data workshop

● Create a database for a data -driven app

● Create a data visualisation for a story

Page 10: Skillshare - Introduction to Data Scraping

Best Practices

Page 11: Skillshare - Introduction to Data Scraping

Best Practices For Scrapers

1. Scraping is not scary!

a. Use existing tools

2. Use a modern and friendly browser

a. Chrome, Firefox, Opera, Safari

b. Avoid Internet Explorer

3. Map out the process

a. Where does scraping fit in?

Page 12: Skillshare - Introduction to Data Scraping

Best Practices For Data Publishers

1. Have a consistent structure

a. Websites

b. PDFs

2. Always think about your data end users

a. Before, during & after publishing

Page 13: Skillshare - Introduction to Data Scraping

Steps

1. Map out the process/pipeline for your data project

2. Identify your data source (website, PDF, API?)

3. Decide on storage format for your scraped dataa. CSV file, Spreadsheet, Google docsb. Database

4. Select scraping tool

5. Verify and Clean data

Page 14: Skillshare - Introduction to Data Scraping

Tools

Page 15: Skillshare - Introduction to Data Scraping

Tools: Web Browsers

Page 16: Skillshare - Introduction to Data Scraping

Tools: Scraping Apps1. Point and click

a. Scraper Google Chrome extension

b. ScraperWiki (Classic version)

c. Import.io, Kimono Labs, Webscraper.io

d. Tabula (PDF)

2. Programming (Python libraries)

a. Beautiful Soup

b. Pattern (PDF and HTML)

c. Scrapy

Page 17: Skillshare - Introduction to Data Scraping

Tools: Storage & Sharing1. Google Spreadsheets

2. Github

3. Datahub.io