41
Python, Web Scraping and Content Management: Scrapy and Django Sammy Fung http://sammy.hk OpenSource.HK Workshop 2014.07.05

Python, web scraping and content management: Scrapy and Django

Embed Size (px)

Citation preview

Page 1: Python, web scraping and content management: Scrapy and Django

Python, Web Scraping and Content Management:

Scrapy and Django

Sammy Fung

http://sammy.hk

OpenSource.HK Workshop 2014.07.05

Page 2: Python, web scraping and content management: Scrapy and Django

Sammy Fung

● Perl → PHP → Python● Linux → Open Source → Open Data● Freelance → Startup● http://sammy.hk ● [email protected]

Page 3: Python, web scraping and content management: Scrapy and Django

Open Data

Page 4: Python, web scraping and content management: Scrapy and Django

Can computer program read this ?

Page 5: Python, web scraping and content management: Scrapy and Django

Is this UI easy understanding ?

Page 6: Python, web scraping and content management: Scrapy and Django
Page 7: Python, web scraping and content management: Scrapy and Django
Page 8: Python, web scraping and content management: Scrapy and Django
Page 9: Python, web scraping and content management: Scrapy and Django

Five Star Open Data

1.make your stuff available on the Web (whatever format) under an open license.

2.make it available as structured data (e.g., Excel instead of image scan of a table)

3.use non-proprietary formats (e.g., CSV instead of Excel)

4.use URIs to denote things, so that people can point at your stuff.

5.link your data to other data to provide context.

5stardata.info by Tim Berners-Lee, the inventor of the Web.

Page 10: Python, web scraping and content management: Scrapy and Django

Open Data

● Data.One– Lead by OGCIO of Hong Kong Government.– Use the term “public sector information” (PSI)

insteads of “open data”.– Many data are not available in machine-readable

format with useful data structure.– A lot of data are still requiring web scraping with

customized data extraction to collect useful machine-readable data.

Page 11: Python, web scraping and content management: Scrapy and Django

Web Scraping with Scrapy

Page 12: Python, web scraping and content management: Scrapy and Django

Web Scraping

a computer software technique of extracting

information from websites. (Wikipedia)

Page 13: Python, web scraping and content management: Scrapy and Django

Scrapy

● Python.● Open source web scraping framework.● Scrap websites and extract structured data.● From data mining to monitoring and

automated testing.

Page 14: Python, web scraping and content management: Scrapy and Django

Scrapy

● Define your own data structures.● Write spiders to extract data.● Built-in XPath selectors to extracting data.● Built-in JSON, CSV, XML output.● Interactive shell console, telnet console,

logging......

Page 15: Python, web scraping and content management: Scrapy and Django

scrapyd

● Scrapy web service daemon.● pip install scrapyd● Web API with simple Web UI:

– http://localhost:6800

● Web API Documentation:– http://scrapyd.readthedocs.org/en/latest/api.html

Page 16: Python, web scraping and content management: Scrapy and Django

scrapyd

● Examples:– curl http://localhost:6800/listprojects.json

– curl http://localhost:6800/listspiders.json?project=default

● eg. {"status": "ok", "spiders": ["pollutant24", "aqhi24"]}

Page 17: Python, web scraping and content management: Scrapy and Django

Scrapy Installation

$ apt-get install python python-virtualenv python-pip

$ virtualenv env

$ source env/bin/activate

$ pip install scrapy

Page 18: Python, web scraping and content management: Scrapy and Django

Creating Scrapy Project

$ scrapy startproject <new project name>

newproject

|-- newproject

| |-- __init__.py

| |-- items.py

| |-- pipelines.py

| |-- settings.py

| |-- spider

| |- __init__.py

|-- scrapy.cfg

Page 19: Python, web scraping and content management: Scrapy and Django

Creating Scrapy Project

● Define your data structure● Write your first spider

– Test with scrapy shell console

● Output / Store collected data – Output with built-in supported formats

– Store to database / object store.

Page 20: Python, web scraping and content management: Scrapy and Django

Define your data structure

items.py

class Hk0WeatherItem(Item):

reporttime = Field()

station = Field()

temperture = Field()

humidity = Field()

Page 21: Python, web scraping and content management: Scrapy and Django

Write your first spider

● Import a Class of your own data structure.– $ scrapy genspider -t basic <YOUR SPIDER NAME>

<DOMAIN>

– $ scrapy list

● Import any scrapy class which you required.– eg. Spider, XPath Selector

● Extend parse() function of a Spider class.● Test with scrapy shell console

– $ scrapy shell <URL>

Page 22: Python, web scraping and content management: Scrapy and Django

Output / Store collected data

● Use built-in JSON, CSV, XML output at command line.– $ scrapy crawl <Spider Name> -t json -o <Output

File>

● Pipelines.py– Import a Class of your own data structure.– Extend process_item() function.

– Add to ITEM_PIPELINES at settings.

Page 23: Python, web scraping and content management: Scrapy and Django

Django web framework

Page 24: Python, web scraping and content management: Scrapy and Django

Creating django project

$ pip install django

$ django-admin.py startproject <Project name>

myproject

|-- manage.py|-- myproject

|-- __init__.py |-- settings.py |-- urls.py |-- wsgi.py

Page 25: Python, web scraping and content management: Scrapy and Django

Creating django project

● Define django settings.– Create database, tables and first django user.

● Create your own django app.– or add existing django apps.– Create database tables.

● Activate django admin UI.– Add URL router to access admin UI.

Page 26: Python, web scraping and content management: Scrapy and Django

Creating django project

● settings.py– Define your database connection.

– Add your own app to INSTALLED_APPS.

– Define your own settings.

Page 27: Python, web scraping and content management: Scrapy and Django

Create django app

$ cd <Project Name>

$ python manage.py startapp <App Name>

myproject

|-- manage.py|-- myproject

| |-- __init__.py| |-- settings.py| |-- urls.py| |-- wsgi.py|-- myapp

|-- admin.py

|-- __init__.py

|-- models.py

|-- tests.py

|-- views.py

Page 28: Python, web scraping and content management: Scrapy and Django

Create django app

● Define your own data model.● Define and activate your admin UI.● Furthermore:

– Define your data views.– Addi URL routers to connect with data views.

Page 29: Python, web scraping and content management: Scrapy and Django

Define django data model

● Define at models.py.● Import django data model base class.● Define your own data model class.● Create database table(s).

– $ python manage.py syncdb

Page 30: Python, web scraping and content management: Scrapy and Django

Define django data model

class WeatherData(models.Model):

reporttime = models.DateTimeField()

station = models.CharField(max_length=3)

temperture = models.FloatField(null=True, blank=True)

humidity = models.IntegerField(null=True, blank=True)

Page 31: Python, web scraping and content management: Scrapy and Django

Define django data model

● admin.py– Import admin class

– Import your own data model class.

– Extend admin class for your data model.

– Register admin class ● with admin.site.register() function.

Page 32: Python, web scraping and content management: Scrapy and Django

Define django data model

class WeatherDataAdmin(admin.ModelAdmin):

list_display = ('reporttime', 'station', 'temperture', 'humidity', 'windspeed')

list_filter = ['station']

admin.site.register(WeatherData, WeatherDataAdmin)

Page 33: Python, web scraping and content management: Scrapy and Django

Enable django admin ui

● Adding to INSTALLED_APPS at settings.py– django.contrib.admin

● Adding URL router at urls.py– $ python manage.py runserver

● Access admin UI– http://127.0.0.1:8000/admin

Page 34: Python, web scraping and content management: Scrapy and Django

Scrapy + Django

Page 35: Python, web scraping and content management: Scrapy and Django

Scrapy + Django

● Define django environment at scrapy settings.– Load django configuration.

● Use Scrapy DjangoItem class– Insteads of Item and Field class– Define which django data model should be linked

with.

● Query and insert data at scrapy pipelines.

Page 36: Python, web scraping and content management: Scrapy and Django

hk0weather

Page 37: Python, web scraping and content management: Scrapy and Django

hk0weather

● Weather Data Project.– https://github.com/sammyfung/hk0weather

– convert weather information to JSON data from HKO webpages.

– python + scrapy + django

Page 38: Python, web scraping and content management: Scrapy and Django

hk0weather

● Hong Kong Weather Data.– 20+ HKO weather stations in Hong Kong.

– Regional weather data.

– Rainfall data.

– Weather forecast report.

Page 39: Python, web scraping and content management: Scrapy and Django

hk0weather

● Setup and activate a python virtual enviornment, and install scrapy and django with pip.

● Clone hk0weather from GitHub– $ git clone https://github.com/sammyfung/hk0weather.git

● Setup database connection at Django and create database, tables and first django user.

● Scrap regional weather data– $ scrapy crawl regionalwx -t json -o regional.json

Page 40: Python, web scraping and content management: Scrapy and Django

DEMO

Page 41: Python, web scraping and content management: Scrapy and Django

Thank you!http://sammy.hk