CIS 192: Lecture 8 HTML Parsingcis192/fall2014/files/lec8.pdf · CIS 192: Lecture 8 HTML Parsing Author: Lili Dworkin Created Date: 10/27/2014 1:04:50 PM

CIS 192: Lecture 8HTML Parsing

Lili Dworkin

University of Pennsylvania

HTTP Requests

Use the requests library to make HTTP requests:

>>> import requests

>>> url = "http://www.cis.upenn.edu/~cis192/

spring2014/"

>>> req = requests.get(url)

>>> req

HTTP Requests

response object has lots of useful attributes / methods:

I url

I text

I headers

I cookies

I status code

I json()

HTTP Requests

Get the HTML source:

>>> source = req.text

>>> source[:99]

u'\n\n \n <meta charset=''utf-8''>\n CIS192'>>> print source[:99]

CIS192

Unicode

The object returned has type unicode, not str:

>>> type(req.text)

>>> unicode('hello')u'hello'>>> str(unicode('hello'))'hello'

Status Codes

>>> url = "http://www.cis.upenn.edu/~cis192/

spring2014/"

>>> req = requests.get(url)

>>> req.status_code

200

Status Codes

I 2xx – success

I 3xx – redirection

I 4xx – client error

I 5xx – server error

Status Codes

A “bad” status code won’t thrown an error in your code, so yourun the risk of thinking things worked when they actually didn’t:

>>> url = 'http://httpbin.org/hidden-basic-auth'>>> req = requests.get(url)

>>> # try do stuff with req.text

>>> # doesn't work! because:>>> req.status_code

404

Status Codes

Either check for the error directly, or use raise_for_error():


>>> if req.status_code != 200:

... raise Exception()

OR


>>> req.raise_for_status()

Traceback (most recent call last):

...

requests.exceptions.HTTPError: 404 Client Error

Requests with Parameters

I requests.get() issues a GET request

I GET requests can take parameters

I The query strings are sent in the url, e.g:

>>> req = requests.get('.../test?name1=value1&name2=value2')

I But you shouldn’t do the formatting yourself:

>>> params = {'name1':'value1', 'name2':'value2'}>>> req = requests.get('.../test', params=params)

httpbin

Let’s practice a bit on httpbin.org, a HTTP request andresponse test service.

HTML Structure

The Dormouse Story

The Dormouse Story

Elsie

Lacie

BeautifulSoup

Use BeautifulSoup to parse HTML:

>>> from bs4 import BeautifulSoup

>>> soup = BeautifulSoup(req.text) # html source

>>> soup = BeautifulSoup(html_doc) # html string

>>> type(soup)

>>>

Tag Objects

A Tag object corresponds to an HTML tag:

>>> soup.p

The Dormouse Story

>>> type(soup.p)

Tags have names and attributes:

>>> tag = soup.p

>>> tag.name

u'p'>>> tag.attrs

{u'class': [u'title']}>>> tag['class'][u'title']

Pretty Printing

Use prettify() to pretty print a tag (can also call this on theentire soup object, but this can get long):

>>> print soup.p.prettify()

The Dormouse Story

Navigating

Say the name of the tag you want:

>>> soup.head

The Dormouse Story

Zooming in:

>>> soup.head.title

The Dormouse Story

Using a tag name as an attribute gets the first tag by that name:

>>> soup.a

Elsie

Navigating: Going Down

Getting a tag’s children:

>>> soup.p

The Dormouse Story

>>> soup.p.contents

[The Dormouse Story]

>>> soup.p.children

>>> [i for i in soup.p.children]



Note that the child might also have a child ...

>>> soup.p

The Dormouse Story

>>> soup.p.contents


>>> child = soup.p.contents[0]

>>> child.contents

[u'The Dormouse Story']


To search recursively and get children of children of children, etc,use descendants:

>>> for i in soup.p.descendants:

... print i

The Dormouse Story

The Dormouse Story

Navigating: Going Up

Getting a tag’s parent:

>>> soup.title

The Dormouse Story

>>> soup.title.parent

The Dormouse Story

Navigating: Going Sideways

Getting a tag’s siblings:

>>> soup.a

Elsie

>>> soup.a.next_sibling

Lacie

NavigableString Objects

A NavigableString object corresponds to a bit of text within a tag:

>>> soup.p

The Dormouse Story

>>> soup.p.string

u'The Dormouse Story'>>> type(soup.p.string)


Does a lot more than regular string:

>>> soup.p.string.parent

The Dormouse Story

So if all you need is the text, you should convert:

>>> str(soup.p.string)

'The Dormouse Story'


Sometimes it’s helpful to check whether what you’re dealing withis a tag or a NavigableString:

>>> from bs4 import BeautifulSoup, NavigableString

>>> [i for i in soup.descendants

if isinstance(i, NavigableString)]

[u'The Dormouse Story', u'The Dormouse Story', u'Elsie', u'Lacie']


Another option: use .text:

>>> soup.p

The Dormouse Story

>>> soup.p.text

u'The Dormouse Story'>>> type(soup.p.text)

This actually has different behavior:

>>> soup.string

>>>

>>> soup.text

u'The Dormouse StoryThe Dormouse StoryElsieLacie'

Searching via Strings

Find all

tags:

>>> soup.find_all('p')[

The Dormouse Story
,
ElsieLacie

]

Searching via Regular Expressions

Find all tags that start with “b”:

>>> import re

>>> soup.find_all(re.compile("^b"))

[

The Dormouse Story
/pp>
Elsie

Lacie

, The Dormouse Story]

Searching via Lists

Find all or tags:

>>> soup.find_all(['a', 'b'])[The Dormouse Story, Elsie, Lacie]

Searching via Functions

Find all tags that have “class” but not “id” attributes:

>>> def has_class_but_no_id(tag):

... return 'class' in tag.attrs and \

... 'id' not in tag.attrs

...

>>> soup.find_all(has_class_but_no_id)

[

The Dormouse Story
,
ElsieLacie

]

Searching via Keyword Argument

>>> soup.find_all(id="link2")

[Lacie]

>>> soup.find_all(href=re.compile("^http(.)*elsie"))

[Elsie]

I You can filter an attribute based on a string, a regularexpression, a list, a function, or the value True.

I You can filter multiple attributes at once by passing in morethan one keyword argument.

Searching via CSS Class

“Class” is a reserved word in Python!

>>> soup.find_all(class="sister")

SyntaxError

>>> soup.find_all(class_="sister")

[Elsie, Lacie]

Amazon Reviews

Let’s get a list of Amazon Reviews for Dive into Python.

Quiz

1. Consider the code l = (i for i in range(10))I What happens if I type l[0]?I What about l.next()?I What about sum(l)?I What about len(l)?

2. Write code to return a list containing all pairs of items fromthe list [’a’, ’b’, ’c’, ’d’].

3. Write a regular expression to match a date of the formMM-DD-YYYY or MM-DD-YY. Don’t worry about makingsure that the numbers are “valid.”

4. If I call a python program with python test.py --nameLili www.google.com -n 10 and parse the command-linearguments with options, args =optparser.parse_args(), what will the list args contain?

5. **

Documents

CIS 192: Lecture 8 HTML Parsingcis192/fall2014/files/lec8.pdf · CIS 192: Lecture 8 HTML Parsing Author: Lili Dworkin Created Date: 10/27/2014 1:04:50 PM