33
CIS 192: Lecture 8 HTML Parsing Lili Dworkin University of Pennsylvania

CIS 192: Lecture 8 HTML Parsingcis192/fall2014/files/lec8.pdf · CIS 192: Lecture 8 HTML Parsing Author: Lili Dworkin Created Date: 10/27/2014 1:04:50 PM

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

  • CIS 192: Lecture 8HTML Parsing

    Lili Dworkin

    University of Pennsylvania

  • HTTP Requests

    Use the requests library to make HTTP requests:

    >>> import requests

    >>> url = "http://www.cis.upenn.edu/~cis192/

    spring2014/"

    >>> req = requests.get(url)

    >>> req

  • HTTP Requests

    response object has lots of useful attributes / methods:

    I url

    I text

    I headers

    I cookies

    I status code

    I json()

  • HTTP Requests

    Get the HTML source:

    >>> source = req.text

    >>> source[:99]

    u'\n\n \n <meta charset=''utf-8''>\n CIS192'>>> print source[:99]

    CIS192

  • Unicode

    The object returned has type unicode, not str:

    >>> type(req.text)

    >>> unicode('hello')u'hello'>>> str(unicode('hello'))'hello'

  • Status Codes

    >>> url = "http://www.cis.upenn.edu/~cis192/

    spring2014/"

    >>> req = requests.get(url)

    >>> req.status_code

    200

  • Status Codes

    I 2xx – success

    I 3xx – redirection

    I 4xx – client error

    I 5xx – server error

  • Status Codes

    A “bad” status code won’t thrown an error in your code, so yourun the risk of thinking things worked when they actually didn’t:

    >>> url = 'http://httpbin.org/hidden-basic-auth'>>> req = requests.get(url)

    >>> # try do stuff with req.text

    >>> # doesn't work! because:>>> req.status_code

    404

  • Status Codes

    Either check for the error directly, or use raise_for_error():

    >>> url = 'http://httpbin.org/hidden-basic-auth'>>> req = requests.get(url)

    >>> if req.status_code != 200:

    ... raise Exception()

    OR

    >>> url = 'http://httpbin.org/hidden-basic-auth'>>> req = requests.get(url)

    >>> req.raise_for_status()

    Traceback (most recent call last):

    ...

    requests.exceptions.HTTPError: 404 Client Error

  • Requests with Parameters

    I requests.get() issues a GET request

    I GET requests can take parameters

    I The query strings are sent in the url, e.g:

    >>> req = requests.get('.../test?name1=value1&name2=value2')

    I But you shouldn’t do the formatting yourself:

    >>> params = {'name1':'value1', 'name2':'value2'}>>> req = requests.get('.../test', params=params)

  • httpbin

    Let’s practice a bit on httpbin.org, a HTTP request andresponse test service.

  • HTML Structure

    The Dormouse Story

    The Dormouse Story

    Elsie

    Lacie

  • BeautifulSoup

    Use BeautifulSoup to parse HTML:

    >>> from bs4 import BeautifulSoup

    >>> soup = BeautifulSoup(req.text) # html source

    >>> soup = BeautifulSoup(html_doc) # html string

    >>> type(soup)

    >>>

  • Tag Objects

    A Tag object corresponds to an HTML tag:

    >>> soup.p

    The Dormouse Story

    >>> type(soup.p)

    Tags have names and attributes:

    >>> tag = soup.p

    >>> tag.name

    u'p'>>> tag.attrs

    {u'class': [u'title']}>>> tag['class'][u'title']

  • Pretty Printing

    Use prettify() to pretty print a tag (can also call this on theentire soup object, but this can get long):

    >>> print soup.p.prettify()

    The Dormouse Story

  • Navigating

    Say the name of the tag you want:

    >>> soup.head

    The Dormouse Story

    Zooming in:

    >>> soup.head.title

    The Dormouse Story

    Using a tag name as an attribute gets the first tag by that name:

    >>> soup.a

    Elsie

  • Navigating: Going Down

    Getting a tag’s children:

    >>> soup.p

    The Dormouse Story

    >>> soup.p.contents

    [The Dormouse Story]

    >>> soup.p.children

    >>> [i for i in soup.p.children]

    [The Dormouse Story]

  • Navigating: Going Down

    Note that the child might also have a child ...

    >>> soup.p

    The Dormouse Story

    >>> soup.p.contents

    [The Dormouse Story]

    >>> child = soup.p.contents[0]

    >>> child.contents

    [u'The Dormouse Story']

  • Navigating: Going Down

    To search recursively and get children of children of children, etc,use descendants:

    >>> for i in soup.p.descendants:

    ... print i

    The Dormouse Story

    The Dormouse Story

  • Navigating: Going Up

    Getting a tag’s parent:

    >>> soup.title

    The Dormouse Story

    >>> soup.title.parent

    The Dormouse Story

  • Navigating: Going Sideways

    Getting a tag’s siblings:

    >>> soup.a

    Elsie

    >>> soup.a.next_sibling

    Lacie

  • NavigableString Objects

    A NavigableString object corresponds to a bit of text within a tag:

    >>> soup.p

    The Dormouse Story

    >>> soup.p.string

    u'The Dormouse Story'>>> type(soup.p.string)

  • NavigableString Objects

    Does a lot more than regular string:

    >>> soup.p.string.parent

    The Dormouse Story

    So if all you need is the text, you should convert:

    >>> str(soup.p.string)

    'The Dormouse Story'

  • NavigableString Objects

    Sometimes it’s helpful to check whether what you’re dealing withis a tag or a NavigableString:

    >>> from bs4 import BeautifulSoup, NavigableString

    >>> [i for i in soup.descendants

    if isinstance(i, NavigableString)]

    [u'The Dormouse Story', u'The Dormouse Story', u'Elsie', u'Lacie']

  • NavigableString Objects

    Another option: use .text:

    >>> soup.p

    The Dormouse Story

    >>> soup.p.text

    u'The Dormouse Story'>>> type(soup.p.text)

    This actually has different behavior:

    >>> soup.string

    >>>

    >>> soup.text

    u'The Dormouse StoryThe Dormouse StoryElsieLacie'

  • Searching via Strings

    Find all

    tags:

    >>> soup.find_all('p')[

    The Dormouse Story

    ,

    ElsieLacie

    ]

  • Searching via Regular Expressions

    Find all tags that start with “b”:

    >>> import re

    >>> soup.find_all(re.compile("^b"))

    [

    The Dormouse Story

    /pp>

    Elsie

    Lacie

    , The Dormouse Story]

  • Searching via Lists

    Find all or tags:

    >>> soup.find_all(['a', 'b'])[The Dormouse Story, Elsie, Lacie]

  • Searching via Functions

    Find all tags that have “class” but not “id” attributes:

    >>> def has_class_but_no_id(tag):

    ... return 'class' in tag.attrs and \

    ... 'id' not in tag.attrs

    ...

    >>> soup.find_all(has_class_but_no_id)

    [

    The Dormouse Story

    ,

    ElsieLacie

    ]

  • Searching via Keyword Argument

    >>> soup.find_all(id="link2")

    [Lacie]

    >>> soup.find_all(href=re.compile("^http(.)*elsie"))

    [Elsie]

    I You can filter an attribute based on a string, a regularexpression, a list, a function, or the value True.

    I You can filter multiple attributes at once by passing in morethan one keyword argument.

  • Searching via CSS Class

    “Class” is a reserved word in Python!

    >>> soup.find_all(class="sister")

    SyntaxError

    >>> soup.find_all(class_="sister")

    [Elsie, Lacie]

  • Amazon Reviews

    Let’s get a list of Amazon Reviews for Dive into Python.

  • Quiz

    1. Consider the code l = (i for i in range(10))I What happens if I type l[0]?I What about l.next()?I What about sum(l)?I What about len(l)?

    2. Write code to return a list containing all pairs of items fromthe list [’a’, ’b’, ’c’, ’d’].

    3. Write a regular expression to match a date of the formMM-DD-YYYY or MM-DD-YY. Don’t worry about makingsure that the numbers are “valid.”

    4. If I call a python program with python test.py --nameLili www.google.com -n 10 and parse the command-linearguments with options, args =optparser.parse_args(), what will the list args contain?

    5. **