Upload
cecil-hood
View
216
Download
2
Embed Size (px)
Citation preview
Web textDay 34 - 11/14/14LING 3820 & 6820
Natural Language Processing
Harry Howard
Tulane University
Course organization
14-Nov-2014NLP, Prof. Howard, Tulane University
2
http://www.tulane.edu/~howard/LING3820/
The syllabus is under construction. http://www.tulane.edu/~howard/
CompCultEN/ Chapter numbering
3.7. How to deal with non-English characters 4.5. How to create a pattern with Unicode
characters 6. Control
Open Spyder
14-Nov-2014
3
NLP, Prof. Howard, Tulane University
Review
14-Nov-2014
4
NLP, Prof. Howard, Tulane University
Finding text on the web
14-Nov-2014
5
NLP, Prof. Howard, Tulane University
http://sethgodin.typepad.com/
14-Nov-2014NLP, Prof. Howard, Tulane University
6
Firefox: Tools > web developer > Page sourceSafari: Prefs > Advanced > Show develop >> show page
source <div class="entry-body"> <p>If
someone asked you how to do something …. By all means, you still need pictures, even video. But there's nothing to replace the specificity that comes from the alphabet. Use labels. Use words.</p> </div><!-- .entry-body -->
14-Nov-2014NLP, Prof. Howard, Tulane University
7
We need
requests % pip install feedparser % pip install BeautifulSoup4
14-Nov-2014NLP, Prof. Howard, Tulane University
8
Get the text
1. import requests
2. from bs4 import BeautifulSoup
3. url = 'http://sethgodin.typepad.com/'
4. html = requests.get(url).text
5. soup = BeautifulSoup(html)
6. print soup.find("div", {"class":"entry-body"}).text.encode('utf8')
14-Nov-2014NLP, Prof. Howard, Tulane University
9
Install feedparser by hand
https://pypi.python.org/pypi/feedparser click on Downloads button choose .zip file $ cd
/Users/harryhow/Downloads/feedparser-5.1.3
$ python setup.py install
14-Nov-2014NLP, Prof. Howard, Tulane University
10
Get the RSS feed
1. from bs4 import BeautifulSoup2. import feedparser3. url = 'feed://feeds.feedblitz.com/sethsblog'4. fp = feedparser.parse(url)5. print "Fetched %s entries from '%s'" %
(len(fp.entries), fp.feed.title)6. blog_posts = []7. for e in fp.entries:8. blog_posts.append({'title': e.title,9. 'content':
BeautifulSoup(e.content[0].value).get_text().encode('utf8'),
10. 'link': e.links[0].href})
11. print blog_posts[0]['content']
14-Nov-2014NLP, Prof. Howard, Tulane University
11
something elsemaybe a quiz
Next time
14-Nov-2014NLP, Prof. Howard, Tulane University
12