35
Don’t Scrape, Glean. Tom Morris

Don't scrape, Glean!

Embed Size (px)

DESCRIPTION

Lacks the demo part, alas, but it's the slides I used

Citation preview

Page 1: Don't scrape, Glean!

Don’t Scrape,Glean.

Tom Morris

Page 2: Don't scrape, Glean!

Scraping sucks.

Page 3: Don't scrape, Glean!

def lastlogin (@hmodel/"//td[@class='text'][@width='193']").first.innerHTML.split("<br />"[9].strip[-10..-1] return date[-4..-1] + "-" + date[-7..-6] + "-" + date[-10..-9]end

Page 4: Don't scrape, Glean!

Hpricot for ‘Last login’ date on

MySpace.

Page 5: Don't scrape, Glean!

try: lastlogin = self.soup.findAll(True, {"width": "193"})[0].br.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.string loginregex = re.compile( r"[0-9]/[0-9]+/[0-9]*") loginregex_inst = loginregex.search(lastlogin) if loginregex_inst is not None: self.lastlogin = loginregex_inst.group() except: pass

Page 6: Don't scrape, Glean!

Taken from a Python/BeautifulSo

up library.

Page 7: Don't scrape, Glean!

(The Ruby is prettier, but who’s

counting?)

Page 8: Don't scrape, Glean!

getElementsByClassName(“foo”)[0].children

Page 9: Don't scrape, Glean!

It’s an edge case. MySpace’s HTML is

worse than average.

Page 10: Don't scrape, Glean!

But it is an ugly recipe for mental

turmoil.

Page 11: Don't scrape, Glean!

The alternative?

Page 12: Don't scrape, Glean!

flickr.getPhotos()

Page 13: Don't scrape, Glean!

And you get back nice XML or JSON(or even SOAP!)

Page 14: Don't scrape, Glean!

But ‘D.R.Y.’!APIs break that

principle.

Page 15: Don't scrape, Glean!

This is the data equivalent of the

‘accessible version’.

Page 16: Don't scrape, Glean!

Enter GRDDL.

Page 17: Don't scrape, Glean!

GRDDL defines a transformation

process for XHTML » RDF.

Page 18: Don't scrape, Glean!

XHTML?That’s what the

spec says.

Page 19: Don't scrape, Glean!

HTML 4 works too.Tidy!

Page 20: Don't scrape, Glean!

RDF?Yes. Trust me.It’s not evil.

Page 21: Don't scrape, Glean!

GRDDL can worklike a data stylesheet

on top of your HTML.

Page 22: Don't scrape, Glean!

You simply use HTML (or XML) in the normal way...

Page 23: Don't scrape, Glean!

...and define how the data

transformation.

Page 24: Don't scrape, Glean!

You can even use it as a bridge for

exisiting APIs and services.

Page 25: Don't scrape, Glean!

Could even be used

for other formatsthan RDF. Atom?

Page 26: Don't scrape, Glean!

Simple example:‘Not Safe For Work’

Page 27: Don't scrape, Glean!

<a href="http://tubgirl.com"

class="nsfw">

Page 28: Don't scrape, Glean!

I can write that.I can’t write xFolk

by hand.

Page 29: Don't scrape, Glean!

Is ‘nsfw’ a good class name? No.

Page 30: Don't scrape, Glean!

Do I care? No.

Page 31: Don't scrape, Glean!

The data layer becomes

separated like CSS is from HTML.

Page 32: Don't scrape, Glean!

That’s the theory.Now for the demo.

Page 33: Don't scrape, Glean!

irc.freenode.net#swig

#swhack

Page 34: Don't scrape, Glean!

getsemantic.comsemantic-

[email protected]

Page 35: Don't scrape, Glean!

[email protected]

http://tommorris.org