Don't scrape, Glean!

Don’t Scrape,Glean.

Tom Morris

Scraping sucks.

def lastlogin (@hmodel/"//td[@class='text'][@width='193']").first.innerHTML.split("<br />"[9].strip[-10..-1] return date[-4..-1] + "-" + date[-7..-6] + "-" + date[-10..-9]end

Hpricot for ‘Last login’ date on

MySpace.

try: lastlogin = self.soup.findAll(True, {"width": "193"})[0].br.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.string loginregex = re.compile( r"[0-9]/[0-9]+/[0-9]*") loginregex_inst = loginregex.search(lastlogin) if loginregex_inst is not None: self.lastlogin = loginregex_inst.group() except: pass

Taken from a Python/BeautifulSo

up library.

(The Ruby is prettier, but who’s

counting?)

getElementsByClassName(“foo”)[0].children

It’s an edge case. MySpace’s HTML is

worse than average.

But it is an ugly recipe for mental

turmoil.

The alternative?

flickr.getPhotos()

And you get back nice XML or JSON(or even SOAP!)

But ‘D.R.Y.’!APIs break that

principle.

This is the data equivalent of the

‘accessible version’.

Enter GRDDL.

GRDDL defines a transformation

process for XHTML » RDF.

XHTML?That’s what the

spec says.

HTML 4 works too.Tidy!

RDF?Yes. Trust me.It’s not evil.

GRDDL can worklike a data stylesheet

on top of your HTML.

You simply use HTML (or XML) in the normal way...

...and define how the data

transformation.

You can even use it as a bridge for

exisiting APIs and services.

Could even be used

for other formatsthan RDF. Atom?

Simple example:‘Not Safe For Work’

<a href="http://tubgirl.com"

class="nsfw">

http://rotten.com/

I can write that.I can’t write xFolk

by hand.

Is ‘nsfw’ a good class name? No.

Do I care? No.

The data layer becomes

separated like CSS is from HTML.

That’s the theory.Now for the demo.

irc.freenode.net#swig

#swhack

getsemantic.comsemantic-

[email protected]

[email protected]

http://tommorris.org

Technology

Don't scrape, Glean!