39
Mechanize shameless content grabbing Friday, 19March, 2010

RubyBarcamp #3 - Mechanize shameless content grabbing

Embed Size (px)

Citation preview

Page 1: RubyBarcamp #3 - Mechanize shameless content grabbing

Mechanizeshameless content grabbing

Friday, 19March, 2010

Page 2: RubyBarcamp #3 - Mechanize shameless content grabbing

Content grabbing

Friday, 19March, 2010

Page 3: RubyBarcamp #3 - Mechanize shameless content grabbing

Content grabbing

Profit

Friday, 19March, 2010

Page 4: RubyBarcamp #3 - Mechanize shameless content grabbing

Content grabbing

Profit Morale

Friday, 19March, 2010

Page 5: RubyBarcamp #3 - Mechanize shameless content grabbing

Content grabbing

Profit Morale

This is not my damn business.

Friday, 19March, 2010

Page 6: RubyBarcamp #3 - Mechanize shameless content grabbing

Deprecations

0.9.x 1.0

WWW::Mechanize Mechanize

Friday, 19March, 2010

Page 7: RubyBarcamp #3 - Mechanize shameless content grabbing

Examples

Friday, 19March, 2010

Page 8: RubyBarcamp #3 - Mechanize shameless content grabbing

@browser = Mechanize.new

page = @browser.get('http://mega.genn.org')

page.search('h2 a.entry-title').each do |a| post = Post.new

post.title = a.text post.orig_url = a[:href]

post.saveend

Friday, 19March, 2010

Page 9: RubyBarcamp #3 - Mechanize shameless content grabbing

@browser = Mechanize.new

page = @browser.get('http://mega.genn.org')

page.search('h2 a.entry-title').each do |a| post = Post.new

post.title = a.text post.orig_url = a[:href]

post_page = @browser.click(a) post.content = post_page.at('.content .entry-content').to_s

post.saveend

Friday, 19March, 2010

Page 10: RubyBarcamp #3 - Mechanize shameless content grabbing

Pulling all together

Friday, 19March, 2010

Page 11: RubyBarcamp #3 - Mechanize shameless content grabbing

In the beginning... def initialize @browser = Mechanize.new end

Friday, 19March, 2010

Page 12: RubyBarcamp #3 - Mechanize shameless content grabbing

Dealing with <meta>s... def initialize @browser = Mechanize.new {|a| a.follow_meta_refresh = true } end

Friday, 19March, 2010

Page 13: RubyBarcamp #3 - Mechanize shameless content grabbing

Conserving memory... def initialize @browser = Mechanize.new {|a| a.follow_meta_refresh = true } @browser.max_history = 2 end

Friday, 19March, 2010

Page 14: RubyBarcamp #3 - Mechanize shameless content grabbing

Simulating user agent... def initialize @browser = Mechanize.new {|a| a.follow_meta_refresh = true } @browser.max_history = 2 @browser.user_agent_alias = 'Linux Mozilla' end

Friday, 19March, 2010

Page 15: RubyBarcamp #3 - Mechanize shameless content grabbing

Using tor anonymizer... def initialize @browser = Mechanize.new {|a| a.follow_meta_refresh = true } @browser.max_history = 2 @browser.user_agent_alias = 'Linux Mozilla' @browser.set_proxy '127.0.0.1', '8118' end

Friday, 19March, 2010

Page 16: RubyBarcamp #3 - Mechanize shameless content grabbing

Setting cookies... def initialize @browser = Mechanize.new {|a| a.follow_meta_refresh = true } @browser.max_history = 2 @browser.user_agent_alias = 'Linux Mozilla' @browser.set_proxy '127.0.0.1', '8118' set_some_cookies end

def set_some_cookies cookie = Mechanize::Cookie.new 'is_bot', "advanced one" cookie.path = '/' cookie.domain = 'mega.genn.org'

uri = URI::HTTP.build( :host => 'mega.genn.org', :path => '/') @browser.cookie_jar.add uri, cookie end

Friday, 19March, 2010

Page 17: RubyBarcamp #3 - Mechanize shameless content grabbing

What to do with JS?

Friday, 19March, 2010

Page 18: RubyBarcamp #3 - Mechanize shameless content grabbing

What to do with JS?

• V8 (Google)

Friday, 19March, 2010

Page 19: RubyBarcamp #3 - Mechanize shameless content grabbing

What to do with JS?

• V8 (Google)

• SpiderMonkey/TraceMonkey (Mozilla)

Friday, 19March, 2010

Page 20: RubyBarcamp #3 - Mechanize shameless content grabbing

Example

I want that dotted pattern.

Don’t ask me why.

Friday, 19March, 2010

Page 21: RubyBarcamp #3 - Mechanize shameless content grabbing

Examplerequire 'harmony'

page = Harmony::Page.fetch('http://mega.genn.org/2010/ipad/')page.load('http://code.jquery.com/jquery-1.4.2.min.js')

dots = page.execute_js("$('.mclrs').html()")

dots = asciify(dots)

Friday, 19March, 2010

Page 22: RubyBarcamp #3 - Mechanize shameless content grabbing

Examplerequire 'harmony'

page = Harmony::Page.fetch('http://mega.genn.org/2010/ipad/')page.load('http://code.jquery.com/jquery-1.4.2.min.js')

dots = page.execute_js("$('.mclrs').html()")

dots = asciify(dots)

. . .... . . .. . . . . .. . . . . ..

.. . . . .. . .. . .... ... ..... . ...... .

... . .... .. .. . .Friday, 19March, 2010

Page 23: RubyBarcamp #3 - Mechanize shameless content grabbing

Long time execution

Friday, 19March, 2010

Page 24: RubyBarcamp #3 - Mechanize shameless content grabbing

Long time executiondelayed_job

Friday, 19March, 2010

Page 25: RubyBarcamp #3 - Mechanize shameless content grabbing

Long time executiondelayed_job

class DelayedParser def perform() Parser.new.extract_posts endend

Simple worker

Friday, 19March, 2010

Page 26: RubyBarcamp #3 - Mechanize shameless content grabbing

Long time executiondelayed_job

class DelayedParser def perform() Parser.new.extract_posts endend

namespace :parser do desc "Parse mega.genn.org using delayed_job" task :dj => :environment do Delayed::Job.enqueue(DelayedParser.new) endend

Simple worker

Simple rake task to run it

Friday, 19March, 2010

Page 27: RubyBarcamp #3 - Mechanize shameless content grabbing

Long time executiondelayed_job

But sometimes it takes too long

Friday, 19March, 2010

Page 28: RubyBarcamp #3 - Mechanize shameless content grabbing

Long time executionloops

Doing small tasks.

Friday, 19March, 2010

Page 29: RubyBarcamp #3 - Mechanize shameless content grabbing

Long time executionloops

+RabbitMQ

Doing small tasks.But at the same time.

Friday, 19March, 2010

Page 30: RubyBarcamp #3 - Mechanize shameless content grabbing

Loops

Loop Loop Loop

ProcessManager

Friday, 19March, 2010

Page 31: RubyBarcamp #3 - Mechanize shameless content grabbing

Loops

PostFinderLoop PostExtractorLoop

Friday, 19March, 2010

Page 32: RubyBarcamp #3 - Mechanize shameless content grabbing

RabbitMQrouting key =

usa.newsrouting key =

europe.newsrouting key =

europe.weather

Broker

routing key =

usa.weather

Messages

Exchange:

Bindings:

Queues: binding key =

usa.#binding key =

#.newsbinding key =

#.weatherbinding key =

europe.#

Friday, 19March, 2010

Page 33: RubyBarcamp #3 - Mechanize shameless content grabbing

RabbitMQ

genn.pages genn.posts

http://mega.genn.org/http://mega.genn.org/page/2/http://mega.genn.org/page/3/...

http://mega.genn.org/2010/mini-cooper-i-will-buy/http://mega.genn.org/2010/ipad/http://mega.genn.org/2010/prawn-the-queen-salad/...

Friday, 19March, 2010

Page 34: RubyBarcamp #3 - Mechanize shameless content grabbing

RabbitMQclass PostFinderLoop < Loops::AMQP::Bunny def process_message(page_url) info "Received page url: #{page_url}" @parser ||= LoopsParser.new @parser.find_posts(page_url) do |post_url| @exchange.publish post_url, :key => 'genn.posts', :persistent => true end endend

PostFinderLoop

Friday, 19March, 2010

Page 35: RubyBarcamp #3 - Mechanize shameless content grabbing

RabbitMQclass PostExtractorLoop < Loops::AMQP::Bunny def process_message(post_url) info "Received post url: #{post_url}" @parser ||= LoopsParser.new

begin @parser.extract_post(post_url) rescue StandardError => e error "Exception #{e} on page (#{post_url})." end endend

PostExtractorLoop

Friday, 19March, 2010

Page 36: RubyBarcamp #3 - Mechanize shameless content grabbing

How to test it?

Friday, 19March, 2010

Page 37: RubyBarcamp #3 - Mechanize shameless content grabbing

I have no idea.

Friday, 19March, 2010

Page 38: RubyBarcamp #3 - Mechanize shameless content grabbing

http://github.com/daemon/ruby_barcamp_kiev_03_2010

Links, slides and other

Friday, 19March, 2010

Page 39: RubyBarcamp #3 - Mechanize shameless content grabbing

Friday, 19March, 2010