147
Photo: http://cliparts.co/clipart/3666251

Building Efficient and Reliable Crawler System With Sidekiq Enterprise

Embed Size (px)

Citation preview

Page 1: Building Efficient and Reliable Crawler System With Sidekiq Enterprise

Photo: http://cliparts.co/clipart/3666251

Page 2: Building Efficient and Reliable Crawler System With Sidekiq Enterprise

Has anyone ever written crawlers?

Page 3: Building Efficient and Reliable Crawler System With Sidekiq Enterprise

Has anyone ever used cron?

Page 4: Building Efficient and Reliable Crawler System With Sidekiq Enterprise

Has anyone ever used Sidekiq?

Page 5: Building Efficient and Reliable Crawler System With Sidekiq Enterprise

Gary (Chien-Wei Chu) @icarus4 / @icarus4.chu

Was a C programmerFall in love with Ruby since 2013

CTO of Statementdog

Page 6: Building Efficient and Reliable Crawler System With Sidekiq Enterprise
Page 7: Building Efficient and Reliable Crawler System With Sidekiq Enterprise

I Play

Photo: https://static01.nyt.com/images/2016/08/19/sports/19BADMINTONweb3/19BADMINTONweb3-master675.jpg

Page 8: Building Efficient and Reliable Crawler System With Sidekiq Enterprise
Page 9: Building Efficient and Reliable Crawler System With Sidekiq Enterprise

Photo: http://classic.battle.net/images/battle/scc/protoss/pix/units/screenshots/d05.jpg

Page 10: Building Efficient and Reliable Crawler System With Sidekiq Enterprise

Photo: http://resources.workable.com/wp-content/uploads/2015/08/ruby-560x224.jpg

Page 11: Building Efficient and Reliable Crawler System With Sidekiq Enterprise

• Introduction to Statementdog

Page 12: Building Efficient and Reliable Crawler System With Sidekiq Enterprise

• Introduction to Statementdog

• Data behind Statementdog

Page 13: Building Efficient and Reliable Crawler System With Sidekiq Enterprise

• Introduction to Statementdog

• Data behind Statementdog

• Past practice of Statementdog

Page 14: Building Efficient and Reliable Crawler System With Sidekiq Enterprise

• Introduction to Statementdog

• Data behind Statementdog

• Past practice of Statementdog

• Problems of the past practice

Page 15: Building Efficient and Reliable Crawler System With Sidekiq Enterprise

• Introduction to Statementdog

• Data behind Statementdog

• Past practice of Statementdog

• Problems of the past practice

• How we design our system to solve the problems.

Page 16: Building Efficient and Reliable Crawler System With Sidekiq Enterprise

Focus on:

• More reliable job scheduling

• Dealing with throttling issue

Page 18: Building Efficient and Reliable Crawler System With Sidekiq Enterprise
Page 19: Building Efficient and Reliable Crawler System With Sidekiq Enterprise
Page 20: Building Efficient and Reliable Crawler System With Sidekiq Enterprise
Page 21: Building Efficient and Reliable Crawler System With Sidekiq Enterprise
Page 22: Building Efficient and Reliable Crawler System With Sidekiq Enterprise
Page 23: Building Efficient and Reliable Crawler System With Sidekiq Enterprise
Page 24: Building Efficient and Reliable Crawler System With Sidekiq Enterprise
Page 25: Building Efficient and Reliable Crawler System With Sidekiq Enterprise
Page 26: Building Efficient and Reliable Crawler System With Sidekiq Enterprise

(Revenue)

Page 27: Building Efficient and Reliable Crawler System With Sidekiq Enterprise

(Revenue)

(EPS)

Page 28: Building Efficient and Reliable Crawler System With Sidekiq Enterprise

(Revenue)

(EPS)

(Gross Margin)

(Net Income)

Page 29: Building Efficient and Reliable Crawler System With Sidekiq Enterprise

(Revenue)

(EPS)

(Gross Margin)

(Net Income)

(Assets)

(Liabilities)

Page 30: Building Efficient and Reliable Crawler System With Sidekiq Enterprise

(Revenue)

(EPS)

(Gross Margin)

(Net Income)

(Assets)

(Liabilities)

(Operating Cash Flow)

(Free Cash Flow)

(Investing Cash Flow)

Page 31: Building Efficient and Reliable Crawler System With Sidekiq Enterprise

(Revenue)

(EPS)

(Gross Margin)

(Net Income)

(Assets)

(Liabilities)

(Operating Cash Flow)

(Free Cash Flow)

(Investing Cash Flow)

(ROE)

(ROA)

(Accounts Receivable)

(Accounts Payable)

Page 32: Building Efficient and Reliable Crawler System With Sidekiq Enterprise

(Revenue)

(EPS)

(Gross Margin)

(Net Income)

(Assets)

(Liabilities)

(Operating Cash Flow)

(Free Cash Flow)

(Investing Cash Flow)

(ROE)

(ROA)

(Accounts Receivable)

(Accounts Payable)

Page 33: Building Efficient and Reliable Crawler System With Sidekiq Enterprise

(Revenue)

(EPS)

(Gross Margin)

(Net Income)

(Assets)

(Liabilities)

(Operating Cash Flow)

(Free Cash Flow)

(Investing Cash Flow)

(ROE)

(ROA)

(Accounts Receivable)

(Accounts Payable) (PMI)

Page 34: Building Efficient and Reliable Crawler System With Sidekiq Enterprise

(Revenue)

(EPS)

(Gross Margin)

(Net Income)

(Assets)

(Liabilities)

(Operating Cash Flow)

(Free Cash Flow)

(Investing Cash Flow)

(ROE)

(ROA)

(Accounts Receivable)

(Accounts Payable) (PMI)

GDP

Page 35: Building Efficient and Reliable Crawler System With Sidekiq Enterprise
Page 36: Building Efficient and Reliable Crawler System With Sidekiq Enterprise

Taiwan Market Observation Post System ( )

Taiwan Stock Exchange ( )

Taiwan Depository & Clearing Corporation ( )

Yahoo Stock Feed

Page 37: Building Efficient and Reliable Crawler System With Sidekiq Enterprise

Yearly - dividend, remuneration of directors and supervisors

Quarterly - quarterly financial statements

Monthly - Revenue

Weekly -

Daily - closing price

Hourly - stock news from Yahoo stock feed

Minutely - important news from Taiwan Market Observation Post System

Page 38: Building Efficient and Reliable Crawler System With Sidekiq Enterprise
Page 39: Building Efficient and Reliable Crawler System With Sidekiq Enterprise

Something like this, but written in PHP

A super long running process (1 hour+) loops from the first stock to the last one

Stock.find_each do |stock| # download xml financial report data …

# extract xml data …

# calculate advanced data …end

Page 40: Building Efficient and Reliable Crawler System With Sidekiq Enterprise

A super long running process for quarterly report

Page 41: Building Efficient and Reliable Crawler System With Sidekiq Enterprise

A super long running process for quarterly report

A super long running process for monthly revenue

Page 42: Building Efficient and Reliable Crawler System With Sidekiq Enterprise

A super long running process for quarterly report

A super long running process for monthly revenue

A super long running process for daily price

Page 43: Building Efficient and Reliable Crawler System With Sidekiq Enterprise

A super long running process for quarterly report

A super long running process for monthly revenue

A super long running process for daily price

A super long running process for news

.

.

.

Page 44: Building Efficient and Reliable Crawler System With Sidekiq Enterprise

• Really slow

Page 45: Building Efficient and Reliable Crawler System With Sidekiq Enterprise

• Really slow

• Inefficient - unable to only retry the failed one

Page 46: Building Efficient and Reliable Crawler System With Sidekiq Enterprise

• Really slow

• Inefficient - unable to only retry the failed one

• Unpredictable server loading

Page 47: Building Efficient and Reliable Crawler System With Sidekiq Enterprise

Job 1 Job 2 Job 3Time

When the server loading is low

Job 4 Job 5

Serverloading

Page 48: Building Efficient and Reliable Crawler System With Sidekiq Enterprise

When the server loading is HIGH

Time

Serverloading

Other task

Page 49: Building Efficient and Reliable Crawler System With Sidekiq Enterprise

Job 1Job 2

Job 3

When the server loading is HIGH

Job 4Job 5

Time

Serverloading

Other task

Page 50: Building Efficient and Reliable Crawler System With Sidekiq Enterprise

Job 1Job 2

Job 3

When the server loading is HIGH

Job 4Job 5

Time

Serverloading

Other task

Too many crawler processes executed at the same time

Page 51: Building Efficient and Reliable Crawler System With Sidekiq Enterprise

• Really slow

• Inefficient - unable to only retry the failed one.

• Unpredictable server loading

• Scale out is not easy

Page 52: Building Efficient and Reliable Crawler System With Sidekiq Enterprise

• Inherent problems of Unix Cron:

Page 53: Building Efficient and Reliable Crawler System With Sidekiq Enterprise

• Inherent problems of Unix Cron:

• Unreliable scheduling

Page 54: Building Efficient and Reliable Crawler System With Sidekiq Enterprise

• Inherent problems of Unix Cron:

• Unreliable scheduling

• High availability is not easy

Page 55: Building Efficient and Reliable Crawler System With Sidekiq Enterprise

• Inherent problems of Unix Cron:

• Unreliable scheduling

• High availability is not easy

• Hard to prioritize job by the popularity

Page 56: Building Efficient and Reliable Crawler System With Sidekiq Enterprise

• Inherent problems of Unix Cron:

• Unreliable scheduling

• High availability is not easy

• Hard to prioritize job by the popularity

• Not easy to deal with bandwidth throttling issue

Page 57: Building Efficient and Reliable Crawler System With Sidekiq Enterprise
Page 58: Building Efficient and Reliable Crawler System With Sidekiq Enterprise

Created by Mike Perham

Page 59: Building Efficient and Reliable Crawler System With Sidekiq Enterprise
Page 60: Building Efficient and Reliable Crawler System With Sidekiq Enterprise

Web serverRequest

Request

Request

.

.

.

Process

Page 61: Building Efficient and Reliable Crawler System With Sidekiq Enterprise

Request

Request

Request

.

.

.

Job queue

push to queue(very fast)

Web server

Process

Page 62: Building Efficient and Reliable Crawler System With Sidekiq Enterprise

Request

Request

Request

.

.

.

Job queue

push to queue(very fast)

Worker process

Worker process

.

.

.

Worker server

Worker process

Web server

Process

Page 63: Building Efficient and Reliable Crawler System With Sidekiq Enterprise

Request

Request

Request

.

.

.

Job queue

push to queue(very fast)

Worker process

Worker process

.

.

.

Worker server

Worker process

Web server

Process Add extra servers when needed

Page 64: Building Efficient and Reliable Crawler System With Sidekiq Enterprise

Request

Request

Request

.

.

.

Job queue

push to queue(very fast)

Producer

Worker process

Worker process

.

.

.

Worker server

Worker process

Web server

Process

Page 65: Building Efficient and Reliable Crawler System With Sidekiq Enterprise

Request

Request

Request

.

.

.

Job queue

push to queue(very fast)

Producer

Consumer

Worker process

Worker process

.

.

.

Worker server

Worker process

Web server

Process

Page 66: Building Efficient and Reliable Crawler System With Sidekiq Enterprise

Worker process

thread 1

thread 2

thread 3

thread 25

.

.

.

Worker process v.s.

Multi-threadSingle process

Page 67: Building Efficient and Reliable Crawler System With Sidekiq Enterprise

Worker process

thread 1

thread 2

thread 3

thread 25

.

.

.

Worker process 1 : 25

Multi-threadSingle process

Page 68: Building Efficient and Reliable Crawler System With Sidekiq Enterprise

Multi-thread

Worker process

thread 1

thread 2

thread 3

thread 25

.

.

.

Single process

Worker process 1 : 25

With the same degree of memory consumption

Page 69: Building Efficient and Reliable Crawler System With Sidekiq Enterprise

Sidekiq (OSS) Sidekiq Pro

Sidekiq Enterprise

Page 70: Building Efficient and Reliable Crawler System With Sidekiq Enterprise

Sidekiq Pro Sidekiq Enterprise

Batches

Enhanced Reliability

Search in Web UI

Worker Metrics

Expiring Jobs

Rate Limiting

Periodic Jobs

Unique Jobs

Historical Metrics

Multi-process

Encryption

Page 71: Building Efficient and Reliable Crawler System With Sidekiq Enterprise
Page 72: Building Efficient and Reliable Crawler System With Sidekiq Enterprise

Parallelism Make Things Faster

Page 73: Building Efficient and Reliable Crawler System With Sidekiq Enterprise

• Really slow

• Inefficient - unable to only retry the failed one.

• Unpredictable server loading

• Scale out is not easy

Page 74: Building Efficient and Reliable Crawler System With Sidekiq Enterprise

• Efficient - only retry the failed one

• Predictable server loading

• Easy to scale out

Page 75: Building Efficient and Reliable Crawler System With Sidekiq Enterprise

• Really slow

• Inefficient - unable to only retry the failed one.

• Unpredictable server loading

• Scale out is not easy

Page 76: Building Efficient and Reliable Crawler System With Sidekiq Enterprise

• Inherent problem of Unix Cron:

• Unreliable scheduling

• High availability is not easy

• Hard to prioritize job by the popularity

• Not easy to deal with bandwidth throttling issue

Page 77: Building Efficient and Reliable Crawler System With Sidekiq Enterprise
Page 78: Building Efficient and Reliable Crawler System With Sidekiq Enterprise
Page 79: Building Efficient and Reliable Crawler System With Sidekiq Enterprise

–Mike Perham, CEO, Contributed Systems, Creator of Sidekiq

Page 80: Building Efficient and Reliable Crawler System With Sidekiq Enterprise

Keep states of cron executions in our robustest part of system - database

All scheduled jobs are invoked by a particular job executed minutely

Page 81: Building Efficient and Reliable Crawler System With Sidekiq Enterprise

Keep states of cron executions in our robustest part of system - database

All scheduled jobs are invoked by a particular job executed minutely

Page 82: Building Efficient and Reliable Crawler System With Sidekiq Enterprise

create_table :cron_jobs do |t| t.string :klass, null: false t.string :cron_expression, null: false t.timestamp :next_run_at, null: false, index: true end

Create table for storing cron settingstable name: cron_jobs

Page 83: Building Efficient and Reliable Crawler System With Sidekiq Enterprise

create_table :cron_jobs do |t| t.string :klass, null: false t.string :cron_expression, null: false t.timestamp :next_run_at, null: false, index: true end

Create table for storing cron settings

worker class name

Page 84: Building Efficient and Reliable Crawler System With Sidekiq Enterprise

create_table :cron_jobs do |t| t.string :klass, null: false t.string :cron_expression, null: false t.timestamp :next_run_at, null: false, index: true end

Create table for storing cron settings

Something like 0 */2 * * *

Page 85: Building Efficient and Reliable Crawler System With Sidekiq Enterprise

create_table :cron_jobs do |t| t.string :klass, null: false t.string :cron_expression, null: false t.timestamp :next_run_at, null: false, index: true end

Create table for storing cron settings

when will a job should be executed

Page 86: Building Efficient and Reliable Crawler System With Sidekiq Enterprise

klass cron_expression next_run_at

Push2000NewsJobs “0 */2 * * *” …

Push2000DailyPriceJobs “0 2 * * 1-5” …

Push2000MonthlyRevenueJobs “0 0 10 * *” …

Page 87: Building Efficient and Reliable Crawler System With Sidekiq Enterprise

# Add to your Cron setting every :minute do runner 'CronJobWorker.perform_async' end

Cron only schedules one job minutely

Page 88: Building Efficient and Reliable Crawler System With Sidekiq Enterprise

class CronJobWorker include Sidekiq::Worker def perform CronJob.find_each("next_run_at <= ?", Time.now) do |job|

end end end

CronJobWorker to invoke all of your crawlers

Find jobs should be executed

Page 89: Building Efficient and Reliable Crawler System With Sidekiq Enterprise

class CronJobWorker include Sidekiq::Worker def perform CronJob.find_each("next_run_at <= ?", Time.now) do |job| Sidekiq::Client.push( class: job.klass.constantize, args: ['foo', ‘bar'] )

end end end

CronJobWorker to invoke all of your crawlers

Push jobs to job queue

Page 90: Building Efficient and Reliable Crawler System With Sidekiq Enterprise

class CronJobWorker include Sidekiq::Worker def perform CronJob.find_each("next_run_at <= ?", Time.now) do |job| Sidekiq::Client.push( class: job.klass.constantize, args: ['foo', ‘bar'] ) x = Sidekiq::CronParser.new(job.cron_expression) job.update!(next_run_at: x.next.to_time) end end end

CronJobWorker to invoke all of your crawlers

Setup the next execution time

Page 91: Building Efficient and Reliable Crawler System With Sidekiq Enterprise

class CronJobWorker include Sidekiq::Worker def perform CronJob.find_each("next_run_at <= ?", Time.now) do |job| Sidekiq::Client.push( class: job.klass.constantize, args: ['foo', ‘bar'] ) x = Sidekiq::CronParser.new(job.cron_expression) job.update!(next_run_at: x.next.to_time) end end end

CronJobWorker to invoke all of your crawlers

Page 92: Building Efficient and Reliable Crawler System With Sidekiq Enterprise

The missed job executions will be executed at next minute

Page 93: Building Efficient and Reliable Crawler System With Sidekiq Enterprise

• Inherent problem of Unix Cron:

• Unreliable scheduling

• Hard to prioritize job by the popularity

• High availability is not easy

• Not easy to deal with bandwidth throttling issue

Page 94: Building Efficient and Reliable Crawler System With Sidekiq Enterprise

Drawbacks solved

• Inherent problem of Unix Cron:

• Unreliable scheduling

• Hard to prioritize job by the popularity

• High availability is not easy

• Not easy to deal with bandwidth throttling issue

Page 95: Building Efficient and Reliable Crawler System With Sidekiq Enterprise

table: cron_jobs

klass cron_expression args next_run_at

Push2000NewsJobs “0 */2 * * *” [] …

Page 96: Building Efficient and Reliable Crawler System With Sidekiq Enterprise

table: cron_jobs

klass cron_expression args next_run_at

Push2000NewsJobs “0 */2 * * *” [] …

NewsWorker “*/30 * * * *” [popular_stock_id_1] …

NewsWorker “*/30 * * * *” [popular_stock_id_2] …

Page 97: Building Efficient and Reliable Crawler System With Sidekiq Enterprise

Drawbacks solved

• Inherent problem of Unix Cron:

• Unreliable scheduling

• Hard to prioritize job by the popularity

• High availability is not easy

• Not easy to deal with bandwidth throttling issue

Page 98: Building Efficient and Reliable Crawler System With Sidekiq Enterprise

• Inherent problem of Unix Cron:

• Unreliable scheduling

• Hard to prioritize job by the popularity

• High availability is not easy

• Not easy to deal with bandwidth throttling issue

Page 99: Building Efficient and Reliable Crawler System With Sidekiq Enterprise

Sidekiq.configure_server do |config| config.periodic do |mgr| mgr.register("* * * * * *", CronJobWorker) end end

Page 100: Building Efficient and Reliable Crawler System With Sidekiq Enterprise

• Inherent problem of Unix Cron:

• Unreliable scheduling

• Hard to prioritize job by the popularity

• High availability is not easy

• Not easy to deal with bandwidth throttling issue

Page 101: Building Efficient and Reliable Crawler System With Sidekiq Enterprise

• Inherent problem of Unix Cron:

• Unreliable scheduling

• Hard to prioritize job by the popularity

• High availability is not easy

• Not easy to deal with bandwidth throttling issue

Page 102: Building Efficient and Reliable Crawler System With Sidekiq Enterprise
Page 103: Building Efficient and Reliable Crawler System With Sidekiq Enterprise

You always want your crawler as fast as possible

Page 104: Building Efficient and Reliable Crawler System With Sidekiq Enterprise

However, your target server doesn’t always allow you to crawl with

unlimited rate

Page 105: Building Efficient and Reliable Crawler System With Sidekiq Enterprise

Insert 2000 jobs to the queue at the same time

Stock.pluck(:id).each do |stock_id| SomeWorker.perform_async(stock_id) end

If you want to craw data for your 2000 stocks

Page 106: Building Efficient and Reliable Crawler System With Sidekiq Enterprise

Assume a target server accepts request at maximum rate equals to 1 request / second

Page 107: Building Efficient and Reliable Crawler System With Sidekiq Enterprise

Time (second)

1 2 3

job1 job2 job3

.

.

. job2000

Insert 2000 jobs to the queue at the same time

All of your jobs may be blocked (except the first one)

Page 108: Building Efficient and Reliable Crawler System With Sidekiq Enterprise

Improvement 1 Schedule jobs with incremental delays

Stock.pluck(:id).each_with_index do |stock_id, index| SomeWorker.perform_in(index, stock_id) end

Page 109: Building Efficient and Reliable Crawler System With Sidekiq Enterprise

Time (second)

1 2 3

job1 job2 job3

…job2000

2000

Page 110: Building Efficient and Reliable Crawler System With Sidekiq Enterprise

Workable, but…

1

job1 job2 job3

…job2000

If the target server is unreachable

Time (second)

Page 111: Building Efficient and Reliable Crawler System With Sidekiq Enterprise

Workable, but…

1 2 3

job1 job2 job3

…job2000

2000

If the target server is unreachable

job3~2000 will still execute at the same time

Time (second)

Page 112: Building Efficient and Reliable Crawler System With Sidekiq Enterprise

• Limit your worker thread to perform specific job with bounded rate

• Sidekiq Enterprise provides two types of rate limiting API

Page 113: Building Efficient and Reliable Crawler System With Sidekiq Enterprise

CONCURRENT_LIMITER = Sidekiq::Limiter.concurrent('price', 10) def perform(...) CONCURRENT_LIMITER.within_limit do # crawl stock data end end

Page 114: Building Efficient and Reliable Crawler System With Sidekiq Enterprise

CONCURRENT_LIMITER = Sidekiq::Limiter.concurrent('price', 10) def perform(...) CONCURRENT_LIMITER.within_limit do # crawl stock data end end Only 10 concurrent operations inside the block

can happen at any given moment

Page 115: Building Efficient and Reliable Crawler System With Sidekiq Enterprise

BUCKET_LIMITER = Sidekiq::Limiter.bucket('price', 10, :second) def perform(...) BUCKET_LIMITER.within_limit do # crawl stock data end end

For every second, you can perform up to 10 operations

Page 116: Building Efficient and Reliable Crawler System With Sidekiq Enterprise

You must fine tune parameters of your limiter for each data source for better performance

Page 117: Building Efficient and Reliable Crawler System With Sidekiq Enterprise

By far, you already got better performance.

However, the throttling control of your target server may not always be static.

Many websites are dynamically throttling controlled.

Page 118: Building Efficient and Reliable Crawler System With Sidekiq Enterprise

If throttling detected, pause your workers for a while

Page 119: Building Efficient and Reliable Crawler System With Sidekiq Enterprise

Redis (job queue)

Page 120: Building Efficient and Reliable Crawler System With Sidekiq Enterprise

Redis (job queue)

default

critical

low

Page 121: Building Efficient and Reliable Crawler System With Sidekiq Enterprise

Redis (job queue)

default

critical

low

Worker thread

Worker thread

Worker thread

Worker thread

Worker thread

Page 122: Building Efficient and Reliable Crawler System With Sidekiq Enterprise

Redis (job queue)

default

critical

low

Worker thread

Worker thread

Worker thread

Worker thread

Worker thread

Page 123: Building Efficient and Reliable Crawler System With Sidekiq Enterprise

Redis (job queue)

default

critical

low

Worker thread

Worker thread

Worker thread

Worker thread

Worker threadyahoo

Page 124: Building Efficient and Reliable Crawler System With Sidekiq Enterprise

Redis (job queue)

default

critical

low

Worker thread

Worker thread

Worker thread

Worker thread

Worker threadyahoo

(paused)

Pause this queue when throttled

Page 125: Building Efficient and Reliable Crawler System With Sidekiq Enterprise

Redis (job queue)

default

critical

low

Worker thread

Worker thread

Worker thread

Worker thread

Worker thread

Schedule a job executed after few seconds to “unpause" job in another queue

yahoo(paused)

Page 126: Building Efficient and Reliable Crawler System With Sidekiq Enterprise

Redis (job queue)

default

critical

low

Worker thread

Worker thread

Worker thread

Worker thread

Worker threadyahoo

(resumed)

Resumed after the unpause queue job executed

Page 127: Building Efficient and Reliable Crawler System With Sidekiq Enterprise

class SomeWorker include Sidekiq::Worker def perform # try to crawl something # ... if throttled queue_name = self.class.get_sidekiq_options['queue'] queue = Sidekiq::Queue.new(queue_name) queue.pause! ResumeJobQueueWorker.perform_in(30.seconds, queue_name) end end end

Page 128: Building Efficient and Reliable Crawler System With Sidekiq Enterprise

class SomeWorker include Sidekiq::Worker def perform # try to crawl something # ... if throttled queue_name = self.class.get_sidekiq_options['queue'] queue = Sidekiq::Queue.new(queue_name) queue.pause! ResumeJobQueueWorker.perform_in(30.seconds, queue_name) end end end

Page 129: Building Efficient and Reliable Crawler System With Sidekiq Enterprise

class SomeWorker include Sidekiq::Worker def perform # try to crawl something # ... if throttled queue_name = self.class.get_sidekiq_options['queue'] queue = Sidekiq::Queue.new(queue_name) queue.pause! ResumeJobQueueWorker.perform_in(30.seconds, queue_name) end end end class ResumeJobQueueWorker include Sidekiq::Worker sidekiq_options queue: :queue_control, unique: :until_executed def perform(queue_name) queue = Sidekiq::Queue.new(queue_name) queue.unpause! if queue.paused? end end

Page 130: Building Efficient and Reliable Crawler System With Sidekiq Enterprise

The queue for ResumeJobQueueWorker MUST NOT equal to the paused queue

We have a dedicated queue for ResumeJobQueueWorker

Page 131: Building Efficient and Reliable Crawler System With Sidekiq Enterprise

Decrease Sidekiq server poll interval for more precise timing control

Page 132: Building Efficient and Reliable Crawler System With Sidekiq Enterprise

Queue pausing alleviates throttling issues Is it possible for us to do things even better?

Page 133: Building Efficient and Reliable Crawler System With Sidekiq Enterprise

Most throttling control aim to block requests from the same IP address

Page 134: Building Efficient and Reliable Crawler System With Sidekiq Enterprise

We can change our IP address via proxy service

Page 135: Building Efficient and Reliable Crawler System With Sidekiq Enterprise

Sidekiq server

Target server

a.b.c.d

Page 136: Building Efficient and Reliable Crawler System With Sidekiq Enterprise

Sidekiq server

Target server

a.b.c.d

a.b.c.d

Page 137: Building Efficient and Reliable Crawler System With Sidekiq Enterprise

Sidekiq server

Target server

a.b.c.d

a.b.c.d

a.b.c.d

a.b.c.d

Same IP for each request

Page 138: Building Efficient and Reliable Crawler System With Sidekiq Enterprise

Sidekiq server

Target server

a.b.c.d

Proxy service

end point

Page 139: Building Efficient and Reliable Crawler System With Sidekiq Enterprise

Sidekiq server

Target server

a.b.c.d

Proxy service

end point

proxy servere.f.g.h

Page 140: Building Efficient and Reliable Crawler System With Sidekiq Enterprise

Sidekiq server

Target server

a.b.c.d

a.b.c.dProxy

service end

point

proxy server

proxy server

e.f.g.h

i.j.k.l

Page 141: Building Efficient and Reliable Crawler System With Sidekiq Enterprise

Sidekiq server

Target server

a.b.c.d

a.b.c.d

a.b.c.d

a.b.c.d

Proxy service

end point

proxy server

proxy server

proxy server

proxy server

e.f.g.h

i.j.k.l

m.n.o.p

q.r.s.t

Page 142: Building Efficient and Reliable Crawler System With Sidekiq Enterprise

Sidekiq server

Target server

a.b.c.d

a.b.c.d

a.b.c.d

a.b.c.d

Proxy service

end point

proxy server

proxy server

proxy server

proxy server

e.f.g.h

i.j.k.l

m.n.o.p

q.r.s.t

Different IP for each request

Page 143: Building Efficient and Reliable Crawler System With Sidekiq Enterprise

• Inherent problem of Unix Cron:

• Unreliable scheduling

• Hard to prioritize job by the popularity

• High availability is not easy

• Not easy to deal with bandwidth throttling issue

Page 144: Building Efficient and Reliable Crawler System With Sidekiq Enterprise
Page 145: Building Efficient and Reliable Crawler System With Sidekiq Enterprise

• With Sidekiq (Enterprise) and a proper design, the following problems are solved

• Slow crawler

• Inefficient - unable to only retry the failed one

• Unpredictable server loading

• Scale out is not easy

• Inherent problem of Unix Cron

• Not easy to deal with bandwidth throttling issue

Page 146: Building Efficient and Reliable Crawler System With Sidekiq Enterprise
Page 147: Building Efficient and Reliable Crawler System With Sidekiq Enterprise