36
Storm at Cleaning up fraudulent traffic on the internet http://xkcd.com/570/

Storm at spider.io - London Storm Meetup 2013-06-18

Embed Size (px)

DESCRIPTION

Slides from my talk at the London Storm Meetup on 2013-06-18. Charting our journey from being a Storm early adopter, to our freeze of Storm releases and switch to batch processing only, to us coming full circle and implementing new fraudulent traffic algorithms with Trident. You might like our blog: http://www.spider.io/blog

Citation preview

Page 1: Storm at spider.io - London Storm Meetup 2013-06-18

Storm at

Cleaning up fraudulent traffic on the internet

http://xkcd.com/570/

Page 2: Storm at spider.io - London Storm Meetup 2013-06-18

Ashley BrownChief Architect

Using Storm since September 2011

Based in the West End

Founded in early 2010

Focused on (fighting) advertising fraud since 2011

Page 3: Storm at spider.io - London Storm Meetup 2013-06-18

What I'll cover

This is a case study: how Storm fits into our architecture and business

NO CODE

NO WORKER-QUEUE DIAGRAMS1

I assume you've seen them all before

Page 4: Storm at spider.io - London Storm Meetup 2013-06-18

Our RulesWhy did we pick Storm in the first place?

Page 5: Storm at spider.io - London Storm Meetup 2013-06-18

Use the right tool for the job

If a piece of software isn't helping the business, chuck it out

Only jump on a bandwagon if it's going in your direction

Douglas de Jager, CEO

Page 6: Storm at spider.io - London Storm Meetup 2013-06-18

Don't write it if you can download it

If there's an open source project that does what's needed, use that

Sometimes this means throwing out home-grown code when a new project is released (e.g. Storm)

Ben Hodgson, Software Engineer

Page 7: Storm at spider.io - London Storm Meetup 2013-06-18

Our goalsWhy do we turn up in the morning?

Page 8: Storm at spider.io - London Storm Meetup 2013-06-18

Find fraudulent website traffic

Collect billions of server and client-side signals per month

Sift, sort, analyse and condense

Identify automated, fraudulent and suspicious traffic

Joe Tallett, Software Engineer

Page 9: Storm at spider.io - London Storm Meetup 2013-06-18

Protect against it

Give our customers the information they need to protect their businesses from bad traffic

This means: give them clean data to make business decisions

Simon Overell, Chief Scientist

Page 10: Storm at spider.io - London Storm Meetup 2013-06-18

Expose it

Work with partners to reveal the scale of the problem

Drive people to solutions which can eliminate fraud

Page 11: Storm at spider.io - London Storm Meetup 2013-06-18

Cut it off

Eliminate bad traffic sources by cutting off their revenue

Build reactive solutions that stop fraud being profitable

Vegard Johnsen, CCO

Page 12: Storm at spider.io - London Storm Meetup 2013-06-18

Storm & our timeline

Page 13: Storm at spider.io - London Storm Meetup 2013-06-18

Storm solved a scaling problem

impressions/month 60 million

60 millionsignals/month

60 million

240 million

1.5 billion

6 billion

Pre-history(Summer 2010)

storm released present day

RabbitMQ queues

Python workers

Hypertable:API datastore

Hadoop batch analysis/

RabbitMQ queues

Python workers

VoltDB:API datastore + real-time joins

No batch analysis(this setup was pretty reliable!)

Custom cluster management

+ worker scaling

RabbitMQ queues

Storm Topologies:in-memory joins

HBase:API datastore

Cascading for post-failure restores

(too much data to do without!)

billions and billions and billions

billions and billions and billions

Logging via CDN

Cascading for data analysis

Hive for aggregations

High-level aggregates in MySQL

1 2 3

Page 14: Storm at spider.io - London Storm Meetup 2013-06-18

Key moments

● Enter the advertising anti-fraud market○ 30x increase in impression volume○ Bursty traffic○ Existing queue + worker system not robust enough○ I can do without the 2am wake up calls

● Enter Storm.

1

Page 15: Storm at spider.io - London Storm Meetup 2013-06-18

RabbitMQ Cluster

(Scalable)

Python worker cluster2

(Scalable)

Server- and client-side signals

1) I lied about the queue/worker diagrams.2) we have a bar in the office; our workers are happy.

VoltDBScalable at launch-

time only

Page 16: Storm at spider.io - London Storm Meetup 2013-06-18

What's wrong with that?

● Queue/worker scaling system relied on our code; it worked, but:○ Only 1, maybe 2 pairs of eyes had ever looked at it○ Code becomes a maintenance liability as soon as it

is written○ Writing infrastructure software not one of our goals

● Maintaining in-memory database for the volume we wanted was not cost-effective (mainly due to AWS memory costs)

● Couldn't scale dynamically: full DB cluster restart required

Page 17: Storm at spider.io - London Storm Meetup 2013-06-18

The Solution

● Migrate our internal event-stream-based workers to Storm○ Whole community to check and maintain code

● Move to HBase for long-term API datastore○ Keep data for longer - better trend decisions

● VoltDB joins → in-memory joins & HBase○ small in-memory join window, then flushed○ full 15-minute join achieved by reading from HBase○ Trident solves this now - wasn't around then

Page 18: Storm at spider.io - London Storm Meetup 2013-06-18

RabbitMQ Cluster

(Scalable)

Storm cluster(Scalable)

Server- and client-side signals

HBase(Scalable)

Cascading on Amazon Elastic

MapReduce

From logs EMERGENCY RESTORE

Page 19: Storm at spider.io - London Storm Meetup 2013-06-18

How long (Storm migration)?

● 17 Sept 2011: Released● 21 Sept 2011: Test cluster processing● 29 Sept 2011: Substantial implementation of

core workers● 30 Sept 2011: Python workers running

under Storm control● Total engineers: 1

Page 20: Storm at spider.io - London Storm Meetup 2013-06-18

The Results (redacted)

● Classifications available within 15 minutes

● Dashboard provides overview of 'legitimate' vs other traffic

● Better data on which to make business decisions

Page 21: Storm at spider.io - London Storm Meetup 2013-06-18

Lessons

● Storm is easy to install & run

● First iteration: use Storm for control and scaling of existing queue+worker systems

● Second iteration: use Storm to provide redundancy via acking/replays

● Third iteration: remove intermediate queues to realise performance benefits

Page 22: Storm at spider.io - London Storm Meetup 2013-06-18

A Quick Aside on DRPC

● Our initial API implementation in HBase was slow

● Large number of partial aggregates to consume, all handled by a single process

● Storm's DRPC provided a 10x speedup - machines across the cluster pulled partials from HBase, generated 'mega-partials'; final step as a reducer => final totals.

Page 23: Storm at spider.io - London Storm Meetup 2013-06-18

Storm solved a scaling problem

impressions/month 60 million

60 millionsignals/month

60 million

240 million

1.5 billion

6 billion

Pre-history(Summer 2010)

storm released present day

RabbitMQ queues

Python workers

Hypertable:API datastore

Hadoop batch analysis/

RabbitMQ queues

Python workers

VoltDB:API datastore + real-time joins

No batch analysis(this setup was pretty reliable!)

Custom cluster management

+ worker scaling

RabbitMQ queues

Storm Topologies:in-memory joins

HBase:API datastore

Cascading for post-failure restores

(too much data to do without!)

billions and billions and billions

billions and billions and billions

Logging via CDN

Cascading for data analysis

Hive for aggregations

High-level aggregates in MySQL

1 2 3

Page 24: Storm at spider.io - London Storm Meetup 2013-06-18

Key moments

● Enabled across substantial internet ad inventory○ 10x increase in impression volume○ Low-latency, always-up requirements○ Competitive marketplace

● Exit Storm.

2

Page 25: Storm at spider.io - London Storm Meetup 2013-06-18

What happened?

● Stopped upgrading at 0.6.2○ Big customers unable to use real-time data at time○ An unnecessary cost○ Batch options provided better resiliency and cost

profile

● Too expensive to provide very low-latency data collection in a compatible way

● Legacy systems continue to run...

Page 26: Storm at spider.io - London Storm Meetup 2013-06-18

How reliable?

● Legacy topologies still running:

Page 27: Storm at spider.io - London Storm Meetup 2013-06-18

CDN

RabbitMQ Cluster

(Scalable)

Storm cluster(Scalable)

Server- and client-side

signals

HBase(Scalable)

Cascading on Amazon Elastic

MapReduce

From logs

LEGACY

Hive on EMR(Aggregate Generation)

Bulk Export

Page 28: Storm at spider.io - London Storm Meetup 2013-06-18

The Results

● Identification of a botnet cluster attracting international press

● Many other sources of fraud under active investigation

● Using Amazon EC2 spot instances for batch analysis when cheapest - not paying for always-up

Page 29: Storm at spider.io - London Storm Meetup 2013-06-18

The Results

Page 30: Storm at spider.io - London Storm Meetup 2013-06-18

Lessons

● Benefit of real-time processing is a business decision - batch may be more cost effective

● Storm is easy and reliable to use, but you need supporting infrastructure around it (e.g. queue servers)

● It may be the supporting infrastructure that gives you problems...

Page 31: Storm at spider.io - London Storm Meetup 2013-06-18

Storm solved a scaling problem

impressions/month 60 million

60 millionsignals/month

60 million

240 million

1.5 billion

6 billion

Pre-history(Summer 2010)

storm released present day

RabbitMQ queues

Python workers

Hypertable:API datastore

Hadoop batch analysis/

RabbitMQ queues

Python workers

VoltDB:API datastore + real-time joins

No batch analysis(this setup was pretty reliable!)

Custom cluster management

+ worker scaling

RabbitMQ queues

Storm Topologies:in-memory joins

HBase:API datastore

Cascading for post-failure restores

(too much data to do without!)

billions and billions and billions

billions and billions and billions

Logging via CDN

Cascading for data analysis

Hive for aggregations

High-level aggregates in MySQL

1 2 3

Page 32: Storm at spider.io - London Storm Meetup 2013-06-18

Key moments

● Arms race begins○ Fraudsters in control of large botnets able to

respond quickly○ Source and signatures of fraud will change faster

and faster in the future, as we close off more avenues

○ Growing demand for more immediate classifications than provided by batch-only

● Welcome Back Storm.

3

Page 33: Storm at spider.io - London Storm Meetup 2013-06-18

What now?

● Returning to Storm, paired with Mahout

● Crunching billions and billions of impressions using Cascading + Mahout

● Real-time response using Trident + Mahout

○ Known-bad signatures identify new botnet IPs, suspect publishers

○ Online learning adapts models to emerging threats

Page 34: Storm at spider.io - London Storm Meetup 2013-06-18

Lessons

● As your business changes, your architecture must change

● Choose Storm if:○ you have existing ad-hoc event streaming systems

that could use more resiliency○ your business needs a new real-time analysis

component that fits an event-streaming model○ you're happy to run appropriate infrastructure around

it● Don't choose Storm if:

○ you have no use for real-time data○ you only want to use it because it's cool

Page 35: Storm at spider.io - London Storm Meetup 2013-06-18

More Lessons

● Using Cascading for Hadoop jobs and Storm for real-time is REALLY handy○ Retains event-streaming paradigm○ No need to completely re-think implementation when

switching between them○ In some circumstances can share code○ We have a library which provides common analysis

components for both implementations

● A reasonably managed Storm cluster will stay up for ages.

Page 36: Storm at spider.io - London Storm Meetup 2013-06-18

http://xkcd.com/749/