40
Storm: Distributed and fault-tolerant realtime computation Ferran Galí i Reniu @ferrangali 19/06/2014

Storm: Distributed and fault tolerant realtime computation

Embed Size (px)

Citation preview

Page 1: Storm: Distributed and fault tolerant realtime computation

Storm: Distributed and fault-tolerant realtime

computationFerran Galí i Reniu

@ferrangali

19/06/2014

Page 2: Storm: Distributed and fault tolerant realtime computation

Ferran Galí i Reniu

● UPC - FIB● Trovit

○ Hadoop○ Lucene/Solr○ Storm

Page 3: Storm: Distributed and fault tolerant realtime computation

Big Data

● Too much data○ Store○ Compute○ Analyse

● Distributed systems○ Provide horizontal scalability

Page 4: Storm: Distributed and fault tolerant realtime computation

● Hadoop

Distributed Systems

HDFS HDFS HDFS

File

Page 5: Storm: Distributed and fault tolerant realtime computation

● Hadoop

Distributed Systems

HDFS

MapReduce

HDFS

MapReduce

HDFS

MapReduce

File

Page 6: Storm: Distributed and fault tolerant realtime computation

Distributed Systems

● Hadoop○ Huge files○ Useful for batch○ High latency○ No real time

Page 7: Storm: Distributed and fault tolerant realtime computation
Page 8: Storm: Distributed and fault tolerant realtime computation

Storm

“Storm is a distributed realtime computation system. Storm provides a set of general primitives for doing realtime computation. Storm is simple, can be used with any programming language, is used by many companies, and is a lot of fun to use!”

http://storm.incubator.apache.org/

Page 9: Storm: Distributed and fault tolerant realtime computation

Storm

● Who’s using it?

Page 10: Storm: Distributed and fault tolerant realtime computation

● Tuple○ Ordered list of elements○ Any type

Storm

String Integer SerializedObject ...

Page 11: Storm: Distributed and fault tolerant realtime computation

Storm

● Stream○ Unbounded sequence of tuples

Tuple Tuple Tuple Tuple Tuple Tuple Tuple

Page 12: Storm: Distributed and fault tolerant realtime computation

Storm

● Spout○ Source of streams

○ From data sources: Queues, API...

Tuple Tuple Tuple Tuple Tuple

Page 13: Storm: Distributed and fault tolerant realtime computation

Storm

● Bolt○ Consumes streams○ Does some processing (transform, join,...)○ Emits streams

Tuple Tuple Tuple

TupleTuple

Tuple

Tuple Tuple Tuple

Page 14: Storm: Distributed and fault tolerant realtime computation

Storm

● Topology○ Graph of spouts & bolts○ Runs forever

Page 15: Storm: Distributed and fault tolerant realtime computation

Architecture

Nimbus

Zookeeper

Zookeeper

Zookeeper

Master

Worker

Worker

Coordinator

Supervisor

Slot

Slot

Slot

Slot

Supervisor

Slot

Slot

Slot

Slot

Page 16: Storm: Distributed and fault tolerant realtime computation

Architecture

Supervisor

Slot

Slot

Slot

SlotWorker process

Single JVM

Tasks - Threads

Page 17: Storm: Distributed and fault tolerant realtime computation

parallelism hint = 4

parallelism hint = 1

parallelism hint = 2

parallelism hint = 2

parallelism hint = 3

parallelism hint = 4

Supervisor

Slot

Slot

Slot

Slot

Supervisor

Slot

Slot

Slot

Slot

Worker processes = 8

Page 18: Storm: Distributed and fault tolerant realtime computation

parallelism hint = 4

parallelism hint = 1

parallelism hint = 2

parallelism hint = 2

parallelism hint = 3

parallelism hint = 4

Worker processes = 8

combined parallelism = 4 + 1 + 2 + 2 + 3 + 4 = 16

Tasks per worker = 16 / 8 = 2

Supervisor

Supervisor

Page 19: Storm: Distributed and fault tolerant realtime computation

Example: Word Count

line line line word word wordFile

FileSpout SplitterBolt CounterBoltparallelism hint = 2 parallelism hint = 3 parallelism hint = 2

Page 20: Storm: Distributed and fault tolerant realtime computation

SplitterBoltFileSpout

Example: Word Count

CounterBolt

Storm is a distributed realtime computation

system. Storm provides a set of general primitives

for doing realtime computation. Storm is

simple, can be used with any programming

language, is used by many companies, and is

a lot of fun to use!

Page 21: Storm: Distributed and fault tolerant realtime computation

SplitterBoltFileSpout

Example: Word Count

CounterBolt

Storm is a distributed

Storm is a distributed realtime computation

system. Storm provides a set of general primitives

for doing realtime computation. Storm is

simple, can be used with any programming

language, is used by many companies, and is

a lot of fun to use!

Page 22: Storm: Distributed and fault tolerant realtime computation

SplitterBoltFileSpout

Example: Word Count

CounterBolt

Storm is a distributed

Storm is a distributed realtime computation

system. Storm provides a set of general primitives

for doing realtime computation. Storm is

simple, can be used with any programming

language, is used by many companies, and is

a lot of fun to use!

realtime computationsystem. Storm provides a

Page 23: Storm: Distributed and fault tolerant realtime computation

SplitterBoltFileSpout

Example: Word Count

CounterBolt

Storm is a distributed

Storm is a distributed realtime computation

system. Storm provides a set of general primitives

for doing realtime computation. Storm is

simple, can be used with any programming

language, is used by many companies, and is

a lot of fun to use!

realtime computationsystem. Storm provides a

shuffle grouping

Page 24: Storm: Distributed and fault tolerant realtime computation

SplitterBoltFileSpout

Example: Word Count

CounterBolt

Storm is a distributed

Storm is a distributed realtime computation

system. Storm provides a set of general primitives

for doing realtime computation. Storm is

simple, can be used with any programming

language, is used by many companies, and is

a lot of fun to use!

realtime computationsystem. Storm provides a

Storm a

isdistributed

realtime

computationsystem

provides

Storm a

shuffle grouping

Page 25: Storm: Distributed and fault tolerant realtime computation

SplitterBoltFileSpout

Example: Word Count

CounterBolt

Storm is a distributed

Storm is a distributed realtime computation

system. Storm provides a set of general primitives

for doing realtime computation. Storm is

simple, can be used with any programming

language, is used by many companies, and is

a lot of fun to use!

realtime computationsystem. Storm provides a

Storm a

isdistributed

realtime

computationsystem

provides

Storm a

Storm

a

is

distributed

realtime

computation

system

provides

Storm

a

x1

x1

x1

x1

x1

x1

x1

x1

x1

x1

shuffle grouping

Page 26: Storm: Distributed and fault tolerant realtime computation

SplitterBoltFileSpout

Example: Word Count

CounterBolt

Storm is a distributed

Storm is a distributed realtime computation

system. Storm provides a set of general primitives

for doing realtime computation. Storm is

simple, can be used with any programming

language, is used by many companies, and is

a lot of fun to use!

realtime computationsystem. Storm provides a

shuffle grouping

ais

Storm distributed

provides a

Storm

is

distributed

realtime

computation

system

a

x2

x1

x1

x1

x2

x1

x1

x1

realtime

computation

provides

fields grouping

systemStorm

Page 27: Storm: Distributed and fault tolerant realtime computation

Groupings

● Shuffle grouping● Fields grouping● All grouping● Global grouping● Direct grouping● Local or shuffle grouping

Page 28: Storm: Distributed and fault tolerant realtime computation

Fault-tolerance

Nimbus

Zookeeper

Zookeeper

Zookeeper

Supervisor

Supervisor

Page 29: Storm: Distributed and fault tolerant realtime computation

● Worker dies○ Supervisor will restart it

● Worker dies too many times○ Nimbus will reassign it to another node

● Node dies○ Nimbus will reassign task to another node

● Nimbus is not a SPOF● Nimbus & Supervisors are fail-fast

Fault-tolerance

Page 30: Storm: Distributed and fault tolerant realtime computation

Guaranteeing message processing

● Through API○ ack○ fail

● Manual tuple replay○ e.g: Spout emits again message with specific id

Page 31: Storm: Distributed and fault tolerant realtime computation

Guaranteeing message processing

● When is a message “fully processed”?

● Solutions○ Transactional Topologies○ Trident framework

Storm is a distributed

Storm

is

distributed

a

Ok

Fail

Ok

Ok

Page 32: Storm: Distributed and fault tolerant realtime computation

Yet another example

tweet tweet tweet

wordword

word

TwitterSpout SplitterBolt

CounterBolt

CommitBolt

signalsignal

signal

DB

shuffle groupingfields grouping

all grouping

https://github.com/ferrangali/betabeers-storm

Page 33: Storm: Distributed and fault tolerant realtime computation

Batch + Real time

● Lambda architecture

Serving

Batch layer

● High latency● Reprocesses all data

New data

Page 34: Storm: Distributed and fault tolerant realtime computation

Batch + Real time

● Lambda architecture

Speed layer

Serving

Batch layer

● Low latency● Fast & incremental algorithms● Eventually overridden by batch layer

● High latency● Reprocesses all data

New data

Page 35: Storm: Distributed and fault tolerant realtime computation

Storm

● Who’s using it?

Page 36: Storm: Distributed and fault tolerant realtime computation

Trovit

● 40 countries● 5 verticals● Hundreds of millions of ads

Page 37: Storm: Distributed and fault tolerant realtime computation

Trovit

● Batch layer:○ MapReduce pipeline over HDFS

HDFS

Filter Enrich Dedup Index

kafka

xml

Page 38: Storm: Distributed and fault tolerant realtime computation

Trovit

● Speed layer○ Storm topology

adad

ad

adad

adrich ad rich ad rich ad

Feeds Spout

Kafka Spout

Processor Bolt Indexer Bolt

Group by index

Commit in batch every 5 minutes

kafka

xml

Page 39: Storm: Distributed and fault tolerant realtime computation

Trovit

HDFS

Filter Enrich Dedup Index

adad

ad

adad

adrichad richad richad

HBaseZookeeper

kafka

xml

Page 40: Storm: Distributed and fault tolerant realtime computation

Questions?Ferran Galí i Reniu

@ferrangali

19/06/2014