Short introduction to Storm

Preview:

DESCRIPTION

Presentation given in class for Cloud Computing at Universitat Politècnica de Catalunya

Citation preview

STORMDISTRIBUTED AND FAULT-TOLERANT

REALTIME COMPUTATION

Jimmy ZögerCLC < FIB < UPC

2013-06-03

INTRODUCTION

• Like Hadoop for realtime processing instead of batch

•Open Source

•Developed by BackType which was later acquired by Twitter

•Developed for analyzing Twitter data

• Similar to S4

STORM TOPOLOGY

SPOUTS

SPOUTS

• The component responsible for feeding messages into the topology

• Emits tuples

• Can be reliable or unreliable (ack() and fail())

INTEGRATION

• Kestrel

• RabbitMQ

• Kafka

• JMS

• Integration is easy with the simple Spout abstraction

BOLTS

BOLTS

• A component that takes tuples as input and produces tuples as output

• Can do filtering, joining, functions, aggregations etc.

•Does not have to process a tuple immediately and may hold onto tuples to process later

• Comparison with Hadoop: A bolt can be a mapper or a reducer (or anything)

STORM TOPOLOGY

STORM TOPOLOGY

• Spouts, bolts and streams

•Distributed

• Runs indefinitely until it is stopped

• Arbitrary complexity

• Streams requiring multiple steps also requires multiple bolts

•No intermediate queues for streams

FAULT-TOLERANCE

•Nimbus daemon and Supervisor daemons are fail-fast and stateless

• Each worker sends heartbeats to Nimbus

• Transactional topologies → Guaranteed processing

NimbusZookeeper

Supervisor

Supervisor

Supervisor

Supervisor

Zookeeper

USE CASES

• Counting words!

• Realtime analytics - trending topics on Twitter

•Online machine learning

• Continuous computation

•Distributed RPC

• Extract, Transform and Load (ETL)

FAST

One benchmark clocked it over a million tuples processed

per second per node

{x,y,z} ↠ {x,y,z} ↠ {x,y,z} ↠ {x,y,z} ↠ {x,y,z} ↠

STORMDISTRIBUTED AND FAULT-TOLERANT

REALTIME COMPUTATION

Jimmy ZögerCLC < FIB < UPC

2013-06-03