Storm Berkeley

Embed Size (px)

Citation preview

  • 8/19/2019 Storm Berkeley

    1/91

    Nathan Marz

    Twitter

    Distributed and fault-tolerant realtime computationStorm

  • 8/19/2019 Storm Berkeley

    2/91

    Basic info

    • Open sourced September 19th

    • Implementation is 15,000 lines of code

    • Used by over 25 companies

    • >2400 watchers on Github (most watched

     JVM project)• Very active mailing list

    • >1800 messages

    • >560 members

  • 8/19/2019 Storm Berkeley

    3/91

    Before Storm

    QueuesWorkers

  • 8/19/2019 Storm Berkeley

    4/91

    Example

    (simplified)

  • 8/19/2019 Storm Berkeley

    5/91

    Example

    Workers schemify tweets

    and append to Hadoop

  • 8/19/2019 Storm Berkeley

    6/91

    Example

    Workers update statistics on URLs by

    incrementing counters in Cassandra

  • 8/19/2019 Storm Berkeley

    7/91

    ScalingDeploy

    Reconfigure/redeploy

  • 8/19/2019 Storm Berkeley

    8/91

    Problems

    • Scaling is painful

    • Poor fault-tolerance

    • Coding is tedious

  • 8/19/2019 Storm Berkeley

    9/91

    What we want

    • Guaranteed data processing

    • Horizontal scalability

    • Fault-tolerance

    • No intermediate message brokers!

    • Higher level abstraction than message passing

    • “Just works”

  • 8/19/2019 Storm Berkeley

    10/91

    Storm

    Guaranteed data processing

    Horizontal scalability

    Fault-tolerance

    No intermediate message brokers!

    Higher level abstraction than message passing

    “Just works”

  • 8/19/2019 Storm Berkeley

    11/91

    Stream

    processing

    Continuous

    computation

    Distributed

    RPC

    Use cases

  • 8/19/2019 Storm Berkeley

    12/91

    Storm Cluster

  • 8/19/2019 Storm Berkeley

    13/91

    Storm Cluster

    Master node (similar to Hadoop JobTracker)

  • 8/19/2019 Storm Berkeley

    14/91

    Storm Cluster

    Used for cluster coordination

  • 8/19/2019 Storm Berkeley

    15/91

    Storm Cluster

    Run worker processes

  • 8/19/2019 Storm Berkeley

    16/91

    Starting a topology

  • 8/19/2019 Storm Berkeley

    17/91

    Killing a topology

  • 8/19/2019 Storm Berkeley

    18/91

    Concepts

    • Streams

    • Spouts

    • Bolts

    • Topologies

  • 8/19/2019 Storm Berkeley

    19/91

    Streams

    Unbounded sequence of tuples

    Tuple Tuple Tuple Tuple Tuple Tuple Tuple

  • 8/19/2019 Storm Berkeley

    20/91

    Spouts

    Source of streams

  • 8/19/2019 Storm Berkeley

    21/91

    Spout examples

    • Read from Kestrel queue

    • Read from Twitter streaming API

  • 8/19/2019 Storm Berkeley

    22/91

    Bolts

    Processes input streams and produces new streams

  • 8/19/2019 Storm Berkeley

    23/91

    Bolts

    • Functions

    • Filters

    • Aggregation

    •  Joins

    • Talk to databases

  • 8/19/2019 Storm Berkeley

    24/91

    Topology

    Network of spouts and bolts

  • 8/19/2019 Storm Berkeley

    25/91

    Tasks

    Spouts and bolts execute as

    many tasks across the cluster

  • 8/19/2019 Storm Berkeley

    26/91

    Task execution

    Tasks are spread across the cluster

  • 8/19/2019 Storm Berkeley

    27/91

    Task execution

    Tasks are spread across the cluster

  • 8/19/2019 Storm Berkeley

    28/91

    Stream grouping

    When a tuple is emitted, which task does it go to?

  • 8/19/2019 Storm Berkeley

    29/91

    Stream grouping

    • Shuffle grouping: pick a random task 

    • Fields grouping: mod hashing on a

    subset of tuple fields

    • All grouping: send to all tasks

    • Global grouping: pick task with lowest id

  • 8/19/2019 Storm Berkeley

    30/91

    Topology

    shuffle

    [“url”]

    shuffle

    shuffle

    [“id1”, “id2”]

    all

  • 8/19/2019 Storm Berkeley

    31/91

    Streaming word count

    TopologyBuilder is used to construct topologies in Java

  • 8/19/2019 Storm Berkeley

    32/91

    Streaming word count

    Define a spout in the topology with parallelism of 5 tasks

  • 8/19/2019 Storm Berkeley

    33/91

    Streaming word count

    Split sentences into words with parallelism of 8 tasks

  • 8/19/2019 Storm Berkeley

    34/91

    Consumer decides what data it receives and how it gets grouped

    Streaming word count

    Split sentences into words with parallelism of 8 tasks

  • 8/19/2019 Storm Berkeley

    35/91

    Streaming word count

    Create a word count stream

  • 8/19/2019 Storm Berkeley

    36/91

    Streaming word count

    splitsentence.py

  • 8/19/2019 Storm Berkeley

    37/91

    Streaming word count

  • 8/19/2019 Storm Berkeley

    38/91

    Streaming word count

    Submitting topology to a cluster

  • 8/19/2019 Storm Berkeley

    39/91

    Streaming word count

    Running topology in local mode

  • 8/19/2019 Storm Berkeley

    40/91

    Demo

  • 8/19/2019 Storm Berkeley

    41/91

    Distributed RPC

    Data flow for Distributed RPC

  • 8/19/2019 Storm Berkeley

    42/91

    DRPC Example

    Computing “reach” of a URL on the fly

  • 8/19/2019 Storm Berkeley

    43/91

    Reach

    Reach is the number of unique people

    exposed to a URL on Twitter

  • 8/19/2019 Storm Berkeley

    44/91

    Computing reach

    URL

    Tweeter

    Tweeter

    Tweeter

    Follower

    Follower

    Follower

    Follower

    Follower

    Follower

    Distinct

    follower

    Distinct

    follower

    Distinct

    follower

    Count Reach

  • 8/19/2019 Storm Berkeley

    45/91

    Reach topology

  • 8/19/2019 Storm Berkeley

    46/91

    Reach topology

  • 8/19/2019 Storm Berkeley

    47/91

    Reach topology

  • 8/19/2019 Storm Berkeley

    48/91

    Reach topology

    Keep set of followers for

    each request id in memory

  • 8/19/2019 Storm Berkeley

    49/91

    Reach topology

    Update followers set when

    receive a new follower

  • 8/19/2019 Storm Berkeley

    50/91

    Reach topology

    Emit partial count afterreceiving all followers for a

    request id

  • 8/19/2019 Storm Berkeley

    51/91

    Demo

  • 8/19/2019 Storm Berkeley

    52/91

    Guaranteeing message

    processing

    “Tuple tree”

  • 8/19/2019 Storm Berkeley

    53/91

    Guaranteeing message

    processing

    • A spout tuple is not fully processed until all

    tuples in the tree have been completed

  • 8/19/2019 Storm Berkeley

    54/91

    Guaranteeing message

    processing

    • If the tuple tree is not completed within a

    specified timeout, the spout tuple is replayed

  • 8/19/2019 Storm Berkeley

    55/91

    Guaranteeing message

    processing

    Reliability API

    G

  • 8/19/2019 Storm Berkeley

    56/91

    Guaranteeing message

    processing

    “Anchoring” creates a new edge in the tuple tree

    G

  • 8/19/2019 Storm Berkeley

    57/91

    Guaranteeing message

    processing

    Marks a single node in the tree as complete

  • 8/19/2019 Storm Berkeley

    58/91

  • 8/19/2019 Storm Berkeley

    59/91

    Transactional topologies

    How do you do idempotent counting with an

    at least once delivery guarantee?

  • 8/19/2019 Storm Berkeley

    60/91

    Won’t you overcount?

    Transactional topologies

  • 8/19/2019 Storm Berkeley

    61/91

    Transactional topologies solve this problem

    Transactional topologies

  • 8/19/2019 Storm Berkeley

    62/91

    Built completely on top of Storm’s primitives

    of streams, spouts, and bolts

    Transactional topologies

  • 8/19/2019 Storm Berkeley

    63/91

    Batch 1 Batch 2 Batch 3

    Transactional topologies

    Process small batches of tuples

  • 8/19/2019 Storm Berkeley

    64/91

    Batch 1 Batch 2 Batch 3

    Transactional topologies

    If a batch fails, replay the whole batch

  • 8/19/2019 Storm Berkeley

    65/91

    Batch 1 Batch 2 Batch 3

    Transactional topologies

    Once a batch is completed, commit the batch

  • 8/19/2019 Storm Berkeley

    66/91

    Batch 1 Batch 2 Batch 3

    Transactional topologies

    Bolts can optionally be “committers”

  • 8/19/2019 Storm Berkeley

    67/91

    Commit 1

    Transactional topologies

    Commits are ordered. If there’s a failure during

    commit, the whole batch + commit is retried

    Commit 1 Commit 2 Commit 3 Commit 4 Commit 4

  • 8/19/2019 Storm Berkeley

    68/91

    Example

  • 8/19/2019 Storm Berkeley

    69/91

    Example

    New instance of this objectfor every transaction attempt

  • 8/19/2019 Storm Berkeley

    70/91

    Example

    Aggregate the count for

    this batch

  • 8/19/2019 Storm Berkeley

    71/91

    Example

    Only update database if

    transaction ids differ

  • 8/19/2019 Storm Berkeley

    72/91

    Example

    This enables idempotency since

    commits are ordered

  • 8/19/2019 Storm Berkeley

    73/91

    Example

    (Credit goes to Kafka devs

     for this trick)

  • 8/19/2019 Storm Berkeley

    74/91

    Transactional topologies

    Multiple batches can be processed in parallel,

    but commits are guaranteed to be ordered

  • 8/19/2019 Storm Berkeley

    75/91

    • Will be available in next version of Storm

    (0.7.0)

    • Requires a source queue that can replay

    identical batches of messages

    •storm-kafka has a transactional spoutimplementation for Kafka

    Transactional topologies

  • 8/19/2019 Storm Berkeley

    76/91

    Storm UI

  • 8/19/2019 Storm Berkeley

    77/91

    Storm on EC2

    https://github.com/nathanmarz/storm-deploy

    One-click deploy tool

    https://github.com/nathanmarz/storm-deployhttps://github.com/nathanmarz/storm-deploy

  • 8/19/2019 Storm Berkeley

    78/91

    Starter code

    https://github.com/nathanmarz/storm-starter

    Example topologies

    https://github.com/nathanmarz/storm-deployhttps://github.com/nathanmarz/storm-deploy

  • 8/19/2019 Storm Berkeley

    79/91

    Documentation

  • 8/19/2019 Storm Berkeley

    80/91

    Ecosystem

    • Scala, JRuby, and Clojure DSL’s

    • Kestrel, AMQP, JMS, and other spout adapters

    • Serializers

    • Multilang adapters

    • Cassandra, MongoDB integration

  • 8/19/2019 Storm Berkeley

    81/91

    Questions?

    http://github.com/nathanmarz/storm

    http://github.com/nathanmarz/stormhttp://github.com/nathanmarz/storm

  • 8/19/2019 Storm Berkeley

    82/91

    Future work 

    • State spout

    • Storm on Mesos

    • “Swapping”

    • Auto-scaling

    • Higher level abstractions

  • 8/19/2019 Storm Berkeley

    83/91

    Implementation

    KafkaTransactionalSpout

  • 8/19/2019 Storm Berkeley

    84/91

    Implementation

    all

    all

    all

  • 8/19/2019 Storm Berkeley

    85/91

    Implementation

    all

    all

    all

    TransactionalSpout is a subtopology

    consisting of a spout and a bolt

  • 8/19/2019 Storm Berkeley

    86/91

    Implementation

    all

    all

    all

    The spout consists of one task that

    coordinates the transactions

  • 8/19/2019 Storm Berkeley

    87/91

    Implementation

    all

    all

    all

    The bolt emits the batches of tuples

  • 8/19/2019 Storm Berkeley

    88/91

    Implementation

    all

    all

    all

    The coordinator emits a “batch” stream

    and a “commit stream”

  • 8/19/2019 Storm Berkeley

    89/91

    Implementation

    all

    all

    all

    Batch stream

  • 8/19/2019 Storm Berkeley

    90/91

    Implementation

    all

    all

    all

    Commit stream

  • 8/19/2019 Storm Berkeley

    91/91

    Implementation

    all

    all

    all

    Coordinator reuses tuple tree framework to detect