44
Apache Storm: Hands-on Session A.A. 2020/21 Fabiana Rossi Laurea Magistrale in Ingegneria Informatica - II anno Macroarea di Ingegneria Dipartimento di Ingegneria Civile e Ingegneria Informatica

Apache Storm: Hands-on Session

  • Upload
    others

  • View
    23

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Apache Storm: Hands-on Session

Apache Storm: Hands-on SessionA.A. 2020/21

Fabiana Rossi

Laurea Magistrale in

Ingegneria Informatica - II anno

Macroarea di IngegneriaDipartimento di Ingegneria Civile e Ingegneria Informatica

Page 2: Apache Storm: Hands-on Session

The reference Big Data stack

Fabiana Rossi - SABD 2020/21 2

Resource Management

Data Storage

Data Processing

High-level Interfaces Su

pp

ort / In

teg

ratio

n

Page 3: Apache Storm: Hands-on Session

Apache Storm

• Apache Storm

• Open-source, real-time, scalable streaming system

• Provides an abstraction layer to execute DSP applications

• Initially developed by Twitter

• Topology

• DAG of spouts (sources of streams) and bolts (operators and

data sinks

• stream: sequence of key-value pairs

3

boltspout

Fabiana Rossi - SABD 2020/21

Page 4: Apache Storm: Hands-on Session

Stream grouping in Storm

• Data parallelism in Storm: how are streams

partitioned among multiple tasks (threads of

execution)?

• Shuffle grouping

• Randomly partitions the tuples

• Field grouping

• Hashes on a subset of the tuple attributes

4Fabiana Rossi - SABD 2020/21

Page 5: Apache Storm: Hands-on Session

Stream grouping in Storm

• All grouping (i.e., broadcast)

• Replicates the entire stream to all the consumer

tasks

• Global grouping

• Sends the entire stream to a single bolt

• Direct grouping

• Sends tuples to the consumer bolts in the same

executor

5Fabiana Rossi - SABD 2020/21

Page 6: Apache Storm: Hands-on Session

Storm architecture

6

• Master-worker architecture

Fabiana Rossi - SABD 2020/21

Page 7: Apache Storm: Hands-on Session

Storm components: Nimbus and Zookeeper

• Nimbus

– The master node

– Clients submit topologies to it

– Responsible for distributing and coordinating the

topology execution

• Zookeeper

– Nimbus uses a combination of the local disk(s)

and Zookeeper to store state about the topology

7Fabiana Rossi - SABD 2020/21

Page 8: Apache Storm: Hands-on Session

Storm components: worker

• Task: operator instance

– The actual work for a bolt or a spout is done in the

task

• Executor: smallest schedulable entity

– Execute one or more tasks related to same operator

• Worker process: Java process running one or

more executors

• Worker node: computing

resource, a container for

one or more worker processes

8Fabiana Rossi - SABD 2020/21

Page 9: Apache Storm: Hands-on Session

Storm components: supervisor

• Each worker node runs a supervisor

The supervisor:

• receives assignments from Nimbus (through

ZooKeeper) and spawns workers based on

the assignment

• sends to Nimbus (through ZooKeeper) a

periodic heartbeat;

• advertises the topologies that they are

currently running, and any vacancies that are

available to run more topologies

9Fabiana Rossi - SABD 2020/21

Page 10: Apache Storm: Hands-on Session

Example of a running topology

Fabiana Rossi - SABD 2020/2110

Page 11: Apache Storm: Hands-on Session

What makes a running topology

Fabiana Rossi - SABD 2020/2111

Page 12: Apache Storm: Hands-on Session

Configuring the parallelism of a topology

Number of worker processes

• How many worker processes to create for the topology across machines

in the cluster.

• Configuration option: TOPOLOGY_WORKERS

Number of executors (threads)

• How many executors to spawn per component.

• Configuration option: None (pass parallelism_hint parameter to setSpout

or setBolt)

Number of tasks

• How many tasks to create per component.

• Configuration option: TOPOLOGY_TASKS

Fabiana Rossi - SABD 2020/2112

Page 13: Apache Storm: Hands-on Session

Example of a running topology

Fabiana Rossi - SABD 2020/2113

Page 14: Apache Storm: Hands-on Session

Running a Topology in Storm

Storm allows two running mode: local, cluster

• Local mode: the topology is execute on a single node

• the local mode is usually used for testing purpose

• we can check whether our application runs as expected

• Cluster mode: the topology is distributed by Storm on

multiple workers

• The cluster mode should be used to run our application on

the real dataset

• Better exploits parallelism

• The application code is transparently distributed

• The topology is managed and monitored at run-time

14Fabiana Rossi - SABD 2020/21

Page 15: Apache Storm: Hands-on Session

Running a Topology in Storm

To run a topology in local mode, we just need to create

an in-process cluster

• it is a simplification of a cluster

• lightweight Storm functions wrap our code

• It can be instantiated using the LocalCluster class.

For example:

15

…conf.setMaxTaskParallelism(3);LocalCluster cluster = new LocalCluster();cluster.submitTopology("myTopology", conf, topology);Utils.sleep(10000); // wait [param] mscluster.killTopology("myTopology");cluster.shutdown();...

conf.setMaxTaskParallelism(...)

• This sets the number of worker processes to use to execute the topology.

Fabiana Rossi - SABD 2020/21

Page 16: Apache Storm: Hands-on Session

Running a Topology in Storm

To run a topology in cluster mode, we need to perform

the following steps:

1. Configure the application for the submission, using the

StormSubmitter class. For example:

16

...Config conf = new Config();conf.setNumWorkers(NUM_WORKERS);StormSubmitter.submitTopology("mytopology", conf, topology);...

NUM_WORKERS

• number of worker processes to be used for running the topology

Fabiana Rossi - SABD 2020/21

Page 17: Apache Storm: Hands-on Session

Running a Topology in Storm

2. Create a jar containing your code and all the dependencies of

your code• do not include the Storm library

• this can be easily done using Maven: use the Maven Assembly Plugin and

configure your pom.xml:

17

<plugin><artifactId>maven-assembly-plugin</artifactId><configuration>

<descriptorRefs><descriptorRef>jar-with-

dependencies</descriptorRef></descriptorRefs><archive>

<manifest>

<mainClass>com.path.to.main.Class</mainClass></manifest>

</archive></configuration>

</plugin>

Page 18: Apache Storm: Hands-on Session

Running a Topology in Storm

3. Submit the topology to the cluster using the storm client, as

follows

18

$ $STORM_HOME/bin/storm jar path/to/allmycode.jar full.classname.Topology arg1 arg2 arg3

Fabiana Rossi - SABD 2020/21

Page 19: Apache Storm: Hands-on Session

Running a Topology in Storm

19

application code control messages

Fabiana Rossi - SABD 2020/21

Page 20: Apache Storm: Hands-on Session

A container-based Storm cluster

Page 21: Apache Storm: Hands-on Session

Running a Topology in Storm

We are going to create a (local) Storm cluster using Docker

We need to run several containers, each of which will

manage a service of our system:

• Zookeeper

• Nimbus

• Worker1, Worker2, Worker3

• Storm Client (storm-cli): we use storm-cli to run topologies or

scripts that feed our DSP application

Auxiliary services: they that will be useful to interact with

our Storm topologies

• Redis

• RabbitMQ: a message queue service

21Fabiana Rossi - SABD 2020/21

Page 22: Apache Storm: Hands-on Session

Docker Compose

To easily coordinate the execution of these multiple services,

we use Docker Compose

• Read more at https://docs.docker.com/compose/

Docker Compose:

• is not bundled within the installation of Docker

• it can be installed following the official Docker documentation

• https://docs.docker.com/compose/install/

• Allows to easily express the container to be instantiated at once,

and the relations among them

• By itself, docker compose runs the composition on a single

machine; however, in combination with Docker Swarm,

containers can be deployed on multiple nodes

22Fabiana Rossi - SABD 2020/21

Page 23: Apache Storm: Hands-on Session

Docker Compose

• We specify how to compose containers in a easy-to-read file, by

default named docker-compose.yml

• To start the docker composition (in background with -d):

• To stop the docker composition:

• By default, docker-compose looks for the docker-compose.yml file in the current working directory; we can

change the file with the configuration using the -f flag

23

$ docker-compose up -d

$ docker-compose down

Fabiana Rossi - SABD 2020/21

Page 24: Apache Storm: Hands-on Session

Docker Compose

• There are different versions of the docker compose file format

• We will use the version 3, supported from Docker Compose 1.13

24

On the docker compose file format: https://docs.docker.com/compose/compose-file/

Fabiana Rossi - SABD 2020/21

Page 25: Apache Storm: Hands-on Session

Storm UI

Page 26: Apache Storm: Hands-on Session

Storm UI

In addition to bolts defined in your topology, Storm uses its own bolts

to perform background work when a topology component

acknowledges that it either succeeded or failed to process a tuple.

By default, Storm sets the number of acker executors to be equal to

the number of workers configured for this topology.

Page 27: Apache Storm: Hands-on Session

• Storm Examples

Page 28: Apache Storm: Hands-on Session

Example: Exclamation

• Problem: Suppose to have a random source

of words. Create a DSP application that adds

two exclamation points to each word.

28Fabiana Rossi - SABD 2020/21

Page 29: Apache Storm: Hands-on Session

Example: Exclamation

• Problem: Suppose to have a random source

of words. Create a DSP application that adds

two exclamation points to each word.

• Solution (1):

29Fabiana Rossi - SABD 2020/21

Page 30: Apache Storm: Hands-on Session

A simple topology: ExclamationTopology

30

...TopologyBuilder builder = new TopologyBuilder();

builder.setSpout("word", new RandomNamesSpout(), 1); builder.setBolt("exclaim1", new ExclamationBolt(), 1)

.shuffleGrouping("word");builder.setBolt("exclaim2", new ExclamationBolt(), 1)

.shuffleGrouping("exclaim1");

Config conf = new Config();conf.setNumWorkers(3);

StormSubmitter.submitTopologyWithProgressBar("ExclamationTopology", conf,builder.createTopology()

);...

Fabiana Rossi - SABD 2020/21

Page 31: Apache Storm: Hands-on Session

Example: Exclamation

• Problem: Suppose to have a random source of

words. Create a DSP application that adds two

exclamation points to each word.

• Solution (2):

31Fabiana Rossi - SABD 2020/21

Page 32: Apache Storm: Hands-on Session

Example: WordCount

• Problem: Suppose to have a random source

of sentences. Create a DSP application

that counts the number of occurrences of

each word.

32Fabiana Rossi - SABD 2020/21

Page 33: Apache Storm: Hands-on Session

Example: WordCount

• Problem: Suppose to have a random source

of sentences. Create a DSP application

that counts the number of occurrences of

each word.

• Solution:

33Fabiana Rossi - SABD 2020/21

Page 34: Apache Storm: Hands-on Session

WordCount

34

...TopologyBuilder builder = new TopologyBuilder();

builder.setSpout("spout", new RandomSentenceSpout(), 5);

builder.setBolt("split", new SplitSentenceBolt(), 8) .shuffleGrouping("spout");

builder.setBolt("count", new WordCountBolt(), 12) .fieldsGrouping("split", new Fields("word"));

Config conf = new Config();...StormSubmitter.submitTopologyWithProgressBar(

"WordCount", conf, builder.createTopology()

);...

Fabiana Rossi - SABD 2020/21

Page 35: Apache Storm: Hands-on Session

Example: Rolling Count

• Problem: Suppose to have a random source

of words. Create a DSP application that

determines the top-N rank of words within a

sliding window of X secs and sliding interval

of Y secs.

35Fabiana Rossi - SABD 2020/21

Page 36: Apache Storm: Hands-on Session

Example: Rolling Count

• Problem: Suppose to have a random source of

words. Create a DSP application that determines the

top-N rank of words within a sliding window of X

secs and sliding interval of Y secs.

• Solution:

36Fabiana Rossi - SABD 2020/21

Page 37: Apache Storm: Hands-on Session

Rolling Count

37

...TopologyBuilder builder = new TopologyBuilder();

builder.setSpout(spoutId, new RandomNamesSpout(), 5);

builder.setBolt(counterId, new RollingCountBolt(), 4) .fieldsGrouping(spoutId, new Fields("word"));

builder.setBolt(intermediateRankerId, new IntermediateRankingBolt(TOP_N), 4)

.fieldsGrouping(counterId, new Fields("obj"));

builder.setBolt(totalRankerId, new TotalRankingsBolt(TOP_N), 1) .globalGrouping(intermediateRankerId);

StormSubmitter.submitTopologyWithProgressBar(...);...

Fabiana Rossi - SABD 2020/21

Page 38: Apache Storm: Hands-on Session

Word Count on a Window (1)

• Storm 1.0 has explicitly introduced the concept of Window.

• We revise a simplified version of the previousWord Count application relying on the window primitives by Storm.

• The idea is to compute the word count in a sliding window.

31Fabiana Rossi - SABD 2020/21

Page 39: Apache Storm: Hands-on Session

Word Count on a Window (2)

32

• We create a data stream processing application

which comprises the following operators:

• a datasource, which emits sentences

• a splitter

• word count operator with a sliding window;

• the length of the sliding window is 9 secs

and it slides every 3 secs;

• To better visualize the results, we include an

auxiliary operator that exports results on a

message queue, implemented with rabbitMQ.

Fabiana Rossi - SABD 2020/21

Page 40: Apache Storm: Hands-on Session

Word Count on a Window (3)

33

...TopologyBuilder builder = new TopologyBuilder();

builder.setSpout("spout", new RandomSentenceSpout(), 5); builder.setBolt("split", new SplitSentenceBolt(), 8)

.shuffleGrouping("spout");

builder.setBolt("count", new WordCountWindowBasedBolt()

.withWindow(BaseWindowedBolt.Duration.seconds(9), //

lengthBaseWindowedBolt.Duration.seconds(3) //

sliding)

, 12).fieldsGrouping("split", new Fields("word"));

StormSubmitter.submitTopologyWithProgressBar(...);...

Page 41: Apache Storm: Hands-on Session

Word Count on a Window (4)

34

public class WordCountWindowBasedBoltextends

BaseWindowedBolt {... public void execute(TupleWindow tuples) {

List<Tuple> incoming = tuples.getNew();for (Tuple tuple : incoming){ ... }

List<Tuple> expired = tuples.getExpired();for (Tuple tuple : expired){ ... }

}...

}

Implementation of the windowed operator

Fabiana Rossi - SABD 2020/21

Page 42: Apache Storm: Hands-on Session

DEBS Grand Challenge 2015 (1)

35

• Analysis of taxi trips based on data streams originating

from New York City taxis

• Input data streams: include starting point, drop-off point,

timestamps, and information related to the payment

• Query 1: identify the top 10 most frequent routes during

the last 30 minutes (sliding window)

• Use geo-spatial grids to define the events of interest

Fabiana Rossi - SABD 2020/21

Page 43: Apache Storm: Hands-on Session

DEBG Grand Challenge 2015 (2)

36

TopologyBuilder builder = new TopologyBuilder();builder.setSpout("datasource",

new RedisSpout(redisUrl, redisPort));

builder.setBolt("parser", new ParseLine()).setNumTasks(numTasks).shuffleGrouping("datasource");

builder.setBolt("filterByCoordinates", new FilterByCoordinates()).setNumTasks(numTasks).shuffleGrouping("parser");

builder.setBolt("metronome", new Metronome()).setNumTasks(numTasksMetronome).shuffleGrouping("filterByCoordinates");

builder.setBolt("computeCellID", new ComputeCellID()).setNumTasks(numTasks).shuffleGrouping("filterByCoordinates");

Fabiana Rossi - SABD 2020/21

Page 44: Apache Storm: Hands-on Session

DEBG Grand Challenge 2015 (3)

37

builder.setBolt("countByWindow", new CountByWindow()).setNumTasks(numTasks).fieldsGrouping("computeCellID",

new Fields(ComputeCellID.F_ROUTE))

.allGrouping("metronome", Metronome.S_METRONOME);

builder.setBolt("partialRank", new PartialRank(10)).setNumTasks(numTasks).fieldsGrouping("countByWindow",

new Fields(ComputeCellID.F_ROUTE));

builder.setBolt("globalRank", new GlobalRank(...), 1).setNumTasks(numTasksGlobalRank).shuffleGrouping("partialRank");

StormTopology stormTopology = builder.createTopology();

Fabiana Rossi - SABD 2020/21