Storm and Cassandra

Preview:

DESCRIPTION

Slides from talk given at the NYC Cassandra Meetup. Discussing how Storm works and how it integrates well with Apache Cassandra. There is also a segway into a example project that uses Storm and Cassandra to implement a scalable reactive web crawler. http://github.com/tjake/stormscraper

Citation preview

Storm and CassandraCassandra NYC Meetup 11/5/2013

Jake Luciani (@tjake)

What is Storm?

• Distributed event processor

• Provides constructs to reliably process all events

• Simple conceptual model

• New to Apache Incubator: http://wiki.apache.org/incubator/StormProposal

Storm ConceptsSpout - Collects work and submits it to be processed. Tracks success or failure of each tuple.

Bolt - Processes tuples and optionally emits more tuples.

… Tuple - A collection of data that is passed within storm.

Stream - Identifies outputs from a Spout/Bolt. Forces tuples have some declared structure.

Host C

Host B

Host A

Storm TopologiesA directed graph of spouts and bolts connected via streams

Zookeeper

A-F

G-P

Q-Z

Firehose Cassandra (optional)

Example Topologies

• Track the top 10 most popular links being shared in the last N minutes.

Where does data end up?

• Storm supports built in RPC so client requests can effectively become a spout.

!

• Put the data into a database…

• Why Cassandra though?

Why Cassandra?

• Cassandra’s Data model allows incremental modifications to rows.

• Different bolts can update different parts of a Cassandra row asynchronously.

Example

StormScraper!A web crawling system built on

Storm + Cassandra !

http://github.com/tjake/stormscraper

StormScraper C* DataModel!CREATE TABLE pages ( url text, scrape_date timestamp, title text, html text, text text, inbound_links set<text>, outbound_links set<text>, PRIMARY KEY (url, scrape_date) );

CREATE TABLE scrape_list ( url text PRIMARY KEY, last_update timestamp, depth int );

StormScraper Topology

StormScraper Topology

Cassandra

StormScraper Topology

Url Spout

Cassandra

StormScraper Topology

Url Spout

Cassandra

StormScraper Topology

Url Spout

Cassandra

StormScraper Topology

Url Spout

Scraper Bolt

Cassandra

StormScraper Topology

Url Spout

Scraper Bolt

Cassandra

StormScraper Topology

Url Spout

Scraper Bolt

Cassandra

StormScraper Topology

Url Spout

Scraper Bolt

Html Writer

Cassandra

StormScraper Topology

Url Spout

Scraper Bolt

Html Writer

Link Writer

Cassandra

StormScraper Topology

Url Spout

Scraper Bolt

Text Extraction

Bolt

Html Writer

Link Writer

Cassandra

StormScraper Topology

Url Spout

Scraper Bolt

Text Extraction

Bolt

Html Writer

Link Writer

Text Writer

Cassandra

StormScraper Topology

Url Spout

Scraper Bolt

Text Extraction

Bolt

Html Writer

Link Writer

Text Writer

Cassandra

StormScraper Topology

Url Spout

Scraper Bolt

Text Extraction

Bolt

Html Writer

Link Writer

Text Writer

Cassandra

StormScraper Topology

Url Spout

Scraper Bolt

Text Extraction

Bolt

Html Writer

Link Writer

Text Writer

Cassandra

StormScraper Topology

Url Spout

Scraper Bolt

Text Extraction

Bolt

Html Writer

Link Writer

Text Writer

Cassandra

StormScraper Topology

Url Spout

Scraper Bolt

Text Extraction

Bolt

Html Writer

Link Writer

Text Writer

Cassandra

StormScraper Topology

Url Spout

Scraper Bolt

Text Extraction

Bolt

Html Writer

Link Writer

Text Writer

Cassandra

Fail

StormScraper Topology

Url Spout

Scraper Bolt

Text Extraction

Bolt

Html Writer

Link Writer

Text Writer

Cassandra

Fail

StormScraper Topology

Url Spout

Scraper Bolt

Text Extraction

Bolt

Html Writer

Link Writer

Text Writer

Cassandra

Fail

Code Walkthrough http://github.com/tjake/

stormscraper

Storm Summary

• Powerful

• But easy to make mistakes

• Wrong tuple expectation, names, types

• Bad topology wiring

Thank You! Q&A?

Recommended