Apache Storm and Kafka Boston Storm User Group September 25, 2014 P. Taylor Goetz, Hortonworks @ptgoetz

Apache Storm and KafkaBoston Storm User Group

September 25, 2014

P. Taylor Goetz, Hortonworks@ptgoetz

What is Apache Kafka?

A pub/sub messaging system.

Re-imagined as a distributed commit log.

Apache Kafka

Fast

“A single Kafka broker can handle hundreds of megabytes of reads and writes per second from thousands of clients.”

http://kafka.apache.org

Apache Kafka

Scalable

“Kafka is designed to allow a single cluster to serve as the central data backbone for a large organization. It can be elastically and transparently expanded without downtime.”


Apache Kafka

Durable

“Messages are persisted on disk and replicated within the cluster to prevent data loss. Each broker can handle terabytes of messages without performance impact.”


Apache Kafka

Distributed

“Kafka has a modern cluster-centric design that offers strong durability and fault-tolerance guarantees.”


Apache Kafka: Use Cases

• Stream Processing

• Messaging

• Click Streams

• Metrics Collection and Monitoring

• Log Aggregation

Apache Kafka: Use Cases

• Greek letter architectures

• Which are really just streaming design patterns

Apache Kafka: Under the Hood

Producers/Consumers(Publish-Subscribe)


Producers write data to Brokers

Consumers read data from Brokers

This work is distributed across the cluster


Data is stored in topics.

Topics are divided into partitions.

Partitions are replicated.


Topics are named feeds to which messages are published.

http://kafka.apache.org/documentation.html



Topics consist of partitions.




A partition is an ordered and immutablesequence of messages that is continually appended to.




A partition is an ordered, immutablesequence of messages that is continually appended to.




Sequential disk access can be faster than RAM!




Within a partition, each message is assigned a uniqueID called an offset that identifies it.





ZooKeeper is used to store cluster state

information and consumer offsets.


Storm and KafkaA match made in heaven.

Data Source Reliability

• A data source is considered unreliable if there is no means to replay a previously-received message.

• A data source is considered reliable if it can somehow replay a message if processing fails at any point.

• A data source is considered durable if it can replay any message or set of messages given the necessary selection criteria.

Data Source Reliability

• A data source is considered unreliable if there is no means to replay a previously-received message.

• A data source is considered reliable if it can somehow replay a message if processing fails at any point.

• A data source is considered durable if it can replay any message or set of messages given the necessary selection criteria.

Kafka is a durable data source.

Reliability in Storm• Exactly once processing requires a durable data

source.

• At least once processing requires a reliable data source.

• An unreliable data source can be wrapped to provide additional guarantees.

• With durable and reliable sources, Storm will not drop data.

• Common pattern: Back unreliable data sources with Apache Kafka (minor latency hit traded for 100% durability).

Storm and KafkaApache Kafka is an ideal source for Storm topologies. It provides everything necessary for:

• At most once processing

• At least once processing

• Exactly once processing

Apache Storm includes Kafka spout implementations for all levels of reliability.

Kafka Supports a wide variety of languages and integration points for both producers and consumers.

Storm-Kafka Integration

• Included in Storm distribution since 0.9.2

• Core Storm Spout

• Trident Spouts (Transactional and Opaque-Transactional)

Storm-Kafka Integration

Features:

• Ingest from Kafka

• Configurable start time (offset position):

• Earliest, Latest, Last, Point-in-Time

• Write to Kafka (next release)

Use Cases

Core Storm Use Case

Cisco Open Security Operations Center(OpenSOC)

Analyzing 1.2 Million Network Packets Per Second in Real Time

OpenSOC: Intrusion Detection

Breaches occur in sec./min./hrs.,but take days/weeks/months to discover.

Data 3V is not getting any smaller…

"Traditional Security analytics tools scale up, not out.”

"OpenSOC is a software application that turns a conventional big data platform into a security analytics platform.”

- James Sirota, Cisco Security Solutions

https://www.youtube.com/watch?v=bQTZ8OgDayA

https://www.youtube.com/watch?v=bQTZ8OgDayA

OpenSOC Conceptual Model

OpenSOC Architecture

PCAP Topology

Telemetry Enrichment Topology

Enrichment

Analytics Dashboards

OpenSOC Deployment @ Cisco

Trident Use CaseHealth Market Science

Master Data Management

Health Market Science

• “Master File” database of every healthcare practitioner in the U.S.

• Kept up-to-date in near-real-time

• Represents the “truth” at any point in time (“Golden Record”)

Health Market Science

• Build products and services around the Master File

• Leverage those services to gather new data and updates

Master Data Management

Data In

Data Out

MDM Pipeline

Polyglot PersistenceChoose the right tool for the job.

Data Pipeline

Why Trident?

• Aggregations and Joins

• Bulk update of persistence layer (Micro-batches)

• Throughput vs. Latency

CassandraCqlState

public void commit(Long txid) {BatchStatement batch = new

BatchStatement(Type.LOGGED); batch.addAll(this.statements); clientFactory.getSession().execute(batch); }

public void addStatement(Statement statement) { this.statements.add(statement); } public ResultSet execute(Statement statement){ return clientFactory.getSession().execute(statement); }

CassandraCqlStateUpdater

public void updateState(CassandraCqlState state, List<TridentTuple> tuples, TridentCollector collector) {

for (TridentTuple tuple : tuples) { Statement statement = this.mapper.map(tuple); state.addStatement(statement); } }

Mapper Implementation

public Statement map(List<String> keys, Number value) {Insert statement = QueryBuilder.insertInto(KEYSPACE_NAME, TABLE_NAME);statement.value(KEY_NAME, keys.get(0));statement.value(VALUE_NAME, value);return statement;

}

public Statement retrieve(List<String> keys) {Select statement = QueryBuilder.select()

.column(KEY_NAME).column(VALUE_NAME)

.from(KEYSPACE_NAME, TABLE_NAME)

.where(QueryBuilder.eq(KEY_NAME, keys.get(0))); return statement;}

Storm Cassandra CQL

[email protected]:hmsonline/storm-cassandra-cql.git

{tuple} <— <mapper> —> CQL Statement

Trident Batch == CQL Batch

Customer Dashboard

Thanks!

P. Taylor Goetz, Hortonworks@ptgoetz

Documents

Apache Storm and Kafka Boston Storm User Group September 25, 2014 P. Taylor Goetz, Hortonworks @ptgoetz