48
Apache Kafka Stream processing made easy

Apache kafka

Embed Size (px)

Citation preview

Page 1: Apache kafka

Apache KafkaStream processing made easy

Page 2: Apache kafka

Biased Content!

More Scotty, less Data

Page 3: Apache kafka

Streams 101An introduction to stream

processing

Page 4: Apache kafka

Transformation of a stream of data fragments into a

continuous flow of information

Page 5: Apache kafka

Stream Processing

Get real-time insightsLower processing latency

Easier to testEasier to maintain

Easier to scale

Different way of thinkingAt-least-once vs exactly-

onceTime

+ -

Page 6: Apache kafka

Every company is already doing stream processing

(more or less ... )

Page 7: Apache kafka

A Stream

Key 1 -> value 1Key 2 -> value 2Key 1 -> value 3

...

Page 8: Apache kafka

A Table

+-------+---------+| Key 1 | value 3 || Key 2 | value 2 |+-------+---------+

Page 9: Apache kafka

A Table through time+-------+---------+| Key 1 | value 1 |+-------+---------+

Timestamp 1

+-------+---------+| Key 1 | value 1 || Key 2 | value 2 |+-------+---------+

Timestamp 2

+-------+---------+| Key 1 | value 3 || Key 2 | value 2 |+-------+---------+

Timestamp 3

Timestamp ...

Page 10: Apache kafka

Let’s remove the redundancy

Page 11: Apache kafka

A Table through time as SETsSET(key1 -> value1)

Timestamp 1

SET(key2 -> value2)Timestamp 2

SET(key 1 -> value3)Timestamp 3

Timestamp ...

Page 12: Apache kafka

SET(key1 -> value1)SET(key2 -> value2)SET(key1 -> value3)

Changelogkey1 -> value1key2 -> value2key1 -> value3

Stream

Page 13: Apache kafka

Tables are materialized views

of streams

Page 14: Apache kafka

Why is this important?

Page 15: Apache kafka

Events used to manipulate core data.

Today events are our core dataDaan Gerits, 2012

Page 16: Apache kafka

Every stream process app is a combination of state and streams

Page 17: Apache kafka

Streaming vs batchis like

agile vs waterfallbut then for data.

Page 18: Apache kafka

KafkaA Stream Processing

Platform

Page 19: Apache kafka

Kafka Proxy

Kafka Streams

Kafka Connect

Kafka Security

Schema Repo

Kafka

Page 20: Apache kafka

Kafka Platform

Streams and Connect apps are just (java) appsStreams and Connect are librariesCan be deployed like any other (java) appMultiple instances of the same app can be launchedUse tools like Mesos, kubernetes, Docker Swarm, ...

Page 21: Apache kafka

Batch Microbatch

Flink

Kafka

Spark

Storm / Heron

Event

Page 22: Apache kafka

Build apps, not Jobs

Page 23: Apache kafka

Kafka Proxy

Kafka Streams

Kafka Connect

Kafka Security

Schema Repo

Kafka

Page 24: Apache kafka

Kafka EngineA message broker with a

twist

Page 25: Apache kafka

Kafka Engine

Producer ConsumerTopicmessage message

Producer

Producer Consumer

Consumer

message

messagemessage

message

Page 26: Apache kafka

Kafka Engine

Producer ConsumerTopicmessage message

Producer

Producer Consumer

Consumer

message

messagemessage

message

Page 27: Apache kafka

Messages

Contain byte arraysHave a

TimestampKeyValue

Page 28: Apache kafka

Topics

Are more like datastoresUses disk instead of memoryRetains the messagesAre partitioned and replicated

Page 29: Apache kafka

Wait?? … Disk??

Page 30: Apache kafka

Sequential disk access is fast*

* Don’t believe me? Read http://kafka.apache.org/documentation#persistence

Page 31: Apache kafka

Producer

Puts messages onto kafkaDetermines the partition to write toCan be implemented in many, many languages

Page 32: Apache kafka

Consumer

Gets messages from kafkaCan be grouped into Consumer Groups

Allows for round robin message deliveryEnables scaling of consumers

Have a persisted offset per Consumer Group

Stored in ZookeeperOr in Kafka

Page 33: Apache kafka

Kafka Engine

Producer ConsumerTopicPartition B

Producer ConsumerTopicPartition A

Page 34: Apache kafka

100 000 msg/secOn a barely tweaked, 3 node cluster

Page 35: Apache kafka

2 000 000 msg/secOn a heavily tweaked cluster

https://engineering.linkedin.com/kafka/benchmarking-apache-kafka-2-million-writes-second-three-cheap-machines

Page 36: Apache kafka

Kafka ConnectGetting data in and out

Page 37: Apache kafka

A Simple and scalable

way to get data in and out

of topics

Page 38: Apache kafka

Kafka Connect

Datasource TopicKafka Connect

Topic DatasinkKafka Connect

Or

Page 39: Apache kafka

Kafka Connect

Datasource TopicKafka Connect

Kafka Connect

Kafka Connect

Page 40: Apache kafka

Kafka Connect

MySQL ⬢ Salesforce ⬢ Redis ⬢ MQTT ⬢ InfluxDB ⬢ RethinkDB ⬢ HBase ⬢ Solr ⬢ Couchbase ⬢ Elasticsearch ⬢ Hazelcast ⬢ Google PubSub ⬢ HDFS ⬢ S3 ⬢ Splunk ⬢ Spooldir ⬢ JDBC ⬢ Syslog ⬢ Cassandra ⬢ Vertica ⬢ DB2 ⬢ Goldengate ⬢ Jenkins ⬢ PredictionIO ⬢ JMS ⬢ Twitter ⬢ Attunity ⬢ MSSQL ⬢ Postgres ⬢ DynamoDB ⬢ IRC ⬢ Kudu ⬢ Ignite ⬢ MongoDB ⬢ Bloomberg Ticker ⬢ FTP

Page 41: Apache kafka

Kafka StreamsProcessing streaming data

Page 42: Apache kafka

Kafka Streams

Topic TopicKafka Streams

Topic

Topic

Page 43: Apache kafka

Kafka Streams

KStream for a stream of dataKTable to keep the latest value for each keyKTable state is distributed across app instancesTransform from streams to tables and tables to streamsChoose which field to use as “timestamp”

Page 44: Apache kafka

TOPIC A

TOPIC B

TOPIC C

Kafka Connect

App

Kafka Streams

App

Kafka Streams

App

Kafka Connect

App

TOPIC C

TOPIC B

TOPIC A

Page 45: Apache kafka

So how do you build solutions with this?

Page 46: Apache kafka

KafkaKafka

ConnectKafka

Streams

Kafka

Kafka

KafkaConnect

Page 47: Apache kafka

TOPIC A

TOPIC B

TOPIC C

Sales JDBC Kafka

Connect

Top Products Ranker

Emailer

TOPIC C

TOPIC B

TOPIC A

Low Stock Notifier

Kafka Connect

AppSlack Poster

Page 48: Apache kafka

Proposal