Upload
daan-gerits
View
859
Download
0
Embed Size (px)
Citation preview
Apache KafkaStream processing made easy
Biased Content!
More Scotty, less Data
Streams 101An introduction to stream
processing
Transformation of a stream of data fragments into a
continuous flow of information
Stream Processing
Get real-time insightsLower processing latency
Easier to testEasier to maintain
Easier to scale
Different way of thinkingAt-least-once vs exactly-
onceTime
+ -
Every company is already doing stream processing
(more or less ... )
A Stream
Key 1 -> value 1Key 2 -> value 2Key 1 -> value 3
...
A Table
+-------+---------+| Key 1 | value 3 || Key 2 | value 2 |+-------+---------+
A Table through time+-------+---------+| Key 1 | value 1 |+-------+---------+
Timestamp 1
+-------+---------+| Key 1 | value 1 || Key 2 | value 2 |+-------+---------+
Timestamp 2
+-------+---------+| Key 1 | value 3 || Key 2 | value 2 |+-------+---------+
Timestamp 3
Timestamp ...
Let’s remove the redundancy
A Table through time as SETsSET(key1 -> value1)
Timestamp 1
SET(key2 -> value2)Timestamp 2
SET(key 1 -> value3)Timestamp 3
Timestamp ...
SET(key1 -> value1)SET(key2 -> value2)SET(key1 -> value3)
Changelogkey1 -> value1key2 -> value2key1 -> value3
Stream
Tables are materialized views
of streams
Why is this important?
Events used to manipulate core data.
Today events are our core dataDaan Gerits, 2012
Every stream process app is a combination of state and streams
Streaming vs batchis like
agile vs waterfallbut then for data.
KafkaA Stream Processing
Platform
Kafka Proxy
Kafka Streams
Kafka Connect
Kafka Security
Schema Repo
Kafka
Kafka Platform
Streams and Connect apps are just (java) appsStreams and Connect are librariesCan be deployed like any other (java) appMultiple instances of the same app can be launchedUse tools like Mesos, kubernetes, Docker Swarm, ...
Batch Microbatch
Flink
Kafka
Spark
Storm / Heron
Event
Build apps, not Jobs
Kafka Proxy
Kafka Streams
Kafka Connect
Kafka Security
Schema Repo
Kafka
Kafka EngineA message broker with a
twist
Kafka Engine
Producer ConsumerTopicmessage message
Producer
Producer Consumer
Consumer
message
messagemessage
message
Kafka Engine
Producer ConsumerTopicmessage message
Producer
Producer Consumer
Consumer
message
messagemessage
message
Messages
Contain byte arraysHave a
TimestampKeyValue
Topics
Are more like datastoresUses disk instead of memoryRetains the messagesAre partitioned and replicated
Wait?? … Disk??
Sequential disk access is fast*
* Don’t believe me? Read http://kafka.apache.org/documentation#persistence
Producer
Puts messages onto kafkaDetermines the partition to write toCan be implemented in many, many languages
Consumer
Gets messages from kafkaCan be grouped into Consumer Groups
Allows for round robin message deliveryEnables scaling of consumers
Have a persisted offset per Consumer Group
Stored in ZookeeperOr in Kafka
Kafka Engine
Producer ConsumerTopicPartition B
Producer ConsumerTopicPartition A
100 000 msg/secOn a barely tweaked, 3 node cluster
2 000 000 msg/secOn a heavily tweaked cluster
https://engineering.linkedin.com/kafka/benchmarking-apache-kafka-2-million-writes-second-three-cheap-machines
Kafka ConnectGetting data in and out
A Simple and scalable
way to get data in and out
of topics
Kafka Connect
Datasource TopicKafka Connect
Topic DatasinkKafka Connect
Or
Kafka Connect
Datasource TopicKafka Connect
Kafka Connect
Kafka Connect
Kafka Connect
MySQL ⬢ Salesforce ⬢ Redis ⬢ MQTT ⬢ InfluxDB ⬢ RethinkDB ⬢ HBase ⬢ Solr ⬢ Couchbase ⬢ Elasticsearch ⬢ Hazelcast ⬢ Google PubSub ⬢ HDFS ⬢ S3 ⬢ Splunk ⬢ Spooldir ⬢ JDBC ⬢ Syslog ⬢ Cassandra ⬢ Vertica ⬢ DB2 ⬢ Goldengate ⬢ Jenkins ⬢ PredictionIO ⬢ JMS ⬢ Twitter ⬢ Attunity ⬢ MSSQL ⬢ Postgres ⬢ DynamoDB ⬢ IRC ⬢ Kudu ⬢ Ignite ⬢ MongoDB ⬢ Bloomberg Ticker ⬢ FTP
Kafka StreamsProcessing streaming data
Kafka Streams
Topic TopicKafka Streams
Topic
Topic
Kafka Streams
KStream for a stream of dataKTable to keep the latest value for each keyKTable state is distributed across app instancesTransform from streams to tables and tables to streamsChoose which field to use as “timestamp”
TOPIC A
TOPIC B
TOPIC C
Kafka Connect
App
Kafka Streams
App
Kafka Streams
App
Kafka Connect
App
TOPIC C
TOPIC B
TOPIC A
So how do you build solutions with this?
KafkaKafka
ConnectKafka
Streams
Kafka
Kafka
KafkaConnect
TOPIC A
TOPIC B
TOPIC C
Sales JDBC Kafka
Connect
Top Products Ranker
Emailer
TOPIC C
TOPIC B
TOPIC A
Low Stock Notifier
Kafka Connect
AppSlack Poster
Proposal