Apache Kafka

a

Emre Akış

2

Outline

• Why do we use Apache Kafka ?

• What is it?

• How it works?

• Demo

• Ecosystem

3

Big Data

• Data doesn’t fit in one computer

• Welcome to the distributed systems

4

(Near) Real-time Big Data & Analytics

• Events (e.g. clickstreams)

• Sensors

• Internet of Things (IoT)

• Data streams

5

Messaging Queues

FIFO

6

Distributed Messaging Queues

• Scalable

• Reliable

• High throughput (read & write)

7

Why’s for Apache Kafka

• Clean and simple architecture

• Easy to use

• Easy to deploy

• High throughput

• Scalability

• High availability

• Persistence (for a while)

8

Apache Kafka 101

• Distributed, partitioned, replicated commit log

service.

• Provides the functionality of a messaging

system.

9

Cluster

Language agnostic TCP protocol

Cluster => group of servers(brokers)

10

Topic

• Category or feed name to which messages are published.

• Partitioned log• Each partition– Ordered– Immutable seq.– Appended to

offset => sequential id number

11

Partition Distribution

• Distributed over servers in the cluster• Replicated for fault tolerance (configurable)• Each partition has a leader server (read &

writes)• Others acts followers (replicate leader)• In case of partition failure one of the followers

becomes new leader

12

Producer

• Decides which message to which partition

– Round-robin

– Semantic partitioning

13

Consumer

• Queue vs. Publish/Subscribe• Traditional queue ordering vs per-partition

ordering

14

Guarantees

• Messages in a partition will be same order they are sent by a producer.

• Consumers see messages in the stored order in log.

15

Demo

• Basic Command Line Tools – Start a server– Create a topic– Send a message– Start a consumer– Multi-broker cluster

• No arguments displays usage information

16

Clients

• Java• Python• Ruby• Go• C/C++• .NET• Clojure• Node.js

• Scala• JRuby• Perl• Erlang• PHP• Rust• HTTP Rest

https://cwiki.apache.org/confluence/display/KAFKA/Clients

https://cwiki.apache.org/confluence/display/KAFKA/Clients

17

Administrative Tools

• Kafka Manager (powered by Yahoo)• Kafkat : Command-line administration for Kafka

brokers.• Kafka Web Console : Displays information about

your Kafka cluster including which nodes are up and what topics they host data for.

• Kafka Offset Monitor : Displays the state of all consumers and how far behind the head of the stream they are.

18

Ecosystem

• Samza• Spark Streaming• Storm

https://cwiki.apache.org/confluence/display/KAFKA/Ecosystem




19

Use Cases

• Messaging • Website activity tracking (at Linkedin)• Metrics • Log aggregation • Stream processing (with Storm or Samza)• Event sourcing (state changes are logged by time)• Commit log (like database transaction log – log

compaction)

20

Who uses ?

• LinkedIn

• Yahoo

• Twitter

• Netflix

• Spotify

• Pinterest

• Uber

• Goldman Sachs

• Tumblr

• PayPal

• Box

• Airbnb

• Mozilla

• Cisco

• Etsy

• Foursquare

• StumbleUpon

• Coursera

• …

https://cwiki.apache.org/confluence/display/KAFKA/Powered+By



21

Resources• http://kafka.apache.org/• https://cwiki.apache.org/confluence/display/KAFKA/Index• https://cwiki.apache.org/confluence/display/KAFKA/Ecosystem• http://www.confluent.io/blog

http://kafka.apache.org/

https://cwiki.apache.org/confluence/display/KAFKA/Index


http://www.confluent.io/blog

22

Q & A

23

About Me

• Twitter : @akisemre• Linkedin : https://tr.linkedin.com/in/emreakis

https://tr.linkedin.com/in/emreakis

https://tr.linkedin.com/in/emreakis