Kafka Streams: The Stream Processing Engine of Apache Kafka

Preview:

Citation preview

1Confidential

Kafka Streams: The New Smart Kid On The BlockThe Stream Processing Engine of Apache Kafka

Eno Thereskaeno@confluent.io

enothereska

Big Data London 2016Slide contributions: Michael Noll

2Confidential

Apache Kafka and Kafka Streams API

3Confidential

What is Kafka Streams: Unix analogy

$ cat < in.txt | grep “apache” | tr a-z A-Z > out.txt

Kafka Core

Kafka Connect Kafka Streams

4Confidential

When to use Kafka Streams

• Mainstream Application Development

• When running a cluster would suck• Microservices• Fast Data apps for small and big

data• Large-scale continuous queries

and transformations• Event-triggered processes• Reactive applications• The “T” in ETL• <and more>

• Use case examples• Real-time monitoring and

intelligence• Customer 360-degree view• Fraud detection• Location-based marketing• Fleet management• <and more>

5Confidential

Some use cases in the wild & external articles• Applying Kafka Streams for internal message delivery pipeline at LINE Corp.

• http://developers.linecorp.com/blog/?p=3960 • Kafka Streams in production at LINE, a social platform based in Japan with 220+ million

users• Microservices and reactive applications

• https://speakerdeck.com/bobbycalderwood/commander-decoupled-immutable-rest-apis-with-kafka-streams

• User behavior analysis• https://

timothyrenner.github.io/engineering/2016/08/11/kafka-streams-not-looking-at-facebook.html

• Containerized Kafka Streams applications in Scala• https://www.madewithtea.com/processing-tweets-with-kafka-streams.html

• Geo-spatial data analysis• http://www.infolace.com/blog/2016/07/14/simple-spatial-windowing-with-kafka-streams/

• Language classification with machine learning• https://dzone.com/articles/machine-learning-with-kafka-streams

6Confidential

Architecture comparison: use case exampleReal-time dashboard for security monitoring

“Which of my data centers are under attack?”

7Confidential

Architecture comparison: use case example

Other

App Dashboard Frontend

AppOther

App

1 Capture businessevents in Kafka 2Must process events with

separate cluster (e.g. Spark) 4 Other apps access latest resultsby querying these DBs3Must share latest results through

separate systems (e.g. MySQL)

Before: Undue complexity, heavy footprint, many technologies, split ownership with conflicting priorities

Your “Job”

Other

App Dashboard Frontend

AppOther

App

1 Capture businessevents in Kafka 2Process events with standard

Java apps that use Kafka Streams 3Now other apps can directlyquery the latest results

With Kafka Streams: simplified, app-centric architecture, puts app owners in control

KafkaStream

s

Your App

Conflicting priorities: infrastructure teams vs. product teams

Complexity: a lot of moving pieces that are also complex individually

Is all this a part of the solution or part of your problem?

8Confidential

How do I install Kafka Streams?• There is and there should be no “installation” – Build Apps,

Not Clusters!• It’s a library. Add it to your app like any other library.

<dependency> <groupId>org.apache.kafka</groupId> <artifactId>kafka-streams</artifactId> <version>0.10.0.1</version></dependency>

9Confidential

How do I package and deploy my apps? How do I …?• Whatever works for you. Stick to what you/your company think

is the best way.• Kafka Streams integrates well with what you already have.• Why? Because an app that uses Kafka Streams is…a normal Java app.

10Confidential

Available APIs

11Confidential

• API option 1: Kafka Streams DSL (declarative)

KStream<Integer, Integer> input = builder.stream("numbers-topic");

// Stateless computationKStream<Integer, Integer> doubled = input.mapValues(v -> v * 2);

// Stateful computationKTable<Integer, Integer> sumOfOdds = input .filter((k,v) -> v % 2 != 0) .selectKey((k, v) -> 1) .groupByKey() .reduce((v1, v2) -> v1 + v2, "sum-of-odds");

The preferred API for most use cases.

The DSL particularly appeals to users:

• When familiar with Spark, Flink• When fans of Scala or functional

programming

12Confidential

• API option 2: Processor API (imperative)

class PrintToConsoleProcessor implements Processor<K, V> {

@Override public void init(ProcessorContext context) {}

@Override void process(K key, V value) { System.out.println("Received record with " + "key=" + key + " and value=" + value); }

@Override void punctuate(long timestamp) {}

@Override void close() {}}

Full flexibility but more manual work

The Processor API appeals to users:• When familiar with Storm, Samza

• Still, check out the DSL!• When requiring functionality that

isnot yet available in the DSL

13Confidential

”My WordCount is better than your WordCount” (?)

Kafka

Spark

These isolated code snippets are nice (and actually quite similar) but they are not very meaningful. In practice, we also need to read data from somewhere, write data back to somewhere, etc.– but we can see none of this here.

14Confidential

WordCount in Kafka

WordCount

15Confidential

Compared to: WordCount in Spark 2.0

1

2

3

Runtime model leaks into processing logic(here: interfacing from Spark with Kafka)

16Confidential

Compared to: WordCount in Spark 2.0

4

5Runtime model leaks into processing logic(driver vs. executors)

17Confidential

18Confidential

Kafka Streams key concepts

19Confidential

Key concepts

20Confidential

Key concepts

21Confidential

Key concepts

Kafka Core Kafka Streams

22Confidential

Streams meet Tables

23Confidential

Streams meet Tables

http://www.confluent.io/blog/introducing-kafka-streams-stream-processing-made-simple http://docs.confluent.io/current/streams/concepts.html#duality-of-streams-and-tables

24Confidential

Motivating example: continuously compute current users per geo-region

4

7

5

3

2

8

Real-time dashboard“How many users younger than 30y, per region?”

alice Asia, 25y, …

bob Europe, 46y, …

… …

user-locations(mobile team)

user-prefs(web team)

25Confidential

Motivating example: continuously compute current users per geo-region

4

7

5

3

2

8

Real-time dashboard“How many users younger than 30y, per region?”

alice

Europe

user-locations

alice Asia, 25y, …

bob Europe, 46y, …

… …

user-locations(mobile team)

user-prefs(web team)

26Confidential

Motivating example: continuously compute current users per geo-region

4

7

5

3

2

8

Real-time dashboard“How many users younger than 30y, per region?”

alice

Europe

user-locations

user-locations(mobile team)

user-prefs(web team)

alice Asia, 25y, …

bob Europe, 46y, …

… …

alice

Europe, 25y, …

bob Europe, 46y, …

… …

27Confidential

Motivating example: continuously compute current users per geo-region

4

7

5

3

2

8 4

7

6

3

2

7

Alice

Real-time dashboard“How many users younger than 30y, per region?”

alice

Europe

user-locations

alice Asia, 25y, …

bob Europe, 46y, …

… …

alice

Europe, 25y, …

bob Europe, 46y, …

… …

-1+1

user-locations(mobile team)

user-prefs(web team)

28Confidential

Same data, but different use cases require different interpretations

alice San Francisco

alice New York City

alice Rio de Janeiro

alice Sydney

alice Beijing

alice Paris

alice Berlin

29Confidential

Same data, but different use cases require different interpretations

alice San Francisco

alice New York City

alice Rio de Janeiro

alice Sydney

alice Beijing

alice Paris

alice Berlin

Use case 1: Frequent traveler status?

Use case 2: Current location?

30Confidential

Same data, but different use cases require different interpretations

“Alice has been to SFO, NYC, Rio, Sydney,Beijing, Paris, and finally Berlin.”

“Alice is in SFO, NYC, Rio, Sydney,Beijing, Paris, Berlin right now.”

⚑ ⚑ ⚑⚑

⚑⚑

⚑ ⚑ ⚑ ⚑⚑

⚑⚑

Use case 1: Frequent traveler status? Use case 2: Current location?

31Confidential

Streams meet Tables

record stream

When you need… so that the topic is interpreted as a

All the values of a key KStream

then you’d read theKafka topic into a

Example

All the places Alicehas ever been to

with messagesinterpreted asINSERT(append)

32Confidential

Streams meet Tables

record stream

changelog stream

When you need… so that the topic is interpreted as a

All the values of a key

Latest value of a key

KStream

KTable

then you’d read theKafka topic into a

Example

All the places Alicehas ever been to

Where Aliceis right now

with messagesinterpreted asINSERT(append)

UPDATE(overwrite existing)

33Confidential

Motivating example: continuously compute current users per geo-region

KTable<UserId, Location> userLocations = builder.table(“user-locations-topic”);KTable<UserId, Prefs> userPrefs = builder.table(“user-preferences-topic”);

34Confidential

Motivating example: continuously compute current users per geo-region

alice

Europe

user-locationsalice Asia, 25y, …

bob Europe, 46y, …

… …

alice

Europe, 25y, …

bob Europe, 46y, …

… …

KTable<UserId, Location> userLocations = builder.table(“user-locations-topic”);KTable<UserId, Prefs> userPrefs = builder.table(“user-preferences-topic”);

// Merge into detailed user profiles (continuously updated)KTable<UserId, UserProfile> userProfiles = userLocations.join(userPrefs, (loc, prefs) -> new UserProfile(loc, prefs));

KTable userProfilesKTable userProfiles

35Confidential

Motivating example: continuously compute current users per geo-region

KTable<UserId, Location> userLocations = builder.table(“user-locations-topic”);KTable<UserId, Prefs> userPrefs = builder.table(“user-preferences-topic”);

// Merge into detailed user profiles (continuously updated)KTable<UserId, UserProfile> userProfiles = userLocations.join(userPrefs, (loc, prefs) -> new UserProfile(loc, prefs));

// Compute per-region statistics (continuously updated)KTable<UserId, Long> usersPerRegion = userProfiles .filter((userId, profile) -> profile.age < 30) .groupBy((userId, profile) -> profile.location) .count();

alice

Europe

user-locationsAfrica 3… …

Asia 8Europe 5

Africa 3… …

Asia 7Europe 6

KTable usersPerRegion KTable usersPerRegion

36Confidential

Motivating example: continuously compute current users per geo-region

4

7

5

3

2

8 4

7

6

3

2

7

Alice

Real-time dashboard“How many users younger than 30y, per region?”

alice

Europe

user-locations

alice Asia, 25y, …

bob Europe, 46y, …

… …

alice

Europe, 25y, …

bob Europe, 46y, …

… …

-1+1

user-locations(mobile team)

user-prefs(web team)

KTable

KTable

KTableKTable

KTable

37Confidential

Streams meet Tables – in the Kafka Streams DSL

38Confidential

Kafka Streams key features

39Confidential

Key features in 0.10• Native, 100%-compatible Kafka integration

40Confidential

Native, 100% compatible Kafka integration

Read from Kafka

Write to Kafka

41Confidential

Key features in 0.10• Native, 100%-compatible Kafka integration• Secure stream processing using Kafka’s security features• Elastic and highly scalable• Fault-tolerant

42Confidential

Scalability, fault tolerance, elasticity

43Confidential

Scalability, fault tolerance, elasticity

44Confidential

Scalability, fault tolerance, elasticity

45Confidential

Scalability, fault tolerance, elasticity

46Confidential

Key features in 0.10• Native, 100%-compatible Kafka integration• Secure stream processing using Kafka’s security features• Elastic and highly scalable• Fault-tolerant• Stateful and stateless computations

47Confidential

Stateful computations• Stateful computations like aggregations or joins require state

• We already showed a join example in the previous slides.• Windowing a stream is stateful, too, but let’s ignore this for now.

• Example: count() will cause the creation of a state store to keep track of counts

• State stores in Kafka Streams• … are per stream task for isolation (think: share-nothing)• … are local for best performance• … are replicated to Kafka for elasticity and for fault-tolerance

• Pluggable storage engines• Default: RocksDB (key-value store) to allow for local state that is larger than available RAM• Further built-in options available: in-memory store• You can also use your own, custom storage engine

48Confidential

State management with built-in fault-tolerance

State stores

(This is a bit simplified.)

49Confidential

State management with built-in fault-tolerance

State stores

(This is a bit simplified.)

charlie 3bob 1 alice 1alice 2

50Confidential

State management with built-in fault-tolerance

State stores

(This is a bit simplified.)

51Confidential

State management with built-in fault-tolerance

State stores

(This is a bit simplified.)

alice 1alice 2

52Confidential

Key features in 0.10• Native, 100%-compatible Kafka integration• Secure stream processing using Kafka’s security features• Elastic and highly scalable• Fault-tolerant• Stateful and stateless computations• Interactive queries

53Confidential

Interactive Queries

KafkaStreams

AppApp

App

App

1 Capture businessevents in Kafka 2 Process the events

with Kafka Streams 4Other apps query externalsystems for latest results! Must use external systems

to share latest results

App

App

App

1 Capture businessevents in Kafka 2 Process the events

with Kafka Streams 3Now other apps can directlyquery the latest results

Before (0.10.0)

After (0.10.1): simplified, more app-centric architecture

KafkaStreams

App

54Confidential

Key features in 0.10• Native, 100%-compatible Kafka integration• Secure stream processing using Kafka’s security features• Elastic and highly scalable• Fault-tolerant• Stateful and stateless computations• Interactive queries• Time model• Windowing• Supports late-arriving and out-of-order data• Millisecond processing latency, no micro-batching• At-least-once processing guarantees (exactly-once is in the works as we

speak)

55Confidential

Wrapping Up

56Confidential

Where to go from here• Kafka Streams is available in Confluent Platform 3.0 and in Apache Kafka

0.10 • http://www.confluent.io/download

• Kafka Streams demos: https://github.com/confluentinc/examples • Java 7, Java 8+ with lambdas, and Scala• WordCount, Interactive Queries, Joins, Security, Windowing, Avro integration, …

• Confluent documentation: http://docs.confluent.io/current/streams/• Quickstart, Concepts, Architecture, Developer Guide, FAQ

• Recorded talks• Introduction to Kafka Streams:

http://www.youtube.com/watch?v=o7zSLNiTZbA• Application Development and Data in the Emerging World of Stream Processing (higher level

talk): https://www.youtube.com/watch?v=JQnNHO5506w

57Confidential

Thank You

Recommended