57
1 Confidential Kafka Streams: The New Smart Kid On The Block The Stream Processing Engine of Apache Kafka Eno Thereska [email protected] enothereska Big Data London 2016 Slide contributions: Michael Noll

Kafka Streams: The Stream Processing Engine of Apache Kafka

Embed Size (px)

Citation preview

Page 1: Kafka Streams: The Stream Processing Engine of Apache Kafka

1Confidential

Kafka Streams: The New Smart Kid On The BlockThe Stream Processing Engine of Apache Kafka

Eno [email protected]

enothereska

Big Data London 2016Slide contributions: Michael Noll

Page 2: Kafka Streams: The Stream Processing Engine of Apache Kafka

2Confidential

Apache Kafka and Kafka Streams API

Page 3: Kafka Streams: The Stream Processing Engine of Apache Kafka

3Confidential

What is Kafka Streams: Unix analogy

$ cat < in.txt | grep “apache” | tr a-z A-Z > out.txt

Kafka Core

Kafka Connect Kafka Streams

Page 4: Kafka Streams: The Stream Processing Engine of Apache Kafka

4Confidential

When to use Kafka Streams

• Mainstream Application Development

• When running a cluster would suck• Microservices• Fast Data apps for small and big

data• Large-scale continuous queries

and transformations• Event-triggered processes• Reactive applications• The “T” in ETL• <and more>

• Use case examples• Real-time monitoring and

intelligence• Customer 360-degree view• Fraud detection• Location-based marketing• Fleet management• <and more>

Page 5: Kafka Streams: The Stream Processing Engine of Apache Kafka

5Confidential

Some use cases in the wild & external articles• Applying Kafka Streams for internal message delivery pipeline at LINE Corp.

• http://developers.linecorp.com/blog/?p=3960 • Kafka Streams in production at LINE, a social platform based in Japan with 220+ million

users• Microservices and reactive applications

• https://speakerdeck.com/bobbycalderwood/commander-decoupled-immutable-rest-apis-with-kafka-streams

• User behavior analysis• https://

timothyrenner.github.io/engineering/2016/08/11/kafka-streams-not-looking-at-facebook.html

• Containerized Kafka Streams applications in Scala• https://www.madewithtea.com/processing-tweets-with-kafka-streams.html

• Geo-spatial data analysis• http://www.infolace.com/blog/2016/07/14/simple-spatial-windowing-with-kafka-streams/

• Language classification with machine learning• https://dzone.com/articles/machine-learning-with-kafka-streams

Page 6: Kafka Streams: The Stream Processing Engine of Apache Kafka

6Confidential

Architecture comparison: use case exampleReal-time dashboard for security monitoring

“Which of my data centers are under attack?”

Page 7: Kafka Streams: The Stream Processing Engine of Apache Kafka

7Confidential

Architecture comparison: use case example

Other

App Dashboard Frontend

AppOther

App

1 Capture businessevents in Kafka 2Must process events with

separate cluster (e.g. Spark) 4 Other apps access latest resultsby querying these DBs3Must share latest results through

separate systems (e.g. MySQL)

Before: Undue complexity, heavy footprint, many technologies, split ownership with conflicting priorities

Your “Job”

Other

App Dashboard Frontend

AppOther

App

1 Capture businessevents in Kafka 2Process events with standard

Java apps that use Kafka Streams 3Now other apps can directlyquery the latest results

With Kafka Streams: simplified, app-centric architecture, puts app owners in control

KafkaStream

s

Your App

Conflicting priorities: infrastructure teams vs. product teams

Complexity: a lot of moving pieces that are also complex individually

Is all this a part of the solution or part of your problem?

Page 8: Kafka Streams: The Stream Processing Engine of Apache Kafka

8Confidential

How do I install Kafka Streams?• There is and there should be no “installation” – Build Apps,

Not Clusters!• It’s a library. Add it to your app like any other library.

<dependency> <groupId>org.apache.kafka</groupId> <artifactId>kafka-streams</artifactId> <version>0.10.0.1</version></dependency>

Page 9: Kafka Streams: The Stream Processing Engine of Apache Kafka

9Confidential

How do I package and deploy my apps? How do I …?• Whatever works for you. Stick to what you/your company think

is the best way.• Kafka Streams integrates well with what you already have.• Why? Because an app that uses Kafka Streams is…a normal Java app.

Page 10: Kafka Streams: The Stream Processing Engine of Apache Kafka

10Confidential

Available APIs

Page 11: Kafka Streams: The Stream Processing Engine of Apache Kafka

11Confidential

• API option 1: Kafka Streams DSL (declarative)

KStream<Integer, Integer> input = builder.stream("numbers-topic");

// Stateless computationKStream<Integer, Integer> doubled = input.mapValues(v -> v * 2);

// Stateful computationKTable<Integer, Integer> sumOfOdds = input .filter((k,v) -> v % 2 != 0) .selectKey((k, v) -> 1) .groupByKey() .reduce((v1, v2) -> v1 + v2, "sum-of-odds");

The preferred API for most use cases.

The DSL particularly appeals to users:

• When familiar with Spark, Flink• When fans of Scala or functional

programming

Page 12: Kafka Streams: The Stream Processing Engine of Apache Kafka

12Confidential

• API option 2: Processor API (imperative)

class PrintToConsoleProcessor implements Processor<K, V> {

@Override public void init(ProcessorContext context) {}

@Override void process(K key, V value) { System.out.println("Received record with " + "key=" + key + " and value=" + value); }

@Override void punctuate(long timestamp) {}

@Override void close() {}}

Full flexibility but more manual work

The Processor API appeals to users:• When familiar with Storm, Samza

• Still, check out the DSL!• When requiring functionality that

isnot yet available in the DSL

Page 13: Kafka Streams: The Stream Processing Engine of Apache Kafka

13Confidential

”My WordCount is better than your WordCount” (?)

Kafka

Spark

These isolated code snippets are nice (and actually quite similar) but they are not very meaningful. In practice, we also need to read data from somewhere, write data back to somewhere, etc.– but we can see none of this here.

Page 14: Kafka Streams: The Stream Processing Engine of Apache Kafka

14Confidential

WordCount in Kafka

WordCount

Page 15: Kafka Streams: The Stream Processing Engine of Apache Kafka

15Confidential

Compared to: WordCount in Spark 2.0

1

2

3

Runtime model leaks into processing logic(here: interfacing from Spark with Kafka)

Page 16: Kafka Streams: The Stream Processing Engine of Apache Kafka

16Confidential

Compared to: WordCount in Spark 2.0

4

5Runtime model leaks into processing logic(driver vs. executors)

Page 17: Kafka Streams: The Stream Processing Engine of Apache Kafka

17Confidential

Page 18: Kafka Streams: The Stream Processing Engine of Apache Kafka

18Confidential

Kafka Streams key concepts

Page 19: Kafka Streams: The Stream Processing Engine of Apache Kafka

19Confidential

Key concepts

Page 20: Kafka Streams: The Stream Processing Engine of Apache Kafka

20Confidential

Key concepts

Page 21: Kafka Streams: The Stream Processing Engine of Apache Kafka

21Confidential

Key concepts

Kafka Core Kafka Streams

Page 22: Kafka Streams: The Stream Processing Engine of Apache Kafka

22Confidential

Streams meet Tables

Page 23: Kafka Streams: The Stream Processing Engine of Apache Kafka

23Confidential

Streams meet Tables

http://www.confluent.io/blog/introducing-kafka-streams-stream-processing-made-simple http://docs.confluent.io/current/streams/concepts.html#duality-of-streams-and-tables

Page 24: Kafka Streams: The Stream Processing Engine of Apache Kafka

24Confidential

Motivating example: continuously compute current users per geo-region

4

7

5

3

2

8

Real-time dashboard“How many users younger than 30y, per region?”

alice Asia, 25y, …

bob Europe, 46y, …

… …

user-locations(mobile team)

user-prefs(web team)

Page 25: Kafka Streams: The Stream Processing Engine of Apache Kafka

25Confidential

Motivating example: continuously compute current users per geo-region

4

7

5

3

2

8

Real-time dashboard“How many users younger than 30y, per region?”

alice

Europe

user-locations

alice Asia, 25y, …

bob Europe, 46y, …

… …

user-locations(mobile team)

user-prefs(web team)

Page 26: Kafka Streams: The Stream Processing Engine of Apache Kafka

26Confidential

Motivating example: continuously compute current users per geo-region

4

7

5

3

2

8

Real-time dashboard“How many users younger than 30y, per region?”

alice

Europe

user-locations

user-locations(mobile team)

user-prefs(web team)

alice Asia, 25y, …

bob Europe, 46y, …

… …

alice

Europe, 25y, …

bob Europe, 46y, …

… …

Page 27: Kafka Streams: The Stream Processing Engine of Apache Kafka

27Confidential

Motivating example: continuously compute current users per geo-region

4

7

5

3

2

8 4

7

6

3

2

7

Alice

Real-time dashboard“How many users younger than 30y, per region?”

alice

Europe

user-locations

alice Asia, 25y, …

bob Europe, 46y, …

… …

alice

Europe, 25y, …

bob Europe, 46y, …

… …

-1+1

user-locations(mobile team)

user-prefs(web team)

Page 28: Kafka Streams: The Stream Processing Engine of Apache Kafka

28Confidential

Same data, but different use cases require different interpretations

alice San Francisco

alice New York City

alice Rio de Janeiro

alice Sydney

alice Beijing

alice Paris

alice Berlin

Page 29: Kafka Streams: The Stream Processing Engine of Apache Kafka

29Confidential

Same data, but different use cases require different interpretations

alice San Francisco

alice New York City

alice Rio de Janeiro

alice Sydney

alice Beijing

alice Paris

alice Berlin

Use case 1: Frequent traveler status?

Use case 2: Current location?

Page 30: Kafka Streams: The Stream Processing Engine of Apache Kafka

30Confidential

Same data, but different use cases require different interpretations

“Alice has been to SFO, NYC, Rio, Sydney,Beijing, Paris, and finally Berlin.”

“Alice is in SFO, NYC, Rio, Sydney,Beijing, Paris, Berlin right now.”

⚑ ⚑ ⚑⚑

⚑⚑

⚑ ⚑ ⚑ ⚑⚑

⚑⚑

Use case 1: Frequent traveler status? Use case 2: Current location?

Page 31: Kafka Streams: The Stream Processing Engine of Apache Kafka

31Confidential

Streams meet Tables

record stream

When you need… so that the topic is interpreted as a

All the values of a key KStream

then you’d read theKafka topic into a

Example

All the places Alicehas ever been to

with messagesinterpreted asINSERT(append)

Page 32: Kafka Streams: The Stream Processing Engine of Apache Kafka

32Confidential

Streams meet Tables

record stream

changelog stream

When you need… so that the topic is interpreted as a

All the values of a key

Latest value of a key

KStream

KTable

then you’d read theKafka topic into a

Example

All the places Alicehas ever been to

Where Aliceis right now

with messagesinterpreted asINSERT(append)

UPDATE(overwrite existing)

Page 33: Kafka Streams: The Stream Processing Engine of Apache Kafka

33Confidential

Motivating example: continuously compute current users per geo-region

KTable<UserId, Location> userLocations = builder.table(“user-locations-topic”);KTable<UserId, Prefs> userPrefs = builder.table(“user-preferences-topic”);

Page 34: Kafka Streams: The Stream Processing Engine of Apache Kafka

34Confidential

Motivating example: continuously compute current users per geo-region

alice

Europe

user-locationsalice Asia, 25y, …

bob Europe, 46y, …

… …

alice

Europe, 25y, …

bob Europe, 46y, …

… …

KTable<UserId, Location> userLocations = builder.table(“user-locations-topic”);KTable<UserId, Prefs> userPrefs = builder.table(“user-preferences-topic”);

// Merge into detailed user profiles (continuously updated)KTable<UserId, UserProfile> userProfiles = userLocations.join(userPrefs, (loc, prefs) -> new UserProfile(loc, prefs));

KTable userProfilesKTable userProfiles

Page 35: Kafka Streams: The Stream Processing Engine of Apache Kafka

35Confidential

Motivating example: continuously compute current users per geo-region

KTable<UserId, Location> userLocations = builder.table(“user-locations-topic”);KTable<UserId, Prefs> userPrefs = builder.table(“user-preferences-topic”);

// Merge into detailed user profiles (continuously updated)KTable<UserId, UserProfile> userProfiles = userLocations.join(userPrefs, (loc, prefs) -> new UserProfile(loc, prefs));

// Compute per-region statistics (continuously updated)KTable<UserId, Long> usersPerRegion = userProfiles .filter((userId, profile) -> profile.age < 30) .groupBy((userId, profile) -> profile.location) .count();

alice

Europe

user-locationsAfrica 3… …

Asia 8Europe 5

Africa 3… …

Asia 7Europe 6

KTable usersPerRegion KTable usersPerRegion

Page 36: Kafka Streams: The Stream Processing Engine of Apache Kafka

36Confidential

Motivating example: continuously compute current users per geo-region

4

7

5

3

2

8 4

7

6

3

2

7

Alice

Real-time dashboard“How many users younger than 30y, per region?”

alice

Europe

user-locations

alice Asia, 25y, …

bob Europe, 46y, …

… …

alice

Europe, 25y, …

bob Europe, 46y, …

… …

-1+1

user-locations(mobile team)

user-prefs(web team)

KTable

KTable

KTableKTable

KTable

Page 37: Kafka Streams: The Stream Processing Engine of Apache Kafka

37Confidential

Streams meet Tables – in the Kafka Streams DSL

Page 38: Kafka Streams: The Stream Processing Engine of Apache Kafka

38Confidential

Kafka Streams key features

Page 39: Kafka Streams: The Stream Processing Engine of Apache Kafka

39Confidential

Key features in 0.10• Native, 100%-compatible Kafka integration

Page 40: Kafka Streams: The Stream Processing Engine of Apache Kafka

40Confidential

Native, 100% compatible Kafka integration

Read from Kafka

Write to Kafka

Page 41: Kafka Streams: The Stream Processing Engine of Apache Kafka

41Confidential

Key features in 0.10• Native, 100%-compatible Kafka integration• Secure stream processing using Kafka’s security features• Elastic and highly scalable• Fault-tolerant

Page 42: Kafka Streams: The Stream Processing Engine of Apache Kafka

42Confidential

Scalability, fault tolerance, elasticity

Page 43: Kafka Streams: The Stream Processing Engine of Apache Kafka

43Confidential

Scalability, fault tolerance, elasticity

Page 44: Kafka Streams: The Stream Processing Engine of Apache Kafka

44Confidential

Scalability, fault tolerance, elasticity

Page 45: Kafka Streams: The Stream Processing Engine of Apache Kafka

45Confidential

Scalability, fault tolerance, elasticity

Page 46: Kafka Streams: The Stream Processing Engine of Apache Kafka

46Confidential

Key features in 0.10• Native, 100%-compatible Kafka integration• Secure stream processing using Kafka’s security features• Elastic and highly scalable• Fault-tolerant• Stateful and stateless computations

Page 47: Kafka Streams: The Stream Processing Engine of Apache Kafka

47Confidential

Stateful computations• Stateful computations like aggregations or joins require state

• We already showed a join example in the previous slides.• Windowing a stream is stateful, too, but let’s ignore this for now.

• Example: count() will cause the creation of a state store to keep track of counts

• State stores in Kafka Streams• … are per stream task for isolation (think: share-nothing)• … are local for best performance• … are replicated to Kafka for elasticity and for fault-tolerance

• Pluggable storage engines• Default: RocksDB (key-value store) to allow for local state that is larger than available RAM• Further built-in options available: in-memory store• You can also use your own, custom storage engine

Page 48: Kafka Streams: The Stream Processing Engine of Apache Kafka

48Confidential

State management with built-in fault-tolerance

State stores

(This is a bit simplified.)

Page 49: Kafka Streams: The Stream Processing Engine of Apache Kafka

49Confidential

State management with built-in fault-tolerance

State stores

(This is a bit simplified.)

charlie 3bob 1 alice 1alice 2

Page 50: Kafka Streams: The Stream Processing Engine of Apache Kafka

50Confidential

State management with built-in fault-tolerance

State stores

(This is a bit simplified.)

Page 51: Kafka Streams: The Stream Processing Engine of Apache Kafka

51Confidential

State management with built-in fault-tolerance

State stores

(This is a bit simplified.)

alice 1alice 2

Page 52: Kafka Streams: The Stream Processing Engine of Apache Kafka

52Confidential

Key features in 0.10• Native, 100%-compatible Kafka integration• Secure stream processing using Kafka’s security features• Elastic and highly scalable• Fault-tolerant• Stateful and stateless computations• Interactive queries

Page 53: Kafka Streams: The Stream Processing Engine of Apache Kafka

53Confidential

Interactive Queries

KafkaStreams

AppApp

App

App

1 Capture businessevents in Kafka 2 Process the events

with Kafka Streams 4Other apps query externalsystems for latest results! Must use external systems

to share latest results

App

App

App

1 Capture businessevents in Kafka 2 Process the events

with Kafka Streams 3Now other apps can directlyquery the latest results

Before (0.10.0)

After (0.10.1): simplified, more app-centric architecture

KafkaStreams

App

Page 54: Kafka Streams: The Stream Processing Engine of Apache Kafka

54Confidential

Key features in 0.10• Native, 100%-compatible Kafka integration• Secure stream processing using Kafka’s security features• Elastic and highly scalable• Fault-tolerant• Stateful and stateless computations• Interactive queries• Time model• Windowing• Supports late-arriving and out-of-order data• Millisecond processing latency, no micro-batching• At-least-once processing guarantees (exactly-once is in the works as we

speak)

Page 55: Kafka Streams: The Stream Processing Engine of Apache Kafka

55Confidential

Wrapping Up

Page 56: Kafka Streams: The Stream Processing Engine of Apache Kafka

56Confidential

Where to go from here• Kafka Streams is available in Confluent Platform 3.0 and in Apache Kafka

0.10 • http://www.confluent.io/download

• Kafka Streams demos: https://github.com/confluentinc/examples • Java 7, Java 8+ with lambdas, and Scala• WordCount, Interactive Queries, Joins, Security, Windowing, Avro integration, …

• Confluent documentation: http://docs.confluent.io/current/streams/• Quickstart, Concepts, Architecture, Developer Guide, FAQ

• Recorded talks• Introduction to Kafka Streams:

http://www.youtube.com/watch?v=o7zSLNiTZbA• Application Development and Data in the Emerging World of Stream Processing (higher level

talk): https://www.youtube.com/watch?v=JQnNHO5506w

Page 57: Kafka Streams: The Stream Processing Engine of Apache Kafka

57Confidential

Thank You