26
Down the event-driven road: Experiences of integrating streaming into analytic data platforms Dr. Dominik Benz, Head of Machine Learning Engineering, inovex GmbH Confluent Meetup Munich, 8.10.2018

Down the event-driven road: Experiences of integrating streaming … · 2019-03-13 · Experiences of integrating streaming into analytic data platforms Dr. Dominik Benz, ... Kafka

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Down the event-driven road: Experiences of integrating streaming … · 2019-03-13 · Experiences of integrating streaming into analytic data platforms Dr. Dominik Benz, ... Kafka

Down the event-driven road: Experiences of integrating

streaming into analytic dataplatforms

Dr. Dominik Benz, Head of Machine Learning Engineering, inovex GmbH

Confluent Meetup Munich, 8.10.2018

Page 2: Down the event-driven road: Experiences of integrating streaming … · 2019-03-13 · Experiences of integrating streaming into analytic data platforms Dr. Dominik Benz, ... Kafka

2

Integrateexisting (batch) data sources?

Check consistency

with datasources?

Build realtimedata

visualizations?

https://flic.kr/p/5eQA7ehttps://flic.kr/p/bpFt7U

Page 3: Down the event-driven road: Experiences of integrating streaming … · 2019-03-13 · Experiences of integrating streaming into analytic data platforms Dr. Dominik Benz, ... Kafka

3

Down the event-driven road ..

Analytic(Streaming)

Data Platforms

Integrating existing(batch) data sources

Checkingconsistency

Building realtimevisualizations

Wrap up & Summary

Page 4: Down the event-driven road: Experiences of integrating streaming … · 2019-03-13 · Experiences of integrating streaming into analytic data platforms Dr. Dominik Benz, ... Kafka

4

A typical analytic data platform

raw processed datahub analysisingress egress

Scheduling, orchestration, metadata

user access, system integration,development

(Hive) Tables

Airflow, HiveMetastore

Batch Processing (Spark, Hive, ..)

Flat files, Databases, APIs, ...

SQL, Notebooks (Zeppelin, ..)

Page 5: Down the event-driven road: Experiences of integrating streaming … · 2019-03-13 · Experiences of integrating streaming into analytic data platforms Dr. Dominik Benz, ... Kafka

5

A typical (?) streaming data platform

raw processed datahub analysisingress egress

Scheduling, orchestration, metadata

user access, system integration,development

(Kafka) Topics, KTables, ..

(Confluent) Schema Registry

Stream Processing (Kafka Streams, Nifi,

..)Kafka Connect

Input Data (Streams)

KSQL

Page 6: Down the event-driven road: Experiences of integrating streaming … · 2019-03-13 · Experiences of integrating streaming into analytic data platforms Dr. Dominik Benz, ... Kafka

6

Down the event-driven road ..

Analytic(Streaming)

Data Platforms

Integrating existing(batch) data sources

Checkingconsistency

Building realtimevisualizations

Wrap up & Summary

Page 7: Down the event-driven road: Experiences of integrating streaming … · 2019-03-13 · Experiences of integrating streaming into analytic data platforms Dr. Dominik Benz, ... Kafka

7

Integrating web tracking

companywebsite tracking

service

tracking pixel

rawtrackingdata

Page 8: Down the event-driven road: Experiences of integrating streaming … · 2019-03-13 · Experiences of integrating streaming into analytic data platforms Dr. Dominik Benz, ... Kafka

› Hortonworks-based platform, including Nifiand Confluent Platform

› Apache Airflow established scheduling / workflowtool, integrated into monitoring, alerting, ..

› Tracking Service: Currently batch-oriented API (request data, get download links, ..),but click event stream planned

› Developers / Analysts with mixed backgroundw.r.t. programming skills

8

Integrating web tracking: setup / constraints

Page 9: Down the event-driven road: Experiences of integrating streaming … · 2019-03-13 · Experiences of integrating streaming into analytic data platforms Dr. Dominik Benz, ... Kafka

› drag-and-drop visual definition of datapipelines

› various built-in connectors (file, stream, database, service, ...)

› event-based processing paradigm

› built-in queues, data provenance, backpressure handling, registry, ...

› focus: ingest & lightweight (!) transformation

› not a complex event processor (like Kafka Streams, Flink, Spark Streaming, ...)

› integrated into HDP stack

9

Apache Nifi in a Nutshell

Page 10: Down the event-driven road: Experiences of integrating streaming … · 2019-03-13 · Experiences of integrating streaming into analytic data platforms Dr. Dominik Benz, ... Kafka

› python library to define & schedule batchworkflows

› programmatic specification of a „DAG“ (= tasks + dependencies)

› clean handling of job run metadata (success, duration, ..)

› developed by AirBnB, open-sourced 2015

› built-in standard operators (bash, hive, spark, kubernetes, ..)

› easily extendible (custom operators, ..)

› once used -> never Oozie again J

10

Apache Airflow in a nutshell

Page 11: Down the event-driven road: Experiences of integrating streaming … · 2019-03-13 · Experiences of integrating streaming into analytic data platforms Dr. Dominik Benz, ... Kafka

11

Integrating web tracking: options

trackingservice

trackingdata

Option Aspects

Airflow only + integrated into monitoring, ..+ job status handling, reloading- not prepared for future streamAPI- handling file content complicated

Unified Abstraction(e.g. Apache Beam)

+ one model for batch / streamingest- comparatively high entry barrier

Nifi only + visual pipeline definition+ easy handling of file content+ event-based paradigm+ operators available- custom status handling, reloading

Kafka-Connect + fault-tolerant+ scalable setup- custom connector coding- custom status handling, reloading

Page 12: Down the event-driven road: Experiences of integrating streaming … · 2019-03-13 · Experiences of integrating streaming into analytic data platforms Dr. Dominik Benz, ... Kafka

› Combinesadvantagesof Airflow & Nifi

› Prepared for futurestreaming API

› Integrated intomonitoring, alerting, ..

› Status handling / reloading easy

12

Integrating web tracking: chosen solution – Airflow + Nifi

trackingservice

trigger(hourly)download

check status(sensors)

trigger, fetchdownload links

download,process, storedata

Page 13: Down the event-driven road: Experiences of integrating streaming … · 2019-03-13 · Experiences of integrating streaming into analytic data platforms Dr. Dominik Benz, ... Kafka

13

Down the event-driven road ..

Analytic(Streaming)

Data Platforms

Integrating existing(batch) data sources

Checkingconsistency

Building realtimevisualizations

Wrap up & Summary

Page 14: Down the event-driven road: Experiences of integrating streaming … · 2019-03-13 · Experiences of integrating streaming into analytic data platforms Dr. Dominik Benz, ... Kafka

14

Checking consistency: Customer Consent

customerportal

grants / revokesconsent

writesconsentto hive

kafka

consentevent

in sync?

https://flic.kr/p/9yHuk8

Customer(consent)database

storesconsent

Page 15: Down the event-driven road: Experiences of integrating streaming … · 2019-03-13 · Experiences of integrating streaming into analytic data platforms Dr. Dominik Benz, ... Kafka

› Analysts need up-to-date version of customerconsent information in platform

› Hard correctness requirements (especiallyregarding revoked consent)

› Continuous monitoring of correctness

› Alerting in case of differences

15

Checking consistency: setup / constraints

Page 16: Down the event-driven road: Experiences of integrating streaming … · 2019-03-13 · Experiences of integrating streaming into analytic data platforms Dr. Dominik Benz, ... Kafka

16

Checking Consistency: Statistics Events

customerportal

kafka

› use existing channel (kafka)

› source inject periodic „statistics events“ into stream with defined measure point(in time)

{type:GRANT, cid:12, ts:2018-10-01 11:00:00 ..}

{type:GRANT, cid:10, ts:2018-10-01 11:01:00 ..}

{type:REVOK, cid:09, ts:2018-10-01 11:01:05 ..}

{type=STAT, measure_ts=2018-10-01 11:01:20,stats={num_consent_v1:72625,

num_consent_v2: 6252, ..}}

time

Page 17: Down the event-driven road: Experiences of integrating streaming … · 2019-03-13 · Experiences of integrating streaming into analytic data platforms Dr. Dominik Benz, ... Kafka

17

Checking Consistency: Evaluate Statistics Event

› perform count on target side (Hive) upto$measurePoint

› compare counts

› counts = simple plausibility check, but more elaboratedchecks (hashes) thinkable

{type=STAT, measure_ts=2018-10-01 11:01:20,stats={num_consent_v1:72625,

num_consent_v2: 6252, ..}}

in sync?

{ measure_ts=2018-10-01 11:01:20,hive_stats={

num_consent_v1:72625, num_consent_v2: 6252, ..}

}

Customer

(consent)

database

Page 18: Down the event-driven road: Experiences of integrating streaming … · 2019-03-13 · Experiences of integrating streaming into analytic data platforms Dr. Dominik Benz, ... Kafka

18

Down the event-driven road ..

Analytic(Streaming)

Data Platforms

Integrating existing(batch) data sources

Checkingconsistency

Building realtimevisualizations

Wrap up & Summary

Page 19: Down the event-driven road: Experiences of integrating streaming … · 2019-03-13 · Experiences of integrating streaming into analytic data platforms Dr. Dominik Benz, ... Kafka

19

Realtime visualizations: Online Shop Purchases

onlineshop

JMS

purchaseevent

normalization,filtering,

aggregation, ..

https://flic.kr/p/9yHuk8

realtimedashboard

Page 20: Down the event-driven road: Experiences of integrating streaming … · 2019-03-13 · Experiences of integrating streaming into analytic data platforms Dr. Dominik Benz, ... Kafka

› Goal: timely insights into various purchaseaspects (items bought last 5min, ..)

› flexible / configurable frontend (time window,aggregation dimension, ..)

› scalable to 100s / 1000s of dashboard users

› low latency of dashboard backend

20

Realtime visualizations: setup / constraints

Page 21: Down the event-driven road: Experiences of integrating streaming … · 2019-03-13 · Experiences of integrating streaming into analytic data platforms Dr. Dominik Benz, ... Kafka

21

Realtime visualizations: components / options

JMS

transport layer

service backend

service API

processing

Kafka-connect

KafkaKafka-streams

Kafka-connect

HBase

Phoenix / JDBC

Spring Boot

Nifi

Kafka

Tranquility

Druid

Spring Boot

aggregation duringprocessing

aggregation at query-time

Built-in, configurableaggregation

Nifi

Kafka

Kafka-connect

HBase

Phoenix / JDBC

Spring Boot

Page 22: Down the event-driven road: Experiences of integrating streaming … · 2019-03-13 · Experiences of integrating streaming into analytic data platforms Dr. Dominik Benz, ... Kafka

22

Realtime visualizations: chosen solution

JMS

Nifi

Kafka

Tranquility

Druid

Spring Boot› Druid: time series database with focus on

› Realtime ingestion, good Kafka integation

› „slice-and-dice“ queries

› distributed scale-out architecture

› Event processing kept simple in Nifi› mainly cleaning, transformation

› aggregation is pushed down to Druid

› But: yet another distributed system .. L› Experiences good so far, but needs work / skills

Page 23: Down the event-driven road: Experiences of integrating streaming … · 2019-03-13 · Experiences of integrating streaming into analytic data platforms Dr. Dominik Benz, ... Kafka

23

Down the event-driven road ..

Analytic(Streaming)

Data Platforms

Integrating existing(batch) data sources

Checkingconsistency

Building realtimevisualizations

Wrap up & Summary

Page 24: Down the event-driven road: Experiences of integrating streaming … · 2019-03-13 · Experiences of integrating streaming into analytic data platforms Dr. Dominik Benz, ... Kafka

› Technology moves from batch to stream – whatabout people?

› Analysts‘ world = often batch world› tooling centered around static datasets› can (and must) be generated from streams› but: education towards stream / event-based

thinking necessary!

› Incremental / stream-based data exchange = paradigm shift› efforts / commitment „from both ends“ necessary

24

The human factor ..

https://flic.kr/p/f2Wx6t

Page 25: Down the event-driven road: Experiences of integrating streaming … · 2019-03-13 · Experiences of integrating streaming into analytic data platforms Dr. Dominik Benz, ... Kafka

25

Stream me up, Scotty ..

The future is event-based, but on the way:

› Existing batch-oriented APIs› use (scheduled) event-based tools for easier later migration

› Checking consistency› inject plausibility checks into data stream

› Realtime visualizations› Druid + Kafka powerful and flexible combination

› Don‘t forget the human in the loop!

Page 26: Down the event-driven road: Experiences of integrating streaming … · 2019-03-13 · Experiences of integrating streaming into analytic data platforms Dr. Dominik Benz, ... Kafka

Vielen Dank

Dr. Dominik Benz

[email protected]

inovex GmbH

Park Plaza

Ludwig-Erhard-Allee 6

76131 Karlsruhe