44
1 1 Streaming Data and Stream Processing with Apache KafkaDavid Tucker, Director of Partner Engineering, Confluent Sid Goel, Partner and Solution Architect, KPI Partners

Streaming Data and Stream Processing with Apache Kafka

Embed Size (px)

Citation preview

Page 1: Streaming Data and Stream Processing with Apache Kafka

11

Streaming Data and Stream Processing with Apache Kafka™

David Tucker, Director of Partner Engineering, Confluent

Sid Goel, Partner and Solution Architect, KPI Partners

Page 2: Streaming Data and Stream Processing with Apache Kafka

33

The opportunity: The shift to streams & digital transformation

By 2020, 70% of organizations will adopt data streaming to enable real-time analytics.

- Gartner | Nov 2016

Streaming ingestion and analytics will become a must-have for digital winners.

- Forrester | Nov.

2015

Page 3: Streaming Data and Stream Processing with Apache Kafka

44

More Facts & Figures

90% of CEO’s believe the digital economy will have a major impact on their industry.

- MIT Sloan / Capgemini (2013)

#1 most important capability executives hope to improve via digital transformation: Ability to support real-time transactions.

- The Economist (2015)

Digital disruptors will displace 40% of incumbent companies over the next 5 years.

- Center for Digital Transformation (2015)

Page 4: Streaming Data and Stream Processing with Apache Kafka

55

Vision of a Streaming Enterprise

Search

NewSQL / NoSQL

RDBMS Monitoring

Document StoreReal-time Analytics Data Warehouse

Mobile Apps

Legacy Apps

Hadoop

Streaming Platform

Page 5: Streaming Data and Stream Processing with Apache Kafka

66

What Can You Do with a Streaming Platform ?

• Publish and Subscribe to streams of data

• Analogous to traditional messaging systems

• Store streams of data

• Consumers can look back in time

• Process streams of data

• Analyze and correlate events in real time

Page 6: Streaming Data and Stream Processing with Apache Kafka

77

The typical architecture

Search Security

Fraud Detection Application

User Tracking Operational Logs Operational Metrics

Data WarehouseApp

Databases

Storage

Interfaces

Monitoring App

Databases

Storage

Interfaces

Page 7: Streaming Data and Stream Processing with Apache Kafka

88

Challenges abound

Search Security

Fraud Detection Application

User Tracking Operational Logs Operational Metrics

HadoopData

WarehouseApp

Databases

Storage

Interfaces

Monitoring

App

Databases

Storage

Interfaces

Diverse data sets, arriving at

an increasing rate

Many complex

data pipelines

Require a separate cluster

for real-time

Difficult & time consuming

to change

Require mission critical

availability into most

recent/relevant data

Difficult to handle

massive amounts

of data

Page 8: Streaming Data and Stream Processing with Apache Kafka

99

Modernized architecture using Apache Kafka

Search Security

Fraud Detection Application

Streams API

App

Streams API

Monitoring

App Data

Warehouse

User Tracking Operational Logs Operational Metrics

Page 9: Streaming Data and Stream Processing with Apache Kafka

1010

Search Security

Fraud Detection Application

Streams API

App

Streams API

Monitoring

App Data

Warehouse

User Tracking Operational Logs Operational Metrics

Modernized architecture using Apache Kafka

Pub/sub to data streams,

alleviate back pressure

Lightweight, easy to modify

with minimal disruption

Decoupled from upstream

apps creating agility

Real-time, context specific

data in the moment

Handle any

volume of data

with ease Scale to meet demands of

diverse streams

Page 10: Streaming Data and Stream Processing with Apache Kafka

1111

Stream Data isThe Faster the Better

Stream Data can beBig or Fast (Lambda)

Stream Data will beBig AND Fast

(Kappa)

Our vision: from big data to stream data

Apache Kafka is the Enabling Technology of this Transition

Big Data wasThe More the Better

Valu

e o

f D

ata

Volume of Data

Valu

e o

f D

ata

Age of Data

Job 1 Job 2

Streams

Table 1 Table 2

DB

Speed Table Batch Table

DB

Streams Hadoop

Page 11: Streaming Data and Stream Processing with Apache Kafka

1212

Kafka Adoption in Large Enterprises Growing Rapidly

Travel Global Banks Insurance Telecom

6 of top 10 7 of top 10 8 of top 10 9 of top 10

Over 35% of the Fortune 500 are using Apache

Kafka™

Page 12: Streaming Data and Stream Processing with Apache Kafka

1313

Industries & Use Cases

Universal Use Cases: IoT, Data Pipelines, Microservices, Monitoring

Industry Use Cases

Financial Services Fraud Detection, Trade Data Capture, Customer 360

Retail Inventory Management, Product Catalog, A/B Testing, Proactive Alerts

Automotive Connected Car, Manufacturing Data Processing

Enterprise Tech Analytics, Security Operations, Collect Performance Data

Telecom Personalized Ad Placement, Customer 360, Network Integrity Systems

Entertainment/Media Log Delivery, Increase Ad Delivery Operations, Cross-Device Insights

Travel/ Leisure Visitor Segmentation, Fraud Detection

Consumer Tech Streaming Video, Personalized Customer Experience, Device Telemetry and Analytics

Healthcare Patient Monitoring, Pharma Substance control, Patient Relapse, Lab Results Alerts

Page 13: Streaming Data and Stream Processing with Apache Kafka

1515

Kafka Adoption Across Key Companies

Financial Services Enterprise Tech Consumer Tech

Entertainment & Media Telecom Retail Travel & Leisure

Page 14: Streaming Data and Stream Processing with Apache Kafka

1616

Confluent Enterprise

The only enterprise streaming platform

based entirely on Apache KafkaTM

Page 15: Streaming Data and Stream Processing with Apache Kafka

1717

Confluent Platform: Enterprise Streaming based on Apache Kafka™

Database

ChangesLog Events loT Data

Web

Events…

CRM

Data Warehouse

Database

Hadoop

Data

Integration

Monitoring

Analytics

Custom Apps

Transformations

Real-time

Applications

Apache Open Source Confluent Open Source Confluent Enterprise

Confluent Platform

Apache Kafka™

Data Compatibility

Monitoring & Administration

Operations

Clients Connectors

Complete Open Trusted Enterprise Grade

Page 16: Streaming Data and Stream Processing with Apache Kafka

1818

Feature Benefit Apache Kafka Confluent Open Source Confluent Enterprise

Apache KafkaHigh throughput, low latency, high availability, secure distributed streaming

platform

Kafka Connect API Advanced API for connecting external sources/destinations into Kafka

Kafka Streams APISimple library that enables streaming application development within the Kafka

framework

Additional Clients Supports non-Java clients; C, C++, Python, etc.

REST ProxyProvides universal access to Kafka from any network connected device via

HTTP

Schema RegistryCentral registry for the format of Kafka data – guarantees all data is always

consumable

Pre-Built ConnectorsHDFS, JDBC, elasticsearch and other connectors fully certified

and fully supported by Confluent

Confluent Control Center Enables easy connector management and stream monitoring

Auto Data Balancing Rebalancing data across cluster to remove bottlenecks

Replication Multi-datacenter replication simplifies and automates MDC Kafka clusters

SupportEnterprise class support to keep your Kafka environment running at top

performanceCommunity Community 24x7x365

Confluent Completes Kafka

Page 17: Streaming Data and Stream Processing with Apache Kafka

1919

How do I get streams of data

into and out of my apps?

Connect Clients REST

Page 18: Streaming Data and Stream Processing with Apache Kafka

2020

Apache KafkaTM Connect – Streaming Data Capture

JDBC

IRC / Twitter

CDC

Elastic

NoSQL

HDFS

Kafka Connect API

Kafka Pipeline

Connector

Connector

Connector

Connector

Connector

Connector

Sources Sinks

Fault tolerant

Manage hundreds of data sources and sinks

Preserves data schema

Part of Apache Kafka project

Integrated within Confluent Platform’s Control Center

Page 19: Streaming Data and Stream Processing with Apache Kafka

2121

Kafka Connect API, Part of the Apache KafkaTM Project

Connect any source to any target system with Apache Kafka

Integrated

• 100% compatible with Kafka v0.9 and higher

• Integrated with Confluent’s Schema Registry

• Easy to manage with Confluent Control Center

Flexible

• 40+ open source connectors available

• Easy to develop additional connectors

• Flexible support for data types and formats

Compatible

• Maintains critical metadata

• Preserves schema information

• Supports schema evolution

Reliable

• Automated failover

• At-least-once guaranteed

• Balances workload between nodes

Page 20: Streaming Data and Stream Processing with Apache Kafka

2222

Kafka Connect API Library of Connectors

* Denotes Connectors developed at Confluent and distributed by Confluent. Extensive validation and testing have been performed.

Databases

*

Analytics

*

Applications / Other

Datastore/File Store

*

*

Page 21: Streaming Data and Stream Processing with Apache Kafka

2323

New in Kafka 0.10.2: Single Message Transforms for Kafka Connect

Modify events before storing in Kafka:

• Mask sensitive information

• Add identifiers

• Tag events

• Store lineage

• Remove unnecessary columns

Modify events going out of Kafka:

• Route high priority events to faster data stores

• Direct events to different ElasticSearch indexes

• Cast data types to match destination

• Remove unnecessary columns

Page 22: Streaming Data and Stream Processing with Apache Kafka

2424

Kafka Clients

Ruby Proxy http/REST

Stdin/stdout

Apache Kafka Native Clients

Confluent Native Clients

Community Supported Clients

Page 23: Streaming Data and Stream Processing with Apache Kafka

2525

REST Proxy: Talking to Non-native Kafka Apps and Outside the Firewall

REST Proxy

Non-Java Applications

Native Kafka Java

Applications

Schema Registry

REST / HTTP

Simplifiesadministrative actions

Simplifies message creation and consumption

Provides a RESTful interface to a Kafka cluster

Page 24: Streaming Data and Stream Processing with Apache Kafka

2626

How do I maintain my data

formats and ensure compatibility?

Page 25: Streaming Data and Stream Processing with Apache Kafka

2727

The Challenge of Data Compatibility at Scale

App 1

App 2

App 3

Many sources without a policy causes mayhem in a centralized data pipeline

Ensuring downstream systems can use the data is key to an operational stream pipeline

Example: Date formats

Even within a single application, different formats can be presented

Incompatibly formatted message

Page 26: Streaming Data and Stream Processing with Apache Kafka

2828

Schema Registry

Elastic

Cassandra

HDFS

Example Consumers

SerializerApp 1

SerializerApp 2

!

Kafka Topic!

Schema

Registry

Define the expected fields for each Kafka topic

Automatically handle schema changes (e.g. new fields)

Prevent backwards incompatible changes

Supports multi-datacenter environments

Page 27: Streaming Data and Stream Processing with Apache Kafka

2929

How do I build stream

processing apps?

Page 28: Streaming Data and Stream Processing with Apache Kafka

3030

Kafka Streams API: the Easiest Way to Process Data in Apache Kafka™

Example Use Cases

• Microservices

• Large-scale continuous queries and transformations

• Event-triggered processes

• Reactive applications

• Customer 360-degree view, fraud detection, location-based marketing, smart electrical grids, fleet management, …

Key Benefits of Apache Kafka’s Streams API

• Build Apps, Not Clusters: no additional cluster required

• Elastic, highly-performant, distributed, fault-tolerant, secure

• Equally viable for small, medium, and large-scale use cases

• “Run Everywhere”: integrates with your existing deployment strategies such as containers, automation, cloud

Your App

Kafka

Streams

API

Page 29: Streaming Data and Stream Processing with Apache Kafka

3131

Architecture Example

Before: Complexity for development and operations, heavy footprint

1 2 3

Capture businessevents in Kafka

Must process events with separate,

special-purpose clusters

Write resultsback to Kafka

Your Processing Job

Page 30: Streaming Data and Stream Processing with Apache Kafka

3232

Architecture Example

With Kafka Streams: App-centric architecture that blends well into your existing infrastructure

1 2

3a

Capture businessevents in Kafka

Process events fast, reliably, securely with

standard Java applicationsWrite resultsback to Kafka

3b

Query latest results directly from

external apps

AppApp

Your App

Kafka

Streams API

Page 31: Streaming Data and Stream Processing with Apache Kafka

3333

New in Kafka 0.10.2 : Session windows in Kafka Streams API

Group events in a stream based on session windows

• Sessions are periods of activity terminated by agap of inactivity

• Purely time-based windows are incorrect for session-based data analysis

Input data

Colors representdifferent users event

Results

User sessions,grouped by event-time session windows

processing-time

event-time

session windowing

Alice

Bob

Dave

Page 32: Streaming Data and Stream Processing with Apache Kafka

3535

How do I synchronize and migrate data

to and from the cloud?

Page 33: Streaming Data and Stream Processing with Apache Kafka

3636

Before: Hybrid Cloud Environments Today

DC1

DB2

DB1

DWH

App2

App3

App4

KV2KV3

DB3

App2-v2

App5

App7

App1-v2

AWS

App8

DWH

App1

Challenges

• Each team/department

must execute their own cloud

migration

• May be moving the same data

multiple times

• Each box represented here

require development, testing,

deployment, monitoring and

maintenance

KV

Page 34: Streaming Data and Stream Processing with Apache Kafka

3737

DC1

After: Cloud Synchronization and Migrations with Confluent Platform

DB2

DB1

KV

DWH

App2

App4

KV2KV3

App2-v2

App5 App7

App1-v2

AWS

App8

DWH

App1K

afk

a

Ka

fka

App3

Benefits

• Continuous low-latency

synchronization

• Centralized manageability and

monitoring

– Track at event level data

produced in all data centers

• Security and governance– Track and control where data

comes from and who is

accessing it

• Cost Savings– Move Data Once

DB3

Page 35: Streaming Data and Stream Processing with Apache Kafka

3838

How do I manage and monitor

my streaming platform at scale?

Page 36: Streaming Data and Stream Processing with Apache Kafka

3939

What Does End-to-End Mean?

“Clocks and Cables” Monitoring

How fast is the throughput?

How many CPU cycles are we using?

End-to-End Monitoring

Did you

leave?

Did you

arrive?

Page 37: Streaming Data and Stream Processing with Apache Kafka

4040

Confluent Control Center: Cluster Health & Administration

Cluster health dashboard

• Monitor the health of your Kafka clustersand get alerts if any problems occur

• Measure system load, performance,and operations

• View aggregate statistics or drill downby broker or topic

Cluster administration

• Monitor topic configurations

Page 38: Streaming Data and Stream Processing with Apache Kafka

4141

Confluent Control Center: End-to-end Monitoring

See exactly where your messages are going in your Kafka cluster

Page 39: Streaming Data and Stream Processing with Apache Kafka

4242

Confluent Control Center: Connector Management

Page 40: Streaming Data and Stream Processing with Apache Kafka

4343

Confluent Control Center: Alerting

Alerts

• Configure alerts on incomplete data

delivery, high latency, Kafka connector

status, and more

• Manage alerts for different users and

applications from a web UI

• Manage alerts for different users and

applications from a web UI

User authentication

• Control access to Confluent Control

Center

• Integrates with existing enterprise

authentication systems

Page 41: Streaming Data and Stream Processing with Apache Kafka

4444

Auto Data Balancing

Dynamically move partitions to optimize resource utilization and reliability

• Easily add and remove nodes from your Kafka cluster

• Rack aware algorithm rebalances partitions across a cluster

• Traffic from balancer is throttled when datatransfer occurs

Befo

re

After

Rebalanc

e

Page 42: Streaming Data and Stream Processing with Apache Kafka

4545

Multi-Datacenter Replication

An easy reliable way to run Kafka across datacenters

Improve reliability

• Easily configure & maintain crosscluster replication

Simplify management

• Centralized configuration and monitoring

• Replicate entire cluster or a subset of topics

• Automatic replication of topic configuration

• Use Kafka’s SASL for Kerberos,Active Directory

• SSL encryption between datacenters

Page 43: Streaming Data and Stream Processing with Apache Kafka

4646

Get Started with Apache Kafka Today!

https://www.confluent.io/downloads/

THE place to start with Apache Kafka!

Thoroughly tested and

quality assured

More extensible developer

experience

Easy upgrade path to

Confluent Enterprise

Page 44: Streaming Data and Stream Processing with Apache Kafka

4747

Thank You