Transcript
Page 1: STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & APACHE … SF... · STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & ... One of the initial authors of Apache Kafka, ... Introduction to

STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & APACHE SAMZA

Processing billions of events every day

Page 2: STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & APACHE … SF... · STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & ... One of the initial authors of Apache Kafka, ... Introduction to

Neha Narkhede

¨  Co-founder and Head of Engineering @ Stealth Startup

¨  Prior to this… ¤ Lead, Streams Infrastructure @ LinkedIn (Kafka &

Samza) ¤ One of the initial authors of Apache Kafka, committer

and PMC member

¨  Reach out at @nehanarkhede

Page 3: STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & APACHE … SF... · STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & ... One of the initial authors of Apache Kafka, ... Introduction to

Agenda

¨  Real-time Data Integration ¨  Introduction to Logs & Apache Kafka ¨  Logs & Stream processing ¨  Apache Samza ¨  Stateful stream processing

Page 4: STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & APACHE … SF... · STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & ... One of the initial authors of Apache Kafka, ... Introduction to

The Data Needs Pyramid

Physiological

Safety

Love/Belonging

Esteem

Selfactualization

Maslow's hierarchy of needs

Data collection

Data processing

Understanding

Automation

Data needs

Page 5: STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & APACHE … SF... · STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & ... One of the initial authors of Apache Kafka, ... Introduction to

Agenda

¨ Real-time Data Integration ¨  Introduction to Logs & Apache Kafka ¨  Logs & Stream processing ¨  Apache Samza ¨  Stateful stream processing

Page 6: STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & APACHE … SF... · STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & ... One of the initial authors of Apache Kafka, ... Introduction to

Increase in diversity of data

1980+

2000+

2010+

Siloeddatafeeds

Database data (users, products, orders etc)

IoT sensors

Events (clicks, impressions, pageviews)Application logs (errors, service calls)Application metrics (CPU usage, requests/sec)

Page 7: STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & APACHE … SF... · STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & ... One of the initial authors of Apache Kafka, ... Introduction to

Explosion in diversity of systems

¨  Live Systems ¤ Voldemort ¤ Espresso ¤ GraphDB ¤ Search ¤ Samza

¨  Batch ¤ Hadoop ¤ Teradata

Page 8: STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & APACHE … SF... · STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & ... One of the initial authors of Apache Kafka, ... Introduction to

Data integration disaster

OracleOracle

Oracle User Tracking

HadoopLog

SearchMonitoring

Data

Warehous

e

Social

Graph

Rec.

EngineSearch Email

VoldemortVoldemort

Voldemort

EspressoEspresso

EspressoLogs

Operational

Metrics

Production Services

...Security

Page 9: STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & APACHE … SF... · STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & ... One of the initial authors of Apache Kafka, ... Introduction to

Centralized service

OracleOracle

Oracle User Tracking

HadoopLog

Search

Monitorin

g

Data

Warehous

e

Social

Graph

Rec

Engine &

Life

Search Email

VoldemortVoldemort

Voldemort

EspressoEspresso

EspressoLogs

Operational

Metrics

Production Services

...Security

Data Pipeline

Page 10: STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & APACHE … SF... · STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & ... One of the initial authors of Apache Kafka, ... Introduction to

Agenda

¨  Real-time Data Integration

¨ Introduction to Logs & Apache Kafka

¨  Logs & Stream processing ¨  Apache Samza ¨  Stateful stream processing

Page 11: STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & APACHE … SF... · STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & ... One of the initial authors of Apache Kafka, ... Introduction to

Kafka at 10,000 ft

¨  Distributed from ground up

¨  Persistent ¨  Multi-subscriber

Cluster of brokers

ProducerProducer

ProducerProducer

ProducerProducer

ProducerConsumer

ProducerConsumer

ProducerConsumer

Page 12: STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & APACHE … SF... · STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & ... One of the initial authors of Apache Kafka, ... Introduction to

Key design principles

¨  Scalability of a file system ¤ Hundreds of MB/sec/server throughput ¤ Many TBs per server

¨  Guarantees of a database ¤ Messages strictly ordered ¤ All data persistent

¨  Distributed by default ¤ Replication model ¤ Partitioning model

Page 13: STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & APACHE … SF... · STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & ... One of the initial authors of Apache Kafka, ... Introduction to

Kafka adoption

Page 14: STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & APACHE … SF... · STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & ... One of the initial authors of Apache Kafka, ... Introduction to

Apache Kafka @ LinkedIn

¨  175 TB of in-flight log data per colo ¨  Low-latency: ~1.5ms ¨  Replicated to each datacenter ¨  Tens of thousands of data producers ¨  Thousands of consumers ¨  7 million messages written/sec ¨  35 million messages read/sec ¨  Hadoop integration

Page 15: STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & APACHE … SF... · STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & ... One of the initial authors of Apache Kafka, ... Introduction to

The data structure every systems engineer should know

Logs

Page 16: STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & APACHE … SF... · STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & ... One of the initial authors of Apache Kafka, ... Introduction to

The Log

¨  Ordered ¨  Append only ¨  Immutable

0 1 2 3 4 5 6 7 8 9 10 11 12

1st record next record written

Page 17: STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & APACHE … SF... · STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & ... One of the initial authors of Apache Kafka, ... Introduction to

The Log: Partitioning

0 1 2 3 4 5 6 7 8 9 10 11 12Partition 0

0 1 2 3 4 5 6 7 8 9

0 1 2 3 4 5 6 7 8 9 10 11 12

Partition 1

Partition 2 13 14 15 16

Page 18: STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & APACHE … SF... · STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & ... One of the initial authors of Apache Kafka, ... Introduction to

Logs: pub/sub done right

0 1 2 3 4 5 6 7 8 9 10 11 12

writes

Data source

Destination system A(time = 7)

Destination system B(time = 11)

reads reads

Page 19: STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & APACHE … SF... · STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & ... One of the initial authors of Apache Kafka, ... Introduction to

Logs for data integration

User updates profile with

new job

Newsfeed

KAFKA

Search Hadoop Standardizationengine

Page 20: STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & APACHE … SF... · STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & ... One of the initial authors of Apache Kafka, ... Introduction to

Agenda

¨  Real-time Data Integration ¨  Introduction to Logs & Apache Kafka

¨ Logs & Stream processing ¨  Apache Samza ¨  Stateful stream processing

Page 21: STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & APACHE … SF... · STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & ... One of the initial authors of Apache Kafka, ... Introduction to

Stream processing = f(log)

Log A Job 1 Log B

Page 22: STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & APACHE … SF... · STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & ... One of the initial authors of Apache Kafka, ... Introduction to

Stream processing = f(log)

Log A Job 1

Job 2

Log B Log C

Log D Log E

Page 23: STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & APACHE … SF... · STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & ... One of the initial authors of Apache Kafka, ... Introduction to

Apache Samza at LinkedIn

User updates profile with

new job

Newsfeed

KAFKA

Search Hadoop Standardizationengine

Page 24: STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & APACHE … SF... · STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & ... One of the initial authors of Apache Kafka, ... Introduction to

Latency spectrum of data systems

Synchronous (milliseconds)

RPC

Batch (Hours)

Latency

Asynchronous processing (seconds to minutes)

Page 25: STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & APACHE … SF... · STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & ... One of the initial authors of Apache Kafka, ... Introduction to

Agenda

¨  Real-time Data Integration ¨  Introduction to Logs & Apache Kafka ¨  Logs & Stream processing

¨ Apache Samza ¨  Stateful stream processing

Page 26: STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & APACHE … SF... · STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & ... One of the initial authors of Apache Kafka, ... Introduction to

Samza API

public interface StreamTask {

void process (IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator);

}

getKey(), getMsg()

sendMsg(topic, key, value)

commit(), shutdown()

Page 27: STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & APACHE … SF... · STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & ... One of the initial authors of Apache Kafka, ... Introduction to

Samza Architecture (Logical view)

Task 1 Task 2 Task 3

Log A

Log B

partition 0 partition 1 partition 2

partition 0 partition 1

Page 28: STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & APACHE … SF... · STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & ... One of the initial authors of Apache Kafka, ... Introduction to

Samza Architecture (Logical view)

Task 1 Task 2 Task 3

Log A

Log B

partition 0 partition 1 partition 2

partition 0 partition 1

Samza container 1 Samza container 2

Page 29: STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & APACHE … SF... · STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & ... One of the initial authors of Apache Kafka, ... Introduction to

Samza Architecture (Physical view)

Samza container 1 Samza container 2

Host 1 Host 2

Page 30: STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & APACHE … SF... · STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & ... One of the initial authors of Apache Kafka, ... Introduction to

Samza Architecture (Physical view)

Samza container 1 Samza container 2

Host 1 Host 2

Samza YARN AM

Node manager Node manager

Page 31: STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & APACHE … SF... · STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & ... One of the initial authors of Apache Kafka, ... Introduction to

Samza Architecture (Physical view)

Samza container 1 Samza container 2

Host 1 Host 2

Samza YARN AM

Node manager Node manager

Kafka Kafka

Page 32: STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & APACHE … SF... · STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & ... One of the initial authors of Apache Kafka, ... Introduction to

Map Reduce Map Reduce YARN AM

Node manager Node manager

HDFS HDFS

Host 1 Host 2

Samza Architecture: Equivalence to Map Reduce

Page 33: STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & APACHE … SF... · STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & ... One of the initial authors of Apache Kafka, ... Introduction to

M/R Operation Primitives

¨  Filter records matching some condition ¨  Map record = f(record) ¨  Join Two/more datasets by key ¨  Group records with same key ¨  Aggregate f(records within the same group) ¨  Pipe job 1’s output => job 2’s input

Page 34: STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & APACHE … SF... · STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & ... One of the initial authors of Apache Kafka, ... Introduction to

M/R Operation Primitives on streams

¨  Filter records matching some condition ¨  Map record = f(record) ¨  Join Two/more datasets by key ¨  Group records with same key ¨  Aggregate f(records within the same group) ¨  Pipe job 1’s output => job 2’s input

Requires state maintenance

Page 35: STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & APACHE … SF... · STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & ... One of the initial authors of Apache Kafka, ... Introduction to

Agenda

¨  Real-time Data Integration ¨  Introduction to Logs & Apache Kafka ¨  Logs & Stream processing ¨  Apache Samza

¨ Stateful stream processing

Page 36: STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & APACHE … SF... · STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & ... One of the initial authors of Apache Kafka, ... Introduction to

Example: Newsfeed

User 567 posted "Hello World"

Status update log

Fan outmessages to

followers

Push notification log

567 -> [123, 679, 789, ...]999 -> [156, 343, ... ]

User 989 posted "Blah Blah"User ... posted "..."

External connection DB

Refresh user 123's newsfeedRefresh user 679's newsfeedRefresh user ...'s newsfeed

Page 37: STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & APACHE … SF... · STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & ... One of the initial authors of Apache Kafka, ... Introduction to

Disk

100-500K msg/sec/node 100-500K msg/sec/node

1-5K queries/sec ??ex: Cassandra, MongoDB, etc

Remote state

Samza task partition 0

Samza task partition 1

Local state vs Remote state: Remote

❌  Performance ❌  Isolation ❌  Limited APIs

Page 38: STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & APACHE … SF... · STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & ... One of the initial authors of Apache Kafka, ... Introduction to

Local

LevelDB/RocksDB

Samza task partition 0

Samza task partition 1

Local

LevelDB/RocksDB

Local state: Bring data closer to computation

Page 39: STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & APACHE … SF... · STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & ... One of the initial authors of Apache Kafka, ... Introduction to

Local

LevelDB/RocksDB

Samza task partition 0

Samza task partition 1

Local

LevelDB/RocksDB

Local state: Bring data closer to computation

Disk Change log stream

Page 40: STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & APACHE … SF... · STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & ... One of the initial authors of Apache Kafka, ... Introduction to

Example Revisited: Newsfeed

User 567 posted "Hello World"

Status update log New connection log

Fan outmessages to

followers

Push notification log

567 -> [123, 679, 789, ...]999 -> [156, 343, ... ]

User 123 followed 567User 890 followed 234

User ... followed ...User 989 posted "Blah Blah"User ... posted "..."

Refresh user 123's newsfeedRefresh user 679's newsfeedRefresh user ...'s newsfeed

Page 41: STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & APACHE … SF... · STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & ... One of the initial authors of Apache Kafka, ... Introduction to

Fault tolerance?

Samza container 1 Samza container 2

Host 1 Host 2

Samza YARN AM

Node manager Node manager

Kafka Kafka

Page 42: STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & APACHE … SF... · STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & ... One of the initial authors of Apache Kafka, ... Introduction to

Local

LevelDB/RocksDB

Samza task partition 0

Samza task partition 1

Local

LevelDB/RocksDB

Durable change log

Fault tolerance in Samza

Page 43: STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & APACHE … SF... · STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & ... One of the initial authors of Apache Kafka, ... Introduction to

Slow jobs

Log A Job 1

Job 2

Log B Log C

Log D Log E

❌  Drop data ❌  Backpressure ❌  Queue ❌ In memory ✅ On disk (KAFKA)

Page 44: STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & APACHE … SF... · STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & ... One of the initial authors of Apache Kafka, ... Introduction to

Summary

¨  Real time data integration is crucial for the success and adoption of stream processing

¨  Logs form the basis for real time data integration ¨  Stream processing = f(logs) ¨  Samza is designed from ground-up for scalability

and provides fault-tolerant, persistent state

Page 45: STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & APACHE … SF... · STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & ... One of the initial authors of Apache Kafka, ... Introduction to

Thank you!

¨  The Log ¤ http://bit.ly/the_log

¨  Apache Kafka ¤ http://kafka.apache.org

¨  Apache Samza ¤ http://samza.incubator.apache.org

¨  Me ¤ @nehanarkhede ¤ http://www.linkedin.com/in/nehanarkhede


Recommended