71
Real Time Data Analytics @ Uber Ankur Bansal Apache Big Data Europe November 14, 2016

Uber Real Time Data Analytics

Embed Size (px)

Citation preview

Page 1: Uber Real Time Data Analytics

Real Time Data Analytics @ Uber Ankur Bansal Apache Big Data EuropeNovember 14, 2016

Page 2: Uber Real Time Data Analytics

About Me

● Sr. Software Engineer, Streaming Team @ Uber○ Streaming team supports platform for real time data

analytics: Kafka, Samza, Flink, Pinot.. and plenty more○ Focused on scaling Kafka at Uber’s pace

● Staff software Engineer @ Ebay○ Build & scale Ebay’s cloud using openstack

● Apache Kylin: Committer, Emeritus PMC

Page 3: Uber Real Time Data Analytics

Agenda

● Real time Use Cases ● Kafka Infrastructure Deep Dive● Our own Development:

○ Rest Proxy & Clients○ Local Agent○ uReplicator (Mirrormaker)○ Chaperone (Auditing)

● Operations/Tooling

Page 4: Uber Real Time Data Analytics

Important Use Cases

Page 5: Uber Real Time Data Analytics

StreamProcessing

Real-time Price Surging

SURGEMULTIPLIERS

Rider eyeballs

Open car information

KAFKA

Page 6: Uber Real Time Data Analytics

Real-time Machine Learning - UberEats ETD

Page 7: Uber Real Time Data Analytics
Page 8: Uber Real Time Data Analytics

● Fraud detection● Share my ETA

And many more ...

Page 9: Uber Real Time Data Analytics

Apache Kafka is Uber’s Lifeline

Page 10: Uber Real Time Data Analytics

Kafka ecosystem @ Uber

Page 11: Uber Real Time Data Analytics

100s of billion

100s TB

Messages/day

bytes/day

Kafka cluster stats

Multiple data centers

Page 12: Uber Real Time Data Analytics

Kafka Infrastructure Deep Dive

Page 13: Uber Real Time Data Analytics

Requirements

● Scale to 100s Billions/day → 1 Trillion/day ● High Throughput ( Scale: 100s TB → PB)● Low Latency for most use cases(<5ms )● Reliability - 99.99% ( #Msgs Available /#Msgs Produced)● Multi-Language Support● Tens of thousands of simultaneous clients.● Reliable data replication across DC

Page 14: Uber Real Time Data Analytics

Kafka Pipeline

Local Agent

uReplicator

Page 15: Uber Real Time Data Analytics

Kafka Pipeline: Data Flow

1

2

3 5 7

64 8

Page 16: Uber Real Time Data Analytics

Kafka Clusters

Local Agent

uReplicator

Page 17: Uber Real Time Data Analytics

Kafka Clusters

● Use case based clusters○ Data (async, reliable)○ Logging (High throughput)○ Time Sensitive (Low Latency e.g. Surge, Push

notifications)○ High Value Data (At-least once, Sync e.g. Payments)

● Secondary cluster as fallback ● Aggregate clusters for all data topics.

Page 18: Uber Real Time Data Analytics

Kafka Clusters

● Scale to 100s Billions/day → 1 Trillion/day ● High Throughput ( Scale: 100s TB → PB)● Low Latency for most use cases(<5ms )● Reliability - 99.99% ( #Msgs Available /#Msgs Produced)● Multi-Language Support● Tens of thousands of simultaneous clients.● Reliable data replication across DC

Page 19: Uber Real Time Data Analytics

Kafka Rest Proxy

Local Agent

uReplicator

Page 20: Uber Real Time Data Analytics

Why Kafka Rest Proxy ?

● Simplified Client API ● Multi-lang support (Java, NodeJs, Python, Golang)● Decouple client from Kafka broker

○ Thin clients = operational ease○ Less connections to Kafka brokers○ Future kafka upgrade

● Enhanced Reliability○ Primary & Secondary Kafka Clusters

Page 21: Uber Real Time Data Analytics

Kafka Rest Proxy: Internals

Page 22: Uber Real Time Data Analytics

Kafka Rest Proxy: Internals

Page 23: Uber Real Time Data Analytics

Kafka Rest Proxy: Internals

● Based on Confluent’s open sourced Rest Proxy ● Performance enhancements

○ Simple http servlets on jetty instead of Jersey ○ Optimized for binary payloads. ○ Performance increase from 7K* to 45-50K QPS/box

● Caching of topic metadata. ● Reliability improvements*

○ Support for Fallback cluster ○ Support for multiple Producers (SLA based segregation)

● Plan to contribute back to community

*Based on benchmarking & analysis done in Jun ’2015

Page 24: Uber Real Time Data Analytics

Rest Proxy: performance (1 box)

Message rate (K/second) at single node

End-

end

Late

ncy

(ms)

Page 25: Uber Real Time Data Analytics

Kafka Clusters + Rest Proxy

● Scale to 100s Billions/day → 1 Trillion/day ● High Throughput ( Scale: 100s TB → PB)● Low Latency for most use cases(<5ms )● Reliability - 99.99% ( #Msgs Available /#Msgs Produced)● Multi-Language Support● Tens of thousands of simultaneous clients.● Reliable data replication across DC

Page 26: Uber Real Time Data Analytics

Kafka Clients

Local Agent

uReplicator

Page 27: Uber Real Time Data Analytics

Client Libraries

● Support for multiple clusters. ● High Throughput

○ Non-blocking, async, batching ○ <1ms produce latency for clients○ Handles Throttling/BackOff signals from Rest Proxy

● Topic Discovery○ Discovers the kafka cluster a topic belongs ○ Able to multiplex to different kafka clusters

● Integration with Local Agent for critical data

Page 28: Uber Real Time Data Analytics

Client Libraries

Add Figure

What if there is network glitch / outage?

Page 29: Uber Real Time Data Analytics

Client Libraries

Add Figure

Page 30: Uber Real Time Data Analytics

Kafka Clusters + Rest Proxy + Clients

● Scale to 100s Billions/day → 1 Trillion/day ● High Throughput ( Scale: 100s TB → PB)● Low Latency for most use cases(<5ms )● Reliability - 99.99% ( #Msgs Available /#Msgs Produced)● Multi-Language Support● Tens of thousands of simultaneous clients.● Reliable data replication across DC

Page 31: Uber Real Time Data Analytics

Local Agent

Local Agent

uReplicator

Page 32: Uber Real Time Data Analytics

Local Agent

● Local spooling in case of downstream outage/backpressure● Backfills at the controlled rate to avoid hammering

infrastructure recovering from outage● Implementation:

○ Reuses code from rest-proxy and kafka’s log module. ○ Appends all topics to same file for high throughput.

Page 33: Uber Real Time Data Analytics

Local Agent Architecture

Add Figure

Page 34: Uber Real Time Data Analytics

Local Agent in Action

Add Figure

Page 35: Uber Real Time Data Analytics

Kafka Clusters + Rest Proxy + Clients + Local Agent

● Scale to 100s Billions/day → 1 Trillion/day ● High Throughput ( Scale: 100s TB → PB)● Low Latency for most use cases(<5ms )● Reliability - 99.99% ( #Msgs Available /#Msgs Produced)● Multi-Language Support● Tens of thousands of simultaneous clients.● Reliable data replication across DC

Page 36: Uber Real Time Data Analytics

uReplicator

Local Agent

uReplicator

Page 37: Uber Real Time Data Analytics

Multi-DC data flow

Page 38: Uber Real Time Data Analytics

CONFIDENTIAL

>> INSERT SCREENSHOT HERE <<

Mirrormaker : existing problems

● New Topic added● New partitions added● Mirrormaker bounced● New mirrormaker added

Page 39: Uber Real Time Data Analytics

uReplicator: In-house solution

ZookeeperHelix MMController

HelixAgent Thread 1

Thread NTopic-partition

HelixAgent Thread 1

Thread NTopic-partition

HelixAgent Thread 1

Thread NTopic-partition

MM worker1 MM worker2 MM worker3

Page 40: Uber Real Time Data Analytics

uReplicator

ZookeeperHelix MMController

HelixAgent Thread 1

Thread NTopic-partition

HelixAgent Thread 1

Thread NTopic-partition

HelixAgent Thread 1

Thread NTopic-partition

MM worker1 MM worker2 MM worker3

Page 41: Uber Real Time Data Analytics

Kafka Clusters + Rest Proxy + Clients + Local Agent

● Scale to 100s Billions/day → 1 Trillion/day ● High Throughput ( Scale: 100s TB → PB)● Low Latency for most use cases(<5ms )● Reliability - 99.99% ( #Msgs Available /#Msgs Produced)● Multi-Language Support● Tens of thousands of simultaneous clients.● Reliable data replication across DC

Page 42: Uber Real Time Data Analytics

uReplicator

● Running in production for 1+ year ● Open sourced: https://github.com/uber/uReplicator● Blog: https://eng.uber.com/ureplicator/

Page 43: Uber Real Time Data Analytics

Chaperone - E2E Auditing

Page 44: Uber Real Time Data Analytics

Chaperone Architecture

Page 45: Uber Real Time Data Analytics

CONFIDENTIAL

>> INSERT SCREENSHOT HERE <<

Chaperone : Track counts

Page 46: Uber Real Time Data Analytics

CONFIDENTIAL

>> INSERT SCREENSHOT HERE <<

Chaperone : Track Latency

Page 47: Uber Real Time Data Analytics

Chaperone

● Running in production for 1+ year ● Planning to open source in ~2 Weeks

Page 48: Uber Real Time Data Analytics

At-least Once Kafka

Page 49: Uber Real Time Data Analytics

Why do we need it?

1

2

3 5 7

64 8

● Most of infrastructure tuned for high throughput ○ Batching at each stage ○ Ack before produce (ack’ed != committed)

● Single node failure in any stage leads to data loss● Need a reliable pipeline for High Value Data e.g. Payments

Page 50: Uber Real Time Data Analytics

How did we achieve it?

● Brokers:○ min.insync.replicas=2, can only torrent one node failure○ unclean.leader.election= false, need to wait until the old

leader comes back● Rest Proxy:

○ Partition Failover● Improved Operations:

○ Replication throttling, to reduce impact of node bootstrap ○ Prevent catching up nodes to become ISR

Page 51: Uber Real Time Data Analytics

Operations/Tooling

Page 52: Uber Real Time Data Analytics

Partition Rebalancing

Add Figure

Page 53: Uber Real Time Data Analytics

Partition Rebalancing

● Calculates partition imbalance and inter-broker dependency.

● Generates & Executes Rebalance Plan.

● Rebalance plans are incremental, can be stopped and resumed.

● Currently on-demand, Automated in the future.

Page 54: Uber Real Time Data Analytics

XFS vs EXT4

Add Figure

Page 55: Uber Real Time Data Analytics

Summary: Scale

● Kafka Brokers:○ Multiple Clusters per DC○ Use case based tuning

● Rest Proxy to reduce connections and better batching● Rest Proxy & Clients

○ Batch everywhere, Async produce ○ Replace Jersey with Jetty

● XFS

Page 56: Uber Real Time Data Analytics

Summary: Reliability

● Local Agent ● Secondary Clusters ● Multi Producer support in Rest Proxy● uReplicator ● Auditing via Chaperone

Page 57: Uber Real Time Data Analytics

Future Work

● Open source contribution○ Chaperone○ Toolkit

● Data Lineage● Active Active Kafka● Chargeback● Exactly once mirroring via uReplicator

Page 59: Uber Real Time Data Analytics

Extra Slides

Page 60: Uber Real Time Data Analytics

Kafka Durability (acks=1)

Broker 1

100

101

102

103

Broker 2

100

101

Broker 3

100

101

Leader

Committed

Producer

Acked

Page 61: Uber Real Time Data Analytics

Kafka Durability (acks=1)

Broker 1

100

101

102

103

Broker 2

100

101

Broker 3

100

101

Leader

Committed

Producer

Failed

Acked

Page 62: Uber Real Time Data Analytics

Kafka Durability (acks=1)

Broker 1

100

101

102

103

Broker 2

100

101

Broker 3

100

101

Leader

Committed

Producer

Page 63: Uber Real Time Data Analytics

Kafka Durability (acks=1)

Broker 1

100

101

102

103

Broker 2

100

101

104

105

106

Broker 3

100

101

104

105

Leader

Committed

Producer

Old HW

Page 64: Uber Real Time Data Analytics

Kafka Durability (acks=1)

Broker 1

100

101

102

103

Broker 2

100

101

104

105

106

Broker 3

100

101

104

105

Leader

Committed

Producer

X

Old HWX

Page 65: Uber Real Time Data Analytics

Kafka Durability (acks=1)

Broker 1

100

101

104

105

106

Broker 2

100

101

104

105

106

Broker 3

100

101

105

106

Leader

Committed

Producer

data loss!!

Page 66: Uber Real Time Data Analytics

Distributed Messaging system

* Supported in Kafka 0.8+

● High throughput● Low latency● Scalable● Centralized● Real-time

Page 67: Uber Real Time Data Analytics

What is Kafka?

● Distributed● Partitioned● Replicated● Commit Log

Broker 1 Broker 2 Broker 3

ZooKeeper

Page 68: Uber Real Time Data Analytics

What is Kafka?

● Distributed● Partitioned● Replicated● Commit Log

Broker 1

Partition 0

Broker 2

Partition 1

Broker 3

Partition 2

ZooKeeper

Page 69: Uber Real Time Data Analytics

What is Kafka?

● Distributed● Partitioned● Replicated● Commit Log

Broker 1

Partition 0

Partition 2

Broker 2

Partition 1

Partition 0

Broker 3

Partition 2

Partition 1

ZooKeeper

Page 70: Uber Real Time Data Analytics

What is Kafka?

● Distributed● Partitioned● Replicated● Commit Log

Broker 1

Partition 0

0 1 2 3

Partition 2

0 1 2 3

Broker 2

Partition 1

0 1 2 3

Partition 0

0 1 2 3

Broker 3

Partition 2

0 1 2 3

Partition 1

0 1 2 3

ZooKeeper

Page 71: Uber Real Time Data Analytics

Kafka Concepts