26
Deploying Kafka at Dropbox Alternately: how to handle 10,000,000 QPS in one cluster (but don't)

Deploying Kafka at Dropbox, Mark Smith, Sean Fellows

Embed Size (px)

Citation preview

Page 1: Deploying Kafka at Dropbox, Mark Smith, Sean Fellows

Deploying Kafka at DropboxAlternately: how to handle 10,000,000 QPS in one cluster (but don't)

Page 2: Deploying Kafka at Dropbox, Mark Smith, Sean Fellows

The Plan

• Welcome

• Use Case

• Initial Design

• Iterations of Woe

• Current Setup

• Future Plans

Page 3: Deploying Kafka at Dropbox, Mark Smith, Sean Fellows

Your Speakers

• Mark Smith <[email protected]>

formerly of Google, Bump, StumbleUpon, etc

likes small airplanes and not getting paged

• Sean Fellows <[email protected]>

formerly of Google

likes corgis and distributed systems

Page 4: Deploying Kafka at Dropbox, Mark Smith, Sean Fellows

The Plan

• Welcome

• Use Case

• Initial Design

• Iterations of Woe

• Current Setup

• Future Plans

Page 5: Deploying Kafka at Dropbox, Mark Smith, Sean Fellows

Dropbox

• Over 500 million signups

• Exabyte scale storage system

• Multiple hardware locations + AWS

Page 6: Deploying Kafka at Dropbox, Mark Smith, Sean Fellows

Log Events

• Wide distribution (1,000 categories)

• Several do >1M QPS each + long tail

• About 200TB/day (raw)

• Payloads range from empty to 15MB JSON blobs

Page 7: Deploying Kafka at Dropbox, Mark Smith, Sean Fellows

Current System

• Existing system based on Scribe + HDFS

• Aggregate to single destination for analytics

• Powers Hive and standard map-reduce type analytics

Want: real-time stream processing!

Page 8: Deploying Kafka at Dropbox, Mark Smith, Sean Fellows

The Plan

• Welcome

• Use Case

• Initial Design

• Iterations of Woe

• Current Setup

• Future Plans

Page 9: Deploying Kafka at Dropbox, Mark Smith, Sean Fellows

Initial Design

• One big cluster

• 20 brokers: 96GB RAM, 16x2TB disk, JBOD config

• ZK ensemble run separately (5 members)

• Kafka 0.8.2 from Github

• LinkedIn configuration recommendations

Page 10: Deploying Kafka at Dropbox, Mark Smith, Sean Fellows

The Plan

• Welcome

• Use Case

• Initial Design

• Iterations of Woe

• Current Setup

• Future Plans

Page 11: Deploying Kafka at Dropbox, Mark Smith, Sean Fellows

Unexpected Catastrophes

• Disks failure or reaching 100%

• Repair is manual, won't expire unless caught up

• Crash looping, controller load

• Simultaneous restarts

• Even graceful, recovery is sometimes very bad (even 0.9!)

• Rebalancing is dangerous

• Saturates disks, partitions fall out of ISRs, offline, etc

Page 12: Deploying Kafka at Dropbox, Mark Smith, Sean Fellows

System Errors

• Controller issues

• Sometimes goes AWOL with e.g. big rebalances

• Can have multiple controllers (during serial operations)

• Cascading OOMs

• Too many connections

Page 13: Deploying Kafka at Dropbox, Mark Smith, Sean Fellows

Lack of Tooling

• Usually left to the reader

• Few best practices

• But we love Kafka Manager

• More to come later!

Page 14: Deploying Kafka at Dropbox, Mark Smith, Sean Fellows

Newer Clients

• State of Go/Python clients

• Bad behavior at scale

• Laserbeam, retries, backoff

• Too many connections == OOM

• Good clients take time

Page 15: Deploying Kafka at Dropbox, Mark Smith, Sean Fellows

Bad Configs

• Many, many tunables -- lots of rope

• Unclean leader election

• Preferred leader automation

• Disk threads (thanks Gwen!)

• Little modern documentation on running at scale

• Todd Palino helped us out early, tho, so thank you!

Page 16: Deploying Kafka at Dropbox, Mark Smith, Sean Fellows

The Plan

• Welcome

• Use Case

• Initial Design

• Iterations of Woe

• Current Setup

• Future Plans

Page 17: Deploying Kafka at Dropbox, Mark Smith, Sean Fellows

Hardware

• Hardware RAID 10

• ~25TB usable/box (spinning rust)

• During broker replacement

• 200ms p99 commit latency down to 10ms!

• Failure tolerance, full disk protection

• Canary cluster

Page 18: Deploying Kafka at Dropbox, Mark Smith, Sean Fellows

Monitoring

• MPS vs QPS (metadata reqs!)

• Bad Stuff graph

• Disk utilization/latency

• Heap usage

• Number of controllers

Page 19: Deploying Kafka at Dropbox, Mark Smith, Sean Fellows
Page 20: Deploying Kafka at Dropbox, Mark Smith, Sean Fellows

Tooling

• Rolling restarter (health checks!)

• Rate limited partition rebalancer (MPS)

• Config verifier/enforcer

• Coordinated consumption (pre-0.9)

• Auditing framework

Page 21: Deploying Kafka at Dropbox, Mark Smith, Sean Fellows
Page 22: Deploying Kafka at Dropbox, Mark Smith, Sean Fellows

Customer Culture

• Topics : organization :: partitions : scale

• Do not hash to partitions

• No ordering requirements

• Namespaces and ownership are required

Page 23: Deploying Kafka at Dropbox, Mark Smith, Sean Fellows

Success! x

• Kafka goes fast (18M+ MPS on 20 brokers)

• Multiple parallel consumption

• Low latency (at high produce rates)

• 0.9 is leaps ahead of 0.8.2 (upgrade!)

• Supportable by a small team (at our scale)

Page 24: Deploying Kafka at Dropbox, Mark Smith, Sean Fellows

The Plan

• Welcome

• Use Case

• Initial Design

• Iterations of Woe

• Current Setup

• Future Plans

Page 25: Deploying Kafka at Dropbox, Mark Smith, Sean Fellows

The Future

• Big is fun but has problems

• Open source our tooling

• Moving towards replication

• Automatic up-partitioning and rebalancing

• Expanding auditing to clients

• Low volume latencies

Page 26: Deploying Kafka at Dropbox, Mark Smith, Sean Fellows

Deploying Kafka at Dropbox

• Mark Smith <[email protected]>

• Sean Fellows <[email protected]>

We would love to talk with other people who are running Kafka at similar

scales. Email us!

And... questions! (If we have time.)