68
Putting Kafka into Overdrive Todd Palino (LinkedIn) & Gwen Shapira (Confluent)

Putting Kafka Into Overdrive

Embed Size (px)

Citation preview

Page 1: Putting Kafka Into Overdrive

Putting Kafka into OverdriveTodd Palino (LinkedIn) & Gwen Shapira (Confluent)

Page 2: Putting Kafka Into Overdrive

UsTodd Palino

Staff SRE @ LinkedInCommitter @ BurrowPreviously:

Systems Engineer @ Verisign

Find me:[email protected]@bonkoif

Gwen Shapira

System Architect @ ConfluentCommitter @ Apache KafkaPreviously:

Software Engineer @ ClouderaSenior Consultant @ Pythian

Find me:[email protected]@gwenshap

Page 3: Putting Kafka Into Overdrive

There’s a Book on That!Early Access available now

Get a signed copy:

● Today at 6:20 PM @ O’Reilly Booth

● Tomorrow at 1:00 PM @ Confluent Booth (#838)

Page 4: Putting Kafka Into Overdrive

You• SRE / DevOps• Developer• Know some things about Kafka

Page 5: Putting Kafka Into Overdrive

Kafka

• High Throughput• Scalable• Low Latency• Real-time• Centralized• Awesome

So, we are done, right?

Page 6: Putting Kafka Into Overdrive

When it comes to critical production systems – Never trust a vendor.

6

Page 7: Putting Kafka Into Overdrive

Our favorite conversation:

Kafka is super slow. Sometimes messages take over a second

to show up.

Page 8: Putting Kafka Into Overdrive

Or…

I can only push 20k messages per second? You have got to be kidding me.

Page 9: Putting Kafka Into Overdrive

We want to know…

• Is this normal? What should we expect?

• What are good hardware / configuration to use to avoid 3am calls?

• Can we tell if Kafka is slow before users call?

• What can developers do to get best performance?• How can developers and SREs work together to troubleshoot

performance issues?

Page 10: Putting Kafka Into Overdrive

Strong FoundationsBuilding a Kafka cluster from the hardware up

Page 11: Putting Kafka Into Overdrive

What’s Important To You?

• Message Retention - Disk size

• Message Throughput - Network capacity

• Producer Performance - Disk I/O

• Consumer Performance - Memory

Page 12: Putting Kafka Into Overdrive

Go Wide

• RAIS - Redundant Array of Inexpensive Servers

• Kafka is well-suited to horizontal scaling

• Also helps with CPU utilization• Kafka needs to decompress and recompress every message batch• KIP-31 will help with this by eliminating recompression

• Don’t co-locate Kafka

Page 13: Putting Kafka Into Overdrive

Disk Layout

• RAID• Can survive a single disk failure (not RAID 0)• Provides the broker with a single log directory• Eats up disk I/O

• JBOD• Gives Kafka all the disk I/O available• Broker is not smart about balancing partitions• If one disk fails, the entire broker stops

• Amazon EBS performance works!

Page 14: Putting Kafka Into Overdrive

Operating System Tuning

• Filesystem Options• EXT or XFS• Using unsafe mount options

• Virtual Memory• Swappiness• Dirty Pages

• Networking

Page 15: Putting Kafka Into Overdrive

Java

• Only use JDK 8 now

• Keep heap size small• Even our largest brokers use a 6 GB heap• Save the rest for page cache

• Garbage Collection - G1 all the way• Basic tuning only• Watch for humongous allocations

Page 16: Putting Kafka Into Overdrive

Monitoring the Foundation

• CPU Load• Network inbound and outbound• Filehandle usage for Kafka• Disk

• Free space - where you write logs, and where Kafka stores messages• Free inodes• I/O performance - at least average wait and percent utilization

• Garbage Collection

Page 17: Putting Kafka Into Overdrive

Broker Ground Rules

• Tuning• Stick (mostly) with the defaults• Set default cluster retention as appropriate• Default partition count should be at least the number of brokers

• Monitoring• Watch the right things• Don’t try to alert on everything

• Triage and Resolution• Solve problems, don’t mask them

Page 18: Putting Kafka Into Overdrive

Too Much Information!

• Monitoring teams hate Kafka• Per-Topic metrics• Per-Partition metrics• Per-Client metrics

• Capture as much as you can• Many metrics are useful while triaging an issue

• Clients want metrics on their own topics• Only alert on what is needed to signal a problem

Page 19: Putting Kafka Into Overdrive

Broker Monitoring

• Bytes In and Out, Messages In• Why not messages out?

• Partitions• Count and Leader Count• Under Replicated and Offline

• Threads• Network pool, Request pool• Max Dirty Percent

• Requests• Rates and times - total, queue, local, and send

Page 20: Putting Kafka Into Overdrive

Topic Monitoring

• Bytes In, Bytes Out• Messages In, Produce Rate, Produce Failure Rate• Fetch Rate, Fetch Failure Rate• Partition Bytes• Quota Throttling• Log End Offset

• Why bother?• KIP-32 will make this unnecessary

• Provide this to your customers for them to alert on

Page 21: Putting Kafka Into Overdrive

How To Have a LifeAvoiding the 3 AM calls

Page 22: Putting Kafka Into Overdrive

All The Best Ops People...

• Know more of what is happening than their customers

Page 23: Putting Kafka Into Overdrive

All The Best Ops People...

• Know more of what is happening than their customers

• Are proactive

Page 24: Putting Kafka Into Overdrive

All The Best Ops People...

• Know more of what is happening than their customers

• Are proactive

• Fix bugs, not work around them

Page 25: Putting Kafka Into Overdrive

All The Best Ops People...

• Know more of what is happening than their customers

• Are proactive

• Fix bugs, not work around them

This applies to our developers too!

Page 26: Putting Kafka Into Overdrive

Anticipating Trouble

• Trend cluster utilization and growth over time

Page 27: Putting Kafka Into Overdrive

Anticipating Trouble

• Trend cluster utilization and growth over time

• Use default configurations for quotas and retention to require customers to talk to you

Page 28: Putting Kafka Into Overdrive

Anticipating Trouble

• Trend cluster utilization and growth over time

• Use default configurations for quotas and retention to require customers to talk to you

• Monitor request times• If you are able to develop a consistent baseline, this is early warning

Page 29: Putting Kafka Into Overdrive

Under Replicated Partitions

• Count of number of partitions which are not fully replicated within the cluster

• Also referred to as “replica lag”

• Primary indicator of problems within the cluster

Page 30: Putting Kafka Into Overdrive

Broker Performance Checks

• Are all the brokers in the cluster working?• Are the network interfaces saturated?

• Reelect partition leaders• Rebalance partitions in the cluster• Spread out traffic more (increase partitions or brokers)

• Is the CPU utilization high? (especially iowait)• Is another process competing for resources?• Look for a bad disk

• Are you still running 0.8?• Do you have really big messages?

Page 31: Putting Kafka Into Overdrive

Appropriately Sizing Topics

• Many theories on how to do this correctly

Page 32: Putting Kafka Into Overdrive

Appropriately Sizing Topics

• Many theories on how to do this correctly• The answer is “it depends”

Page 33: Putting Kafka Into Overdrive

Appropriately Sizing Topics

• Many theories on how to do this correctly

• The answer is “it depends”

• Questions to answer• How many brokers do you have in the cluster?• How many consumers do you have?

• Do you have specific partition requirements?

Page 34: Putting Kafka Into Overdrive

Appropriately Sizing Topics

• Many theories on how to do this correctly

• The answer is “it depends”

• Questions to answer• How many brokers do you have in the cluster?• How many consumers do you have?

• Do you have specific partition requirements?

• Keeping partition sizes manageable

Page 35: Putting Kafka Into Overdrive

Appropriately Sizing Topics

• Many theories on how to do this correctly• The answer is “it depends”

• Questions to answer• How many brokers do you have in the cluster?• How many consumers do you have?• Do you have specific partition requirements?

• Keeping partition sizes manageable• Multiple tiers makes this more interesting• Don’t have too many partitions

Page 36: Putting Kafka Into Overdrive

Kafka’s Fine, It’s a Client Problem

Page 37: Putting Kafka Into Overdrive

Kafka’s Fine, It’s a Client Problem

• Even so, don’t just throw it over the fence

Page 38: Putting Kafka Into Overdrive

The ClientsBecause Todd only sees 50% of the picture

Page 39: Putting Kafka Into Overdrive

The basics

App

Client

Broker

Broker

Broker

Broker

Page 40: Putting Kafka Into Overdrive

How do we know it’s the app?

Try Perf tool Slow?

OK, actually?

Try Perf toolon the Broker

Probably the app

Slow?Either the broker

orMax capacity

orConfiguration

OK, actually?

Network

Page 41: Putting Kafka Into Overdrive

Throttling!

• Brokers can protect themselves against clients• client_id -> maximum bytes / sec (per broker)• Server responses are delayed• throttle metrics available on clients and brokers

Page 42: Putting Kafka Into Overdrive

Producer

Page 43: Putting Kafka Into Overdrive

ApplicationThreads

Producer

Batch 1Batch 2Batch 3

Broker

Broker

Broker

Broker

Fail? Broker Broker

Send(Record)

Metadata / Exception

Page 44: Putting Kafka Into Overdrive

ApplicationThreads

Producer

Batch 1Batch 2Batch 3

Broker

Broker

Broker

Broker

Fail? Broker Broker

Send(Record)

Metadata / Exception

waiting-threadsRequest-latency

Batch-sizeCompression-rateRecord-queue-timeRecord-send-rateRecords-per-request

Record-retry-rateRecord-error-rate

Page 45: Putting Kafka Into Overdrive

ApplicationThreads

Producer

Batch 1Batch 2Batch 3

Broker

Broker

Broker

Broker

Fail? Broker Broker

Send(Record)

Metadata / Exception

Add threadsAsyncAdd producers

Batch.sizeLinger.msSend.buffer.bytesReceive.buffer.bytescompression

acks

Page 46: Putting Kafka Into Overdrive

Send() API

Sync = Slow

producer.send(record).get();

Async

producer.send(record);

Or

producer.send(record, new Callback()

);

Page 47: Putting Kafka Into Overdrive

Batch.size vs Linger.ms

• Batch will be sent as soon as it is full• Therefore small batch size can decrease throughput• Increase batch size if the producer is running near saturation• If consistently sending near-empty batchs – increase to linger.ms will add a bit of latency, but improve throughput

Page 48: Putting Kafka Into Overdrive

Consumer

Page 49: Putting Kafka Into Overdrive

My Consumer is not just slow – it is hanging!

• There are no messages available (try perf consumer)• Next message is too large• Perpetual rebalance

• Not polling enough• Multiple consumers in same group in same thread

Page 50: Putting Kafka Into Overdrive

Reminder!

Consumers typically live in “consumer groups”Partitions in topics are balanced between consumers in groups

Topic T1

Partition 0

Partition 1

Partition 2

Partition 3

Consumer Group 1

Consumer 1

Consumer 2

Page 51: Putting Kafka Into Overdrive

Rebalances are the consumer performance killer

Consumers must keep pollingOr they die.

When consumers die, the group rebalances.

When the group rebalances, it does not consume.

Page 52: Putting Kafka Into Overdrive

Min.fetch.bytes vs. max.wait

• What if the topic doesn’t have much data?• “Are we there yet?” “and now?”• Reduce load on broker by letting fetch requests wait a bit for data• Add latency to increase throughput• Careful! Don’t fetch more than you can process!

Page 53: Putting Kafka Into Overdrive

Commits take time

• Commit less often• Commit async

Page 54: Putting Kafka Into Overdrive

Add partitions

• Consumer throughput is often limited by target• i.e. you can only write to HDFS so fast (and it aint fast)

• My SLA is 1GB/s but single-client HDFS writes are 20MB/s• If each consumer writes to HDFS – you need 50 consumers• Which means you need 50 partitions

• Except sometimes adding partitions is a bitch• So do the math first

Page 55: Putting Kafka Into Overdrive

I need to get data from Dallas to AWS

• Put the consumer far from Kafka• Because failure to pull data is safer than failure to push

• Tune network parameters in Client, Kafka and both OS•Send buffer -> bandwidth X delay•Receive buffer•Fetch.min.bytes

This will maximize use of bandwidth.Note that cheap AWS nodes have low bandwidth

Page 56: Putting Kafka Into Overdrive

Monitor

• records-lag-max•Burrow is useful here

• fetch-rate• fetch-latency• records-per-request / bytes-per-request

Apologies on behalf of Kafka community.

We forgot to document metrics for the new

consumer

Page 57: Putting Kafka Into Overdrive

Wrapping Up

Page 58: Putting Kafka Into Overdrive

One Ecosystem

• Kafka can scale to millions of messages per second, and more• Operations must scale the cluster appropriately• Developers must use the right tuning and go parallel

Page 59: Putting Kafka Into Overdrive

One Ecosystem

• Kafka can scale to millions of messages per second, and more• Operations must scale the cluster appropriately• Developers must use the right tuning and go parallel

• Few problems are owned by only one side• Expanding partitions often requires coordination• Applications that need higher reliability drive cluster configurations

Page 60: Putting Kafka Into Overdrive

One Ecosystem

• Kafka can scale to millions of messages per second, and more• Operations must scale the cluster appropriately• Developers must use the right tuning and go parallel

• Few problems are owned by only one side• Expanding partitions often requires coordination• Applications that need higher reliability drive cluster configurations

• Either we work together, or we fail separately

Page 61: Putting Kafka Into Overdrive

Would You Like to Know More?

• Kafka Summit is April 26th in San Francisco• Reliability guarantees in Kafka (Gwen)• Some "Kafkaesque" Days in Operations (Joel Koshy)• More Datacenters, More Problems (Todd)• Many more talks...

• ApacheCon Big Data is May 9 - 12 in Vancouver• Streaming Data Integration at Scale (Ewen Cheslack-Postava)• Kafka at Peak Performance (Todd)• Building a Self-Serve Kafka Ecosystem (Joel Koshy)

Page 62: Putting Kafka Into Overdrive

Rebalances are the consumer performance killer

Page 63: Putting Kafka Into Overdrive

AppendixMore Information for later

Page 64: Putting Kafka Into Overdrive

JDK OptionsHeap Size -Xmx6g -Xms6g

Metaspace -XX:MetaspaceSize=96m -XX:MinMetaspaceFreeRatio=50 -XX:MaxMetaspaceFreeRatio=80

G1 Tuning -XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:InitiatingHeapOccupancyPercent=35 -XX:G1HeapRegionSize=16M

GC Logging -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -XX:+PrintTenuringDistribution -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintTenuringDistribution -Xloggc:/path/to/logs/gc.log -verbose:gc

Error Handling -XX:-HeapDumpOnOutOfMemoryError -XX:ErrorFile=/path/to/logs/hs_err.log

Page 65: Putting Kafka Into Overdrive

OS Tuning Parameters

• Networking:net.core.rmem_default = 124928net.core.rmem_max = 2048000net.core.wmem_default = 124928net.core.wmem_max = 2048000net.ipv4.tcp_rmem = 4096 87380 4194304net.ipv4.tcp_wmem = 4096 16384 4194304net.ipv4.tcp_max_tw_buckets = 262144net.ipv4.tcp_max_syn_backlog = 1024

Page 66: Putting Kafka Into Overdrive

OS Tuning Parameters (cont.)

• Virtual Memory:vm.oom_kill_allocating_task = 1vm.max_map_count = 200000vm.swappiness = 1vm.dirty_writeback_centisecs = 500vm.dirty_expire_centisecs = 500vm.dirty_ratio = 60vm.dirty_background_ratio = 5

Page 67: Putting Kafka Into Overdrive

Kafka Broker Sensorskafka.server:name=BytesInPerSec,type=BrokerTopicMetricskafka.server:name=BytesOutPerSec,type=BrokerTopicMetricskafka.server:name=MessagesInPerSec,type=BrokerTopicMetricskafka.server:name=PartitionCount,type=ReplicaManagerkafka.server:name=LeaderCount,type=ReplicaManagerkafka.server:name=UnderReplicatedPartitions,type=ReplicaManager

kafka.server:name=RequestHandlerAvgIdlePercent,type=KafkaRequestHandlerPoolkafka.controller:name=ActiveControllerCount,type=KafkaController

kafka.controller:name=OfflinePartitionsCount,type=KafkaControllerkafka.log:name=max-dirty-percent,type=LogCleanerManager

kafka.network:name=NetworkProcessorAvgIdlePercent,type=SocketServerkafka.network:name=RequestsPerSec,request=*,type=RequestMetrics

kafka.network:name=RequestQueueTimeMs,request=*,type=RequestMetricskafka.network:name=LocalTimeMs,request=*,type=RequestMetricskafka.network:name=RemoteTimeMs,request=*,type=RequestMetricskafka.network:name=ResponseQueueTimeMs,request=*,type=RequestMetricskafka.network:name=ResponseSendTimeMs,request=*,type=RequestMetricskafka.network:name=TotalTimeMs,request=*,type=RequestMetrics

Page 68: Putting Kafka Into Overdrive

Kafka Broker Sensors - Topicskafka.server:name=BytesInPerSec,type=BrokerTopicMetrics,topics=*kafka.server:name=BytesOutPerSec,type=BrokerTopicMetrics,topics=*kafka.server:name=MessagesInPerSec,type=BrokerTopicMetrics,topics=*kafka.server:name=TotalProduceRequestsPerSec,type=BrokerTopicMetrics,topic=*kafka.server:name=FailedProduceRequestsPerSec,type=BrokerTopicMetrics,topic=*kafka.server:name=TotalFetchRequestsPerSec,type=BrokerTopicMetrics,topic=*kafka.server:name=FailedFetchRequestsPerSec,type=BrokerTopicMetrics,topic=*kafka.log:type=Log,name=LogEndOffset,topic=*,partition=*