Click here to load reader

Intro to Apache Kafka

  • View
    189

  • Download
    1

Embed Size (px)

Text of Intro to Apache Kafka

  1. 1. 1 Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. Intro to Apache Kafka Jason Hubbard | Systems Engineer
  2. 2. 2 Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. Kafka Overview
  3. 3. 3 Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. What is Kafka? Developed by LinkedIn after challenges building pipelines into Hadoop Message-based store used to build data pipelines and support streaming applications Kafka offers Publish & subscribe semantics Horizontal scalability High availability Nodes in a Kafka cluster (called brokers) can handle Reads/writes per second in the 100s of MBs Thousands of producers and consumers Multiple node failures (with proper configuration)
  4. 4. 4 Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. Why Kafka? (Or rather, why not Flume?) No ability to replay events Multiple sinks requires event replication (via multiple channels) Sinks that share a source (mostly) process events in sync Spool Source Avro Sink Channel Spool Source Avro Sink Channel Avro Source HBase Sink Channel HDFS Sink HBase HDFS Logs More Logs Channel
  5. 5. 5 Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. Why Kafka for Hadoop? 2009 2012
  6. 6. 6 Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. Why Kafka? Decoupling 2012 2013+?
  7. 7. 7 Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. A Departure from Legacy Models Message stores have two well-known types Queues (producer-consumers) Topics (publisher-subscribers) One consumer gets one message from a queue, then its gone Consumers might work alone or in concert Multiple subscribers can get one message from a topic Messages are published Kafka inverts or blends these concepts Tracks consumers by group identification Retains messages by expiration, not consumer interaction Bakes in partitioning for scalability and parallel operations Bakes in replication for availability and fault tolerance
  8. 8. 8 Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. Components & Roles A Kafka server is called a broker Brokers can work together in a cluster Each broker hosts message stores called topics You can partition a topic across brokers for scale and parallelism You can also replicate a topic for resilience to failure Producers push to a Kafka topic, consumers pull Kafka provides Consumer and Producer APIs
  9. 9. 9 Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. Detailed Architecture Its all about the logs! No not application logs
  10. 10. 10 Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. Kafka Detailed Architecture Brokers and consumers initialize their state in Zookeeper Broker state includes host name, port address, and partition list Consumer state includes group name and message offsets (deprecated) Producer Consumer Producers Kafka Cluster Consumers Broker Producer Consumer Broker Zookeeper Broker Broker Offsets
  11. 11. 11 Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. Kafka and Zookeeper Kafka uses Zookeeper To indicate liveness of each broker To store broker and consumer state To coordinate leader elections for failover Zookeeper stores consumer offset by default This can be switched to the brokers, if desired Zookeeper also tracks and supports state changes such as Adding/removing brokers and consumers Rebalancing consumers Directing producers and consumers to partition leaders
  12. 12. 12 Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. Topic Partitions Partition is a totally-ordered store of messages (log) Partition order is immutable Messages are deleted as their time runs out New messages are appendable only The message offset is both a sequence number and a unique identifier (topic, partition) 0 1 2 3 4 5 6 7 8 9 1 0 1 1 1 2 1 3 0 1 2 3 4 5 6 7 8 9 1 0 1 1 0 1 2 3 4 5 6 7 8 9 1 0 1 1 1 2 1 3 Partition 0 Partition 1 Partition 2 Writes Old New
  13. 13. 13 Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. How are partitions distributed? Partitions are usually distributed across brokers Each broker may host partitions of several topics One broker acts as leader for any replicated partition Other brokers with a replica act as followers Only leaders serves read/write requests If the leader blinks out, a follower is elected to take over Election occurs only among in-sync replicas (ISRs)
  14. 14. 14 Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. Scalability & Parallelism Partitions can be used to allow message storage that exceeds one brokers capacity More brokers = greater message capacity Partitions also allow consumer groups to read a topic in parallel Each member can read a partition Kafka ensures no consumer contention in one group for a partition
  15. 15. 15 Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. Replication A topic partition is the unit of replication A replica remains in-sync with its leader so long as It maintains communication with Zookeeper It does not fall too far behind the leader (configurable) Replicating to n brokers Allows Kafka to offer availability under n - 1 losses The quality of this offer is tempered by the ISR group count
  16. 16. 16 Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. Fault Tolerance A broker may lead for some partitions and follow for others The replication for each topic determines how many brokers will follow Followers passively replicate the leader You can set an ISR policy Boils down to preference for high, medium, or low throughput The right ISR policy strikes some balance between Availability: electing a leader quickly in the event of failure Latency: assuring a producer its messages are safe (i.e., durable)
  17. 17. 17 Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. Producers Producers publish data (messages) to Kafka topics Producers choose the partition a message goes to By selecting in round-robin fashion to distribute the load By assigning a semantic partitioning function to key the messages
  18. 18. 18 Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. Consumers A consumer reads messages published to Kafka topics by moving its offset The offset increments by default Every consumer specifies a group label Consumer acts in one group do not affect other groups If one group "tails" a topics messages, it does not change what another group can consume They come and go with little impact on the cluster or other consumers
  19. 19. 19 Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. Kafka Consumer Group Operation Every message in a partition is read by the same instance of a consumer group Group members can be processes residing on separate machines The diagram below shows a two-broker cluster The brokers host one topic in four partitions, P0-P3 Group A has two instances; each instance reads two partitions Group B has four instances; each instance reads one partition Kafka Cluster P0 P3 P1 P2 Consumer Group A C1 C2 Consumer Group B C3 C4 C5 C6 Broker 1 Broker 2
  20. 20. 20 Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. Messages Kafka stores messages in its own format Producers and consumers also use this format for transfer efficiency Any serializable object can be a message Popular formats include string, JSON, and Avro Each messages id is also its unique identifier in a topic partition
  21. 21. 21 Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. Traditional Message Ordering Traditional queues store messages in the order received Consumers draw messages in store order With multiple consumers however, messages are not received in order Consumers may experience different delay They might also consume messages at different rates To retain order, only one process may consume from the queue Comes at the expense of parallelism
  22. 22. 22 Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. Guarantees for Ordering Kafka appends messages sent by a producer to one partition in sending order If a producer sends M1 followed by M2 to the same partition M1 will have a lower offset than M2 M1 will appear earlier in the partition A consumer always sees messages in stored order Given a partition with N replications, up to N-1 server failures may occur without message loss