Amazon Kinesis Capture, Deliver, and Process Real …files.meetup.com/1744630/Amazon Kinesis Meet up with...Real-time dashboards and alarms Machine learning algorithms or sliding window

Amazon Kinesis Capture, Deliver, and Process Real-time Data Streams on AWS

What to Expect from this 30 minute Session •  Amazon Kinesis Overview •  Kinesis data ingestion model •  6 things you need to know about Kinesis Streams

•  How to think about partition keys •  Sizing the stream •  Extended Rete •  PutRecords API for high throughput •  Kinesis Producer Library •  Timestamps in your data •  Scaling your stream

•  Consuming from Kinesis Streams

Amazon Kinesis Overview

Amazon Kinesis Managed service to capture and expose data streams for processing

Amazon Web Services

AZ AZ AZ

Durable, highly consistent storage replicates dataacross three data centers (availability zones)

Aggregate andarchive to S3

Millions ofsources producing100s of terabytes

per hour

FrontEnd

AuthenticationAuthorization

Ordered streamof events supportsmultiple readers

Real-timedashboardsand alarms

Machine learningalgorithms or

sliding windowanalytics

Aggregate analysisin Hadoop or a

data warehouse

Inexpensive: $0.028 per million puts

Amazon Kinesis Streams Foundational service for stream data processing

Real-time Ingest•  Highly Scalable•  Durable•  Elastic •  Replay-able Reads

Continuous Processing FX •  Elastic•  Load-balancing incoming streams•  Fault-tolerance, Checkpoint / Replay•  Enable multiple processing apps in parallel

Enable data movement into Stores/ Processing Engines

Managed Service

Low end-to-end latency

Amazon Kinesis: Streaming Data Ingestion

•  Provisioned entity called a Stream

• Composed of Shards

• Each Shard ingests data up to 1MB/

sec, and up to 1000 TPS and

egresses up to 2 MB/sec

• All data is stored for 24 hours

• Scale Kinesis streams at any time

by splitting or merging Shards

• Replay data inside of 24Hr. Window

Kinesis Stream Managed Entity To Capture And Store Data

•  Producers use a PUT call to store data in a Stream.

Each record <= 1 MB

•  PutRecord {Data,StreamName,PartitionKey}

•  PutRecords{Records{Data,PartitionKey}, StreamName}

•  A Partition Key is supplied by producer and used to

distribute (MD5 hash) the PUTs across (hash key

range) of Shards

•  Unique Sequence # is returned to the Producer

upon a successful PUT call

•  Unique timestamp affixed to each record

Producer

Shard 1

Shard 2

Shard 3

Shard n

Shard 4

Producer

Producer

Producer

Producer

Producer

Producer

Producer

Producer

Kinesis

Putting Data into Kinesis Simple Put interface to store data in Kinesis

Managed Buffer •  Care about a reliable, scalable

way to capture data •  Defer all other aggregation to

consumer •  Generate Random Partition

Keys •  Ensure a high cardinality for

Partition Keys with respect to shards, to spray evenly across available shards

Topic #1: Thinking about ingestion Workload determines partition key strategy

Streaming Map-Reduce •  Streaming Map-Reduce: leverage

partition keys as a natural way to aggregate data

•  For e.g. Partition Keys per billing customer, per DeviceId, per stock symbol

•  Design partition keys to scale •  Be aware of “hot partition keys or

shards ”

Topic #2: Sizing the Kinesis Stream Provision adequate Shards, You can always change them •  For ingress needs – capture all incoming data

•  Count likely data producers – log servers, sensor/ things, smartphone app installs

•  Individual payload size and (desired) frequency of Puts

•  For egress needs – feed all consuming applications •  Each Shard can do 2 MB/ sec on egress •  Add more Shards for more applications

•  Include head-room for ‘catch-up’ with data in stream in the event of application failures

Topic #3: Kinesis PutRecords API High throughput API for efficient writes to Kinesis

•  PutRecords {Records {Data,PartitionKey}, StreamName} •  Supports 500 records. •  A record can be =<1 MB, up to 5 MB for entire API request •  Can include records with different partition keys

•  Response •  PutRecords is not atomic. It can fail partially. •  API response includes array of response Records both

successful and unsuccessful records. •  An unsuccessful response record includes ErrorCode and

ErrorMessage values •  You must write code that examines the PutRecordsResult objects to detect individual record failures and take appropriate action.

Topic #4: Kinesis Producer Library Highly configurable library to write to Kinesis

•  Collects records and uses PutRecords for high throughput writes •  Can writes to multiple Streams with automatic and configurable retries •  Retries in case of errors, with ability to distinguish between retryable

and non-retry-able errors •  Tracks record age and enforces maximum buffering times •  Aggregates user records to increase payload size and improve

throughput •  Integrates seamlessly with the Amazon Kinesis Client Library (KCL) to

de-aggregate batched records •  Submits Amazon CloudWatch metrics on your behalf to provide

visibility into producer performance

Topic #5: Including ‘Time’ in your data ApproximateArrivalTimeStamp: Records in Stream at millisecond precision

•  Each Amazon Kinesis record includes an approximate arrivaltimestamp at millisecond precision

•  Set when Stream successfully receives record •  No guarantees about the timestamp accuracy, or that it is

increasing across records in a shard or stream. •  ApproximateArrivalTimestamp is exposed in

processRecords API call. •  Use it when data producer can’t tell time, want to build

some time-windowed application, want to know age of oldest unread record in the Shard

New!

•  Keep track of your metrics •  Monitor CloudWatch metrics:

PutRecord.Bytes and GetRecords.Bytes metrics keep track of shard usage

•  Retry if rise in input rate is temporary •  Reshard to increase number of

shards •  SplitShard – Adds more shards •  MergeShard – Removes shards •  Use the Kinesis Scaling Utility (on

Github)

Metric Units PutRecords.Bytes Bytes

PutRecords.Latency Milliseconds

PutRecords.Success Count

PutRecords.Records Count

Incoming Bytes Bytes

Incoming Records Count

Topic #6: Dealing with provisioned throughput exceeded Metrics and Resharding (SplitShard/ MergeShard)

Sending & Reading Data from Kinesis Streams

AWS SDK

LOG4J

Flume

Fluentd

Get* APIs

Kinesis Client Library + Connector Library

Apache Storm

Amazon Elastic MapReduce

Sending Consuming

AWS Mobile SDK

Kinesis Producer Library

AWS Lambda

Apache Spark

Documents

Amazon Kinesis Capture, Deliver, and Process Real …files.meetup.com/1744630/Amazon Kinesis Meet up with...Real-time dashboards and alarms Machine learning algorithms or sliding window