Upload
others
View
6
Download
0
Embed Size (px)
Citation preview
Amazon Kinesis Capture, Deliver, and Process Real-time Data Streams on AWS
What to Expect from this 30 minute Session • Amazon Kinesis Overview • Kinesis data ingestion model • 6 things you need to know about Kinesis Streams
• How to think about partition keys • Sizing the stream • Extended Rete • PutRecords API for high throughput • Kinesis Producer Library • Timestamps in your data • Scaling your stream
• Consuming from Kinesis Streams
Amazon Kinesis Overview
Amazon Kinesis Managed service to capture and expose data streams for processing
Amazon Web Services
AZ AZ AZ
Durable, highly consistent storage replicates dataacross three data centers (availability zones)
Aggregate andarchive to S3
Millions ofsources producing100s of terabytes
per hour
FrontEnd
AuthenticationAuthorization
Ordered streamof events supportsmultiple readers
Real-timedashboardsand alarms
Machine learningalgorithms or
sliding windowanalytics
Aggregate analysisin Hadoop or a
data warehouse
Inexpensive: $0.028 per million puts
Amazon Kinesis Streams Foundational service for stream data processing
Real-time Ingest• Highly Scalable• Durable• Elastic • Replay-able Reads
Continuous Processing FX • Elastic• Load-balancing incoming streams• Fault-tolerance, Checkpoint / Replay• Enable multiple processing apps in parallel
Enable data movement into Stores/ Processing Engines
Managed Service
Low end-to-end latency
Amazon Kinesis: Streaming Data Ingestion
• Provisioned entity called a Stream
• Composed of Shards
• Each Shard ingests data up to 1MB/
sec, and up to 1000 TPS and
egresses up to 2 MB/sec
• All data is stored for 24 hours
• Scale Kinesis streams at any time
by splitting or merging Shards
• Replay data inside of 24Hr. Window
Kinesis Stream Managed Entity To Capture And Store Data
• Producers use a PUT call to store data in a Stream.
Each record <= 1 MB
• PutRecord {Data,StreamName,PartitionKey}
• PutRecords{Records{Data,PartitionKey}, StreamName}
• A Partition Key is supplied by producer and used to
distribute (MD5 hash) the PUTs across (hash key
range) of Shards
• Unique Sequence # is returned to the Producer
upon a successful PUT call
• Unique timestamp affixed to each record
Producer
Shard 1
Shard 2
Shard 3
Shard n
Shard 4
Producer
Producer
Producer
Producer
Producer
Producer
Producer
Producer
Kinesis
Putting Data into Kinesis Simple Put interface to store data in Kinesis
Managed Buffer • Care about a reliable, scalable
way to capture data • Defer all other aggregation to
consumer • Generate Random Partition
Keys • Ensure a high cardinality for
Partition Keys with respect to shards, to spray evenly across available shards
Topic #1: Thinking about ingestion Workload determines partition key strategy
Streaming Map-Reduce • Streaming Map-Reduce: leverage
partition keys as a natural way to aggregate data
• For e.g. Partition Keys per billing customer, per DeviceId, per stock symbol
• Design partition keys to scale • Be aware of “hot partition keys or
shards ”
Topic #2: Sizing the Kinesis Stream Provision adequate Shards, You can always change them • For ingress needs – capture all incoming data
• Count likely data producers – log servers, sensor/ things, smartphone app installs
• Individual payload size and (desired) frequency of Puts
• For egress needs – feed all consuming applications • Each Shard can do 2 MB/ sec on egress • Add more Shards for more applications
• Include head-room for ‘catch-up’ with data in stream in the event of application failures
Topic #3: Kinesis PutRecords API High throughput API for efficient writes to Kinesis
• PutRecords {Records {Data,PartitionKey}, StreamName} • Supports 500 records. • A record can be =<1 MB, up to 5 MB for entire API request • Can include records with different partition keys
• Response • PutRecords is not atomic. It can fail partially. • API response includes array of response Records both
successful and unsuccessful records. • An unsuccessful response record includes ErrorCode and
ErrorMessage values • You must write code that examines the PutRecordsResult objects to detect individual record failures and take appropriate action.
Topic #4: Kinesis Producer Library Highly configurable library to write to Kinesis
• Collects records and uses PutRecords for high throughput writes • Can writes to multiple Streams with automatic and configurable retries • Retries in case of errors, with ability to distinguish between retryable
and non-retry-able errors • Tracks record age and enforces maximum buffering times • Aggregates user records to increase payload size and improve
throughput • Integrates seamlessly with the Amazon Kinesis Client Library (KCL) to
de-aggregate batched records • Submits Amazon CloudWatch metrics on your behalf to provide
visibility into producer performance
Topic #5: Including ‘Time’ in your data ApproximateArrivalTimeStamp: Records in Stream at millisecond precision
• Each Amazon Kinesis record includes an approximate arrivaltimestamp at millisecond precision
• Set when Stream successfully receives record • No guarantees about the timestamp accuracy, or that it is
increasing across records in a shard or stream. • ApproximateArrivalTimestamp is exposed in
processRecords API call. • Use it when data producer can’t tell time, want to build
some time-windowed application, want to know age of oldest unread record in the Shard
New!
• Keep track of your metrics • Monitor CloudWatch metrics:
PutRecord.Bytes and GetRecords.Bytes metrics keep track of shard usage
• Retry if rise in input rate is temporary • Reshard to increase number of
shards • SplitShard – Adds more shards • MergeShard – Removes shards • Use the Kinesis Scaling Utility (on
Github)
Metric Units PutRecords.Bytes Bytes
PutRecords.Latency Milliseconds
PutRecords.Success Count
PutRecords.Records Count
Incoming Bytes Bytes
Incoming Records Count
Topic #6: Dealing with provisioned throughput exceeded Metrics and Resharding (SplitShard/ MergeShard)
Sending & Reading Data from Kinesis Streams
AWS SDK
LOG4J
Flume
Fluentd
Get* APIs
Kinesis Client Library + Connector Library
Apache Storm
Amazon Elastic MapReduce
Sending Consuming
AWS Mobile SDK
Kinesis Producer Library
AWS Lambda
Apache Spark