(SDD405) Amazon Kinesis Deep Dive | AWS re:Invent 2014

November 13, 2014 | Las Vegas, NV
Adi Krishnan, Sr. Product Manager Amazon Kinesis

Scenarios
Accelerated Ingest-Transform-Load
Continual Metrics/ KPI Extraction
Responsive Data Analysis
Data Types
IT infrastructure,Applications logs, Social media, Fin.Market data, Web Clickstreams, Sensors, Geo/Location data
Digital AdTech./ Marketing
Advertising Data aggregation
Advertising metrics like coverage,yield, conversion
Analytics on Userengagement with Ads, Optimized bid/ buy engines
Software/ Technology
IT server , App logs ingestion
IT operational metrics dashboards
Devices/ Sensor Operational Intelligence
Financial Services
Market/ Financial Transaction order data collection
Financial market data metrics
Fraud monitoring, and Value-at-Risk assessment, Auditing of market order data
ConsumerOnline/
E-Commerce
Onlinecustomer engagement data aggregation
Consumerengagement metrics like page views, CTR
Customer clickstream analytics, Recommendation engines
Scenarios Across Industry Segments
1
2
3

Amazon KinesisManaged Service for streaming data ingestion, and processing
Amazon Web ServicesAZAZAZDurable, highly consistent storage replicates dataacross three data centers (availability zones) Aggregate andarchive to S3Millions ofsources producing100s of terabytesper hourFrontEndAuthenticationAuthorizationOrdered streamof events supportsmultiple readersReal-timedashboardsand alarmsMachine learningalgorithms or sliding windowanalyticsAggregate analysisin Hadoop or adata warehouseInexpensive: $0.028 per million puts

Real-time Ingest
•Highly Scalable
•Durable
•Elastic
•Replay-able Reads
Continuous Processing FX
•Elastic
•Load-balancing incoming streams
•Fault-tolerance, Checkpoint / Replay
•Enable multiple processing apps in parallel
Enable data movement into Stores/ Processing Engines
Managed Service
Low end-to-end latency

Kinesis Stream Managed Ability To Capture And Store Data

Putting Data into Kinesis
Simple Put interface to store data in Kinesis

Best Practices: Putting Data in KinesisDetermine Your Partition Key Strategy
•Kinesis as a managed buffer or a streaming mapreduce
•Ensure a high cardinality for Partition Keys with respect to shards, to prevent a “hot shard” problem
–Generate Random Partition Keys
•Streaming Map-Reduce: Leverage Partition Keys for business specific logic as applicable
–Partition Key per billing customer, per DeviceId, per stock symbol

Best Practices: Putting Data in KinesisProvisioning Adequate Shards
•For ingress needs
•Egress needs for all consuming applications: If more than 2 simultaneous consumers
•Include head-room for catching up with data in stream in the event of application failures

Best Practices: Putting Data in KinesisPre-Batch before Puts for better efficiency

# KINESIS appender
log4j.logger.KinesisLogger=INFO, KINESIS
log4j.additivity.KinesisLogger=false
log4j.appender.KINESIS=com.amazonaws.services.kinesis.log4j. KinesisAppender
# DO NOT use a trailing %n unless you want a newline to be transmitted to KINESIS after every message
log4j.appender.KINESIS.layout=org.apache.log4j.PatternLayout
log4j.appender.KINESIS.layout.ConversionPattern=%m
# mandatory properties for KINESIS appender
log4j.appender.KINESIS.streamName=testStream
#optional, defaults to UTF-8
log4j.appender.KINESIS.encoding=UTF-8
#optional, defaults to 3
log4j.appender.KINESIS.maxRetries=3
log4j.appender.KINESIS.bufferSize=1000
log4j.appender.KINESIS.threadCount=20
#optional, defaults to 30 seconds
log4j.appender.KINESIS.shutdownTimeout=30https://github.com/awslabs/kinesis-log4j- appender
Best Practices: Putting Data in KinesisPre-Batch before Puts for better efficiency

•Retry if rise in input rate is temporary
•Reshardto increase number of shards
•Monitor CloudWatch metrics: PutRecord.Bytesand GetRecords.Bytesmetrics keep track of shard usage
Metric
Units
PutRecord.Bytes
Bytes
PutRecord.Latency
Milliseconds
PutRecord.Success
Count
•Keep track of your metrics
•Log hashkeyvalues generated by your partition keys
•Log Shard-Ids
•Determine which Shard receive the most (hashkey) traffic.
StringshardId= putRecordResult.getShardId();
putRecordRequest.setPartitionKey
(String.format( "myPartitionKey"));

Options:
•stream-name -The name of the Stream to be scaled
•scaling-action -The action to be taken to scale. Must be one of "scaleUp”, "scaleDown" or “resize"
•count -Number of shards by which to absolutely scale up or down, or resize to or:
•pct-Percentage of the existing number of shards by which to scale up or down
https://github.com/awslabs/amazon- kinesis-scaling-utils

Sending & Reading Data from Kinesis Streams
HTTP Post
AWS SDK
LOG4J
Flume
Fluentd
Get* APIs
Kinesis Client
Library
+
Connector Library
Apache
Storm
Amazon Elastic
MapReduce
Sending Consuming
AWS Mobile
SDK

Building Kinesis Applications: Kinesis Client LibraryOpen Source library for fault-tolerant, continuous processing apps
•Java client library, also available for Python Developers
•Source available on Github
•Build app with Kinesis Client Library
•Deploy on your set of EC2 instances
•Every KCL application includes these components:
•Record processor factory: Creates the record processor
•Record processor: The processing unit that processes data from a shard of a Kinesis stream
•Worker: The processing unit that maps to each application instance

•The KCL uses the IRecordProcessor interface to communicate with your application
•A Kinesis application must implement the KCL's IRecordProcessor interface
•Contains the business logic for processing the data retrieved from the Kinesis stream

•One record processor maps to one shard and processes data records from that shard
•One worker maps to one or more record processors
•Balances shard-worker associations when worker / instance counts change
•Balances shard-worker associations when shards split or merge

Moving data into Amazon S3, Redshift

Amazon Kinesis Connector LibraryCustomizable, Open Source Apps to Connect Kinesis with S3, Redshift, DynamoDB
ITransformer
•Defines the transformation of records from the Amazon Kinesis stream in order to suit the user- defined data model
IFilter
•Excludes irrelevant records from the processing.
IBuffer
•Buffers the set of records to be processed by specifying size limit (# of records)& total byte count
IEmitter
•Makes client calls to other AWS services and persists the records stored in the buffer.
S3
DynamoDB
Redshift
Kinesis

S3 Dynamo
DB
Redshift
Kinesis
Amazon Kinesis Connectors
• S3 Connector
– Batch writes files for archive into S3
– Uses sequence-based file naming scheme
• Redshift Connector
– Once written to S3, loads to Redshift
– Provides manifest support
– Supports user defined transformers
• DynamoDB Connector
– BatchPut appends to a table
– Supports user defined transformers

Best Practices: Processing Data From KinesisBuild applications as part of an Auto Scaling group
•Simply helps with application availability
•Scales in response to incoming spikes in-data volume, assuming Shards have been provisioned
•Select scaling metrics based on nature of Kinesis application
–Instance metrics: CPU, Memory, and others
–Kinesis Metrics: PutRecord.Bytes, GetRecord.Bytes

Metric
Units
PutRecord.Bytes
Bytes
PutRecord.Latency
Milliseconds
PutRecord.Success
Count
GetRecords.Bytes
Bytes
GetRecords.IteratorAge
Milliseconds
GetRecords.Latency
Milliseconds
Getrecords.Success
Count

Best Practices: Processing Data From KinesisBuild an flush-to-S3 consumer app
•App can specify three conditions that can trigger a buffer flush:
–Number of records
–Total byte count
–Time since last flush
•The buffer is flushed and the data is emitted to the destination when any of these thresholds is crossed.
# Flush when buffer exceeds 8 Kinesis records, 1 KB size limit or when time since last emit exceeds 10 minutes
bufferSizeByteLimit = 1024
bufferRecordCountLimit = 8
bufferMillisecondsLimit = 600000

Best Practices: Processing Data From Kinesis
•In KCL app, ensure data being processed is persisted to durable store like DynamoDB, or S3, prior to check-pointing.
•Duplicates: Make the authoritative data repository (usually at the end of the data flow) resilient to duplicates. That way the rest of the system has a simple policy –keep retrying until you succeed.
•Idempotent Processing: Use number of records since previous checkpoint, to get repeatable results when the record processors fail over.

•Creates a manifest file based on a custom set of input files
•Use a manifest stream with only one shard
•Adjust checkpoint frequency, connector buffer and filter to align with your redshift load models
Best Practices: Processing Data From Kinesis

Amazon Kinesis Customer Scenarios

Collect all data of interest continuously

Faster time to market due to ease of deployment

Enable operators, partners get to valuable data quickly

(SDD405) Amazon Kinesis Deep Dive | AWS re:Invent 2014

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to (SDD405) Amazon Kinesis Deep Dive | AWS re:Invent 2014

Similar to (SDD405) Amazon Kinesis Deep Dive | AWS re:Invent 2014 (20)

More from Amazon Web Services

More from Amazon Web Services (20)

Recently uploaded

Recently uploaded (20)

(SDD405) Amazon Kinesis Deep Dive | AWS re:Invent 2014