Amazon Kinesis is the AWS service for real-time streaming big data ingestion and processing. This talk gives a detailed exploration of Kinesis stream processing. We'll discuss in detail techniques for building, and scaling Kinesis processing applications, including data filtration and transformation. Finally we'll address tips and techniques to emitting data into S3, DynamoDB, and Redshift.
Handwritten Text Recognition for manuscripts and early printed texts
(SDD405) Amazon Kinesis Deep Dive | AWS re:Invent 2014
1. November 13, 2014 | Las Vegas, NV
Adi Krishnan, Sr. Product Manager Amazon Kinesis
2.
3.
4. Scenarios
Accelerated Ingest-Transform-Load
Continual Metrics/ KPI Extraction
Responsive Data Analysis
Data Types
IT infrastructure,Applications logs, Social media, Fin.Market data, Web Clickstreams, Sensors, Geo/Location data
Digital AdTech./ Marketing
Advertising Data aggregation
Advertising metrics like coverage,yield, conversion
Analytics on Userengagement with Ads, Optimized bid/ buy engines
Software/ Technology
IT server , App logs ingestion
IT operational metrics dashboards
Devices/ Sensor Operational Intelligence
Financial Services
Market/ Financial Transaction order data collection
Financial market data metrics
Fraud monitoring, and Value-at-Risk assessment, Auditing of market order data
ConsumerOnline/
E-Commerce
Onlinecustomer engagement data aggregation
Consumerengagement metrics like page views, CTR
Customer clickstream analytics, Recommendation engines
Scenarios Across Industry Segments
1
2
3
5.
6. Amazon KinesisManaged Service for streaming data ingestion, and processing
Amazon Web ServicesAZAZAZDurable, highly consistent storage replicates dataacross three data centers (availability zones) Aggregate andarchive to S3Millions ofsources producing100s of terabytesper hourFrontEndAuthenticationAuthorizationOrdered streamof events supportsmultiple readersReal-timedashboardsand alarmsMachine learningalgorithms or sliding windowanalyticsAggregate analysisin Hadoop or adata warehouseInexpensive: $0.028 per million puts
7. Real-time Ingest
•Highly Scalable
•Durable
•Elastic
•Replay-able Reads
Continuous Processing FX
•Elastic
•Load-balancing incoming streams
•Fault-tolerance, Checkpoint / Replay
•Enable multiple processing apps in parallel
Enable data movement into Stores/ Processing Engines
Managed Service
Low end-to-end latency
10. Putting Data into Kinesis
Simple Put interface to store data in Kinesis
11. Best Practices: Putting Data in KinesisDetermine Your Partition Key Strategy
•Kinesis as a managed buffer or a streaming map- reduce
•Ensure a high cardinality for Partition Keys with respect to shards, to prevent a “hot shard” problem
–Generate Random Partition Keys
•Streaming Map-Reduce: Leverage Partition Keys for business specific logic as applicable
–Partition Key per billing customer, per DeviceId, per stock symbol
12. Best Practices: Putting Data in KinesisProvisioning Adequate Shards
•For ingress needs
•Egress needs for all consuming applications: If more than 2 simultaneous consumers
•Include head-room for catching up with data in stream in the event of application failures
14. # KINESIS appender
log4j.logger.KinesisLogger=INFO, KINESIS
log4j.additivity.KinesisLogger=false
log4j.appender.KINESIS=com.amazonaws.services.kinesis.log4j. KinesisAppender
# DO NOT use a trailing %n unless you want a newline to be transmitted to KINESIS after every message
log4j.appender.KINESIS.layout=org.apache.log4j.PatternLayout
log4j.appender.KINESIS.layout.ConversionPattern=%m
# mandatory properties for KINESIS appender
log4j.appender.KINESIS.streamName=testStream
#optional, defaults to UTF-8
log4j.appender.KINESIS.encoding=UTF-8
#optional, defaults to 3
log4j.appender.KINESIS.maxRetries=3
#optional, defaults to 2000
log4j.appender.KINESIS.bufferSize=1000
#optional, defaults to 20
log4j.appender.KINESIS.threadCount=20
#optional, defaults to 30 seconds
log4j.appender.KINESIS.shutdownTimeout=30https://github.com/awslabs/kinesis-log4j- appender
Best Practices: Putting Data in KinesisPre-Batch before Puts for better efficiency
15. •Retry if rise in input rate is temporary
•Reshardto increase number of shards
•Monitor CloudWatch metrics: PutRecord.Bytesand GetRecords.Bytesmetrics keep track of shard usage
Metric
Units
PutRecord.Bytes
Bytes
PutRecord.Latency
Milliseconds
PutRecord.Success
Count
•Keep track of your metrics
•Log hashkeyvalues generated by your partition keys
•Log Shard-Ids
•Determine which Shard receive the most (hashkey) traffic.
StringshardId= putRecordResult.getShardId();
putRecordRequest.setPartitionKey
(String.format( "myPartitionKey"));
16. Options:
•stream-name -The name of the Stream to be scaled
•scaling-action -The action to be taken to scale. Must be one of "scaleUp”, "scaleDown" or “resize"
•count -Number of shards by which to absolutely scale up or down, or resize to or:
•pct-Percentage of the existing number of shards by which to scale up or down
https://github.com/awslabs/amazon- kinesis-scaling-utils
17. Sending & Reading Data from Kinesis Streams
HTTP Post
AWS SDK
LOG4J
Flume
Fluentd
Get* APIs
Kinesis Client
Library
+
Connector Library
Apache
Storm
Amazon Elastic
MapReduce
Sending Consuming
AWS Mobile
SDK
18.
19. Building Kinesis Applications: Kinesis Client LibraryOpen Source library for fault-tolerant, continuous processing apps
•Java client library, also available for Python Developers
•Source available on Github
•Build app with Kinesis Client Library
•Deploy on your set of EC2 instances
•Every KCL application includes these components:
•Record processor factory: Creates the record processor
•Record processor: The processing unit that processes data from a shard of a Kinesis stream
•Worker: The processing unit that maps to each application instance
20. •The KCL uses the IRecordProcessor interface to communicate with your application
•A Kinesis application must implement the KCL's IRecordProcessor interface
•Contains the business logic for processing the data retrieved from the Kinesis stream
21. •One record processor maps to one shard and processes data records from that shard
•One worker maps to one or more record processors
•Balances shard-worker associations when worker / instance counts change
•Balances shard-worker associations when shards split or merge
23. Amazon Kinesis Connector LibraryCustomizable, Open Source Apps to Connect Kinesis with S3, Redshift, DynamoDB
ITransformer
•Defines the transformation of records from the Amazon Kinesis stream in order to suit the user- defined data model
IFilter
•Excludes irrelevant records from the processing.
IBuffer
•Buffers the set of records to be processed by specifying size limit (# of records)& total byte count
IEmitter
•Makes client calls to other AWS services and persists the records stored in the buffer.
S3
DynamoDB
Redshift
Kinesis
24.
25.
26.
27.
28. S3 Dynamo
DB
Redshift
Kinesis
Amazon Kinesis Connectors
• S3 Connector
– Batch writes files for archive into S3
– Uses sequence-based file naming scheme
• Redshift Connector
– Once written to S3, loads to Redshift
– Provides manifest support
– Supports user defined transformers
• DynamoDB Connector
– BatchPut appends to a table
– Supports user defined transformers
29. Best Practices: Processing Data From KinesisBuild applications as part of an Auto Scaling group
•Simply helps with application availability
•Scales in response to incoming spikes in-data volume, assuming Shards have been provisioned
•Select scaling metrics based on nature of Kinesis application
–Instance metrics: CPU, Memory, and others
–Kinesis Metrics: PutRecord.Bytes, GetRecord.Bytes
31. Best Practices: Processing Data From KinesisBuild an flush-to-S3 consumer app
•App can specify three conditions that can trigger a buffer flush:
–Number of records
–Total byte count
–Time since last flush
•The buffer is flushed and the data is emitted to the destination when any of these thresholds is crossed.
# Flush when buffer exceeds 8 Kinesis records, 1 KB size limit or when time since last emit exceeds 10 minutes
bufferSizeByteLimit = 1024
bufferRecordCountLimit = 8
bufferMillisecondsLimit = 600000
32. Best Practices: Processing Data From Kinesis
•In KCL app, ensure data being processed is persisted to durable store like DynamoDB, or S3, prior to check-pointing.
•Duplicates: Make the authoritative data repository (usually at the end of the data flow) resilient to duplicates. That way the rest of the system has a simple policy –keep retrying until you succeed.
•Idempotent Processing: Use number of records since previous checkpoint, to get repeatable results when the record processors fail over.
33. •Creates a manifest file based on a custom set of input files
•Use a manifest stream with only one shard
•Adjust checkpoint frequency, connector buffer and filter to align with your redshift load models
Best Practices: Processing Data From Kinesis