This session is recommended for anyone interested in building real-time streaming applications using AWS. In this session, you will get a deep understanding of how data can be ingested by Amazon Kinesis and made available for real-time analysis and processing. We’ll also show how you can leverage the Kinesis client to make your applications highly available and fault tolerant. We’ll explore various design considerations in implementing real-time solutions and explain key concepts against the backdrop of an actual use case. Finally, we’ll situate stream processing in the broader context of your big data applications.
3. Generated data
Available for analysis
Data volume
Gartner: User Survey Analysis: Key Trends Shaping the Future of Data Center Infrastructure Through 2011
IDC: Worldwide Business Analytics Software 2012–2016 Forecast and 2011 Vendor Shares
7. Big data: best served fresh
Big data
• Hourly server logs:
Were your systems misbehaving one
hour ago?
• Weekly/monthly bill:
What you spent this billing cycle?
• Daily customer-preferences report from your
website’s clickstream:
What deal or ad to try next time?
• Daily fraud reports:
Was there fraud yesterday?
Real-time big data
• Amazon CloudWatch metrics:
What went wrong now?
• Real-time spending alerts/caps:
Prevent overspending now.
• Real-time analysis:
What to offer the current customer now?
• Real-time detection:
Block fraudulent use now.
8. Talk outline
• A tour of Amazon Kinesis concepts in the context of a
Twitter Trends service
• Implementing the Twitter Trends service
• Amazon Kinesis in the broader context of a big data
ecosystem
• Q&A
9. Sending and reading data from Amazon Kinesis
streams
HTTP POST
AWS SDK
Log4j
Flume
Fluentd
Get* APIs
Amazon Kinesis
Client Library
+
Connector Library
Apache
Storm
Amazon Elastic
MapReduce
Sending Reading
10. A tour of Amazon Kinesis concepts in the
context of a Twitter Trends service
17. Core concepts recapped
• Data record - Tweet
• Stream - All tweets (the Twitter Firehose)
• Partition key - Twitter topic (every tweet record belongs to exactly one of
these)
• Shard - All the data records belonging to a set of Twitter topics that will get
grouped together
• Sequence number - Each data record gets one assigned when first
ingested
• Worker - Processes the records of a shard in sequence number order
18. twitter-trends.com
Using the Amazon Kinesis API directly
A
m
a
z
o
n
K
i
n
e
s
i
s
iterator = getShardIterator(shardId, LATEST);
while (true) {
[records, iterator] =
getNextRecords(iterator, maxRecsToReturn);
process(records);
}
process(records): {
for (record in records) {
updateLocalTop10(record);
}
if (timeToDoOutput()) {
writeLocalTop10ToDDB();
}
}
while (true) {
localTop10Lists =
scanDDBTable();
updateGlobalTop10List(
localTop10Lists);
sleep(10);
}
25. Checkpoint, replay design pattern
Amazon Kinesis
1417182123
Shard-i
235810
Shard
ID
Lock Seq
num
Shard-i
Host A
Host B
Shard
ID
Local top-10
Shard-i
0
10
18
X2
3
5
8
10
14
17
18
21
23
0
310
Host AHost B
{#Obama: 10235, #Seahawks: 9835, …}{#Obama: 10235, #Congress: 9910, …}
1023
14
17
18
21
23
26. Catching up from delayed processing
Amazon Kinesis
1417182123
Shard-i
235810
Host A
Host B
X2
3
5
2
3
5
8
10
14
17
18
21
23
How long until
we’ve caught up?
Shard throughput SLA:
- 1 MBps in
- 2 MBps out
=>
Catch up time ~ down time
Unless your application
can’t process 2 MBps on a
single host
=>
Provision more shards
28. Why not combine applications?
Amazon Kinesis
twitter-trends.com
twitter-stats.com
Output state and
checkpoint every 10
seconds
Output state and
checkpoint every hour
31. How many shards?
Additional considerations:
- How much processing is needed
per data record/data byte?
- How quickly do you need to
catch up if one or more of your
workers stalls or fails?
You can always dynamically reshard
to adjust to changing workloads
32. Twitter Trends shard processing code
Class TwitterTrendsShardProcessor implements IRecordProcessor {
public TwitterTrendsShardProcessor() { … }
@Override
public void initialize(String shardId) { … }
@Override
public void processRecords(List<Record> records,
IRecordProcessorCheckpointer checkpointer) { … }
@Override
public void shutdown(IRecordProcessorCheckpointer checkpointer,
ShutdownReason reason) { … }
}
33. Twitter Trends shard processing code
Class TwitterTrendsShardProcessor implements IRecordProcessor {
private Map<String, AtomicInteger> hashCount = new HashMap<>();
private long tweetsProcessed = 0;
@Override
public void processRecords(List<Record> records,
IRecordProcessorCheckpointer checkpointer) {
computeLocalTop10(records);
if ((tweetsProcessed++) >= 2000) {
emitToDynamoDB(checkpointer);
}
}
}
34. Twitter Trends shard processing code
private void computeLocalTop10(List<Record> records) {
for (Record r : records) {
String tweet = new String(r.getData().array());
String[] words = tweet.split(" t");
for (String word : words) {
if (word.startsWith("#")) {
if (!hashCount.containsKey(word)) hashCount.put(word, new AtomicInteger(0));
hashCount.get(word).incrementAndGet();
}
}
}
private void emitToDynamoDB(IRecordProcessorCheckpointer checkpointer) {
persistMapToDynamoDB(hashCount);
try {
checkpointer.checkpoint();
} catch (IOException | KinesisClientLibDependencyException | InvalidStateException
| ThrottlingException | ShutdownException e) {
// Error handling
}
hashCount = new HashMap<>();
tweetsProcessed = 0;
}
42. Bits & bytes (lots of them)
151 billion bid opportunities analyzed per day
120 terabytes of data analyzed per day
9 POPS + cloud & thousands of servers across the globe
40ms average response time
Transactional/financial data (so every record counts)$
43. The problem: Batch log shipping
Warehouse
(analytics,
decisioning,
optimization,
archive)
Impressions
Log File
Bidder Data
(wins)
Site Events
Third-Party
Segments
Event/Click
Log File
Segment
Log File
44.
45.
46. The solution: Streaming architecture
• Real-time data access
• Record-by-record routing and processing
• Multiple consumers (push and pull)
• Terabyte scale and beyond
48. MediaMath firehose
Multiple Amazon Kinesis streams
• Hundreds of shards
• Terabytes of daily data
• Topic-level streams
Seamless integration with existing producers and consumers
Consumers built on Amazon Kinesis Client Library (KLC)
• KCL-based consumers on (mostly) c3.4xlarge instance
• Connectors to Amazon S3, Amazon EMR, Amazon Kinesis, and
more