AWS Lambda allows any Node.js app to be run at scale in a massively parallel environment with no up-front costs or planning. This session shows how to use Lambda to build dynamic analytic data flows that can be tuned as they execute, based on initial results, to provide real-time output streamed to web clients. This process enables a cost-effective and responsive user experience for ad hoc big data jobs and lets developers focus on how data is consumed and presented, instead of how it is obtained.
2. What to Expect from the Session
This is a deep-dive on general computing uses for
AWS Lambda.
• You will understand what makes Lambda a big deal for
big data.
• You will not learn about asynchronously triggered
workloads (see related sessions for that).
• You will see interactive, data-driven user experiences
that work with minimal ops overhead and at any scale.
3. Problem: Big data, little time
At FireEye, one of the ways we protect customers is by
analyzing mountains of event data to find “evil.”
Some of it we have online in indexes, some of it we have in
cold storage on Amazon S3.
We needed to be able to take advantage of the rich history
in our archived data without hurting our user experience.
5. Amazon EMR
Scheduled jobs that process all
data for anomaly detection:
• K-means
• Linear regression
• Geographic time-lining
What analysis are we doing?
AWS Lambda
Free-form searching to drive ad
hoc:
• Reports
• Visualizations
• Analytical statistics (clustering,
correlation, linear regression,
etc.)
6. Visualize search results analytically
User-defined analytics
based on ad hoc features
of the search result set
draw attention to otherwise
uninteresting facets of the
data.
7. How big is our Big?
For an average customer:
Average security event size is about 3k bytes at 20k
events/sec ~= 60 MB/sec, which is about 5 TB/day.
One week = 35 TB, 12 billion events.
8. How long does this take?
A single process downloads, decompresses, greps, and
processes at about 35k events/sec (105 MB/sec).
To process a week of data:
Processes Time Scale
1 ~4 days
10 ~6 hours
100 ~1 hour
1000 ~5 minutes
10000 seconds
0
50,000
100,000
150,000
200,000
250,000
300,000
350,000
400,000
1 10 100 1000
9. Lambda FTW
What if you could spin up 10k
processes in 100 ms?
Standard map-reduce pattern
without the startup time or hassle
of map-reduce frameworks.
Write your simple worker code,
and let cascading Lambda
functions handle the heavy lifting.
10. Lambda cascade
AWS Big Data blog: “Building Scalable and Responsive Big Data Interfaces with AWS Lambda”
11. Code components
Basic web app
Handles UI request,
invokes cascade
functions, streams
results.
Cascade function
Invokes workers,
aggregates and
returns results. Can
be made recursive.
Worker function
Performs atomic
work, returns
results to invoker.
12. Basic web app
var listStream = new S3KeyListStream(searchParams);
var lambdaStream = new LambdaStream(maxWorkers);
listStream
.pipe(lambdaStream, { end: false })
.pipe(serverSentStream)
.pipe(httpResponse);
13. Basic web app key points
• Batched async execution within an async pipeline is very
unintuitive.
• Trick is to use end:false to manually call end in pipeline
code when all work is done.
• Pipeline will naturally queue up batches to stay under
configured Lambda provisioning limits.
14. Lambda cascade function
// Chop our given list of keys up into batches
var batches = [];
var batch = [];
for (var i = 0, len = allKeys.length; i < len; i++){
batch.push(allKeys[i]);
if (batch.length >= batchSize){
batches.push(batch.slice());
batch = [];
}
}
15. Lambda cascade function (continued)
// Invoke each batch in parallel, returning
aggregated result when all are finished.
async.map(batches, invoke,
function (err, results) {
if (err) {
context.fail('async.map error: ' + err.toString());
return;
}
context.succeed(results);
16. Lambda cascade function key points
• Nature of the data and workload will dictate the correct
batch sizes to give a cascade function. Need to avoid
running out of memory to aggregate results.
• 100:1 seems to work well, good balance between low
cascade overhead and manageable intermediate result
size.
17. Worker function
var lineSplitter = new eventstream.split();
lineSplitter.on(‘data’, process).on(‘end’, cb);
// Create our pipeline
s3.getObject({
Bucket: srcBucket,
Key: srcKey
})
.createReadStream()
.pipe(zlib.createGunzip())
.pipe(lineSplitter);
18. Worker function key points
• Use the full 1.5 GB of memory.
• Download Amazon S3 keys concurrently.
• 5 seems to be the magic number for files in the 2-3 MB
range.
• Use a faster decompression algorithm like LZ4 high-
compression, which is up to 32x faster than zlib.
• Make sure warnings and failures percolate up with
results.
21. How do my followers feel about _____
1. Enter in a keyword to the UI.
2. A Lambda worker executes for each follower.
3. Sentiment is reviewed (positive/negative/neutral).
4. Results are aggregated.
24. Progressive results
Thirty seconds is an eternity in UX time.
Go beyond a progress bar, return streaming, progressive
results.
Show something meaningful in 3-5 seconds, final result in
30.
Graphically represent the updating data.
27. Lambda is the future (and past)
It demonstrates the essence of AWS: capability through
simplicity.
These things are no longer needed:
• Servers
• Operating systems
• Networking
Dev effort focuses only on core competencies, not
infrastructure.
28. Dev advantages
• If the code works once, it works
at any scale.
• Unit and integration testing are
easy (no cluster setup required).
• Any failures are due to faulty
code or bad input, which are
caught by good unit tests.
29. Beyond containers
• No patching, all upgrades are core
competency updates
• No instance monitoring, only app
monitoring
• Goes beyond containers, devs
have ultra-consistent environment
30. Remember mainframes?
Mainframes offer attractive operating model,
unattractive graphical capabilities.
PCs take over by bringing the compute to
the people for a rich, graphical experience.
Ubiquitous mobile broadband centralizes the
compute again by allowing best of both
worlds.
1970’s
1990’s
2010’s
31. Related Sessions
ARC308 - The Serverless Company Using AWS Lambda:
Streamlining Architecture with AWS
CMP301 - AWS Lambda: Event-Driven Code in the Cloud