(CMP403) AWS Lambda: Simplifying Big Data Workloads

© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Martin Holste, FireEye
October 2015
CMP403
AWS Lambda
Simplifying Big Data Workloads

What to Expect from the Session
This is a deep-dive on general computing uses for
AWS Lambda.
• You will understand what makes Lambda a big deal for
big data.
• You will not learn about asynchronously triggered
workloads (see related sessions for that).
• You will see interactive, data-driven user experiences
that work with minimal ops overhead and at any scale.

Problem: Big data, little time
At FireEye, one of the ways we protect customers is by
analyzing mountains of event data to find “evil.”
Some of it we have online in indexes, some of it we have in
cold storage on Amazon S3.
We needed to be able to take advantage of the rich history
in our archived data without hurting our user experience.

Our app creates questions and finds answers
Lambda-
driven search
and analytics
EMR
analytic
output
EC2-based
proprietary
detection
Amazon EMR triggers
investigations
EC2-based
indexed
search
AWS Lambda provides context
Questions Answers

Amazon EMR
Scheduled jobs that process all
data for anomaly detection:
• K-means
• Linear regression
• Geographic time-lining
What analysis are we doing?
AWS Lambda
Free-form searching to drive ad
hoc:
• Reports
• Visualizations
• Analytical statistics (clustering,
correlation, linear regression,
etc.)

Visualize search results analytically
User-defined analytics
based on ad hoc features
of the search result set
draw attention to otherwise
uninteresting facets of the
data.

How big is our Big?
For an average customer:
Average security event size is about 3k bytes at 20k
events/sec ~= 60 MB/sec, which is about 5 TB/day.
One week = 35 TB, 12 billion events.

How long does this take?
A single process downloads, decompresses, greps, and
processes at about 35k events/sec (105 MB/sec).
To process a week of data:
Processes Time Scale
1 ~4 days
10 ~6 hours
100 ~1 hour
1000 ~5 minutes
10000 seconds
0
50,000
100,000
150,000
200,000
250,000
300,000
350,000
400,000
1 10 100 1000

Lambda FTW
What if you could spin up 10k
processes in 100 ms?
Standard map-reduce pattern
without the startup time or hassle
of map-reduce frameworks.
Write your simple worker code,
and let cascading Lambda
functions handle the heavy lifting.

Lambda cascade
AWS Big Data blog: “Building Scalable and Responsive Big Data Interfaces with AWS Lambda”

Code components
Basic web app
Handles UI request,
invokes cascade
functions, streams
results.
Cascade function
Invokes workers,
aggregates and
returns results. Can
be made recursive.
Worker function
Performs atomic
work, returns
results to invoker.

Basic web app
var listStream = new S3KeyListStream(searchParams);
var lambdaStream = new LambdaStream(maxWorkers);
listStream
.pipe(lambdaStream, { end: false })
.pipe(serverSentStream)
.pipe(httpResponse);

Basic web app key points
• Batched async execution within an async pipeline is very
unintuitive.
• Trick is to use end:false to manually call end in pipeline
code when all work is done.
• Pipeline will naturally queue up batches to stay under
configured Lambda provisioning limits.

Lambda cascade function
// Chop our given list of keys up into batches
var batches = [];
var batch = [];
for (var i = 0, len = allKeys.length; i < len; i++){
batch.push(allKeys[i]);
if (batch.length >= batchSize){
batches.push(batch.slice());
batch = [];
}
}

Lambda cascade function (continued)
// Invoke each batch in parallel, returning
aggregated result when all are finished.
async.map(batches, invoke,
function (err, results) {
if (err) {
context.fail('async.map error: ' + err.toString());
return;
}
context.succeed(results);

Lambda cascade function key points
• Nature of the data and workload will dictate the correct
batch sizes to give a cascade function. Need to avoid
running out of memory to aggregate results.
• 100:1 seems to work well, good balance between low
cascade overhead and manageable intermediate result
size.

Worker function
var lineSplitter = new eventstream.split();
lineSplitter.on(‘data’, process).on(‘end’, cb);
// Create our pipeline
s3.getObject({
Bucket: srcBucket,
Key: srcKey
})
.createReadStream()
.pipe(zlib.createGunzip())
.pipe(lineSplitter);

Worker function key points
• Use the full 1.5 GB of memory.
• Download Amazon S3 keys concurrently.
• 5 seems to be the magic number for files in the 2-3 MB
range.
• Use a faster decompression algorithm like LZ4 high-
compression, which is up to 32x faster than zlib.
• Make sure warnings and failures percolate up with
results.

Non–Amazon S3–sourced workloads
Lambda can source from anything:
Amazon DynamoDB
Amazon RDS
Amazon Kinesis
Amazon EC2 endpoints
The Internet

How do my followers feel about _____
1. Enter in a keyword to the UI.
2. A Lambda worker executes for each follower.
3. Sentiment is reviewed (positive/negative/neutral).
4. Results are aggregated.

Progressive results
Thirty seconds is an eternity in UX time.
Go beyond a progress bar, return streaming, progressive
results.
Show something meaningful in 3-5 seconds, final result in
30.
Graphically represent the updating data.

Mechanical sympathy
Visualizing the result stream as it matures communicates
the magnitude of the work being performed and shows
value.

Lambda is the future (and past)
It demonstrates the essence of AWS: capability through
simplicity.
These things are no longer needed:
• Servers
• Operating systems
• Networking
Dev effort focuses only on core competencies, not
infrastructure.

Dev advantages
• If the code works once, it works
at any scale.
• Unit and integration testing are
easy (no cluster setup required).
• Any failures are due to faulty
code or bad input, which are
caught by good unit tests.

Beyond containers
• No patching, all upgrades are core
competency updates
• No instance monitoring, only app
monitoring
• Goes beyond containers, devs
have ultra-consistent environment

Remember mainframes?
Mainframes offer attractive operating model,
unattractive graphical capabilities.
PCs take over by bringing the compute to
the people for a rich, graphical experience.
Ubiquitous mobile broadband centralizes the
compute again by allowing best of both
worlds.
1970’s
1990’s
2010’s

Related Sessions
ARC308 - The Serverless Company Using AWS Lambda:
Streamlining Architecture with AWS
CMP301 - AWS Lambda: Event-Driven Code in the Cloud

Remember to complete
your evaluations!

(CMP403) AWS Lambda: Simplifying Big Data Workloads

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to (CMP403) AWS Lambda: Simplifying Big Data Workloads

Similar to (CMP403) AWS Lambda: Simplifying Big Data Workloads (20)

More from Amazon Web Services

More from Amazon Web Services (20)

Recently uploaded

Recently uploaded (20)

(CMP403) AWS Lambda: Simplifying Big Data Workloads