Monitoring Evolution From Flying Blind to Instrument Flight

© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
DVO205
The AdRoll Monitoring Evolution:
From Flying Blind to Flying by Instrument
Brian Troutwine, AdRoll
Ilan Rabinovitch, Datadog
October 2015

Today’s speakers
Ilan Rabinovitch
Dir. Technical Community
Datadog
Brian Troutwine
Sr. Software Engineer
AdRoll

Quick Overview of
Datadog
• Monitoring for modern applications
• Dynamic Infrastructure
• Microservices
• Time series storage of metrics and events
• 100s of built in integrations
• Eg. EC2, ELB, ECS and more.

CAMS
Culture
Automation
Metrics
Sharing
Damon Edwards and John Willis

CAMS
Culture
Automation
METRICS
SHARING

You’re in the cloud and it's everything you dreamed of!

You’re in the cloud and it's everything you dreamed of!
Autoscaling
Container
orchestration
Infinite storage

In cloud we trust.
But how do we verify health?

How does your current monitoring fit in?

• Host-centric
How does our current monitoring fit in?

• Host-centric
• Static configurations tracking dynamic infrastructure

• Host-centric
• Static configurations tracking
dynamic infrastructure
• Focused on resources, rather than
work

• Host-centric
• Static configurations tracking dynamic infrastructure
• Focused on resources, rather than work
• Difficult to pull together and compare data from
multiple sources

So what to monitor?
More at: http://goo.gl/t1Rgcg

How to use that data?
More at: http://goo.gl/t1Rgcg

Recurse until you find root cause

Query-based monitoring
• Aggregates matter because the underlying infrastructure is dynamic
• Express our monitors or alerts as queries on predicates:
• “avg response time for requests to hosts running nginx > 500
ms”
• “min # of hosts running nginx < 3”
• Mash up data sources for a 360-degree view of a problem

Query-based monitoring
“Show me iowait across nginx hosts, grouped by
availability zone”

The problem domain
• Low latency (< 100 ms per transaction)
• Firm real-time system
• Highly concurrent ( ~2 million transactions
per second, peak)
• Global, 24/7 operation

In the early days of the
AdRoll real-time bidding
(RTB) project, we could
use our intuition.

• The system was simple.
• The number of total
requests was small.
• The impact of mistakes
was minor.

We could be reasonably
confident that our mental
model of the system’s
behavior was accurate.

The trouble with a
complex system is that its
behavior in practice gets
away from you pretty fast.

Our first approach was
to batch process logs
generated by
individual bidders.
Batch processing

Pros:
• We were already doing this.
• It’s simple to implement.
• It’s straightforward to
conceive.
Batch processing

Cons:
• High update latency
• Catastrophic errors lose logs
• Denies impulse
experimentation
Batch processing

Our second approach
was to generate coarse
real-time metrics and
analyze those.
Coarse real-time metrics

Pros:
• Iterative step up from
batch processing
• Proves out the concept
• Simple to implement

Cons:
• Still relied on intuition
• Bidder implementation was
sub-optimal
• Dashboards were one-size-
fits-all approach

By this point, the complexity of the
system and our ambitions were
growing.
• Two engineers were added to the
team.
• Tens more in the department.
• RTB became a central project.

We were making
decisions in a
knowledge void.

At this point, we have
AWS CloudWatch.

CloudWatch
reports the basic
health of your
system.

CloudWatch provides
the total view of the
AWS services you’re
using.

What we don’t have at
this point is a detailed
view of our system.

What we don’t have is
the ability to explore the
information we have,
especially in high-
stress situations.

Exometer solves our
Erlang-side problem.
Detailed application-
level instrumentation is
cheap and easy.

Datadog solves
our aggregation,
visualization, and
alerting problems.

Datadog integrates with
CloudWatch. Our system-
specific metrics can be
correlated with the basic
health of the system.

This can be done in real time.
Correlation of system information

This can be done in
high-stress situations.

This can be done by
other departments of
the business.

A bid “times out”
when we don’t reply
back to the exchange
in 100 ms.
Timeout spikes

We didn’t realize this
was happening. It’s an
early win of our
sophisticated monitoring.
Timeout spikes

System load is normal.
There’s not a periodic
spike in bid request traffic.
Timeout spikes

There is a correlated
jump in network
traffic, however.
Timeout spikes

There were also
correlated spikes in
the Erlang VM’s
process run queue.
Timeout spikes

• VM scheduler threads are locked to CPU
• CPU-intensive background process kicks
on every 20 minutes
• No CPU shield on the server
• VM scheduler thread gets kicked from its
assigned CPU, processes back up
Timeout spikes

Failure of bid-
request traffic is an
all-hands problem.
Traffic crash

Without traffic, the
bidders can do nothing.
Traffic crash

That’s a healthy couple
of days' worth of traffic.
It dips in the night, and
climbs in the day.

This is a weekend’s
worth of traffic lost.
Traffic crash

• Confirmed with CloudWatch that
networking to the machines was fine
• No changes had been made to the
production system (it was a looser
time)
• All detail metrics from the Erlang VM
are acceptable
Traffic crash

The exchange
confirmed a drop in
traffic from their system.
Traffic crash

Turns out, we hit an
implicit exchange
limitation.
Traffic crash

We also became more
conscientious about
alerting effectively.
Traffic crash

At high scale, it’s very
easy to have to be
over-provisioned for
the system’s load.
Sophisticated autoscaling

Worse, it’s very easy to
be under provisioned
for system load.

All CloudWatch alarms
on EC2 instances can
be pressed into service
for autoscaling.

Our first autoscaling
approach used
remaining idle CPU.

As traffic drops off at
the end of the day,
we need less CPU
time to process it.

This was great! We
immediately saved
loads of money.

Problem was, it’s an
indirect measurement.
There’s always some
nuance you’ll miss.

Co-resident subsystems
eat into the CPU time,
giving an inaccurate
impression.

CPU consumption
carries no information
about aberrant
system issues.

What can be done?

Distill the performance
capability of your
system into a single
signal.

The "metadata index"
tracks the load on the
bidders. It’s a weighted
sum of key metrics.

As traffic drops, the
metadata index
drops. Indirectly, idle
CPU increases.

We emit this
metadata index into
CloudWatch as a
custom metric.

As soon as it hits
CloudWatch, you
can autoscale on it.

This is twice as efficient
as the CPU idle scaling
signal. One-half the
number of machines.

There’s a lot of fraud in
the online advertising
industry.
Anti-fraud CookieBouncer

A certain kind of “hot
cookie” fraud
caused a tolerable
fault in the bidders.

CookieBouncer
blocks bidding on
fraudulent, “hot,”
cookies in real time.

Our concern was
blocking too much traffic,
in turn blocking
legitimate bids through
over-aggressive tuning.

We built a new
CookieBouncer
dashboard and introduced
the ability to tune it in real
time on every bidder.

We rolled CookieBouncer
out with conservative
settings and started
adjusting, keeping tabs
on the key indicators.

We adjusted and were
very surprised at the total
number of blocked
cookies and the
percentage of total traffic.

The instrumentation
speaks for itself.

Learn more at….
DVO204 - Monitoring Strategies: Finding Signal in the Noise
Thursday, Oct 8, 11:00 AM - 12:00 PM
OR
http://bit.ly/1Qo4Zmy

Remember to complete
your evaluations!

Monitoring Evolution From Flying Blind to Instrument Flight

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Monitoring Evolution From Flying Blind to Instrument Flight

Similar to Monitoring Evolution From Flying Blind to Instrument Flight (20)

More from Amazon Web Services

More from Amazon Web Services (20)

Recently uploaded

Recently uploaded (20)

Monitoring Evolution From Flying Blind to Instrument Flight