Today, AdRoll runs its infrastructure by instrumentation: constantly asking empirical questions, analyzing data for answers, and designing new features with instrumentation in mind to understand how functionality will work upon release. AdRoll’s development methodology did not start out this way, however. It took a cultural shift and many new tools and processes to adopt this approach. In this session, AdRoll and Datadog will discuss how to evolve your organization from a state of “flying blind” to a culture focused on monitoring and data-based decisions. Session sponsored by Datadog.
3. Quick Overview of
Datadog
• Monitoring for modern applications
• Dynamic Infrastructure
• Microservices
• Time series storage of metrics and events
• 100s of built in integrations
• Eg. EC2, ELB, ECS and more.
13. • Host-centric
• Static configurations tracking dynamic infrastructure
How does our current monitoring fit in?
14. • Host-centric
• Static configurations tracking
dynamic infrastructure
• Focused on resources, rather than
work
How does our current monitoring fit in?
15. • Host-centric
• Static configurations tracking dynamic infrastructure
• Focused on resources, rather than work
• Difficult to pull together and compare data from
multiple sources
How does our current monitoring fit in?
16. So what to monitor?
More at: http://goo.gl/t1Rgcg
17. How to use that data?
More at: http://goo.gl/t1Rgcg
19. Query-based monitoring
• Aggregates matter because the underlying infrastructure is dynamic
• Express our monitors or alerts as queries on predicates:
• “avg response time for requests to hosts running nginx > 500
ms”
• “min # of hosts running nginx < 3”
• Mash up data sources for a 360-degree view of a problem
24. The problem domain
• Low latency (< 100 ms per transaction)
• Firm real-time system
• Highly concurrent ( ~2 million transactions
per second, peak)
• Global, 24/7 operation
25. In the early days of the
AdRoll real-time bidding
(RTB) project, we could
use our intuition.
26. • The system was simple.
• The number of total
requests was small.
• The impact of mistakes
was minor.
27. We could be reasonably
confident that our mental
model of the system’s
behavior was accurate.
28. The trouble with a
complex system is that its
behavior in practice gets
away from you pretty fast.
29. Our first approach was
to batch process logs
generated by
individual bidders.
Batch processing
30. Pros:
• We were already doing this.
• It’s simple to implement.
• It’s straightforward to
conceive.
Batch processing
32. Our second approach
was to generate coarse
real-time metrics and
analyze those.
Coarse real-time metrics
33. Pros:
• Iterative step up from
batch processing
• Proves out the concept
• Simple to implement
Coarse real-time metrics
34. Cons:
• Still relied on intuition
• Bidder implementation was
sub-optimal
• Dashboards were one-size-
fits-all approach
Coarse real-time metrics
35. By this point, the complexity of the
system and our ambitions were
growing.
• Two engineers were added to the
team.
• Tens more in the department.
• RTB became a central project.
57. • VM scheduler threads are locked to CPU
• CPU-intensive background process kicks
on every 20 minutes
• No CPU shield on the server
• VM scheduler thread gets kicked from its
assigned CPU, processes back up
Timeout spikes
63. This is a weekend’s
worth of traffic lost.
Traffic crash
64. • Confirmed with CloudWatch that
networking to the machines was fine
• No changes had been made to the
production system (it was a looser
time)
• All detail metrics from the Erlang VM
are acceptable
Traffic crash