Zsolt Várnai, Principal Software Engineer at Skyscanner, presented "The advantages of real-time monitoring in apps development" as part of the Big Data, Budapest v 3.0 meetup organised on the 19th of May 2016 at Skyscanner's headquarters.
3. • Monthly or less frequent releases
• Big release test cycle and bugfixing period
• Analytics data is used occasionally
• Minimal information about what is happening in the production app
The past (1-2 years ago)
4. • If it worked before release then it will work later as well
=> NOT TRUE
• What can change
• OS update
• New devices
• Server side behavior (any 3rd party tool + internal servers)
• Higher diversity, lots of use cases on real devices
The past (1-2 years ago)
5. • Bi-weekly release trains
• Feature flag controls feature visibility
• Checking GA and MP data through API on daily basis (with
daily summary)
• React on issues in 1-2 days
• Monitoring app reviews regularly (slack channel feed)
The past (6 months ago)
6. • What could possibly go wrong?
• Failing network requests
• Looks OK on the server side, remains unnoticed
• Client fails when tries to process it
• 3rd party tools causing crashes
• There is no failure, but the app doesn’t show the relevant content
• Invalid state causing permant crash/error loops
Issues
7. • Collect, process and monitor as much data as possible from
various sources
• Analytics data (conversion metrics and other metrics for core
functionality)
• Monitor store reviews (manual, but a good source of direct
information)
• Low level application logs, visible and silent errors/warnings
Solutions
8. • Deep instrumentation throughout the application code
• Stream based real-time metrics from production apps (Kafka)
• Aggregating relevant metrics from the event stream (openTSDB=time series database)
• Alerting on metrics (Bosun)
• Incident management system (VictorOps)
• Dashboards (Grafana)
• Drilling down on detailed events in case of an incident (Elasticsearch)
• Good chance of fixing big issues remotely before new release (feature flag coverage)
Today
9.
10.
11.
12. • Smarter alerting capabilities
• General error/crash rates are misleading
• Ability to alert on big changes within a specific dimension (app version,
running experiments, different error types/services)
• Proper green flag system to alert relevant people without a dedicated
squad to supervise (“You build it you run it” model)
• Automated staged rollout progression based on real time metrics
• Automated review analysis
Future