Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

The Observability Pipeline

The pervasiveness of cloud and containers has led to systems that are much more distributed and dynamic in nature. Highly elastic microservice and serverless architectures mean containers spin up on demand and scale to zero when that demand goes away. In this world, servers are very much cattle, not pets. This shift has exposed deficiencies in some of the tools and practices we used in the world of servers-as-pets. Specifically, there are questions around how we monitor and debug these types of systems at scale. And with the rise of DevOps and product mindset, making data-driven decisions is becoming increasingly important for agile development teams.

In this talk, we discuss a new approach to system monitoring and data collection: the observability pipeline. For organizations that are heavily siloed, this approach can help empower teams when it comes to operating their software. The observability pipeline provides a layer of abstraction that allows you to get operational data such as logs and metrics everywhere it needs to be without impacting developers and the core system. Unlocking this data can also be a huge win for the business with things like auditability, business analytics, and pricing. Lastly, it allows you to change backing data systems easily or test multiple in parallel. With the amount of data and the number of tools modern systems demand these days, we'll see how the observability pipeline becomes just as essential to the operations of a service as the CI/CD pipeline.

  • Login to see the comments

The Observability Pipeline

  1. 1. @tyler_treat The Observability Pipeline Tyler Treat / deliver:Agile 2019 / April 29, 2019
  2. 2. @tyler_treat The way we build systems has fundamentally changed.
  3. 3. @tyler_treat Our systems are more complex than they’ve ever been.
  4. 4. @tyler_treat Don’t believe me?
  5. 5. @tyler_treat https://www.youtube.com/watch?v=xy3w2hGijhE
  6. 6. @tyler_treat Pets vs. Cattle
  7. 7. @tyler_treat This is our server. His name is Toby.
  8. 8. @tyler_treat We take good care of Toby.
  9. 9. @tyler_treat We release to him twice a year.
 (quarterly if we’re feeling dangerous)
  10. 10. @tyler_treat Toby is compatible with most
 versions of Internet Explorer.
  11. 11. @tyler_treat Toby likes to go on long walks,
 so sometimes we’ll take him 
 offline for a bit.
 (usually just nights and weekends)
  12. 12. @tyler_treat No one seems to mind.
  13. 13. @tyler_treat Sometimes Toby crashes,
 but we always make sure
 to restart him.
  14. 14. @tyler_treat We like Toby.
  15. 15. @tyler_treat This is 74db150601cd.
  16. 16. @tyler_treat It’s best not to get too
 attached because when he’s
 no longer needed, well…
  17. 17. @tyler_treat
  18. 18. @tyler_treat Transactional
 DB App Server Reporting
 DB
  19. 19. @tyler_treat Transactional
 DB App Server Reporting
 DB
  20. 20. @tyler_treat “We need to be highly available.”
  21. 21. @tyler_treat Transactional
 DB App Server Reporting
 DB
  22. 22. @tyler_treat Node 1 App Server Reporting
 DB Node 2 Node 3 Node 4 Node 5 Database Cluster App Server App Serverrver
  23. 23. @tyler_treat Node 1 App Server Reporting
 DB Node 2 Node 3 Node 4 Node 5 Database Cluster App Server App Serverrver
  24. 24. @tyler_treat “We need to support every device.”
  25. 25. @tyler_treat Node 1 App Server Reporting
 DB Node 2 Node 3 Node 4 Node 5 Database Cluster App Server App Serverrver
  26. 26. @tyler_treat Node 1 App Server Reporting
 DB Node 2 Node 3 Node 4 Node 5 Database Cluster App Server App Serverrver
  27. 27. @tyler_treat “We need faster response times.”
  28. 28. @tyler_treat Node 1 App Server Reporting
 DB Node 2 Node 3 Node 4 Node 5 Database Cluster App Server App Serverrver
  29. 29. @tyler_treat Node 1 App Server Reporting
 DB Node 2 Node 3 Node 4 Node 5 Database Cluster App Server App Serverrver Node 1 Node 2 Node 3 Node 4 Node 5 Cache Cluster
  30. 30. @tyler_treat “We need real-time analytics, not batch.”
  31. 31. @tyler_treat Node 1 App Server Reporting
 DB Node 2 Node 3 Node 4 Node 5 Database Cluster App Server App Serverrver Node 1 Node 2 Node 3 Node 4 Node 5 Cache Cluster
  32. 32. @tyler_treat App Server Node 1 Node 2 Node 3 Node 4 Node 5 Database Cluster App Server App Serverrver Node 1 Node 2 Node 3 Node 4 Node 5 Cache Cluster Node 1 Node 2 Node 3 Node 4 Node 5 BI Data Cluster BI Server BI Server Data Pipeline
  33. 33. @tyler_treat “We need to release multiple times a day.”
  34. 34. @tyler_treat App Server Node 1 Node 2 Node 3 Node 4 Node 5 Database Cluster App Server App Serverrver Node 1 Node 2 Node 3 Node 4 Node 5 Cache Cluster Node 1 Node 2 Node 3 Node 4 Node 5 BI Data Cluster BI Server BI Server Data Pipeline
  35. 35. @tyler_treat Node 1 Node 2 Node 3 Node 4 Node 5 BI Data Cluster BI Server BI Server 1 2 3 4 5 Database Cluster 1 2 3 4 5 Cache Cluster Microservice 1 2 3 4 5 Database Cluster 1 2 3 4 5 Cache Cluster Microservice 1 2 3 4 5 Database Cluster 1 2 3 4 5 Cache Cluster Microservice 1 2 3 4 5 Database Cluster 1 2 3 4 5 Cache Cluster Microservice Data Pipeline
  36. 36. @tyler_treat “We need to support multiple geos.”
  37. 37. @tyler_treat Node 1 Node 2 Node 3 Node 4 Node 5 BI Data Cluster BI Server BI Server 1 2 3 4 5 Database Cluster 1 2 3 4 5 Cache Cluster Microservice 1 2 3 4 5 Database Cluster 1 2 3 4 5 Cache Cluster Microservice 1 2 3 4 5 Database Cluster 1 2 3 4 5 Cache Cluster Microservice 1 2 3 4 5 Database Cluster 1 2 3 4 5 Cache Cluster Microservice Data Pipeline
  38. 38. @tyler_treat North America BI Server BI Server Microservice Microservice Microservice Microservice Asia Pacific BI Server BI Server Microservice Microservice Microservice Microservice
  39. 39. @tyler_treat North America BI Server BI Server Microservice Microservice Microservice Microservice North America BI Server BI Server Microservice Microservice Microservice Microservice North America BI Server BI Server Microservice Microservice Microservice Microservice North America BI Server BI Server Microservice Microservice Microservice Microservice CDN
  40. 40. @tyler_treat North America BI Server BI Server Microservice Microservice Microservice Microservice North America BI Server BI Server Microservice Microservice Microservice Microservice North America BI Server BI Server Microservice Microservice Microservice Microservice North America BI Server BI Server Microservice Microservice Microservice Microservice CDN Infrastructure Load Balancers Orchestrators DNS Configuration . . .
  41. 41. @tyler_treat North America BI Server BI Server Microservice Microservice Microservice Microservice North America BI Server BI Server Microservice Microservice Microservice Microservice North America BI Server BI Server Microservice Microservice Microservice Microservice North America BI Server BI Server Microservice Microservice Microservice Microservice CDN CI/CD Repo Repo Repo Repo Builder Builder Builder Builder Builder Builder Artifacts Artifacts Artifacts Deployer Deployer Infrastructure Load Balancers Orchestrators DNS Configuration . . .
  42. 42. @tyler_treat “Oh, and one more thing…”
  43. 43. @tyler_treat “…we need to do DevOps.”
  44. 44. @tyler_treat North America BI Server BI Server Microservice Microservice Microservice Microservice North America BI Server BI Server Microservice Microservice Microservice Microservice North America BI Server BI Server Microservice Microservice Microservice Microservice North America BI Server BI Server Microservice Microservice Microservice Microservice CDN CI/CD Repo Repo Repo Repo Builder Builder Builder Builder Builder Builder Artifacts Artifacts Artifacts Deployer Deployer Infrastructure Load Balancers Orchestrators DNS Configuration . . .
  45. 45. @tyler_treat North America BI Server BI Server Microservice Microservice Microservice Microservice North America BI Server BI Server Microservice Microservice Microservice Microservice North America BI Server BI Server Microservice Microservice Microservice Microservice North America BI Server BI Server Microservice Microservice Microservice Microservice CDN CI/CD Repo Repo Repo Repo Builder Builder Builder Builder Builder Builder Artifacts Artifacts Artifacts Deployer Deployer “DevOps” Infrastructure Load Balancers Orchestrators DNS Configuration . . .
  46. 46. @tyler_treat The way we build systems has fundamentally changed.
  47. 47. @tyler_treat Because our constraints and expectations have fundamentally changed.
  48. 48. @tyler_treat Cloud and containers have led to much more distributed and dynamic systems.
  49. 49. @tyler_treat Transactional
 DB App Server Reporting
 DB
  50. 50. @tyler_treat North America BI Server BI Server Microservice Microservice Microservice Microservice North America BI Server BI Server Microservice Microservice Microservice Microservice North America BI Server BI Server Microservice Microservice Microservice Microservice North America BI Server BI Server Microservice Microservice Microservice Microservice CDN CI/CD Repo Repo Repo Repo Builder Builder Builder Builder Builder Builder Artifacts Artifacts Artifacts Deployer Deployer Infrastructure Load Balancers Orchestrators DNS Configuration . . . “DevOps”
  51. 51. @tyler_treat This shift has exposed deficiencies in our tools and practices…
  52. 52. @tyler_treat …and has led to new tools created to help us support our systems.
  53. 53. @tyler_treat How do we make sense of it all?
  54. 54. @tyler_treat In particular, how do we make this…
  55. 55. @tyler_treat North America BI Server BI Server Microservice Microservice Microservice Microservice North America BI Server BI Server Microservice Microservice Microservice Microservice North America BI Server BI Server Microservice Microservice Microservice Microservice North America BI Server BI Server Microservice Microservice Microservice Microservice CDN CI/CD Repo Repo Repo Repo Builder Builder Builder Builder Builder Builder Artifacts Artifacts Artifacts Deployer Deployer Infrastructure Load Balancers Orchestrators DNS Configuration . . . “DevOps”
  56. 56. @tyler_treat more like this…
  57. 57. @tyler_treat North America BI Server BI Server Microservice Microservice Microservice Microservice North America BI Server BI Server Microservice Microservice Microservice Microservice North America BI Server BI Server Microservice Microservice Microservice Microservice North America BI Server BI Server Microservice Microservice Microservice Microservice CDN CI/CD Repo Repo Repo Repo Builder Builder Builder Builder Builder Builder Artifacts Artifacts Artifacts Deployer Deployer Infrastructure Load Balancers Orchestrators DNS Configuration . . . “DevOps”
  58. 58. @tyler_treat “The Observability Pipeline”
  59. 59. @tyler_treat A Brave New World
  60. 60. @tyler_treat Operations for
  61. 61. @tyler_treat APM Debugger Profiler SSH grep
  62. 62. @tyler_treat APM Debugger Profiler SSH grep
  63. 63. @tyler_treat APM Debugger Profiler SSH grep
  64. 64. @tyler_treat APM Debugger Profiler SSH grep
  65. 65. @tyler_treat APM Debugger Profiler SSH grep
  66. 66. @tyler_treat APM Debugger Profiler SSH System Behavior grep
  67. 67. @tyler_treat APM Debugger Profiler SSH System Behavior Actual Customer Impact grep
  68. 68. @tyler_treat Operations for
  69. 69. @tyler_treat APM Debugger Profiler SSH grep
  70. 70. @tyler_treat APM Debugger Profiler SSH grep
  71. 71. @tyler_treat APM Debugger Profiler SSH Testing in Production at Scale, Amit Gud grep
  72. 72. @tyler_treat APM Debugger Profiler SSH System Behavior Actual Customer Impact ???grep
  73. 73. @tyler_treat grep APM Debugger Profiler SSH System Behavior Actual Customer Impact ???
  74. 74. @tyler_treat Also, culture.
  75. 75. @tyler_treat Many companies rely on a separate operations team to monitor, triage, and even resolve issues.
  76. 76. @tyler_treat This model doesn’t map to the world of microservices and containers.
  77. 77. @tyler_treat And it leads to ineffective feedback loops.
  78. 78. @tyler_treat In order for developers to take on this responsibility, they need to be enabled.
  79. 79. @tyler_treat “DevOps” teams are really “Developer Enablement” teams.
  80. 80. @tyler_treat This shift in how we build systems has caused an explosion of new tools and terminology.
  81. 81. @tyler_treat “Observability”
  82. 82. @tyler_treat Post Hoc vs. Ad Hoc
  83. 83. @tyler_treat Data Available Understanding
  84. 84. @tyler_treat Data Available Understanding Known Knowns • Things we are aware of and understand • “The system has a 1GB memory limit”
  85. 85. @tyler_treat Data Available Understanding Known Knowns • Things we are aware of and understand • “The system has a 1GB memory limit” Known Unknowns • Things we are aware of but don’t understand • “The system exceeded its memory limit and crashed, causing an outage”
  86. 86. @tyler_treat Data Available Understanding Unknown Knowns • Things we understand but are not aware of • “We implemented an orchestrator to ensure the system is always running” Known Knowns • Things we are aware of and understand • “The system has a 1GB memory limit” Known Unknowns • Things we are aware of but don’t understand • “The system exceeded its memory limit and crashed, causing an outage”
  87. 87. @tyler_treat Data Available Understanding Unknown Knowns • Things we understand but are not aware of • “We implemented an orchestrator to ensure the system is always running” Known Knowns • Things we are aware of and understand • “The system has a 1GB memory limit” Unknown Unknowns • Things we are neither aware of nor understand • “Instances churn because the orchestrator restarts the process when it approaches its memory limit, causing
 sporadic failures and slowdowns” Known Unknowns • Things we are aware of but don’t understand • “The system exceeded its memory limit and crashed, causing an outage”
  88. 88. @tyler_treat Data Available Understanding Unknown Knowns • Things we understand but are not aware of • “We implemented an orchestrator to ensure the system is always running” Known Knowns • Things we are aware of and understand • “The system has a 1GB memory limit” Unknown Unknowns • Things we are neither aware of nor understand • “Instances churn because the orchestrator restarts the process when it approaches its memory limit, causing
 sporadic failures and slowdowns” Known Unknowns • Things we are aware of but don’t understand • “The system exceeded its memory limit and crashed, causing an outage” FACTS
  89. 89. @tyler_treat Data Available Understanding Unknown Knowns • Things we understand but are not aware of • “We implemented an orchestrator to ensure the system is always running” Known Knowns • Things we are aware of and understand • “The system has a 1GB memory limit” Unknown Unknowns • Things we are neither aware of nor understand • “Instances churn because the orchestrator restarts the process when it approaches its memory limit, causing
 sporadic failures and slowdowns” Known Unknowns • Things we are aware of but don’t understand • “The system exceeded its memory limit and crashed, causing an outage” FACTS HYPOTHESES
  90. 90. @tyler_treat Data Available Understanding Unknown Knowns • Things we understand but are not aware of • “We implemented an orchestrator to ensure the system is always running” Known Knowns • Things we are aware of and understand • “The system has a 1GB memory limit” Unknown Unknowns • Things we are neither aware of nor understand • “Instances churn because the orchestrator restarts the process when it approaches its memory limit, causing
 sporadic failures and slowdowns” Known Unknowns • Things we are aware of but don’t understand • “The system exceeded its memory limit and crashed, causing an outage” ASSUMPTIONS FACTS HYPOTHESES
  91. 91. @tyler_treat Unknown Unknowns • Things we are neither aware of nor understand • “Instances churn because the orchestrator restarts the process when it approaches its memory limit, causing
 sporadic failures and slowdowns” DISCOVERIES Data Available Understanding Unknown Knowns • Things we understand but are not aware of • “We implemented an orchestrator to ensure the system is always running” Known Knowns • Things we are aware of and understand • “The system has a 1GB memory limit” Known Unknowns • Things we are aware of but don’t understand • “The system exceeded its memory limit and crashed, causing an outage” ASSUMPTIONS FACTS HYPOTHESES
  92. 92. @tyler_treat Unknown Unknowns • Things we are neither aware of nor understand • “Instances churn because the orchestrator restarts the process when it approaches its memory limit, causing
 sporadic failures and slowdowns” DISCOVERIES Data Available Understanding Known Unknowns • Things we are aware of but don’t understand • “The system exceeded its memory limit and crashed, causing an outage” HYPOTHESES MonitoringObservability
  93. 93. @tyler_treat Unknown Unknowns • Things we are neither aware of nor understand • “Instances churn because the orchestrator restarts the process when it approaches its memory limit, causing
 sporadic failures and slowdowns” DISCOVERIES Data Available Understanding Known Unknowns • Things we are aware of but don’t understand • “The system exceeded its memory limit and crashed, causing an outage” HYPOTHESES TestingExploring
  94. 94. @tyler_treat “The army is now fully prepared to fight the previous war.”
  95. 95. @tyler_treat 
 Observability Data application logs system logs audit logs application metrics distributed traces events
  96. 96. @tyler_treat Some
 challenges… 
 Observability Data application logs system logs audit logs application metrics distributed traces events - Locked up inside a single vendor’s solution - Not readily available across the enterprise
 (or in some cases, too readily available) - Many tools and products needed for
 different data and use cases - Tool and data needs vary from team to
 team - Ever-changing landscape of tools, products,
 and services - Sheer volume of data can be overwhelming
  97. 97. @tyler_treat System
  98. 98. @tyler_treat System Splunk Universal Forwarder
  99. 99. @tyler_treat System Splunk Universal Forwarder Datadog Metrics Agent Datadog APM Agent
  100. 100. @tyler_treat System Splunk Universal Forwarder Datadog Metrics Agent Datadog APM Agent Universal Analytics Client
  101. 101. @tyler_treat System Splunk Universal Forwarder Datadog Metrics Agent Datadog APM Agent Universal Analytics Client Amazon Glacier S3 Client
  102. 102. @tyler_treat System Splunk Universal Forwarder Datadog APM Agent Universal Analytics Client Amazon Glacier S3 Client … Datadog Metrics Agent
  103. 103. System Splunk Universal Forwarder Datadog APM Agent Universal Analytics Client S3 Client … Datadog Metrics Agent System Splunk Universal Forwarder Datadog APM Agent Universal Analytics Client S3 Client … Datadog Metrics Agent System Splunk Universal Forwarder Datadog APM Agent Universal Analytics Client S3 Client … Datadog Metrics Agent System Sp Un For Datad A Universal Analytics Client S3 Client … Datado A System Splunk Universal Forwarder Datadog APM Agent Universal Analytics Client S3 Client … Datadog Metrics Agent System Splunk Universal Forwarder Datadog APM Agent Universal Analytics Client S3 Client … Datadog Metrics Agent System Splunk Universal Forwarder Datadog APM Agent Universal Analytics Client S3 Client … Datadog Metrics Agent System Sp Un For Datad A Universal Analytics Client S3 Client … Datado A Splunk Universal Forwarder Universal Analytics Client Splunk Universal Forwarder Universal Analytics Client Splunk Universal Forwarder Universal Analytics Client Sp Un For Universal Analytics Client System System System System
  104. 104. @tyler_treat “Oh, actually we want to change how we parse our logs.”
  105. 105. System Splunk Universal Forwarder Datadog APM Agent Universal Analytics Client S3 Client … Datadog Metrics Agent System Splunk Universal Forwarder Datadog APM Agent Universal Analytics Client S3 Client … Datadog Metrics Agent System Splunk Universal Forwarder Datadog APM Agent Universal Analytics Client S3 Client … Datadog Metrics Agent System Sp Un For Datad A Universal Analytics Client S3 Client … Datado A System Splunk Universal Forwarder Datadog APM Agent Universal Analytics Client S3 Client … Datadog Metrics Agent System Splunk Universal Forwarder Datadog APM Agent Universal Analytics Client S3 Client … Datadog Metrics Agent System Splunk Universal Forwarder Datadog APM Agent Universal Analytics Client S3 Client … Datadog Metrics Agent System Sp Un For Datad A Universal Analytics Client S3 Client … Datado A Splunk Universal Forwarder Universal Analytics Client Splunk Universal Forwarder Universal Analytics Client Splunk Universal Forwarder Universal Analytics Client Sp Un For Universal Analytics Client System System System System
  106. 106. @tyler_treat “Re-roll the agents."
  107. 107. @tyler_treat “Oh, actually we want to use Sumo Logic for logging.”
  108. 108. System Splunk Universal Forwarder Datadog APM Agent Universal Analytics Client S3 Client … Datadog Metrics Agent System Splunk Universal Forwarder Datadog APM Agent Universal Analytics Client S3 Client … Datadog Metrics Agent System Splunk Universal Forwarder Datadog APM Agent Universal Analytics Client S3 Client … Datadog Metrics Agent System Sp Un For Datad A Universal Analytics Client S3 Client … Datado A System Splunk Universal Forwarder Datadog APM Agent Universal Analytics Client S3 Client … Datadog Metrics Agent System Splunk Universal Forwarder Datadog APM Agent Universal Analytics Client S3 Client … Datadog Metrics Agent System Splunk Universal Forwarder Datadog APM Agent Universal Analytics Client S3 Client … Datadog Metrics Agent System Sp Un For Datad A Universal Analytics Client S3 Client … Datado A Splunk Universal Forwarder Universal Analytics Client Splunk Universal Forwarder Universal Analytics Client Splunk Universal Forwarder Universal Analytics Client Sp Un For Universal Analytics Client System System System System
  109. 109. @tyler_treat “Re-roll the agents."
  110. 110. System Sumo Logic Collector Datadog APM Agent Universal Analytics Client S3 Client … Datadog Metrics Agent System Sumo Logic Collector Datadog APM Agent Universal Analytics Client S3 Client … Datadog Metrics Agent System Sumo Logic Collector Datadog APM Agent Universal Analytics Client S3 Client … Datadog Metrics Agent System Sum Co Datad A Universal Analytics Client S3 Client … Datado A System Sumo Logic Collector Datadog APM Agent Universal Analytics Client S3 Client … Datadog Metrics Agent System Sumo Logic Collector Datadog APM Agent Universal Analytics Client S3 Client … Datadog Metrics Agent System Sumo Logic Collector Datadog APM Agent Universal Analytics Client S3 Client … Datadog Metrics Agent System Sum Co Datad A Universal Analytics Client S3 Client … Datado A Sumo Logic Collector Universal Analytics Client Sumo Logic Collector Universal Analytics Client Sumo Logic Collector Universal Analytics Client Sum Co Universal Analytics Client System System System System
  111. 111. @tyler_treat “Oh, actually we want to use New Relic for APM.”
  112. 112. System Sumo Logic Collector Datadog APM Agent Universal Analytics Client S3 Client … Datadog Metrics Agent System Sumo Logic Collector Datadog APM Agent Universal Analytics Client S3 Client … Datadog Metrics Agent System Sumo Logic Collector Datadog APM Agent Universal Analytics Client S3 Client … Datadog Metrics Agent System Sum Co Datad A Universal Analytics Client S3 Client … Datado A System Sumo Logic Collector Datadog APM Agent Universal Analytics Client S3 Client … Datadog Metrics Agent System Sumo Logic Collector Datadog APM Agent Universal Analytics Client S3 Client … Datadog Metrics Agent System Sumo Logic Collector Datadog APM Agent Universal Analytics Client S3 Client … Datadog Metrics Agent System Sum Co Datad A Universal Analytics Client S3 Client … Datado A Sumo Logic Collector Universal Analytics Client Sumo Logic Collector Universal Analytics Client Sumo Logic Collector Universal Analytics Client Sum Co Universal Analytics Client System System System System
  113. 113. @tyler_treat “Re-roll the agents."
  114. 114. System Sumo Logic Collector Universal Analytics Client S3 Client … New Relic APM Agent System Sumo Logic Collector Universal Analytics Client S3 Client … New Relic APM Agent System Sumo Logic Collector Universal Analytics Client S3 Client … New Relic APM Agent System Sum Co Universal Analytics Client S3 Client … New R A System Sumo Logic Collector Universal Analytics Client S3 Client … New Relic APM Agent System Sumo Logic Collector Universal Analytics Client S3 Client … New Relic APM Agent System Sumo Logic Collector Universal Analytics Client S3 Client … New Relic APM Agent System Sum Co Universal Analytics Client S3 Client … New R A Sumo Logic Collector Universal Analytics Client Sumo Logic Collector Universal Analytics Client Sumo Logic Collector Universal Analytics Client Sum Co Universal Analytics Client System System System System
  115. 115. @tyler_treat “Oh, actually we want to evaluate Honeycomb for debugging.”
  116. 116. System Sumo Logic Collector Universal Analytics Client S3 Client … New Relic APM Agent System Sumo Logic Collector Universal Analytics Client S3 Client … New Relic APM Agent System Sumo Logic Collector Universal Analytics Client S3 Client … New Relic APM Agent System Sum Co Universal Analytics Client S3 Client … New R A System Sumo Logic Collector Universal Analytics Client S3 Client … New Relic APM Agent System Sumo Logic Collector Universal Analytics Client S3 Client … New Relic APM Agent System Sumo Logic Collector Universal Analytics Client S3 Client … New Relic APM Agent System Sum Co Universal Analytics Client S3 Client … New R A Sumo Logic Collector Universal Analytics Client Sumo Logic Collector Universal Analytics Client Sumo Logic Collector Universal Analytics Client Sum Co Universal Analytics Client System System System System
  117. 117. @tyler_treat “Re-roll the agents."
  118. 118. System Sumo Logic Collector Universal Analytics Client S3 Client … New Relic APM Agent System Sumo Logic Collector Universal Analytics Client S3 Client … New Relic APM Agent System Sumo Logic Collector Universal Analytics Client S3 Client … New Relic APM Agent System Sum Co Universal Analytics Client S3 Client … New R A System Sumo Logic Collector Universal Analytics Client S3 Client … New Relic APM Agent System Sumo Logic Collector Universal Analytics Client S3 Client … New Relic APM Agent System Sumo Logic Collector Universal Analytics Client S3 Client … New Relic APM Agent System Sum Co Universal Analytics Client S3 Client … New R A Sumo Logic Collector Universal Analytics Client Sumo Logic Collector Universal Analytics Client Sumo Logic Collector Universal Analytics Client Sum Co Universal Analytics Client System System System System Honeytail AgentHoneytail Agent Honeytail Agent Honey Honeytail Agent Honeytail Agent Honeytail Agent Honey
  119. 119. @tyler_treat You get the idea.
  120. 120. @tyler_treat How big of a lift is it for your organization to change tools?
  121. 121. @tyler_treat How easy is it to experiment with new ones?
  122. 122. @tyler_treat Data Sources • VMs • Containers • Load balancers • Service meshes • Audit logs • VPC flow logs • Firewall logs • … Data Sinks • Centralized logging • SIEM • Monitoring • APM • Alerting • Cold storage • BI • … What data to send? Where to send it? How to send it?
  123. 123. @tyler_treat A decoupled approach
  124. 124. @tyler_treat What data to send? Where to send it? How to send it? Data Sources • VMs • Containers • Load balancers • Service meshes • Audit logs • VPC flow logs • Firewall logs • … Data Sinks • Centralized logging • SIEM • Monitoring • APM • Alerting • Cold storage • BI • … Observability Pipeline
  125. 125. @tyler_treat Anatomy of an Observability Pipeline
  126. 126. @tyler_treat Structure your damn data. 1. Data Specifications
  127. 127. @tyler_treat log.error(“User '{}' login failed”.format(user))
  128. 128. @tyler_treat ERROR 2019-04-05 13:26.42 User ‘tylertreat' login failed
  129. 129. @tyler_treat log.error(“User login failed”, event=LOGIN_ERROR, user=“tylertreat”, email=“tyler.treat@realkinetic.com”, error=error)
  130. 130. @tyler_treat { “timestamp”: “2019-04-05 13:26.42”, “level”: “ERROR”, “event”: “user_login_error”, “user”: “tylertreat”, “email”: “tyler.treat@realkinetic.com”, “error”: “Invalid username or password”, “message”: “User login failed” }
  131. 131. @tyler_treat JSON is fine.
  132. 132. @tyler_treat Pass a context object to everything.
  133. 133. @tyler_treat def login(ctx, username, email, password): ctx.set(user=username, email=email) ... log.error(“User login failed”, event=LOGIN_ERROR, context=ctx, error=error) ...
  134. 134. @tyler_treat { “timestamp”: “2019-04-05 13:26.42”, “level”: “ERROR”, “event”: “user_login_error”, “context”: { “id”: “accfbb8315c44a52ad893ca6772e1caf”, “http_method”: “POST”, “http_path”: “/login”, “user”: “tylertreat”, “email”: “tyler.treat@realkinetic.com”, }, “error”: “Invalid username or password”, “message”: “User login failed” }
  135. 135. @tyler_treat { “timestamp”: “2019-04-05 13:26.42”, “level”: “ERROR”, “event”: “user_login_error”, “context”: { “id”: “accfbb8315c44a52ad893ca6772e1caf”, “http_method”: “POST”, “http_path”: “/login”, “user”: “tylertreat”, “email”: “tyler.treat@realkinetic.com”, }, “error”: “Invalid username or password”, “message”: “User login failed” }
  136. 136. @tyler_treat What goes on the context?
  137. 137. @tyler_treat What can you get for “free” and what do you need to pass along?
  138. 138. @tyler_treat Create standard specs for each data type collected (logs, metrics, traces).
  139. 139. @tyler_treat Specs can enforce required fields (e.g. user id, license, trace id) and data types.
  140. 140. @tyler_treat { “timestamp”: “2019-04-05 13:26.42”, “level”: “INFO”, “event”: “user_login”, “context”: { “id”: “accfbb8315c44a52ad893ca6772e1caf”, “http_method”: “POST”, “http_path”: “/login”, “user”: “tylertreat”,
 “user_id”: “3bb12f6c63274abe87fd1ee4ee37f3d2”,
 “license”: “942e6543f0844be680e72003d5e060fd”, “email”: “tyler.treat@realkinetic.com”, } }
  141. 141. @tyler_treat Be mindful not to log sensitive data like passwords.
  142. 142. @tyler_treat Specs alone aren’t enough! 2. Specification Libraries
  143. 143. @tyler_treat Empowering developers requires providing tools that align the “easy” path with the “right” path.
  144. 144. @tyler_treat We need libraries that implement the specs and make it easy for devs to instrument their systems.
  145. 145. @tyler_treat • Java: log4j • Go: logrus • Python: structlog • Ruby: ruby-cabin • .NET: serilog • JS: structured-log • etc. There are many existing libraries for structured logging.
  146. 146. @tyler_treat For tracing and metrics, there are vendor-neutral APIs like OpenTracing and OpenCensus.
  147. 147. @tyler_treat We need a lightweight agent that can collect data from hosts/containers. 3. Data Collector
  148. 148. @tyler_treat Collect data, perform transformations/ filters, and write it to the data pipeline.
  149. 149. @tyler_treat Typically runs as an agent on the host (DaemonSet in Kubernetes).
  150. 150. @tyler_treat Data is written to stdout/stderr or a Unix domain socket.
  151. 151. @tyler_treat Just use Fluentd or Logstash (+Beats).
  152. 152. @tyler_treat We need a scalable, fault-tolerant data stream to handle the firehose of observability data generated. 4. Data Pipeline
  153. 153. @tyler_treat This also provides a buffer that decouples producers from consumers.
  154. 154. @tyler_treat System Splunk Universal Forwarder Datadog APM Agent Universal Analytics Client Amazon Glacier S3 Client … Datadog Metrics Agent
  155. 155. @tyler_treat System Splunk Universal Forwarder Datadog APM Agent Universal Analytics Client Amazon Glacier S3 Client … Datadog Metrics Agent
  156. 156. @tyler_treat Lots of options…
  157. 157. @tyler_treat
  158. 158. @tyler_treat We need a component to consume data from the pipeline, perform filtering, and write it to the appropriate backends. 5. Data Router
  159. 159. @tyler_treat May perform transformations and processing of data, but heavy processing should be the responsibility of a backend system (e.g. alerting or aggregations).
  160. 160. @tyler_treat This is where the data spec comes into play.
  161. 161. @tyler_treat The data type determines how incoming data is routed.
  162. 162. @tyler_treat Data Pipeline Amazon Glacier Data Router logs traces metrics
  163. 163. @tyler_treat Data Pipeline Amazon Glacier Data Router logs traces metrics
  164. 164. @tyler_treat Data Pipeline Amazon Glacier Data Router logs traces metrics
  165. 165. @tyler_treat This is primarily a stateless component writing to APIs.
  166. 166. @tyler_treat Good fit for “serverless” solutions.
  167. 167. @tyler_treat Piecing It All Together
  168. 168. @tyler_treat
  169. 169. @tyler_treat You don’t need to build it out all in one go.
  170. 170. @tyler_treat There are quick wins along the way!
  171. 171. @tyler_treat Evolving to an Observability Pipeline • Adopt structured logging • Move log/data collection out of process • Use a centralized logging system • Introduce a streaming data solution • Start adding data consumers
  172. 172. @tyler_treat Moving from host-centric to service-centric observability.
  173. 173. @tyler_treat This maps to VMs and containers as well as it does to “serverless” models.
  174. 174. @tyler_treat Ops Systems Production Product
 Development Product
 Management Security &
 Compliance Support/
 Helpdesk
  175. 175. @tyler_treat Dev/Ops/SRE Systems Production Audit Business Analytics Pricing Decisions Data-Driven Product Decisions Threat Detection Monitoring Debugging & Operational Insights ...
  176. 176. @tyler_treat Dev/Ops/SRE Systems Production
  177. 177. @tyler_treat Dev/Ops/SRE Systems Production
  178. 178. @tyler_treat Dev/Ops/SRE Systems Production
  179. 179. @tyler_treat Dev/Ops/SRE Systems Production
  180. 180. @tyler_treat Dev/Ops/SRE Systems Production
  181. 181. @tyler_treat Dev/Ops/SRE Systems Production
  182. 182. @tyler_treat Benefits • Pattern can be evolved to with quick wins along the way • Maps to elastic and serverless architectures better • Empowers teams in siloed organizations and unlocks data for other parts of the business • Enables teams to use the tools best suited to their needs • Easier to change tools or evaluate them side-by-side by decoupling • Minimizes impact on developers and the core system
  183. 183. @tyler_treat But it’s not a silver bullet.
  184. 184. @tyler_treat Downsides • Moving away from agent-based model means we have to handle data routing ourselves • A lot of the Data Router components might need to be custom-made using various vendor SDKs or client libraries (assuming they have APIs) • This also means we might lose some of the value-add features of certain agents • Unclear how well this maps to pull-based models (e.g. Prometheus)
  185. 185. @tyler_treat CI/CD Pipeline +
 Observability Pipeline
  186. 186. @tyler_treat CI/CD Pre- Production
 (theorizing about known unknowns) Post- Production
 (learning from unknown unknowns) Observability
  187. 187. @tyler_treat Thank You realkinetic.com
 bravenewgeek.com

×