Meetup presentation on Feb 27th 2019 at the Dock8s Meetup in Heidelberg/Rhein-Neckar, at the verivox campus.
The talk touches on all areas which involve a cloud journey of a major produkt (iDesk2) of the Haufe Group: Planning & Politics, Technology and doing Operations for that product as a DevOps team.
Architecture decision records - How not to get lost in the past
Cloud Journey: Lifting a Major Product to Kubernetes
1. Welcome to
&
Dock8s Meetup
Robert Werlich
Site Reliability Engineer
robert.werlich@verivox.com
Marlen Blaube
Senior HR Business Partner
marlen.blaube@verivox.com
2. Cloud Journey: Lifting a Major
Product to Kubernetes
Dock8s Meetup Heidelberg
Feb 27th 2019
Martin Danielsson, Haufe Group, Freiburg
@donmartin76 (Twitter, Github)
3. Dock8s Meetup Heidelberg, February 27th 2019
whoami
C:> WINDOWS.EXE
C/C++/C# Background
10+ years
$ docker ps
Containers & Kubernetes
Since ~4 years
wicked.haufe.io maintainer
OSS API Management
Solution
Architect
Developer
since 2006
7. Dock8s Meetup Heidelberg, February 27th 2019
Some numbers
100+ active
git repos
874k LOC
10-15 Developers
200-500
concur-
rent users
Typically
100 req/s
448 GB RAM
56 Cores
8.
9. Major revenue
Strategic move
to containers Modular
Architecture
Without Container
Experience
Hosted with Hoster
(€€€)
Long Release
cycles
(LOTS of) Manual
Work for Releases
Little Operations
Insight
Error tracking
very difficult
Non-Parity
Dev/Test/Prod
(Cost!)
Legacy Web App
(Java based)
10. Dock8s Meetup Heidelberg, February 27th 2019
Vision – Goals
Enabling
CI/CD
Automatic
Provisioning
Full Insight
Minimize
Ops
13. Dock8s Meetup Heidelberg, February 27th 2019
Stakeholder Management
CONVINCE THEM,
DON‘T PERSUADE
THEM
COMMUNICATE
OFTEN AND
CLEARLY
DON‘T
UNDERESTIMATE
TASKS AT HAND
BE TRANSPARENT SHARE
SUCCESSES
BUT ALSO
FAILURES!
14. Dock8s Meetup Heidelberg, February 27th 2019
Team Setup – Vision
100% DevOps Engineers
T-Shaped Engineers
No dedicated manual testers
Automate! YBI, YRI. Ops experience?
15. Dock8s Meetup Heidelberg, February 27th 2019
Some HR topics
Release
Managers?
Operations
Responsibility?
Quality
Engineers
(testers)?
On Call Duty?
18. Dock8s Meetup Heidelberg, February 27th 2019
Steps to DevOps Happiness
Provision Deploy CI/CD
Weekly for Production, Daily for Dev/Test
Ship when ready!
19. Dock8s Meetup Heidelberg, February 27th 2019
Wait, uh, what…?
Target
“No-Ops”
No long-running
systems
Enable validation of
3rd Party component
upgrades
Incremental
changes
Practice Disaster
Recovery Daily
100% Reproducible
Deployments
On-demand Production
Identical Environments
20. Dock8s Meetup Heidelberg, February 27th 2019
Code
&
Pipelines
So, it‘s all…
… and pipelines are also code
21. Dock8s Meetup Heidelberg, February 27th 2019
Incremental Backend Development
Merge feature to
master
•After code
review
•Including test
suite changes
Build master
branch
•Includes unit
testing
•First integration
tests
Deploy to
integration system
•Blue/Green with
integration tests
Deploy to
Production
•Blue/Green with
integration tests
22. Dock8s Meetup Heidelberg, February 27th 2019
Incremental Frontend Development
Merge feature to
master
•After code
review
•Including test
suite changes
Build master
branch
•Includes unit
testing
•First integration
tests
Deploy to
integration system
•Run e2e
integration tests
•Rollback if
failing
Deploy to
Production
•Run e2e
integration tests
•Rollback if
failing
24. Dock8s Meetup Heidelberg, February 27th 2019
Full Provisioning
Create backup
Provision new
infrastructure
•From backups
•Same as
disaster
recovery!
Deploy
components
•Using
deployment
pipelines
•Partly
parallelized
Top level DNS
switch
•Using DNS
traffic
manager
Destroy old
infrastructure
•If tests
succeed
25. Dock8s Meetup Heidelberg, February 27th 2019
Persistence Options
Roll your own persistence Persistence “as a service”
Self managed VMs (incl. NFS) Managed Disks
(AWS EBS, Azure Managed Disks)
DBaaS (many options)
Files as a service
(AWS EFS, Azure Files)
Gluster/Ceph FS (cluster)
26. Dock8s Meetup Heidelberg, February 27th 2019
iDesk2 Deployment Architecture
Resource Group
Kubernetes
Cluster
ks8
Master
ks8
Agent
ks8
Agent n
…
NFS
VM(s)
Postgres
VM(s)
Disks
Disks
• Azure Files not fast enough
• Legacy components depend on
UNIX rights (Azure Files is SMB)
• Azure Disks only ReadWriteOnce
• Azure PGaaS was not yet available
• More „bang for your buck“
• PG Admin knowledge in Team
28. Dock8s Meetup Heidelberg, February 27th 2019
Some hints…
Assess your Persistence
Needs early on
If possible, use DBaaS
(avoid NIH syndrome)
Externalize Configuration
Shared File Storage is not
“Cloud Native”
30. Dock8s Meetup Heidelberg, February 27th 2019
Now that we have Kubernetes…?
Self healing
Robust
Production Ready
Battle proven
“Vertrauen ist gut...
… Kontrolle ist besser!”
Complex
Additional Abstraction
Layer
31. Dock8s Meetup Heidelberg, February 27th 2019
“Kontrolle” - What do you mean?
Detecting these things is a start...
32. Dock8s Meetup Heidelberg, February 27th 2019
Fail: Lyin’ Monitors
End-to-End Monitoring
ALL GOOD
People logging in
500
… an entire weekend.
34. Dock8s Meetup Heidelberg, February 27th 2019
Prometheu
s
A
Metrics
Endpoint
http://A:8080/metrics
JVM Metrics
Node.js Metrics
VM Exporters
(node_exporter)
DB Exporters
(pg_exporter)
Kubernetes Statistics
Prometheus Client based
Custom Exporters
...
BTime Series
DB
36. Dock8s Meetup Heidelberg, February 27th 2019
Metrics
White Box Black Box
Counters
GaugesHistograms
Summaries Application
Network
Latencies
Errors
Timeouts
Infrastructure
Disk Space
CPU
Memory
Pod Status
46. Dock8s Meetup Heidelberg, February 27th 2019
Percentage of document
retrieval requests served within
0.25 and 1s
Percentage of search requests
answered within 1, 3 and 7.5s
Percentage of Error Pages
Indicators
95% and 98.5%
50%, 95% and 98.5%
<1%
Agreements
Service Level
48. Dock8s Meetup Heidelberg, February 27th 2019
Holistic View
Instrument early (and lots)
Deployments easier
Less fear of change
We are in control!
hope and think we
49. Dock8s Meetup Heidelberg, February 27th 2019
Fails: Resiliency Issues
VMs are sometimes
patched and restarted.
Or they just die.
So will any
service on them.
Networks are
unreliable.
Connections will fail.
Use (libraries for)
circuit breakers
and retries.
Re-establishing TLS on
each call to external
services is expensive.
… and the service
will hate you. Use
Keep-Alive.
SPOFs will
eventually fail.
Assess and act.
Learn how to
detect problems.
52. Dock8s Meetup Heidelberg, February 27th 2019
Key Performance
Indicators
• >70% Cost Saving
• Release Effort down >98% via
automation
• Higher Release Pace (3-5/y to 15-20/mo)
• Performance measurable
• Faster Reaction to Issues
• Unlocks Cloud Technology
Dock8s Meetup Heidelberg, February 27th 2019
53. Dock8s Meetup Heidelberg, February 27th 2019
k8s Ops possible
as a Team
Requires full automation
(also test)
Team dedication Rethinking ops is
challenging
No Silver Bullet
Assess your requirements
Could just as well have been AWS; Azure was investigated first as we didn‘t know whether we would have the need to go to Azure Germany (this was 2017).
This has a couple of implications:
You need backups for persistent data inside the cluster
You must be able to automatically restore them
You will also get a certain amount of „non-persisted“ time (time where you cannot persist user changed) – for Aurora, this is around 90 minutes each Tuesday early morning Acceptable for us, may not be acceptable for other teams
Instrument your components to get out (possibly) interesting metrics.
Rather instrument more, if you do it from the start, it doesn’t hurt much. And adding later is also rather easy.
Monitor and alert on anticipated failures or known previous issues If for some reason you cannot find or fix the root cause* With Monitor and Alert, I subsume Logging and Tracing here.
Enable insight and visualization - or “debugging” if you will - to see inside your system what might have gone wrong.
“This is what you would call ‘instrumenting’ your code” - exporting metrics from it
You would use a client library (there are client libraries for most programming languages). This takes your application current state of all tracked metrics and transforms it into a format that Prometheus understands and exposes it via the http endpoints, which Prometheus scrapes at regular time intervals.
There are a number of libraries and servers which help in exporting existing metrics from third-party systems as Prometheus metrics. This is useful for cases where it is not feasible to instrument a given system with Prometheus metrics directly
What can we do with that data - Two examples: Dashboarding and Alerting
E.g. Grafana can use Prometheus as a data source via the Prometheus Query Language to display time series as a graph, e.g. for dashboarding.
Simultaneously, Prometheus can evaluate certain expressions to see whether alerts have to be triggered. These are then passed on to another component of Prometheus, the Alertmanager, which in turn makes sure the alerts are delivered to wherever they should be delivered to. For us, that’s (both) Rocket Chat and E-Mail.
One step back, what kind of metrics exist? Let’s look at a couple of categories - first, white box and black box. That’s where the metrics come from - do you measure them inside your stack (white box), or do you probe from the outside - black box.
Hint: You should do both.
Bottom left you see different types of metrics here specifically Prometheus supports - Counters (things which only increase), Gauges (things which go up and down), Histograms (to see a discrete distribution) and Summaries (for seeing quantiles).
Bottom right you see the sources of metrics - infrastructure (things like disk space, CPU and memory utilization), network (latencies, errors, timeouts and such) and perhaps the most interesting bit - your own application metrics.
Recall - there is no automatic way of retrieving all of your application specific metrics - this is the instrumentation bit.
It was in parts an eye opener to us when we started looking at metrics...
By simply inspecting response times on various end points, we could pinpoint issues we weren’t really aware of, but which helped getting an even better experience on our web site.
Mind you - all of these things were already in the logs - but who reads logs in case you don’t REALLY have a problem. Takers?
Typical “Newletter Friday” - The editors of on of the largest products send out newsletters each friday, which we immediately see on login numbers.
So, what’s this number?
It’s number of individual time series we collect from our production system. Prometheus can do lots more, up to millions, but it’s still quite a number of things to look at and evaluate.
OK, so, great. We have a bunch of metrics. What do we do with those?
Of course you should alert on infrastructure failure - if the failure entrails any need of intervention. If you can automatically recover - no need to alert.
Rule of thumb: Alerts should be ACTIONABLE. If there’s an alert - you should have to do something (even if it’s just investigating). If an alert doesn’t require any actions - chances are good you should not alert on it (and just collect statistics).
We have found out that this is dang hard though.
The other thing that is just plain clear is that you must make sure that your application is available - probably by using some black box type of end to end test. If your application isn’t available - that must be your top priority to get it back up and running (but that’s obvious).
Is that enough though?
Let’s say we have 99,99% availability, does that mean everything is fine? No. We must find additional metrics to measure how well we are doing.
Actually, we would like to measure user happiness. We are doing that with NPS and “Kundenbarometer”, but we’d like to have at least an approximation in real time. Well, you can’t do that, but you can approximate via the definition of functional and non-functional requirements you know (or at least assume) are important for customer happiness.
Typical things are: Latencies or expected runtimes, and of course that your application does what it’s intended to do.
This takes us back to metrics and calculated metrics, in other words KPIs, or SLIs, Service Level Indicators.
Disclaimer: This is not an exact science, but always a guesstimate. Rule of thumb should at least be: If these indicators are off, the customer will definitely be UNHAPPY.
And in addition to these, we of course also track the availability, where we also have an SLA.
So, as these are the values to which we will be held accountable, we better also alert on these.
We have gathered a more holistic view on our application - we no longer just look at what has to be developed, we also, from the start, look at how the components will behave at runtime, and how we can observe them.
We don’t have to think very hard about how and where to run things - we have solved most tricky problems using Kubernetes and the toolset around; we just have to re-apply patterns, while relying on that most things aren’t that complicated that nobody has solved them yet.
We have a lot less fear of changing things. Since everything is built up as code, everything is easily and fairly quickly reproducible, and we can efficiently test changes up front.
We have gathered a feeling that we are in control. At least, we hope and think we are in control. And that’s a nice feeling.
Restarted VMs:
Redis cluster failed after restart
AppServer could not reconnect to Redis
Pods running only once? SPOF
Expect failures
Circuit breakers: Currently Hystrix, investigating Istio/linkerd
TLS: External semantic search – clogged up their load balancer after a couple of hours of traffic.