Talk given by Thomas Stocking at Icinga Camp San Francisco 2016 - https://www.icinga.org/community/events/archive/2016-archive/icinga-camp-san-francisco/
Boost PC performance: How more available memory can improve productivity
Monitor OpenStack Environments from the bottom up and front to back
1. Monitor OpenStack Environments
From the Bottom Up and Front to Back
Thomas Stocking, Director of Systems Engineering, Founder. GroundWork Open Source
Inc.
February 16, 2016 | Icinga Camp
Thanks for attending icinga day San Francisco! I’m happy to be here and give this little talk about Openstack monitoring. I think this will take about 40-45 minutes, maybe less if I talk fast. I’ll be happy to take your questions at the end, so write them down or remember them, and ask away when we get to there. I’ll try to keep it interesting.
So, I work for Groundwork Open Source Inc. We are involved in the effort to make Openstack more easily managed and monitored, and this presentation will show you a little more about that, and how the various tools fit in, and where.
I’ll talk about how the space is evolving, tell the story of monitoring, as it were. Things are changing, and with DevOps, automation, and a changing landscape, monitoring will have to change.
We will go over some high-level concepts and ideas we have, and in conclusion share some of the things we have learned after 12 years in the space.
Lot’s of companies are adopting Openstack, for lots of reasons, mostly efficiency (more on that later). It’s for real, and growing.
It’s open source, which realy means that innovation is free to move ahead, and it is. Open stack is innovative and flexible, and is getting a lot of attention.
It’s newness is driving more waves of change, and retooling for openstack management is under way
Geeks love it because it is new and shiny, but it’s gaining ground because it’s not just a geek tool, or a silicon valley fad. There are a lot of serious members, and corporate sponsors.
Let me tell you a story about the old way we used to do deployments, maybe 10 years ago or more.
This will be familiar to some of you, I’m sure.
You want to deploy an app? Make a multi-tiered network,
Put some servers into racks and plug them in to the top-of-rack switches
Load the apps on the servers
Add monitoring to each one
Add your ssh keys and plugins, turn on SNMP
Lather, rinse and repeat
You get a nice dashboard and email alerts. It works.
But virtualization came a long, for a lot of reasons. Ease of deployment sure, but also application isolation in VMs made the apps more secure
A big driver was overprovisioning - In order to handle peaks, most pre-virtualization servers were averaging at 15% capacity.
Afterwards this jumped to 35% (some claim more). It’s a big jump.
Fundamentally, though monitoring didn’t change much, except that you were checking VMs, so the results for CPU and memory were not exactly true when checked with plugins.
An important point that was added was that the hypervisors were instrumented with APIs (like vsphere). These exposed some data on the hypervisor, and if you knew how you could grab it for monitoring.
But the story continued. Soon the network was virtual, and the switch fabric more and more became defined in software.
We are moving to sftware defined everything, even storage systems use virualized disks, luns, etc.
Cloud services started getting more widely used, and APIs were offered.
MS deprecated their Azure API a whule back, and a re re-launching in a more hybrid-cloud way. We will see what happens!
At this point, monitoring became blind in some areas. We no longer could get to what we wanted with SNMP and ssh with plugins.
Also, things were getting deployed and redeployed too fast to keep up with the monitoring configuration. Something had to be done.
This is another view of the landscape.
Notice that the silo views have multiplied, here. Not that I was a new problem, but the areas of coverage for monitoring are really fragmented now, with the hypervisor seeing a few layers, and the apps seeing the top. Shared fog in the network side, since the old way of looking at traffic can’t really pick out flows and find problems when services (like storage) are increasingly virtualized, and so hit a network somewhere in the process.
Everything was always connectoed to everything else, but now it’s more dependent.
And just when you thought it was safe to go back in the water, there are containers floating around.
Containers are interesting from a monitoring perspective: they are opaque from the host OS. You can’t see into them to find out what’s going on.
In fact you need another container to see the details, which is exactly what we did with Boxspy (it’s on github if you need it).
Containers are great for app isolation. They complicate monitoring the apps, quite a bit.
They are also great for dynamic provisioning - just spin up more or less of a given component, like app or web servers ,or more cached systems to scale up or down.
The focus so far has been on provisioning, though, and monitoring is a low priority.
The tools we use have to account for this situation, and the likely trends towards more micro-services and isolation.
So, that’s a lot of things to deal with. But don’t panic. Let’s summarize:
There are a lot of virtual stacks around these days, but all of them have APIs that you can use to access them. The flavor of the day is REST APIs, with JSON, so we had better learn how to speak REST.
Pretty much no one uses SOAP any more, and does anyone even remember CORBA? XML/RPC is also considered really old by now.
The trend is also strongly towards mobile-friendly web Uis, and that usually means javascript. This is the direction we are seeing the big players take, and through systems like Openstack they are driving more and more into companies that we have as customers. We have to learn to speak REST and JSON.
So how does Openstack fit in? Let’s go over how it works at a high level – deeper dives are the subject of much longer talks. We will be in Austin in April for those.
Openstack uses a service oriented architecture. It splits off compute, network storage, authentication and authorization all into their own services.
Using a pluggable model, you can replace services with other compatible components. For instance you can swap out the default KVM hypervisor for VMWare, or even HyperV if you want.
On the hardware side, it’s completely transparent – the APIs make hardware vendor choices irrelevant to programming, as long at they are supported. A lot of hardware types are supported –
the device vendors are on board, which accounts for a lot of the success of openstack overall, actually.
The other important thing openstack does is break down respurce allocation into small “elastic compute units”, or ECUs. These are used like quanta – you have a multiplier on a small unit to get a bigger unit.
That makes the edges of these units align in memory, storage, transport, etc across the whole system, and makes performance more predictable, and easier to optimize. Contrast this to Vmware, where you can set
the resource level you want for a VM, and then the hypervisor figures out how to deliver that to you when you need it.
Wow, it sounds good! So why hasn't Openstack taken over the world yet?
The reality is a little more complicated.
This is the openstack architecture, and the black box services it contains. Each service has an API that allows control and visibility into the black box. The focus so far in development has been on provisioning (like with Docker),
and monitoring is largely under-coded, and spread among the APIs.
What we are working on at GroundWork is a monitoring façade API, that gets all the metrics wherever they are and allows simple interactuion with the openstack installation. We coded this, and are prepping it for incubation
in the Openstack project, but that process takes time and attention from the managers. In the meantime wi put it in our commercial product, but ultimeatly we want to contribute it to Openstack
Here’s some counters and metrics we can pull from the façade
Also for VMs. We found that VMs (part of compute) were better instrumented for monitoring than other parts of the system. Let’s hope that the other managers put some emphasis on monitoring in future!
Ok, this is where it gets interesting.
At a high level, when you are implementing openstack, you really aren’t doing it in a vacuum. There are always other components around that you will be monitoring in the normal, older way with SSH and Plugins or SNMP, or other common techniques.
Here’s going to be a network, non software-defined. SNMP is dead – long live SNMP - and unless you are at a startup which uses only openstack all the time (anyone?) then there will be legacy systems running. Also you are going to have the provisioning systems around with tools like Mirantis, Manage IQ, Ansible, Chef etc. You need to monitor those.
In openstack itself, also, there are places you need to check that aren’t API-accessible – like is the MySQL instance running? Are the ports for the APIs available? Basic monitoring that makes sense outside the black boxes of openstack.
And all of these systems will produce performance data that you will want to capture and store and compare. More on that in a minute.
What we are talking about here from a high-level management perspective is a Unified view – all the data from all the monitoring combined into one place to allow for analysis and alerting. To get that, we need to standardize the data.
You also have (in any environment) a plethora of monitoring tools. Usually these are what we call “silo tools” – specific to certain groups of admins, with certain functions like network database, security, etc.
This makes for all sorts of problems, from finger-pointing to over monitoring. The biggest issues in our opinion is the lack of an overall big-picture dashboard, which allows you to see what’s going on from multiple angles, provided by various monitoring methods and tools.
This unified monitoring dashboard has a feature we call “stack monitoring” – from bottom to top – which means that you get to see a given resource and all the status and metrics associated with it in a single view. The collection of the metrics can be done by the tool of your choice, be it an API call, Icinga, or any other tool, even one from a silo you don’t’ control.
The stack allows you to correlate metrics across the whole infrastructure.
It also allows you to create filtered sets of infrastructure that are of interest to individual clients (like exec, networking, security, support, etc).
If you are collecting and aggregating data from multiple systems, you will run into the problem of aliasing.
For example, container, VMs, and storage systems may all have different names for the same resource. Or hostnames under some Oses are capitalized (any guesses here?), while others are not.
By far the biggest issue is agreement among humans. In our experience, as hard as it is to build an effective aliasing system, it’s much harder to convince people to use consistent host naming conventions. And some places have no conventions at all – one person we know named a stack of systems “who?”, “when?”, “where?”, and “are you crazy?”.
The results, while funny, were far from productive!
So here you see two examples, one where you have network interface data and netflow data combining with IPMI and ID server data, and someone even reported back results under an IPv4 address, but in the unified view we want it all under one device/hostname. The alias we report all of this data under is os-eng-h1, which is an openstack installation grouped under hypervisors.
There’s a predictive alert here from IPMI saying that the power supply is apt to fail. Might be good to pay some attention to that soon.
The second example is a docker container, but we are bringing in data from the boxspy and the API of the hypervisor where the VM docker host is running, and that VM itself, all to the alias of the container.
So you can see that aliasing gives us the unification we want and need.
Full auto-discovery and name alignment just doesn’t exist (yet). You need this kind of aliasing.
But what about automation? We are definitely seeing a trend towards no manual anything as an ideal, and the closer we get in monitoring the more time we can spend on important things like root cause analysis.
So how do we deal with the fact that virtualization makes dynamic provisioning of VMs and containers really easy, and in fact, really useful? We want to monitor the newly added (and remove the newly deleted) systems right away.
There’s no way we will do this by hand – there’s no time, and it’s going to happen at 3am, when no one will make any changes anyway.
So we need a way to make things show up automatically in monitoring.
We can spin up instances with agents installed. These will auto-register and start monitoring with plugins, and extra plugins can be automatically downloaded when needed.
We can also collect the data from the API and start metric collection on the new vms and containers automatically, so the data will quickly appear in the monitoring system.
We call this process continuous configuration discovery
Let’s take a look at some of the tools we can use to set up Openstack for monitoring.
In the context of deployment, you will need a way to gather all the particulars of the Openstack installation together, deploy the installation, and then use that same tool to manage and as a repository of the data needed to monitor it.
We find Fuel (open source) to be excellent at this. It has the abulity to fully validate the parameters of the install, including the problematic issues with networking like VLAN tagging, addressing schema, etc that are the cause of most issues with new deployments.
Once validated, you can deploy it with one click, and half an hour later have a guarnteed working install of openstack.
You can then use another tool like Murano, whoch an also open source, and provides a repo of apps in VM form. You can then deploy your apps, including iconga for monitoring if you like, and the agents you might want or need like boxspy to monitor containers, and opentsbd to collect performance data.
Using Murano, you can be sure to have consistency with the versions you deploy in openstack.
What about monitoring configuration? Well Fuel has a lot of details you may need like SSH keys, the addresses of database servers, and credentials you needd for monitoring. Fuel has a lot of plugins, some of which allow you to grab this data out, and use it configure icinga
Once you get that going, you can forward all the data from icinga and the APIS, the agents, etc to your unified view, and send the perfomance data to opentsdb.
Let’s talk abut opentsdb.
It’s become more and more important to gather and store raw performance data as time series for graphing and correlation, and ultimately root cause analysis and pattern recognition.
We have found that the combination of opentsdb and grafana (both open source) are excelltn for this. We even integrated them into GroundWork!
The can provide an centralized data collector for performance data. Opentsdb is very storage efficient, and can be expanded on demand, and trimmed as needed.
Grafana is a front-end that allows you to make nice dashboards, ad-hoc as well, and to find the outliers in the time series of metrics you store.
So, to summarize again, we have built up an architecture for monitoring openstack and other similar systems.
Using tools of choice, which have APIs we can use and query, we collect the metrics and compare them to configured thresholds to create status data. We then send this to or REST API internally, and normalize it, apply aliases, and present it in a UI (through another API, actually).
The performance data follows a similar path, and ends up where we point it – we can send it to RRDs, or to OpenTSDB.
The good part of this design is that it is easy to write and update the collectors. If the data we want and need is available in a REST API, one collector looks much like another, and can quickly be instrumented to the collector hub.
We just recently finished the first draft of the Icinga connector, for example, so that Icinga data can be easily integrated into the GroundWork product.
Using this approach, we get true unified monitoring of openstack.
Let’s go overs the advantages and benefits of this approach.
Automation reduces cost f maintenance and improves quality of data, especially in dynamic configuration scenarios.
It leads to streamlined operations, which is good, since operations staff are getting trimmed and infrastructures grow and become more complex.
In terms of Hardware, you avoid lock in using Linux and Openstack for everything, and you gain visibility into the hardware you use for openstack via the API, in terms of configuration, status, and performance.
With stack monitoring you can reduce finger pointing and over monitoring, and gain a view of your applications that is complete, Bottom to top and front to back. This helps you optimize resource allocation and put the resources in the right place, at the right level.
Finally, we see more and more that the roles of DevOps and Sysadmins are merging (More ops and troubleshooting in Devops). Ths is part of a trend we see towards more systems/person, even as high as 1000/1 in some cases.
So the conclusions we have reached here actually take two slides to explain. Sorry for the repetition…
First of all, we need to change our tools. Old techniques still have a role, but leave us blind – we need to have a new approach.
Don’t throw away the old tools
Integrate everything with an API. It’s the only way to keep up.
Aliasing is easier than arguing. Never underestimate the human capacity to delay and deny!
Pick the best tool for the task – no one tool can do everything, and you need the specialization
Don’t overlap your monitoring tools. Monitoring is expensive, and you are wasting more than just bandwidth. People need to respond to alerts, even if they automate configuring them.
Open source tools are still driving innovation. We will see advances in open source before we see them in proprietary code, usually.
Part of DevOps is to automate, and part of automation means having tools that respond to automation. Some will be proprietary, some open, but all should be possible to configure and monitor, preferably via a REST API!
You will not find one tool to rule them all. If you do, someone will install something it can’t monitor within 15 minutes.
This is complicated stuff. Don’t expect agreement, especially on names.
We see some trends here:
Pragmatic approaches will rule, as Devops shows. More and more virtualization and app isolation will happen.
Again automation, and continuous configuration discovery is needed to make it possible to keep up with dynamic changes
Openstack has feet of clay. It’s real, and will evolve, but it still needs work to get fully instrumented, even with the architecture we are advocating here. We do think that this is the right approach, and we will keep working on developing the façade, as well as the connectors as the APIs mature and add more monitoring metrics.