%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
Scaling Graphite At Yelp
1. …And Metrics For All
Paul O’Connor
github.com/pauloconnor
2015-05-19
2. About Yelp
Founded: 2004
Monthly Active Users: ~142 Million
Non-US Monthly Users: ~31 Million
Review: ~77 Million
Local Businesses: 2.1 Million
Territories: Available in 31 countries
15. Aggregation Rules
• Not to be confused with storage aggregation
• Tells the carbon aggregator what to aggregate and how
output_template (frequency) = method input_pattern
<env>.applications.<app>.all.requests (60) = sum
<env>.applications.<app>.*.requests
prod.applications.apache.www01.requests
prod.applications.apache.www02.requests
prod.applications.apache.www03.requests
prod.applications.apache.www04.requests
prod.applications.apache.www05.requests
prod.applications.apache.all.requests
15
16. Whisper
• Fixed size database
• Allows for roll ups
• Allows for backfilling data
16
24. Where does the aggregator fit?
• Aggregator uses a lot of CPU. Put it on it’s own node
24
25. Scaling further
• Use nodes for particular functions:
• Use forwarding relay nodes solely to forward
• Have consistent hashing nodes
• Have aggregation nodes
25
28. Getting your data back out
• Graphite Dashboard
• Third Party Dashboard
• We use Grafana http://grafana.org/
• Graphite-api https://github.com/brutasse/graphite-api
30. Tips
• Aggregate before ingestion
• Control the metrics that can be sent
• Metrics are a gas - they expand to fill all available room
• Use C implementation of carbon
• Use the latest webapp.
31. Optimize your dashboard queries
• services.biz_app.*.*.timers.pyramid_uwsgi_metrics_tweens_*.p99
• 2154 results
• 35 seconds to just find these files on disk
• Running functions against these results
• Timeout after a minute
• Dashboard automatically refreshing every 10 seconds
Hi, I’m Paul. I’m an SRE in Yelp’s Dublin office, where I’ve been for about a year. Today, I’m going to talk a bit about metrics in Yelp - in particular how we’ve scaled Graphite to handle over 12,000,000 metrics a minute.
For those of you who don’t know, Yelp is a company that produces huge amounts of logs, and huge amounts of metrics, and also has a side business for finding and reviewing local businesses. Founded in 2004, about 142 million MAU of which 31 million are outside the US.
So, let’s get started with the basics. What is a metric? Simply, it’s a name and a value. The problem with that is that it is only correct for the moment that the value is recorded but we don’t know when that was. Simple answer…
Let’s add a timestamp to the value. Now we know what the metric’s value was, and when. This is getting useful now. Let’s look at an example
I’ve colour coded these just make it easier to follow along. We can see that we’re looking at a metric called server1.load.1m, which has a value of 28.8ish (Yes, I know this is high. This is the actual load average from one of the graphite nodes we use. More on this later). Finally, we have an Epoch timestamp. Now, a single data point on it’s on isn’t terrible useful, especially if you want to look for trends.
Now we have five data points, spanning 5 minutes. We have some accurate historical data we can now see how this server’s load was. Unfortunately, numbers aren’t terribly wonderful at showing changes in data quickly.
Let’s through it into a graph, and immediately we can see what’s happening with our data. Now that we know what we’re storing, and how we want to present the data, let’s have a look at our solution - Graphite
Graphite is made up of three main components - The Carbon daemon, Whisper and the web app. The carbon daemon has three components in it, which we’ll go into seperately.
So, the relay is pretty simple. It does two simple things. It will forward received metrics to somewhere else, based on a set of rules, or it will forward them based on sharding, using a consistent hashing algorithm. This simply means that when any relay receives a metric, it will always be forwarded to the same destination.
Relay rules are fairly simple. A rule consists of 4 parts - a distinct name for the rule, a regex pattern for matching the metric, a comma separated list of destinations, and an optional rule, telling carbon to continue whether or not to continue processing rules once it matches on a metric. This is useful for splitting metrics between multiple nodes. A rule tells the relay daemon “If you see a metric that matches this regex, forward it to these destinations”. This is very useful for replicating data, or splitting data between multiple storage backends.
With consistent hashing, carbon relay will shard the metrics across a list of backends. This is a nice way of scaling out the storage layer. We’ll cover this in detail shortly.
The carbon cache is the responsible daemon for writing to disk. The cache will hold metrics in memory until it can write to disk in an efficient a manner as possible. When it writes the metrics to disk, it follows a storage schema which is configurable per metric name or type.
It would be lovely to store all data points for all metrics for all time, but there’s a problem. Each data point takes 12 bytes, so if we received a metric every 10 seconds, that would be about 37MB per metric. Given a system with a million unique metrics, that would be 37TB of storage that’s fast enough to handle that many metrics. That’s expensive, and quite wasteful.
Instead, Whisper and carbon cache can use different retention policies. A retention policy has three parts - a name for the policy, a regex pattern to match on, and the retention policy itself. This retention policy says that for any database server, we will store the metrics at a resolution of 10 seconds for 7 days, 1 minute for 30 days, 5 minutes for 90 days, and 30 minutes for 365 days. What does this actually mean though? When the carbon cache receives a metric within a 10 second window, it will store that metric as is. For seven days, there will be over 60,000 datapoints. As metrics slide outside the 7 day window, they will be taken in groups of 6 (6 groups of 10 seconds in a minute), and then processed so that the 6 datapoints become 1. Let’s talk about how these metrics are processed.
As with everything else we’ve seen so far, Carbon let’s you do whatever you want with your metrics. In this case, we can decide how we want to aggregate our metrics as we step from our 10 second resolution to 1 minute. Again, these rules have 4 parts - a name, a regex for matching the metrics, an xFiles factor and an aggregation method. The xfiles factor is an important option here. It defines what fraction of the points we are aggregating should be non-null in order to create a non null metric. The aggregationMethod defines how the points should be aggregated. Options for this are sum, min, max, last, and average, with the default being average.
And so, the last of our three carbon daemons is the aggregator. The aggregator runs along side the relay, will accept metrics, and as the name suggests, will aggregate them based on a set of rules which we’ll talk about in a moment. This is really handy if you want to creates totals across number of nodes - for example, you could create a list of metrics for a cluster so you can easily see performance, egress and combined disk space before the metrics are written to disk. I’ll cover why this is a really useful feature shortly.
Aggregation rules are quite simple, but they are very powerful. Don’t confuse them with storage aggregation rules though, which only deal with on disk aggregation.
A rule is basically asking what it should write, how often, how to aggregate and from what. In this example above, we’re using two variables in the names - env and app. These variables will map to the input metric name based on the location within the name, so in position 0 we have env becoming prod, and in position 2, the app is apache. The new metric that we generate will therefore be called prod.applications.apache.all.requests. Because we have an asterisk in position 3 on the input pattern, this will aggregate all nodes that match. The final metric will be a sum of all matching metrics, forward to the carbon cache every 60 seconds
As I said, this is very powerful, but requires a lot of CPU to run.
The storage mechanism for Graphite is the file format Whisper. It’s pretty close to a rewrite of RRD. Some downsides to Whisper - each datapoint is stored with it’s timestamp, rather than assuming position in file is the time it was created, and the file is fixed size so a metric that sends 1 datapoint once will take up the same disk space as a full metric
So, the last piece of the Graphite stack is the web app. It’s a Django web app, that reads from both the carbon caches, and the whisper files on disk.
The very simplest graphite setup you can have is simply having carbon cache listening on TCP port 2003 (which is the standard graphite port), and writing all metrics directly to disk. This works fine for low amounts of metrics for testing. In this situation, you will be bound by disk io, unless you’re backed by SSD.
Now, we’re bringing in the carbon relay to use consistent hashing between caches. Why would we do this? Queues and back pressure. When a carbon cache is waiting for the optimal time to write to disk, it may start dropping metrics. This is a decent way of doing things if you have plenty of CPU, RAM and disk IO to use. In this situation, you can scale out the carbon caches until you run out of CPU cores. Don’t forget, carbon is single threaded, so you’ll lose a cpu core to each process.
OK, so we’re getting a bit more complex here now. Each of our carbon daemons has a queue, and these queues can fill quite quickly. Since CPU and RAM is cheaper than super speedy storage, we can off load a lot of work to those. The consistent hashing algorithm is quite computationally intensive, so splitting load across multiple relay nodes helps ensure performance stays good.
So, you can see here now that we have another HAProxy layer, and another Carbon Relay layer. Let’s walk through the layers again, top to bottom: The top haproxy layer receives metrics on tcp port 2003 (and 2004 for pickle) and forwards in a round robin to the first layer of carbon relay daemons. This layer will forward metrics to destinations, based on rules. One of the destinations will be the next HAProxy layer, which will then round robin to the next carbon relay layer, which is responsible for consistent hashing which will forward to the appropriate carbon cache which writes to whisper on disk which is read by the webapp. With me so far? Excellent!
So, since we have one server working, let’s spin up a second identical one, and start duplicating data. As you can see (hopefully), the first carbon relay layer on each box is writing to the second carbon relay on the second box. This means that no matter what server the metric comes into, it will be persisted onto both boxes. Think of this as Raid 1 - mirrored copies of the data. Obviously, this may not be the ideal solution for everyone. If you don’t particularly care about duplication, you can just use the second carbon relay layer, and consistent hash across both servers.
If you don’t particularly care about duplication, you can just use the second carbon relay layer, and consistent hash across both servers. Think of this as Raid 0. You will get more storage, and more performance from your nodes, but if one of your nodes goes down, you lose N data.
Because the aggregator is so CPU intensive, I find it’s easier to move it onto it’s own node. This might sound expensive, but it will give you more metrics which will be useful. The flow in the above diagram is the first layer of relays in server1 forwards to the HAProxy layer on aggregation1. From there, the metric is round rosined to a relay which using consistent hashing to write to a particular carbon aggregator. From there, the carbon aggregator will flush to the second HAProxy layer on server1 which will forward to the second layer of HAproxy, which in turns forwards onto the second layer of carbon relay which consistent hashes onto the cache. I like to have a carbon aggregator attached to every storage node I have.
Scaling graphite is hard, and it’s expensive. Most people start with a single node, with a single cache and a single webapp. There are better ways of scaling. I had an issue where the forwarding relays were so overloaded, that they started dropping metrics, so people started seeing gaps in their metrics. Spinning up new nodes that existed just to forward metrics to destinations reduced the load on the storage nodes, and allowed for more carbon cache daemons so we could use the full performance of our storage card.
Consistent hashing is an expensive operation but it’s stateless, so you can shard this function across many load balanced nodes.
The storage node ideally will have 3 things running on it - carbon cache for writing to disk, the webapp to get the metrics back out, and memcached for storing generated metrics.
This is a reasonably up to date diagram of the graphite infrastructure in Yelp. We have more relay nodes, and we’re don’t have the aggregators shown, but this is the bulk of the system. The two lower nodes are the power houses of the system. Both have dual 10 core 2.8ghz cpus, 256GB ram, and 3.2TB Fusion IO cards which are basically SSDs which sit on the pci-e bus. This allows us to record about 12,000,000 updates a second, across about 1,000,000 metrics.
This was sent to me the day we started getting traffic in. Figured I needed a meme somewhere!
So, we have all of our metrics stored safely on disk, we’re not dropping anything on the floor, and we’re not over loading the nodes. Excellent. How do we get the data back out? The default webapp dashboard is fine. It does a lot of work, it’s embeddable, and is powerful. Unfortunately, it’s not the prettiest thing in the world.
We’ve settled on using Grafana. It’s an open source project, originally based on the Kibana code base for those of you who use the ELK stack. It’s a node based application which stores data in elastic search. It’s very simple to get running, especially in Docker, and does a lot of very cool stuff.
The last option is graphite-api for those of you who want quick lightweight access to the data. There are number of drawbacks to graphite-api, which are listed on the github page, but it can be very useful for servers where you don’t want to run apache.
So, we have a scaled system which works well for now. We’re growing, rather quickly. We’re getting more and more metrics daily, and we’ll need to revisit our metrics system. There are some very interesting tools coming down the line that moves away from the python carbon daemons, and the whisper files. InfluxDB is a time series database designed for the sole purpose of storing metrics. There is already an ecosystem of tools built around it, including Grafana, and it is designed to be run on multiple nodes which helps with horizontal scaling.
Cassandra is a well known database that can shard and scale easily. It’s reasonably mature, and is used by many metrics companies including Librato, and SignalFX. Again, there is a large ecosystem of tooling built around it, and can plug into Graphite and the carbon daemons easily.
The last option? Just pay someone else to do it. Sometimes, it’s just easier to offload the work onto a company who has a dedicated team, and knowledge than spend money on nodes, and an engineer to maintain it internally. Of course, this may not be an issue for everyone, but sometimes, out sourcing can be very beneficial.
And of course, we’re hiring. We’re looking for people to join our site reliability team - we’ve got openings in Dublin, London, New York City and San Francisco