Scaling Graphite At Yelp

•Download as PPTX, PDF•

54 likes•10,033 views

Paul O'Connor

This is a presentation given at DevOps Belfast about how we scaled Graphite at Yelp

Software Internet

…And Metrics For All
Paul O’Connor
github.com/pauloconnor
2015-05-19

About Yelp
Founded: 2004
Monthly Active Users: ~142 Million
Non-US Monthly Users: ~31 Million
Review: ~77 Million
Local Businesses: 2.1 Million
Territories: Available in 31 countries

What are metrics?
Name Value Timestamp
server1.load.1m 28.826667 1431950640

What are metrics?
Name Value Timestamp
server1.load.1m 28.826667 1431950640
server1.load.1m 29.188333 1431950700
server1.load.1m 29.231667 1431950760
server1.load.1m 29.083333 1431950820
server1.load.1m 29.710000 1431950880

Graphite Components
• Carbon:
• relay
• cache
• aggregator
• Whisper
• Web app

Carbon Relay
• Deals with 2 things
• Replication
• Sharding

Relay Methods
• Rules
• [replicate]
• pattern = ^services.ads..+
• servers = 10.1.2.3, 10.2.2.3
• continue = true
• Consistent Hashing
• Defines a sharding strategy across multiple backends
10

Carbon Cache
• Receives metrics and persists them to disk
• Writes based on storage schemas
11

Storage Schemas
• Details retention rates for storing metrics
[databases_10sec_1year]
pattern = ^servers.db.*$
retentions = 10s:7d,1m:30d,5m:90d,30m:365d
12

Storage Aggregation
• Rules for aggregating data to lower-precision retentions
[all_min]
pattern = .min$
xFilesFactor = 0.1
aggregationMethod = min
13

Carbon Aggregator
• Buffers metrics before forwarding to carbon cache
• Roll up metrics based on rules
14

Aggregation Rules
• Not to be confused with storage aggregation
• Tells the carbon aggregator what to aggregate and how
output_template (frequency) = method input_pattern
<env>.applications.<app>.all.requests (60) = sum
<env>.applications.<app>.*.requests
prod.applications.apache.www01.requests
prod.applications.apache.www02.requests
prod.applications.apache.www03.requests
prod.applications.apache.www04.requests
prod.applications.apache.www05.requests
prod.applications.apache.all.requests
15

Whisper
• Fixed size database
• Allows for roll ups
• Allows for backfilling data
16

Web App
• Django based app for rendering graphs
17

Putting it all together
• Carbon cache listening on port 2003
• Write to disk
• Listen with web
18

Getting more complicated
• Carbon relay using consistent hashing to multiple caches
• Individual caches responsible for specific metrics
19

More Relays
• Use HAProxy to load balance between relays
• Use more relays to use CPU
20

Even more relays
• Useful for sending metrics to other locations
21

Replicate the metrics
• Duplicate your metrics for backup, and redundancy
22

More caches instead
• Consistent hash across multiple nodes
23

Where does the aggregator fit?
• Aggregator uses a lot of CPU. Put it on it’s own node
24

Scaling further
• Use nodes for particular functions:
• Use forwarding relay nodes solely to forward
• Have consistent hashing nodes
• Have aggregation nodes
25

Getting your data back out
• Graphite Dashboard
• Third Party Dashboard
• We use Grafana http://grafana.org/
• Graphite-api https://github.com/brutasse/graphite-api

Tips
• Aggregate before ingestion
• Control the metrics that can be sent
• Metrics are a gas - they expand to fill all available room
• Use C implementation of carbon
• Use the latest webapp.

Optimize your dashboard queries
• services.biz_app.*.*.timers.pyramid_uwsgi_metrics_tweens_*.p99
• 2154 results
• 35 seconds to just find these files on disk
• Running functions against these results
• Timeout after a minute
• Dashboard automatically refreshing every 10 seconds

What’s the Future?
• InfluxDB
• Cassandra
• Third party
33

We’re hiring!
http://www.yelp.com/careers
Hiring SREs in Dublin, London, New York, San Francisco

What's hot

DOWNSAMPLING DATAInfluxData

WHODIS_kearns_presentation.v0aEdward Kearns

Dato vs GraphXKeira Zhou

Data IntegrationDatio Big Data

Spark Summit EU talk by Tug GrallSpark Summit

tado° Makes Your Home Environment Smart with InfluxDBInfluxData

Building Better Data Pipelines using Apache AirflowSid Anand

Setting up InfluxData for IoTInfluxData

How to Enable Industrial Decarbonization with Node-RED and InfluxDBInfluxData

Statsd introductionRick Chang

Flink Forward Berlin 2017: Pramod Bhatotia, Do Le Quoc - StreamApprox: Approx...Flink Forward

presto-at-netflix-hadoop-summit-15Zhenxiao Luo

uReplicator: Uber Engineering’s Scalable, Robust Kafka ReplicatorMichael Hongliang Xu

From Ceilometer to Telemetry: not so alarming!Nicolas (Nick) Barcet

Running Airflow Workflows as ETL Processes on Hadoopclairvoyantllc

Streaming Sensor Data with Grafana and InfluxDB | Ryan Mckinley | GrafanaInfluxData

A True Story About Database OrchestrationInfluxData

Artmosphere DemoKeira Zhou

Intro to InfluxDB 2.0 and Your First Flux Query by Sonia GuptaInfluxData

Ceilometer presentation ODS Grizzly.pdfOpenStack Foundation

What's hot (20)

DOWNSAMPLING DATA

WHODIS_kearns_presentation.v0a

Dato vs GraphX

Data Integration

Spark Summit EU talk by Tug Grall

tado° Makes Your Home Environment Smart with InfluxDB

Building Better Data Pipelines using Apache Airflow

Setting up InfluxData for IoT

How to Enable Industrial Decarbonization with Node-RED and InfluxDB

Statsd introduction

Flink Forward Berlin 2017: Pramod Bhatotia, Do Le Quoc - StreamApprox: Approx...

presto-at-netflix-hadoop-summit-15

uReplicator: Uber Engineering’s Scalable, Robust Kafka Replicator

From Ceilometer to Telemetry: not so alarming!

Running Airflow Workflows as ETL Processes on Hadoop

Streaming Sensor Data with Grafana and InfluxDB | Ryan Mckinley | Grafana

A True Story About Database Orchestration

Artmosphere Demo

Intro to InfluxDB 2.0 and Your First Flux Query by Sonia Gupta

Ceilometer presentation ODS Grizzly.pdf

Viewers also liked

Cohesive SDN Summit Presentation: OpenFlow is SDN, SDN is not OpenFlowCohesive Networks

Open DevelopmentPaolo Mottadelli

CrowGert Laaso

Hadoop / Spark on Malware ExpressionMapR Technologies

Free - Chris Andersonschooldialoog

concepto de colección localguestf488db7

Architecting your Splunk deploymentSplunk

AppSensor Near Real-Time Event Detection and Response - DevNexus 2016jtmelton

George Park Workshop 1 - Cosumnes CSDCosumnes CSD

Can you handle The TRUTH ,..? Missing page history of JESUS and Hidden TRUTHHeri kusrianto

vanEngelen 360 Inspiratieborrel - Trends Update 2014Van Engelen

Game Over - HTML5 GamesGuido Garcia

Respond to and troubleshoot production incidents like an saTom Cudd

De tabernakelAlexander Greenberg

Interact Differently: Get More From Your Tools Through Exposed APIsKevin Fealey

Modern Infrastructure from Scratch with PuppetPuppet

FDA's Brian Bradley Case Study and Process Review of the Veterans Review and ...Foundation for Democratic Advancement

Build Stuff 2015 programNeringa Reichenbergeryte-Young

IT Infrastructure Monitoring Strategies in HealthcareCA Technologies

Lost in Translation - Blackhat Brazil 2014Rodrigo Montoro

Viewers also liked (20)

Cohesive SDN Summit Presentation: OpenFlow is SDN, SDN is not OpenFlow

Open Development

Crow

Hadoop / Spark on Malware Expression

Free - Chris Anderson

concepto de colección local

Architecting your Splunk deployment

AppSensor Near Real-Time Event Detection and Response - DevNexus 2016

George Park Workshop 1 - Cosumnes CSD

Can you handle The TRUTH ,..? Missing page history of JESUS and Hidden TRUTH

vanEngelen 360 Inspiratieborrel - Trends Update 2014

Game Over - HTML5 Games

Respond to and troubleshoot production incidents like an sa

De tabernakel

Interact Differently: Get More From Your Tools Through Exposed APIs

Modern Infrastructure from Scratch with Puppet

FDA's Brian Bradley Case Study and Process Review of the Veterans Review and ...

Build Stuff 2015 program

IT Infrastructure Monitoring Strategies in Healthcare

Lost in Translation - Blackhat Brazil 2014

Similar to Scaling Graphite At Yelp

Universal metrics with Apache BeamEtienne Chauchot

Data Science in the Cloud @StitchFixC4Media

Canary Analyze All The Things: How We Learned to Keep Calm and Release OftenC4Media

Dynamic Reactor Pattern for Distributed Systems in Control and MonitoringJordan McBain

How to Improve the Observability of Apache Cassandra and Kafka applications...Paul Brebner

Building Modern Digital Services on Scalable Private Government Infrastructur...Andrés Colón Pérez

Strategies in continuous deliveryAviran Mordo

Tools for Measurements and AnalysisRIPE NCC

Prometheus: What is is, what is new, what is comingJulien Pivotto

PyCon Poland 2016: Maintaining a high load Python project: typical mistakesViach Kakovskyi

LISA2017 Kubernetes: Hit the Ground RunningChris McEniry

ICANN DNS Symposium 2021: Measuring Recursive Resolver CentralityAPNIC

OpenTSDB for monitoring @ CriteoNathaniel Braun

Application Monitoring using Open Source: VictoriaMetrics - ClickHouseVictoriaMetrics

Application Monitoring using Open Source - VictoriaMetrics & Altinity ClickHo...Altinity Ltd

IPv6 and the DNS, RIPE 73APNIC

Understanding Distributed Source ControlLorna Mitchell

Broadcast Music Inc - Scaling Up, Doing More, Having More Fun!ghodgkinson

HAProxyconf 2019 - Criteo - Transitioning from Ticketing to LBaaSpierrecdn -

Building Scalable Aggregation SystemsJared Winick

Similar to Scaling Graphite At Yelp (20)

Universal metrics with Apache Beam

Data Science in the Cloud @StitchFix

Canary Analyze All The Things: How We Learned to Keep Calm and Release Often

Dynamic Reactor Pattern for Distributed Systems in Control and Monitoring

How to Improve the Observability of Apache Cassandra and Kafka applications...

Building Modern Digital Services on Scalable Private Government Infrastructur...

Strategies in continuous delivery

Tools for Measurements and Analysis

Prometheus: What is is, what is new, what is coming

PyCon Poland 2016: Maintaining a high load Python project: typical mistakes

LISA2017 Kubernetes: Hit the Ground Running

ICANN DNS Symposium 2021: Measuring Recursive Resolver Centrality

OpenTSDB for monitoring @ Criteo

Application Monitoring using Open Source: VictoriaMetrics - ClickHouse

Application Monitoring using Open Source - VictoriaMetrics & Altinity ClickHo...

IPv6 and the DNS, RIPE 73

Understanding Distributed Source Control

Broadcast Music Inc - Scaling Up, Doing More, Having More Fun!

HAProxyconf 2019 - Criteo - Transitioning from Ticketing to LBaaS

Building Scalable Aggregation Systems

Recently uploaded

Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...harshavardhanraghave

%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...masabamasaba

%in Midrand+277-882-255-28 abortion pills for sale in midrandmasabamasaba

AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesVictorSzoltysek

Microsoft AI Transformation Partner Playbook.pdfWilly Marroquin (WillyDevNET)

The Top App Development Trends Shaping the Industry in 2024-25 .pdfayushiqss

%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...masabamasaba

%in Hazyview+277-882-255-28 abortion pills for sale in Hazyviewmasabamasaba

%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfonteinmasabamasaba

call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️Delhi Call girls

Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812

10 Trends Likely to Shape Enterprise Technology in 2024Mind IT Systems

Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Steffen Staab

%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...masabamasaba

The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171

VTU technical seminar 8Th Sem on Scikit-learnAmarnathKambale

%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfonteinmasabamasaba

call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️Delhi Call girls

Direct Style Effect Systems -The Print[A] Example- A Comprehension AidPhilip Schwarz

%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...masabamasaba

Recently uploaded (20)

Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...

%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...

%in Midrand+277-882-255-28 abortion pills for sale in midrand

AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques

Microsoft AI Transformation Partner Playbook.pdf

The Top App Development Trends Shaping the Industry in 2024-25 .pdf

%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...

%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview

%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein

call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️

Unlocking the Future of AI Agents with Large Language Models

10 Trends Likely to Shape Enterprise Technology in 2024

Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...

%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...

The Ultimate Test Automation Guide_ Best Practices and Tips.pdf

VTU technical seminar 8Th Sem on Scikit-learn

%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein

call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️

Direct Style Effect Systems -The Print[A] Example- A Comprehension Aid

%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...

Scaling Graphite At Yelp

1. …And Metrics For All Paul O’Connor github.com/pauloconnor 2015-05-19

2. About Yelp Founded: 2004 Monthly Active Users: ~142 Million Non-US Monthly Users: ~31 Million Review: ~77 Million Local Businesses: 2.1 Million Territories: Available in 31 countries

3. What are metrics? Name Value

4. What are metrics? Name Value Timestamp

5. What are metrics? Name Value Timestamp server1.load.1m 28.826667 1431950640

6. What are metrics? Name Value Timestamp server1.load.1m 28.826667 1431950640 server1.load.1m 29.188333 1431950700 server1.load.1m 29.231667 1431950760 server1.load.1m 29.083333 1431950820 server1.load.1m 29.710000 1431950880

7. What are metrics? Name Value Timestamp server1.load.1m 28.826667 1431950640 server1.load.1m 29.188333 1431950700 server1.load.1m 29.231667 1431950760 server1.load.1m 29.083333 1431950820 server1.load.1m 29.710000 1431950880

8. Graphite Components • Carbon: • relay • cache • aggregator • Whisper • Web app

9. Carbon Relay • Deals with 2 things • Replication • Sharding

10. Relay Methods • Rules • [replicate] • pattern = ^services.ads..+ • servers = 10.1.2.3, 10.2.2.3 • continue = true • Consistent Hashing • Defines a sharding strategy across multiple backends 10

11. Carbon Cache • Receives metrics and persists them to disk • Writes based on storage schemas 11

12. Storage Schemas • Details retention rates for storing metrics [databases_10sec_1year] pattern = ^servers.db.*$ retentions = 10s:7d,1m:30d,5m:90d,30m:365d 12

13. Storage Aggregation • Rules for aggregating data to lower-precision retentions [all_min] pattern = .min$ xFilesFactor = 0.1 aggregationMethod = min 13

14. Carbon Aggregator • Buffers metrics before forwarding to carbon cache • Roll up metrics based on rules 14

15. Aggregation Rules • Not to be confused with storage aggregation • Tells the carbon aggregator what to aggregate and how output_template (frequency) = method input_pattern <env>.applications.<app>.all.requests (60) = sum <env>.applications.<app>.*.requests prod.applications.apache.www01.requests prod.applications.apache.www02.requests prod.applications.apache.www03.requests prod.applications.apache.www04.requests prod.applications.apache.www05.requests prod.applications.apache.all.requests 15

16. Whisper • Fixed size database • Allows for roll ups • Allows for backfilling data 16

17. Web App • Django based app for rendering graphs 17

18. Putting it all together • Carbon cache listening on port 2003 • Write to disk • Listen with web 18

19. Getting more complicated • Carbon relay using consistent hashing to multiple caches • Individual caches responsible for specific metrics 19

20. More Relays • Use HAProxy to load balance between relays • Use more relays to use CPU 20

21. Even more relays • Useful for sending metrics to other locations 21

22. Replicate the metrics • Duplicate your metrics for backup, and redundancy 22

23. More caches instead • Consistent hash across multiple nodes 23

24. Where does the aggregator fit? • Aggregator uses a lot of CPU. Put it on it’s own node 24

25. Scaling further • Use nodes for particular functions: • Use forwarding relay nodes solely to forward • Have consistent hashing nodes • Have aggregation nodes 25

26. 26

27.

28. Getting your data back out • Graphite Dashboard • Third Party Dashboard • We use Grafana http://grafana.org/ • Graphite-api https://github.com/brutasse/graphite-api

29. 29

30. Tips • Aggregate before ingestion • Control the metrics that can be sent • Metrics are a gas - they expand to fill all available room • Use C implementation of carbon • Use the latest webapp.

31. Optimize your dashboard queries • services.biz_app.*.*.timers.pyramid_uwsgi_metrics_tweens_*.p99 • 2154 results • 35 seconds to just find these files on disk • Running functions against these results • Timeout after a minute • Dashboard automatically refreshing every 10 seconds

32.

33. What’s the Future? • InfluxDB • Cassandra • Third party 33

34. We’re hiring! http://www.yelp.com/careers Hiring SREs in Dublin, London, New York, San Francisco

Editor's Notes

Hi, I’m Paul. I’m an SRE in Yelp’s Dublin office, where I’ve been for about a year. Today, I’m going to talk a bit about metrics in Yelp - in particular how we’ve scaled Graphite to handle over 12,000,000 metrics a minute.
For those of you who don’t know, Yelp is a company that produces huge amounts of logs, and huge amounts of metrics, and also has a side business for finding and reviewing local businesses. Founded in 2004, about 142 million MAU of which 31 million are outside the US.
So, let’s get started with the basics. What is a metric? Simply, it’s a name and a value. The problem with that is that it is only correct for the moment that the value is recorded but we don’t know when that was. Simple answer…
Let’s add a timestamp to the value. Now we know what the metric’s value was, and when. This is getting useful now. Let’s look at an example
I’ve colour coded these just make it easier to follow along. We can see that we’re looking at a metric called server1.load.1m, which has a value of 28.8ish (Yes, I know this is high. This is the actual load average from one of the graphite nodes we use. More on this later). Finally, we have an Epoch timestamp. Now, a single data point on it’s on isn’t terrible useful, especially if you want to look for trends.
Now we have five data points, spanning 5 minutes. We have some accurate historical data we can now see how this server’s load was. Unfortunately, numbers aren’t terribly wonderful at showing changes in data quickly.
Let’s through it into a graph, and immediately we can see what’s happening with our data. Now that we know what we’re storing, and how we want to present the data, let’s have a look at our solution - Graphite
Graphite is made up of three main components - The Carbon daemon, Whisper and the web app. The carbon daemon has three components in it, which we’ll go into seperately.
So, the relay is pretty simple. It does two simple things. It will forward received metrics to somewhere else, based on a set of rules, or it will forward them based on sharding, using a consistent hashing algorithm. This simply means that when any relay receives a metric, it will always be forwarded to the same destination.
Relay rules are fairly simple. A rule consists of 4 parts - a distinct name for the rule, a regex pattern for matching the metric, a comma separated list of destinations, and an optional rule, telling carbon to continue whether or not to continue processing rules once it matches on a metric. This is useful for splitting metrics between multiple nodes. A rule tells the relay daemon “If you see a metric that matches this regex, forward it to these destinations”. This is very useful for replicating data, or splitting data between multiple storage backends. With consistent hashing, carbon relay will shard the metrics across a list of backends. This is a nice way of scaling out the storage layer. We’ll cover this in detail shortly.
The carbon cache is the responsible daemon for writing to disk. The cache will hold metrics in memory until it can write to disk in an efficient a manner as possible. When it writes the metrics to disk, it follows a storage schema which is configurable per metric name or type.
It would be lovely to store all data points for all metrics for all time, but there’s a problem. Each data point takes 12 bytes, so if we received a metric every 10 seconds, that would be about 37MB per metric. Given a system with a million unique metrics, that would be 37TB of storage that’s fast enough to handle that many metrics. That’s expensive, and quite wasteful. Instead, Whisper and carbon cache can use different retention policies. A retention policy has three parts - a name for the policy, a regex pattern to match on, and the retention policy itself. This retention policy says that for any database server, we will store the metrics at a resolution of 10 seconds for 7 days, 1 minute for 30 days, 5 minutes for 90 days, and 30 minutes for 365 days. What does this actually mean though? When the carbon cache receives a metric within a 10 second window, it will store that metric as is. For seven days, there will be over 60,000 datapoints. As metrics slide outside the 7 day window, they will be taken in groups of 6 (6 groups of 10 seconds in a minute), and then processed so that the 6 datapoints become 1. Let’s talk about how these metrics are processed.
As with everything else we’ve seen so far, Carbon let’s you do whatever you want with your metrics. In this case, we can decide how we want to aggregate our metrics as we step from our 10 second resolution to 1 minute. Again, these rules have 4 parts - a name, a regex for matching the metrics, an xFiles factor and an aggregation method. The xfiles factor is an important option here. It defines what fraction of the points we are aggregating should be non-null in order to create a non null metric. The aggregationMethod defines how the points should be aggregated. Options for this are sum, min, max, last, and average, with the default being average.
And so, the last of our three carbon daemons is the aggregator. The aggregator runs along side the relay, will accept metrics, and as the name suggests, will aggregate them based on a set of rules which we’ll talk about in a moment. This is really handy if you want to creates totals across number of nodes - for example, you could create a list of metrics for a cluster so you can easily see performance, egress and combined disk space before the metrics are written to disk. I’ll cover why this is a really useful feature shortly.
Aggregation rules are quite simple, but they are very powerful. Don’t confuse them with storage aggregation rules though, which only deal with on disk aggregation. A rule is basically asking what it should write, how often, how to aggregate and from what. In this example above, we’re using two variables in the names - env and app. These variables will map to the input metric name based on the location within the name, so in position 0 we have env becoming prod, and in position 2, the app is apache. The new metric that we generate will therefore be called prod.applications.apache.all.requests. Because we have an asterisk in position 3 on the input pattern, this will aggregate all nodes that match. The final metric will be a sum of all matching metrics, forward to the carbon cache every 60 seconds As I said, this is very powerful, but requires a lot of CPU to run.
The storage mechanism for Graphite is the file format Whisper. It’s pretty close to a rewrite of RRD. Some downsides to Whisper - each datapoint is stored with it’s timestamp, rather than assuming position in file is the time it was created, and the file is fixed size so a metric that sends 1 datapoint once will take up the same disk space as a full metric
So, the last piece of the Graphite stack is the web app. It’s a Django web app, that reads from both the carbon caches, and the whisper files on disk.
The very simplest graphite setup you can have is simply having carbon cache listening on TCP port 2003 (which is the standard graphite port), and writing all metrics directly to disk. This works fine for low amounts of metrics for testing. In this situation, you will be bound by disk io, unless you’re backed by SSD.
Now, we’re bringing in the carbon relay to use consistent hashing between caches. Why would we do this? Queues and back pressure. When a carbon cache is waiting for the optimal time to write to disk, it may start dropping metrics. This is a decent way of doing things if you have plenty of CPU, RAM and disk IO to use. In this situation, you can scale out the carbon caches until you run out of CPU cores. Don’t forget, carbon is single threaded, so you’ll lose a cpu core to each process.
OK, so we’re getting a bit more complex here now. Each of our carbon daemons has a queue, and these queues can fill quite quickly. Since CPU and RAM is cheaper than super speedy storage, we can off load a lot of work to those. The consistent hashing algorithm is quite computationally intensive, so splitting load across multiple relay nodes helps ensure performance stays good.
So, you can see here now that we have another HAProxy layer, and another Carbon Relay layer. Let’s walk through the layers again, top to bottom: The top haproxy layer receives metrics on tcp port 2003 (and 2004 for pickle) and forwards in a round robin to the first layer of carbon relay daemons. This layer will forward metrics to destinations, based on rules. One of the destinations will be the next HAProxy layer, which will then round robin to the next carbon relay layer, which is responsible for consistent hashing which will forward to the appropriate carbon cache which writes to whisper on disk which is read by the webapp. With me so far? Excellent!
So, since we have one server working, let’s spin up a second identical one, and start duplicating data. As you can see (hopefully), the first carbon relay layer on each box is writing to the second carbon relay on the second box. This means that no matter what server the metric comes into, it will be persisted onto both boxes. Think of this as Raid 1 - mirrored copies of the data. Obviously, this may not be the ideal solution for everyone. If you don’t particularly care about duplication, you can just use the second carbon relay layer, and consistent hash across both servers.
If you don’t particularly care about duplication, you can just use the second carbon relay layer, and consistent hash across both servers. Think of this as Raid 0. You will get more storage, and more performance from your nodes, but if one of your nodes goes down, you lose N data.
Because the aggregator is so CPU intensive, I find it’s easier to move it onto it’s own node. This might sound expensive, but it will give you more metrics which will be useful. The flow in the above diagram is the first layer of relays in server1 forwards to the HAProxy layer on aggregation1. From there, the metric is round rosined to a relay which using consistent hashing to write to a particular carbon aggregator. From there, the carbon aggregator will flush to the second HAProxy layer on server1 which will forward to the second layer of HAproxy, which in turns forwards onto the second layer of carbon relay which consistent hashes onto the cache. I like to have a carbon aggregator attached to every storage node I have.
Scaling graphite is hard, and it’s expensive. Most people start with a single node, with a single cache and a single webapp. There are better ways of scaling. I had an issue where the forwarding relays were so overloaded, that they started dropping metrics, so people started seeing gaps in their metrics. Spinning up new nodes that existed just to forward metrics to destinations reduced the load on the storage nodes, and allowed for more carbon cache daemons so we could use the full performance of our storage card. Consistent hashing is an expensive operation but it’s stateless, so you can shard this function across many load balanced nodes. The storage node ideally will have 3 things running on it - carbon cache for writing to disk, the webapp to get the metrics back out, and memcached for storing generated metrics.
This is a reasonably up to date diagram of the graphite infrastructure in Yelp. We have more relay nodes, and we’re don’t have the aggregators shown, but this is the bulk of the system. The two lower nodes are the power houses of the system. Both have dual 10 core 2.8ghz cpus, 256GB ram, and 3.2TB Fusion IO cards which are basically SSDs which sit on the pci-e bus. This allows us to record about 12,000,000 updates a second, across about 1,000,000 metrics.
This was sent to me the day we started getting traffic in. Figured I needed a meme somewhere!
So, we have all of our metrics stored safely on disk, we’re not dropping anything on the floor, and we’re not over loading the nodes. Excellent. How do we get the data back out? The default webapp dashboard is fine. It does a lot of work, it’s embeddable, and is powerful. Unfortunately, it’s not the prettiest thing in the world. We’ve settled on using Grafana. It’s an open source project, originally based on the Kibana code base for those of you who use the ELK stack. It’s a node based application which stores data in elastic search. It’s very simple to get running, especially in Docker, and does a lot of very cool stuff. The last option is graphite-api for those of you who want quick lightweight access to the data. There are number of drawbacks to graphite-api, which are listed on the github page, but it can be very useful for servers where you don’t want to run apache.
So, we have a scaled system which works well for now. We’re growing, rather quickly. We’re getting more and more metrics daily, and we’ll need to revisit our metrics system. There are some very interesting tools coming down the line that moves away from the python carbon daemons, and the whisper files. InfluxDB is a time series database designed for the sole purpose of storing metrics. There is already an ecosystem of tools built around it, including Grafana, and it is designed to be run on multiple nodes which helps with horizontal scaling. Cassandra is a well known database that can shard and scale easily. It’s reasonably mature, and is used by many metrics companies including Librato, and SignalFX. Again, there is a large ecosystem of tooling built around it, and can plug into Graphite and the carbon daemons easily. The last option? Just pay someone else to do it. Sometimes, it’s just easier to offload the work onto a company who has a dedicated team, and knowledge than spend money on nodes, and an engineer to maintain it internally. Of course, this may not be an issue for everyone, but sometimes, out sourcing can be very beneficial.
And of course, we’re hiring. We’re looking for people to join our site reliability team - we’ve got openings in Dublin, London, New York City and San Francisco

Scaling Graphite At Yelp

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Scaling Graphite At Yelp

Similar to Scaling Graphite At Yelp (20)

Recently uploaded

Recently uploaded (20)

Scaling Graphite At Yelp

Editor's Notes