LinkedIn started May 2003, I started August 2011. Over 8 years of cruft and confusion piled up before we even considered moving to Apache Traffic Server. This talk will focus on the journey and what we learned along the way:
* What LinkedIn is doing with ATS to affect change across the entire stack with a infrastructure tier
* Building automation and tooling
* Bizarre scenarios how users are querying the site
* Metrics and monitoring
* Patches we contributed
Really this talk is about how introducing ATS into the LinkedIn stack completelychanged how we tackled several complex problems
I started working at LinkedIn August 2011, I manage the SRE responsible for Core, Security and Presentation Infrastructure. We support Identity infrastructure, Growth/Registration, Engagement and several other systems that I spare you the details on since you’re here to learn about ATS… so what is it?
Bad ass HTTP proxyMulti-threadedNon-blocking I/OPluggableWell known for cachingInktomi wrote sometime in the late mid-90s and Yahoo open-sourced it in 2010.
So a few companies are using it… and these are just the people that bothered to put their logo on the Customers page
If you wanted a feature, it would be built as a Tomcat filter and deployed out to the majority of the site. For anything not running in Tomcat, there was not a solutionLots of frontends on lots of hosts
LinkedIn started acquiring companies, really difficult to integrate their stack into our own.We’re in a heterogeneous environment, supporting features across multiple platforms is a requirement and abstracting that from the frontend itself into an infrastructure tier completely changes the game.give a story?
Centralize the effort of these features. Acquisitions become first-ish class citizensNeed to make a change, push out the plugin to the ATS tier instead of coordinating with all the service owners for weeks to update/deploy their codeReduce the time to deploy
slow down the delivery on this oneHA Proxy and Varnish were not consideredNginx, ATS: maturity, scalability. modular , in house knowledgehttps://iwww.corp.linkedin.com/wiki/cf/display/ENGS/Comparison+of+TS+and+nginxdynamically load plugins without having to recompile a new server binary gave us the flexibility we needed to enable/disable features quickly
Before we even started, we had patches for ATS addressing an issue with keep-alive handling and adding support for remap_with_recv_port to allow requests from different incoming ports to different originsNo good source of truth how to route requests, so we had to audit configs and access logs to build the configMetrics for non-Java services at LinkedIn really didn’t exist. We built a Python-based framework to seamlessly fit into our monitoring modelAfter all this, we could start migrating!
We were a little ambitious
Ok, very ambitious
L1 Proxy will be ATS with a few plugins
First, we migrated our SEO optimized Profile pages to L1 Proxy. Allowed us to support routing unauthenticated users out of ECH3
Certain requests are able to served out of different data centers. So for public profile requests from our signed-out users, we can route them to our Chicago data center for improved RTT.
if there is a cookie named foo and it starts with bar, route it to the moon.
Drop the request at L1 Proxy instead of wasting cycles on the frontend. Allowed for us to automatically deny requests based on limits, without the need for people to scan access logs and manually block IP addressesWe were prepared with our new Sentinel plugin before Christmas, but decided to delay enabling until after the holidays. Since scrapers do not seem to celebrate New Years, we were forced to enable on New Years Day 2012, and it worked!
Hopefully you had a chance to attend Veena’s talk yesterday on The Curious Case of Dust JavaScript and Performance. In case you didn’t make it, I’ll explain what it does
call out Veena’s talk, you’ve missed it but you can see more information here…add note about USSR + V8
At a high level, this allows your app to do specifically what it’s supposed to. Before Fizzy, LinkedIn would ship code into multiple frontends so they could render the module within the various services.Now… People You May Know’s module can be fetched and embedded into the Profile page, while an Ad can be pulled in from another frontend.
Up until this point, configs were manually edited and deployed. this sucked, big time. When we started, there was no great source of truth of the data we needed. By this point, LinkedIn SRE had a metadaa store in place with most of the data in place and we just needed to fill the gaps. Significantly improved managing configs and reduced the amount of human error.Teams across the company wanted to build ATS plugins. This was both good and bad. Good that we were able to solve some difficult problems, bad that our proxy tier was becoming more complicated.The Mobile team wrote a plugin to detect when to issue redirects to Mobile pages, instead of handling in the Tomcat frontendsSecurity started manipulating and enforcing cookies, addressing legacy issues that were traditionally difficult to track downThis also lead to the development of QD Proxy…
The key design behind Quick Deploy is that if you develop a service locally, you can initiate a request to that service, either directly or indirectly, by using QD Proxy in LinkedIn’s staging environment. All other components of my request go to the Staging environment.
If I have a minor tweak to make, and not ready to commit, I can set up a QD Proxy profile to route the request to my dev box for the Frontend request and my Frontend will talk back to QD Proxy for all the backend calls, which will be sent to the backends in Staging.Really freakin’ sweet.
I can also have a frontend in Staging send the backend requests to my dev box based on my QD Proxy profile.Testing features before committing against a complete environment is now possible.
not because of Fizzy, but from a compounding loop through a single pair of load balancers causing the LB pair CPU to spike and drop requests. this sucks! we were so close to finishing the migrations and now we’ll have buy new load balancers to handle the load… or will we?That's when HA Proxy came into our lives.The Reliable, High Performance TCP/HTTP Load BalancerSince we already had all the data in RangeWe could generate our HA Proxy configs in seconds and deploy them in minutesAllow us to automate load balancing changes without having to make changes on network gear
not because of Fizzy, but from a compounding loop through a single pair of load balancers causing the LB pair CPU to spike and drop requests. this sucks! we were so close to finishing the migration and now we’ll to handle the load… or will we?
That's when HA Proxy came into our lives.The Reliable, High Performance TCP/HTTP Load BalancerSince we already had all the data in RangeWe could generate our HA Proxy configs in seconds and deploy them in minutesAllow us to automate load balancing changes without having to make changes on network gear
Moving to HA Proxy gave SRE:complete control over how we handled the load balancingreduced requests to NetOps, in turn reduced our turn around timereduced network hops between ATS and the Frontendremoved single points of failure between L1 Proxy and Fizzy by eliminating the load balancer
Month 10: we did it! www.linkedin.com migrated behind L1 Proxy and Fizzy.
Startedout at 4000 QPS and 120M members, today we’re nearing 70000 QPS at 225M members. that’s approximately a 4x increase in QPS, year over year. NetScaler is still in play, but now only providing the load balancing to L1 Proxy as well as SSL termination. At this point, we’re able to consider possibilities of removing the NetScaler all-together.Bug fixes for features in ATS can be rolled out in hours not days and our acquisitions get all the goodness the rest of the site does
Stability:* When you’re introducing a critical tier and forcing everyone onto it, customer service is key. Spent many hours debugging invalid (and some valid) escalations to help build confidenceInvalid requests:POST requests with no content-length and no bodyConnection failed:Clients using CONNECT for no reason!
Here are 15 out of 30 outages since we set up L1 Proxy and FizzyEach outage reminded us of the impact from even the smallest of changes. There will be mistakes, there will be unexpected surprises.If you’re going to fail, do it quick and recover fast. Learn from the mistakes, and avoid the repeaters.
We’re doing more with ATS than ever before and the outage rate is not affected by it.issues with plugins are now caught earlier in the development process. they’re performance tested before going to staging, deployment schedule with strict guidelines to ensure testing/verification is done before promoting to production.and you can see a downwardtrend with the Human factor
We strive to keep our graphs looking good, even when they’re bad… so much so that my team will draw on post-it notes to cover up nasty outages. So how did we do this… with a few different tools.
I suggest summarizing some of the data, unless you’re prepared to consume all the metrics
Great for reading variables (core + plugins) to monitor.
don’t want to shell out to gather metrics, there’s an HTTP endpoint!awesome, right?
we take start_time and use it to calculate uptime by subtracting start_time from time.time()
tracking start_time helps highlight crashes, deployments, people doing things to the service that shouldn’t be
monitor trends coming in and going out
We track how close we’re getting to the throttle limit.
that’s bad
Core dump rate: monitoring file system for core dumps < 24 hours, alert if >NTCP States: captured from netstat, watching for spike in TIME_WAITProc: memory usage, swap usage, file descriptor usage
They don’t give enough of a picture for the requests we’re processing. If you need to debug a problem, we need a combination of these familiar logging formats… fortunately there’s custom logging
Log request headers, response headers, timing, originWe tail -f the log, aggregate and report timing for given paths (something we don’t get with traffic_logstats)
If someone adds or removes a host from the deployment system’s topology, our config generatorswill pick it up. We even have some of these configs ready to be headless so changes will be automatically propagated.Salt is an open source remote execution framework written in Python. Since we can write Python modules to do whatever we want, we’re able to create the pre/post hooks necessary for rolling out changes:take host out of rotationbleed trafficconfirm it’s out of rotationupgrade packagesinstall configsrestart trafficserververify process is runningreview log filesgo back into rotationbefore these steps were all done by a human, and ultimately led to mistakes. we now automate these tasks and iterate on them every time we learn how to better the process.quick plug on inFormed
inFormed is our in-house report of things happening in productionFed through multiple bridges:jira ticketing systemircdeploymentswhatever
available in the experimental section of plugins
This is an example
Google DWR recently was updated in the last couple weeks and caused Chrome to bomb out on one of our javascript files. Within 20 minutes, we had a temporary fix deployed into production to issue the correct Content-Type header.
Why would you need Boom?
Enabled anyone to debug production issues against a single host instead of scanning for your request across 50+ servers. I can pin my requests to a specific L1 Proxy host, through a specific Fizzy host and then to a specific Profile host. Hell yes!
This is even more awesome due to ATS’ non-blocking I/O, avoids burning up threads on the frontendsWho has looked at LinkedIn’s “View as Source”?
Saving 10% per request at the expense of idle CPU is a huge win!
Being tested in our staging environment as we speak. Potential CDN savings still to be calculated.
traffic_manager was unable to communicate with traffic_server because of a hard-coded file descriptor limit of 32 for the internal healthcheck. so traffic_server would restart every ~2 minutes and you see an uptime graph like this…
when ATS hits the connection_throttle limit, it would never get out of the throttle until restarting ATS.
as we started adding our own stats, there was no checking in place to prevent a plugin from creating too many variables/metrics and the {stat} end point was not able to return the results within the given buffer.https://git-wip-us.apache.org/repos/asf?p=trafficserver.git&a=search&h=HEAD&st=author&s=brianghttps://git-wip-us.apache.org/repos/asf?p=trafficserver.git&a=search&h=HEAD&st=author&s=manjeshnilangehttps://issues.apache.org/jira/issues/?jql=project%20%3D%20TS%20AND%20reporter%20in%20(manjeshnilange%2C%20briang%2C%20manjesh)%20AND%20updated%20%3E%202011-06-01----- Meeting Notes (6/18/13 11:50) -----write out what you want to say
We have rewritten a few of our plugins to use this new API, one of them is literally half the code it was before.This has enabled us to growfrom 2 engineers to working on ATS plugins to 6 engineers and the ramp up time for plugin development is dramatically reduced.Doug’s comment on atscppapi:I'd say the main feedback is that it's *really* easy to use compared to the raw API. Hides all the grunge, and just lets you focus on your logic. I wrote a transform plugin that would probably have taken me weeks of struggling with virtual I/O buffers and so on in just a few hours, and that included learning the basics of the API. Now that I've done it once, it would be even faster. So far I haven't hit any limitations of the abstraction. It does a excellent job of providing the functionality of the ATS in a way that matches the plugin developer's mental view of the tasks to be performed, rather than going from the mindset of internal ATS implementation. As long as you understand the basic concept of the ATS state machine, writing a new plugin is almost trivial.
Earlier this year, our Media origin for the CDN was nearing capacity. The NetApp filer’s CPU were over 50% and if we needed to failover, we would not have been able to serve Media requests (profile pictures, cached external content). Since we had so much success with ATS as a reverse proxy, why not try using it for its bread and butter... caching.
After a couple weeks of tweaking config and $30,000 in gear later, a caching layer was built on-top of our Media origin. We had 98% cache hit rate serving requests < 2ms. This reduced our NetApp filer’s CPU to less than 1%. The team responsible for the Media origin thought the NetApp CPU graphs were broken (we kind of forgot to mention we finished migrating the traffic over to the new cache)and savedthe company $400,000by avoiding having to upgrading our filers.Recap…$30,000 of commodity gear + ATS saved LinkedIn $400,000
Thank you for your time! Come meet the team behind ATS @ LinkedIn during our office hours at 1:15PM. We’re interested in answering any questions around our experiences of solving problems at LinkedIn with Apache Traffic Server.