6. Meebo Bar 1000+ sites Quantcast: 197 MM monthly uniques* LOTS of pageviews LOTS of ad requests 6 * http://bit.ly/xAPXx
7. Meebo’s Ad Server Given User features Available ads Objective Maximize revenue P(click) Price Satisfy advertisers Respect targeting Smooth campaign delivery Complex application Lots of concurrent requests 7
9. Sample App: FortuneTeller Given Username Available fortunes Objective Select fortune for user JaccardSimilarity(username, fortune) username=PyConIsForLovers “Generosity and perfection are your everlasting goals.” 9
17. Take Two: Apache mod_wsgi Using mpm_prefork Worker processes handle requests One concurrent request per process Memory cached between requests O/S schedules CPU 17
18. Take Two: Apache mod_wsgi Advantages Straightforward, synchronous code Cached memory Disadvantages Resource inefficient Need working set in each process Cold cache on restart Managing worker count Too few: 502 Too many: OOM? Database DoS? 18
21. Take Three: Twisted Asynchronous framework Events and callbacks Twisted orchestrates context switches Twisted server Single event loop Concurrent requests 21
22. Quick Break: Event Loops s = socket.socket(…) s.setblocking(ISBLOCKING) s.connect((HOST, PORT)) greeting = s.recv(1024) s.close() Blocking Wait for data Nonblocking Initiate, return immediately Data (if available) Exception: “I’m not done yet” Requires more plumbing 22
23. Quick Break: Event Loops Nonblocking sockets in an event loop 23 1. f(x): s = NonBlockingSocket(…) greeting = s.recv(1024) print x, “|”, greeting Events fd=5, fp=g, {s: ‘hi’, a: 5} fd=5, fp=g, {s: ‘hi’, a: 5} 2. Call recv(). fd=2, fp=f, {x: 8080} 3. Create context, add to the event loop. fd=3, fp=myfunc, {} fd=3, fp=myfunc, {} 4. Process events that are ready (select/poll). fd=18, fp=f, {x: 80} fd=18, fp=f, {x: 80} fd=18, fp=f, {x: 80} 5. Return to context when data is ready. 6. “80 | Hello from socket s!”
24. Take Three: Twisted Asynchronous framework Events and callbacks Twisted orchestrates context switches Twisted server Single event loop Concurrent requests 24
25. Take Three: Twisted Advantages Shared memory User space context switches Disadvantages Develop asynchronously Stuck in the framework Asynchronous libraries No I/O in C Unfair scheduling Using multiple cores 25
28. Take Four: gevent + gunicorn gevent Networking library Uses event loop Synchronous API Synchronous code running asynchronously Monkey patching Rewrites standard modules Coroutines for function context Lightweight threads, no stack greenlet implementation 28
29. Take Four: gevent + gunicorn gunicorn (“Green Unicorn”) Lightweight WSGI server Multiple worker processes Share queued requests gevent support 29
30. Take Four: gevent + gunicorn Advantages Best of both worlds! mod_wsgi Straightforward, synchronous code No framework, just python Multicore support Twisted Shared memory User space context switches Disadvantages Pure-python libraries Unfair scheduling 30
37. “Evented” Development Synchronous code still runs asynchronously Requests aren’t independent Things to keep in mind Duplicate work Socket caching CPU hogging 37
38. gunicorn + gevent in Production Managing gunicorn greins Randall Leeds (tilgovi): github/meebo/greins Multiple apps URL routing Server hooks Worker launch Pre/post requests Daemon interface Debugging gevent gevent-profiler Shaun Lindsay (srlindsay): github/meebo/gevent-profiler Execution trace Time spent 38
IntroductionMatt Spitz. Software Engineer at meebo. Here today to talk about how building web applications in python and the pros/cons of the various means by which we can serve them up.
Users make requests to an application, which uses a shared storage backend.
Same thing, just lots and lots and lots of concurrent requests
With such a large-scale application, small optimizations can have a huge impactSave money on hardware (machines, RAM, CPU)Faster response time, better user experienceHandling more concurrent requestsSubstantially decrease impact on shared resourcesone example of a high-concurrency web application is theadserver we run at meebobefore I talk about the adserver, let me introduce the meebo bar
Themeebo bar is deployed to our partner sites and offers a neat way to share content on the site and allows users to chat with other members of the site.
Show off the chat in the corner, the sharing buttons, and the ad unitCan’t give you numbers, but suffice it to say that any adserver to which you’re making those calls can be considered a “high-concurrency web application”
Selecting the ad a user is most likely to click onServing the most valuable ads (e.g. highest CPC)Respect whatever targeting the advertisers have selectedEnsuring smooth, complete delivery for each ad campaignTheadserver is a pretty complicated beast and I think that going through it wouldn’t really help in making my point for this talk, so I wrote a sample application that has a similar structure and resource-usage patterns
describeJaccardSimilarity (size(intersection(x,y))/size(union(x,y))) super arbitrary, just to represent some CPU processing in the applicationSHOW OFF THE CODE(make sure to show off the user fortune caching)
We’re gonna try four different serving implementations
How difficult is it to write code for these applications?What’s the extent to which these applications allow us to use 3rd party libraries?How efficient is the application in terms of memory?Can we take advantage of multi-core machines?
SHOW OFF THE CODE
Simple to writeRequests don’t affect one another--Need to reload all working set (all fortunes) with each requestNo database connection cachingIt’s a start, but it doesn’t scale
Before I show you a performance graph, want to go over the benchmarks25ms delay on interface between guest and host to exaggerate the effects of I/O on response time
8 processes maximumRequires loading all fortunes with each request
Apache spins up a number of worker processes to handle requestsWorkers handle a configurable number of requests before being replacedWorkers handle exactly one request at a timeMemory is cached in the worker, so we can re-use the set of fortunes between requestsOperating system handles schedulingMAKE SURE TO SHOW OFF THE HANDLER
Using almost the same simple, synchronous code as we had in the CGIMemory is cached across requests in the same workerNo shared memory between workersNeed to load set of all fortunes in each workerMore workers requires more RAMEach worker load requires a DB requestHammers the database on apache restart
Using 8 worker processes
Twisted is an asynchronous framework for building network applications Developer structures code as events and callbacksTwisted orchestrates context switches among requests, typically on things that take a long time (I/O)Twisted server Single event loop => single process Handles multiple requests simultaneously in the event loopAnd since we’re all in one process, memory is shared among requests
Some of this may be review, but it’s important that everyone understands thisBlocking: connect and recv wait until their actions complete before returningNonblocking: connect and recv initiate the action (if it hasn’t been already) and return the data or raise an exception immediatelyRequires a lot more plumbing than the example above
…so let’s go back to this slide (the next one)
Twisted is a framework built around an event loopProvides a nice interface for setting up your functions and callbacks (for success or error)Keeps track of multiple execution paths simultaneously, just as we saw in the previous exampleThe big problem with Twisted is that you can’t just plug in your synchronous app. You have to set up these events and callbacks for every piece of code might block.MAKE SURE TO SHOW OFF THE CODE AND HOW MUCH OF A PAIN IT IS
AdvantagesMemory is shared among requests (we only have to load the fortunes once to service many simultaneous requests)Context switches happen in user space (fast)DisadvantagesNeed to rewrite code to be asynchronous Guido sez: “I hate callback-based programming.” It’s hard to wrap your brain around. stuck in the framework– everything has to be asynchronous, you have to use Twisted’s standard libraries, which may not behave quite as you’d like3rd- party libraries must also be asynchronous No I/O in C libraries (at least not out of the box)CPU-intense requests monopolize the processormod_wsgi: O/S handles scheduling, processes scheduled at any time, and CPU time is shared “fairly” Twisted: CPU scheduled explicitly, CPU-bound blocks of code prevent other requests from runningTaking advantage of multiple cores isn’t trivial-- load balancer? multiprocessing module?
Note that Twisted is running only on a single core
geventNetworking library using libevent Has an event loop, but its API is synchronousTransforms synchronous applications to be asynchronous automatically!!!“Monkey patches” python system modules (socket)Rewrites socket calls to set up a callback and a context after writing the request to the socketFunction context in coroutinesThink of coroutines as lightweight threadsPointer to code + context, no stacke.g. Closures and generatorsUses an event loop to manage all concurrent requestsContext switch on network I/O (just like Twisted)
gunicornFast, lightweight WSGI server written by Benoit Uses multiple workers to handle requestsBig win: Supports gevent workers out of the boxEach worker maintains a pool of coroutines to handle incoming requests Those workers share memory among requestsAt this point, we look at the code.
AdvantagesBest of both worlds!mod_wsgiEasy to writeNo framework to do everything asynchronously, just pythonCan take advantage of multiple coresTwistedShared memory among requests within each workerContext switches in user spaceDisadvantagesSimilar to TwistedNo I/O in C librariesCPU-intense requests monopolize the processor
gunicorn_1 is comparable to TwistedNegligible performance impact when the application is made asynchronous
gunicorn_4-8 is faster than mod_wsgiMaking context switch deterministically and in user space is more efficient than OS scheduling
gevent takes care of transforming synchronous code, but it’s still executed in an event loop Synchronous code is not necessarily executed synchronouslyDuplicated loads: simultaneous database requests 1) no fortunes? load up the fortunes! 2) no fortunes? load up the fortunes! => use “events” to protect duplicate effortsSocket caching: can’t naively cache socketsCan’t use the same socket for two simultaneous operationsMust create a new socket per connection or use a poolCPU hogging Might want to offload CPU-intense things to another daemon/process
Managing gunicorngreinsRandall LeedsEnables running multiple apps in a single gunicorn instanceRoutes traffic based on URLAllows for global and per-app server hooksOn worker startup (preloading a working set)Pre/post requests (Apache-style request logging)Provides standard start/stop/reload/restart interface to gunicornDebugging gevent applicationsgevent-profilerShaun LindsayProvides a linear trace of all function calls and context switchesAnalyzes where CPU time is spent in a given application
Blocking code is easy to understand, but traditional deployments aren’t very efficientAsynchronous applications make the best use of resources, but they’re a pain to writeRunning gevent workers in gunicorn is both simple and efficient, as it allows you to write blocking code that is converted to be asynchronous automatically.At meebo, we've found this setup to be amazingly efficient and reliable, even under extreme loadA number of our mission-critical, high-concurrency web applications have been running under this setup for the last 7 months with no major issues or outages. Been able to save money on hardware with no impact on response time…we even got a Halloween costume out of it.