Skyscanner Hotels is built on top of Aiohttp and has been chosen as one of the formally supported frameworks by the company. We recently presented this talk which tackles how Aiohttp is used, what we have learned and how we are trying to generalize some implementations and patterns to be used by all micro services from scratch at Skyscanner.
The talk will cover also essential things related with Asyncio, HTTP protocol, and the AWS infrastructure. All of them are essential to understanding some of the points that are covered during this talk.
2. โ About me
โ Skyscanner and aiohttp, why?
โ Tracing incoming requests
โ Calling other microservices
โ DNS in AWS with Aiohttp
โ Misleading timeouts, the reactor saturation side effect.
โ Desired plans
โ Questions
Running Aiohttp at scale @SkyscannerEng
3. Pau Freixes @pfreixes
โ Senior Software Engineer working at Skyscanner for almost 2 years.
โ Member of the Hotels Attachment Squad.
โ Also collaborator of Mshell Squad. Helping with the Python stuff.
โ Open source committer aioredis, aiohttp, etc.
Running Aiohttp at scale @SkyscannerEng
5. Running Aiohttp at scale @SkyscannerEng
โ Skyscanner loves microservice pattern architecture.
โ Microservices talk to each other using HTTP
โ Most used languages used at Skyscanner Java, JavaScript, Python have an
official and supported HTTP framework.
โ The adoption of an standard HTTP framework: Share knowledge, Avoid
fragmentation, Implementation of communalities
6. Running Aiohttp at scale @SkyscannerEng
Why Aiohttp? Or why Asyncio?
โ Looking for a framework based on IO bound scenarios
โ AWS API, Dynamodb, S3, etc.
โ Microservices architecture. I.e DDBs abstracted on top of HTTP
Rest services.
โ Aiohttp meets the basic requirements:
โ Acceptable performance vs commodity.
โ There is an active community.
โ Enough maturity of Asyncio
โ Asyncio API becomes stable since Python 3.5
โ Reputed Python HTTP services have plans to use Asyncio i.e.
Tornado.
7. Running Aiohttp at scale @SkyscannerEng
But choosing Aiohttp/Asyncio also means face some uncertainties:
โ Asynchronous code != synchronous code
โ Different problems
โ Different patterns
โ Less experience
โ Some libraries might be not mature at all
โ Not enough time for growing
โ Small community
โ Less feedback
โ etc
9. Running Aiohttp at scale @SkyscannerEng
We need to know what is happening and what happened in our microservice.
โ HTTP endpoints statistics:
โ Time per request
โ Statistics such as avg, p90, p99
โ Number of requests
โ Status code per request. Errors
โ HTTP Access log
โ Historically access.
โ Indexed by fields such as status code, endpoint,
โ Identify each line of log to a specific request.
10. Docker foundations of our microservice that enables the request tracing
Running Aiohttp at scale @SkyscannerEng
โ Requests goes into through the HAProxy
โ AioHttp microservice handles the incoming request
โ AioHttp sends metrics to the StatsD container
โ Per each metric (real time)
โ Low latency network
โ AioHttp sends logs to the Heka container
โ Per each log ( real time)
โ Low latency network
โ StatsD sends aggregations of metrics (OpenTSDB)
โ Avg, p90, p99
โ Almost not real time, per minute
โ Heka parsers and sends the data (ElasticSearch)
โ Almost not real time
11. Running Aiohttp at scale @SkyscannerEng
Aiohttp middlewares are the perfect place to instrumentalize incoming requests:
async def middleware_timing(self, app,
handler):
async def timing(request):
start = app.loop.time()
response = await handler(request)
print(app.loop.time() - start)
return response
return timing
app = web.Application(middlewares=[timing])
12. Running Aiohttp at scale @SkyscannerEng
Aiohttp microservices at Skyscanner come with the following middlewares for free:
โ Metric requests: Produce statistics by request.
โ Access Log: Upload access log.
โ Correlation id: Identify univocally an incoming request.
13. Running Aiohttp at scale @SkyscannerEng
Demo about metrics and access log
14. Running Aiohttp at scale
How could we follow the code path executed by an specific request ?
async def foo(request):
logging.info("Doing some complicate stuff")
await asyncio.sleep(1)
async def bar(request):
start = loop.time()
await foo(request)
logging.info("Time for foo {}".format(loop.time() - start)
async def view(request):
logging.info("New request")
await bar(request)
15. Running Aiohttp at scale @SkyscannerEng
Any request at Skyscanner is identified by an unique ID, this identifier is saved at
some place that will be used automatically by any logging call.
16. Running Aiohttp at scale @SkyscannerEng
aiotask-context stores information within the current asyncio.task instance.
async foo():
aiotask_context.set("key", True)
await asyncio.sleep(1)
aiotask_context.get("key")
17. Running Aiohttp at scale @SkyscannerEng
The request id is stored as a task attribute by a middleware to make it available at
any code place.
async def correlation_id(self, app, handler):
async def save_correlation_id(request):
correlation_id = request.headers.get(
"Skyscanner-Correlation-Id",
request.headers.get(
"X-Correlation-Id",
str(uuid.uuid4()
)
)
context.set("Skyscanner-Correlation-Id", correlation_id)
return await handler(request)
18. Running Aiohttp at scale @SkyscannerEng
When the aiohttp microservice is started a new logging filter is installed to
populate automatically the request id at each logging call.
class RequestId(logging.Filter):
def filter(self, record):
correlation_id = context.get("Skyscanner-Correlation-Id")
record.correlationid = correlation_id
return True
20. Running Aiohttp at scale @SkyscannerEng
All logging calls will log the request id as a literal and also as a new field to be
indexed by ElasticSearch
try:
price_per_night = request.price / request.nights
except ZeroDivisionError:
logging.warning("Invalid `night` value param")
raise
23. Running Aiohttp at scale @SkyscannerEng
async validate_currency(request):
session = ClientSession()
resp = await session.post("http://currency.eu-west-1.skyscnr.local",
data={'currency': request.currency}
)
if resp.status_code != 200:
raise ValidationError(
"Currency {} invalid".format(request.currency))
I.e validate if the currency sent by a request is valid or not. We will use an external
microservice service for that:
24. Running Aiohttp at scale @SkyscannerEng
We need to know what is happening and what happened with the calls to external
microservices.
โ Time per request
โ Statistics such as avg, p90, p99
โ Number of requests
โ Status code per request. Errors
25. Running Aiohttp at scale @SkyscannerEng
Aiohttp does not provide an official a way to trace request events, yet. An ad hoc
class called MetricsClientSession is implemented to replace the official
ClientSession.
class MetricsClient(ClientSession):
def _request(self, *args, **kwargs):
start = self._loop.time()
response = super(MetricsClient, self)._request(*args, **kwargs)
elapsed = self._loop.time() - start
logging.info("Time spent {}".format(elapsed))
return response
26. Running Aiohttp at scale @SkyscannerEng
Demo about calling other services
27. DNS in AWS with Aiohttp
Running Aiohttp at scale @SkyscannerEng
28. Running Aiohttp at scale @SkyscannerEng
$ dig currency.eu-west-1.skyscnr.local
currency.eu-west-1.skyscnr.local. 59 IN A 10.51.106.106
currency.eu-west-1.skyscnr.local. 59 IN A 10.51.165.90
currency.eu-west-1.skyscnr.local. 59 IN A 10.51.35.2
AWS:
โ DNS TTL 60 seconds
โ IP addresses can change
โ Number of IP addresses can grow
29. Running Aiohttp at scale @SkyscannerEng
Aiohttp versions < 2 does not support it
A DNS cache was implemented based on the AWS requirements. This
implementation would become the official one for Aiohttp 2 versions.
31. Running Aiohttp at scale @SkyscannerEng
DNS cache and the dog pile effect. The following code will make 100 DNS queries
without using the cache.
import asyncio
tasks = [validate_currency('EUR') for i in range(100)]
loop = asyncio.get_event_loop()
loop.run_until_complete(asyncio.gather(*tasks))
32. Running Aiohttp at scale @SkyscannerEng
The dog pile effect happens when there is a miss in the cache, all ongoing requests
will end up performing a DNS query.
To get rid of this side effect, a throttling mechanism was implemented. Available in
Aiohttp 2.3.
33. Running Aiohttp at scale @SkyscannerEng
# there was a miss in the cache
if host in self._throttle_dns_events:
yield from self._throttle_dns_events[host].wait()
else:
self._throttle_dns_events[host] =
EventResultOrError(self._loop)
addrs = yield from
self._resolver.resolve(
host, port,
family=self._family)
self._cached_hosts.add(host, addrs)
self._throttle_dns_events[host].set()
return self._cached_hosts.next_addrs(host)
34. Misleading timeouts, the reactor saturation side effect.
Running Aiohttp at scale @SkyscannerEng
35. Running Aiohttp at scale @SkyscannerEng
Calls to third services are protected by timeouts, having the chance to do the
proper countermeasures.
async validate_currency(request):
session = ClientSession()
try:
resp = await session.post("http://currency.eu-west-1.skyscnr.local",
data={'currency': request.currency}
timeout=1,
)
except asyncio.TimeoutError:
raise HttpError(504)
36. Running Aiohttp at scale @SkyscannerEng
When the reactor is saturated the timeouts might be triggered. Timeouts are
handled internally by asyncio as future callbacks that will cancel a specific Future.
def cancel_future(future):
future.cancel()
async def request(*args, timeout=2):
f = asyncio.Future()
asyncio.call_later(timeout, cancel_future, f)
# some internal stuff that triggers the network
# operations
return f
37. Running Aiohttp at scale @SkyscannerEng
Lets try to monitor the reactor saturation, how ?
38. Running Aiohttp at scale @SkyscannerEng
โฆ. with the LAG of a scheduled function. The time lapsed between executions can
be used to measure how busy is a reactor..
def lag():
elapsed = before - loop.time()
if elapsed > 1:
print("Reactor had a delay")
loop.call_later(lag, 1)
39. Running Aiohttp at scale @SkyscannerEng
Example of the lag metric. You can identify the reactor saturation that happened
at some point.
41. Running Aiohttp at scale @SkyscannerEng
โ Trace queued operations
โ HTTP pool has a connection limit
โ Once the limit is reached the operation is queued
โ When there is a free connection, operation is unqueued
โ AWS Xray support
โ Another Middleware
โ Trace calls to third services
โ Back pressure at HTTP layer
โ When the reactor is too busy return 504
โ Scale horizontally when there are a flood of 504
42. Edinburgh โข Glasgow โข Singapore โข Beijing โข Miami โข Barcelona โข Shenzhen โข Sofia โข Budapest โข London โข Tokyo
Questions?
Slides http://bit.ly/runningatscale