Skyscanner Engineering: Running aiohttp at scale by Pau Freixes

Running Aiohttp at
scale
by Pau Freixes

● About me
● Skyscanner and aiohttp, why?
● Tracing incoming requests
● Calling other microservices
● DNS in AWS with Aiohttp
● Misleading timeouts, the reactor saturation side effect.
● Desired plans
● Questions
Running Aiohttp at scale @SkyscannerEng

Pau Freixes @pfreixes
● Senior Software Engineer working at Skyscanner for almost 2 years.
● Member of the Hotels Attachment Squad.
● Also collaborator of Mshell Squad. Helping with the Python stuff.
● Open source committer aioredis, aiohttp, etc.

Skyscanner and aiohttp, why?

● Skyscanner loves microservice pattern architecture.
● Microservices talk to each other using HTTP
● Most used languages used at Skyscanner Java, JavaScript, Python have an
official and supported HTTP framework.
● The adoption of an standard HTTP framework: Share knowledge, Avoid
fragmentation, Implementation of communalities

Why Aiohttp? Or why Asyncio?
● Looking for a framework based on IO bound scenarios
○ AWS API, Dynamodb, S3, etc.
○ Microservices architecture. I.e DDBs abstracted on top of HTTP
Rest services.
● Aiohttp meets the basic requirements:
○ Acceptable performance vs commodity.
○ There is an active community.
● Enough maturity of Asyncio
○ Asyncio API becomes stable since Python 3.5
○ Reputed Python HTTP services have plans to use Asyncio i.e.
Tornado.

But choosing Aiohttp/Asyncio also means face some uncertainties:
● Asynchronous code != synchronous code
○ Different problems
○ Different patterns
○ Less experience
● Some libraries might be not mature at all
○ Not enough time for growing
○ Small community
○ Less feedback
○ etc

Tracing incoming requests

We need to know what is happening and what happened in our microservice.
● HTTP endpoints statistics:
○ Time per request
○ Statistics such as avg, p90, p99
○ Number of requests
○ Status code per request. Errors
● HTTP Access log
○ Historically access.
○ Indexed by fields such as status code, endpoint,
○ Identify each line of log to a specific request.

Docker foundations of our microservice that enables the request tracing
● Requests goes into through the HAProxy
● AioHttp microservice handles the incoming request
● AioHttp sends metrics to the StatsD container
○ Per each metric (real time)
○ Low latency network
● AioHttp sends logs to the Heka container
○ Per each log ( real time)
○ Low latency network
● StatsD sends aggregations of metrics (OpenTSDB)
○ Avg, p90, p99
○ Almost not real time, per minute
● Heka parsers and sends the data (ElasticSearch)
○ Almost not real time

Aiohttp middlewares are the perfect place to instrumentalize incoming requests:
async def middleware_timing(self, app,
handler):
async def timing(request):
start = app.loop.time()
response = await handler(request)
print(app.loop.time() - start)
return response
return timing
app = web.Application(middlewares=[timing])

Aiohttp microservices at Skyscanner come with the following middlewares for free:
● Metric requests: Produce statistics by request.
● Access Log: Upload access log.
● Correlation id: Identify univocally an incoming request.

Demo about metrics and access log

Running Aiohttp at scale
How could we follow the code path executed by an specific request ?
async def foo(request):
logging.info("Doing some complicate stuff")
await asyncio.sleep(1)
async def bar(request):
start = loop.time()
await foo(request)
logging.info("Time for foo {}".format(loop.time() - start)
async def view(request):
logging.info("New request")
await bar(request)

Any request at Skyscanner is identified by an unique ID, this identifier is saved at
some place that will be used automatically by any logging call.

aiotask-context stores information within the current asyncio.task instance.
async foo():
aiotask_context.set("key", True)
await asyncio.sleep(1)
aiotask_context.get("key")

The request id is stored as a task attribute by a middleware to make it available at
any code place.
async def correlation_id(self, app, handler):
async def save_correlation_id(request):
correlation_id = request.headers.get(
"Skyscanner-Correlation-Id",
request.headers.get(
"X-Correlation-Id",
str(uuid.uuid4()
)
)
context.set("Skyscanner-Correlation-Id", correlation_id)
return await handler(request)

When the aiohttp microservice is started a new logging filter is installed to
populate automatically the request id at each logging call.
class RequestId(logging.Filter):
def filter(self, record):
correlation_id = context.get("Skyscanner-Correlation-Id")
record.correlationid = correlation_id
return True

Installation of the RequestId filter
LOG_SETTINGS = {
'handlers': {
'console': {
'class': 'logging.StreamHandler',
'level': 'INFO',
'formatter': 'default',
'filters': ['correlationid'],
}
},
'formatters': {
'default': {
'format': '%(asctime)s %(levelname)s %(correlationid)s |
%(message)s',
},
},
}

All logging calls will log the request id as a literal and also as a new field to be
indexed by ElasticSearch
try:
price_per_night = request.price / request.nights
except ZeroDivisionError:
logging.warning("Invalid `night` value param")
raise

Demo about request id

Calling other microservices

async validate_currency(request):
session = ClientSession()
resp = await session.post("http://currency.eu-west-1.skyscnr.local",
data={'currency': request.currency}
)
if resp.status_code != 200:
raise ValidationError(
"Currency {} invalid".format(request.currency))
I.e validate if the currency sent by a request is valid or not. We will use an external
microservice service for that:

We need to know what is happening and what happened with the calls to external
microservices.
● Time per request
○ Statistics such as avg, p90, p99
● Number of requests
● Status code per request. Errors

Aiohttp does not provide an official a way to trace request events, yet. An ad hoc
class called MetricsClientSession is implemented to replace the official
ClientSession.
class MetricsClient(ClientSession):
def _request(self, *args, **kwargs):
start = self._loop.time()
response = super(MetricsClient, self)._request(*args, **kwargs)
elapsed = self._loop.time() - start
logging.info("Time spent {}".format(elapsed))
return response

Demo about calling other services

DNS in AWS with Aiohttp

$ dig currency.eu-west-1.skyscnr.local
currency.eu-west-1.skyscnr.local. 59 IN A 10.51.106.106
AWS:
● DNS TTL 60 seconds
● IP addresses can change
● Number of IP addresses can grow

Aiohttp versions < 2 does not support it
A DNS cache was implemented based on the AWS requirements. This
implementation would become the official one for Aiohttp 2 versions.

>>> from aiohttp.connector import TCPConnector
>>> connector = TCPConnector(dns_ttl=60)
>>>
>>> async def ip(hostname, port):
>>> hosts = await connector._resolve_host(hostname, port)
>>> print(next(hosts)['host'])
>>>
>>> asyncio.get_event_loop().run_until_complete(ip("currency.eu-west-1.skyscnr.local", 8080))
10.51.165.90
10.51.35.2
10.51.106.106
10.51.165.90

DNS cache and the dog pile effect. The following code will make 100 DNS queries
without using the cache.
import asyncio
tasks = [validate_currency('EUR') for i in range(100)]
loop = asyncio.get_event_loop()
loop.run_until_complete(asyncio.gather(*tasks))

The dog pile effect happens when there is a miss in the cache, all ongoing requests
will end up performing a DNS query.
To get rid of this side effect, a throttling mechanism was implemented. Available in
Aiohttp 2.3.

# there was a miss in the cache
if host in self._throttle_dns_events:
yield from self._throttle_dns_events[host].wait()
else:
self._throttle_dns_events[host] =
EventResultOrError(self._loop)
addrs = yield from
self._resolver.resolve(
host, port,
family=self._family)
self._cached_hosts.add(host, addrs)
self._throttle_dns_events[host].set()
return self._cached_hosts.next_addrs(host)

Misleading timeouts, the reactor saturation side effect.

Calls to third services are protected by timeouts, having the chance to do the
proper countermeasures.
async validate_currency(request):
session = ClientSession()
try:
resp = await session.post("http://currency.eu-west-1.skyscnr.local",
data={'currency': request.currency}
timeout=1,
)
except asyncio.TimeoutError:
raise HttpError(504)

When the reactor is saturated the timeouts might be triggered. Timeouts are
handled internally by asyncio as future callbacks that will cancel a specific Future.
def cancel_future(future):
future.cancel()
async def request(*args, timeout=2):
f = asyncio.Future()
asyncio.call_later(timeout, cancel_future, f)
# some internal stuff that triggers the network
# operations
return f

Lets try to monitor the reactor saturation, how ?

…. with the LAG of a scheduled function. The time lapsed between executions can
be used to measure how busy is a reactor..
def lag():
elapsed = before - loop.time()
if elapsed > 1:
print("Reactor had a delay")
loop.call_later(lag, 1)

Example of the lag metric. You can identify the reactor saturation that happened
at some point.

Desired Plans

● Trace queued operations
○ HTTP pool has a connection limit
○ Once the limit is reached the operation is queued
○ When there is a free connection, operation is unqueued
● AWS Xray support
○ Another Middleware
○ Trace calls to third services
● Back pressure at HTTP layer
○ When the reactor is too busy return 504
○ Scale horizontally when there are a flood of 504

Edinburgh • Glasgow • Singapore • Beijing • Miami • Barcelona • Shenzhen • Sofia • Budapest • London • Tokyo
Questions?
Slides http://bit.ly/runningatscale

Skyscanner Engineering: Running aiohttp at scale by Pau Freixes

Recommended

Recommended

More Related Content

Recently uploaded

Recently uploaded (20)

Featured

Featured (20)

Skyscanner Engineering: Running aiohttp at scale by Pau Freixes