Horizontal scaling in the Cloud is the way to adapt resources to load of systems. The Cloud allows users to scale virtually indefinitely, or enough for their needs.
This way the number of servers follows trend of requests, and TCO (Total Cost of Owneship) of IT infrastructure can be reduced. Nonetheless companies can avoid dealing with capacity planning and pre-provisioning issues.
This talk will show how to use Python and Rackspace/OpenStack API and SDK to implement an event-based scaling solution (software released under the open-source Apache License: stay tuned).
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
PyCon Russia 2014 - Auto Scale in the Cloud
1. An introduction
Scale in the Cloud
Created by: Simone Soldateschi
Modified Date: 2014-06-02
Classification: Public Conference
2. RACKSPACE® HOSTING | WWW.RACKSPACE.COM
Who am I?
Simone Soldateschi
• Java, C/C++, PHP, Python developer
• More than 8 years experience as SysAdm/SysEng
• Developer Support Engineer at Rackspace
• Task automation enthusiast
• MTB’ing, triathlon, photo, manga
@soldasimo
simonesoldateschi
2
3. RACKSPACE® HOSTING | WWW.RACKSPACE.COM
Who are Rackspace?
Founded in 1998 in San Antonio, TX by three guys that
wanted to create a hosting company
Home of Fanatical Support /o/
Second biggest Public Cloud in the world
OpenStack Project co-founder
4. RACKSPACE® HOSTING | WWW.RACKSPACE.COM
To be recognized as one of the
world’s great service companies.
“
”
Rackspace Vision
4
5. RACKSPACE® HOSTING | WWW.RACKSPACE.COM
• Python SDK, Cloud
• Auto Scaling
• Management System
• Control law
• Garçon, all together now!
Roadmap
5
16. simone.soldateschi@rackspace.co.uk
New Usage Models
CLOUDSMART
16
Dedicated Servers are Pets
• Great thought to their acquisition
• Name them and know each one
• Willing to pay big money for their care
Cloud Servers are Livestock
• Use them as long as they provide value
• Acquire more of them when needed
• Dispose of any that aren’t needed
• Get rid of them if they become ill
56. RACKSPACE® HOSTING | WWW.RACKSPACE.COM
Desired State
Write code to tell the
computer
how to set up itself!
56RACKSPACE® HOSTING | WWW.RACKSPACE.COM
57. RACKSPACE® HOSTING | WWW.RACKSPACE.COM
• Python SDK, Cloud
• Auto Scaling
• Management System
• Control law
• Garçon, all together now!
Roadmap
57
67. RACKSPACE® HOSTING | WWW.RACKSPACE.COM
Closed-Loop Control Law – Garçon implementation
67
Garçon
68. RACKSPACE® HOSTING | WWW.RACKSPACE.COM
Garçon in-depth – cm2asd
68
# fetch current list of servers
l_current_servers = scaling_group_servers(scaling_group_id)
Garçon
69. RACKSPACE® HOSTING | WWW.RACKSPACE.COM
Garçon in-depth – cm2asd
69
for i in range(len(l_current_servers)-1, -1, -1):
server_id = l_current_servers[i]
s = get_server(server_id)
Garçon
if s.status != 'ACTIVE':
# server not active
l_current_servers.pop(i)
continue
m = get_server_metadata(s.id)
try:
if (m['aspoc.server_status'] != 'configured'):
server not configured
l_current_servers.pop(i)
continue
except KeyError:
# server not configured
l_current_servers.pop(i)
continue
71. RACKSPACE® HOSTING | WWW.RACKSPACE.COM
Garçon in-depth – cm2asd
71
# compute average system load for scaling group
servers_avg_load = servers_average_load(l_checks, samples,
sample_time)
# compare current load against configured threshold
if servers_avg_load >= threshold_high:
# trigger scale_up_webhook
r = requests.post(scale_up_webhook)
if r.status_code != 202:
logger.error('scale_up_webhook (%s) returned HTTP %d' %
(scale_up_webhook, r.status_code))
Garçon
72. RACKSPACE® HOSTING | WWW.RACKSPACE.COM
Garçon in-depth – cm2asd
72
if servers_avg_load <= threshold_low:
# trigger scale_down_webhook
r = requests.post(scale_down_webhook)
if r.status_code != 202:
logger.error('scale_down_webhook (%s) returned HTTP %d' %
(scale_down_webhook, r.status_code))
Garçon
73. RACKSPACE® HOSTING | WWW.RACKSPACE.COM
Garçon in-depth – cfgmgmtd
73
for s_id in l_current_servers:
...
# server exists?
try:
cs = pyrax.cloudservers
cs.servers.get(s_id)
except:
logging.warning('Auto Scale server (%s, %s) missing '
'(maybe deleted manually?)' % ('-', s_id))
continue
...
try:
# read server metadata
m = get_server_metadata(s_id)
...
74. RACKSPACE® HOSTING | WWW.RACKSPACE.COM
Garçon in-depth – cfgmgmtd
74
Use metadata
try:
if (server_status != 'configured' and
server_status != 'configuring'):
...
# run thread to configure server
threading.Thread(target=configure_server, args=(s_id,)).start()
No metadata?
except KeyError:
# CONFIGURE server (KeyError, no metadata) in thread
threading.Thread(target=configure_server,
args=(s_id, ansible_timeout,)).start()
?
X
75. RACKSPACE® HOSTING | WWW.RACKSPACE.COM
Garçon in-depth – cfgmgmtd, Ansible
75
Reset server’s password:
# set server password
password = generate_password(10, punctuation=False)
set_server_password(server_id, password)
Server’s info (e.g. IP address):
# fetch server info
ip = get_server_ipv4(server_id, MGMT_NETWORK)
77. RACKSPACE® HOSTING | WWW.RACKSPACE.COM
Garçon in-depth – cfgmgmtd, Monitoring System
77
Create checks for new server to be managed
# cloud monitoring (agent_id := server_uuid)
add_cm_cpu_check(server_id))
# good, set 'aspoc_server.status=configured' in metadata
set_server_metadata(server_id, 'aspoc.server_status', 'configured')
78. RACKSPACE® HOSTING | WWW.RACKSPACE.COM
• Python SDK, Cloud
• Auto Scaling
• Management System
• Control law
• Garçon, all together now!
RECAP
78
That’s me!I started as a developer, then worked as Systems Engineer.
Since I moved to UK I strived to combine programming and SysEng skills.
I like real life too, cycling and whatever outdoor activities.
If you’ve seen a Rackspace presentation before, you likely have seen a statement such as this talking about great customer service.
At Rackspace, we believe that Service is our key strategic differentiator and reason customers will continue to trust their business to Rackspace. While more than ever technology is playing a key role in RAX capabilities (OpenStack, Public/Private Cloud, RackConnect, etc)...we’ll continue to rely on service as our primary differentiator.
First of all
You are Python developers, should you want use OpenStack you’ll need to use Python SDK, namely pyrax
Beware that the name will change as it reminds Rackspace (i.e.: RAX suffix), but it is supposed to support OpenStack
Let’s create a Python Virtual Environment, to keep things clean.
== CLICK ==
And install pyrax SDK and ipython for testing purposes
We will use Python 2
== CLICK ==
See PyRAX project, doc and snippets on GitHub
Define environment variables
Use pyrax to authenticate
Define environment variables
== CLICK ==
Give it a shot -- Use pyrax to authenticate, and see what happens
Pyrax supports many Cloud components, that you can choose from to write software.
For our purposes we need to use just the following three Cloud components: Monitoring, Servers, Auto Scale
To fully understand how scaling infrastructures works on the Cloud, let’s discuss what Autoscaling mean.
This lead us to the next section of this presentation.
Discuss what Auto Scale is, how to scale, when and why. Then go deeper into scaling techniques.
Traditionally if you wanted a more powerful server, you would buy more RAM and add CPUs.
That approach is called vertical scaling.
Say that 40 servers can serve the highest peak of traffic your infrastructure is going to have.
You need to provision 40 servers by that date, if you know when it is going to happen
Or just buy 40 servers from the very beginning
Gray area is wasted money.
You have many more servers just to be prepared for the high-traffic moment.
High chances are that you are going to have hard times trying to explain someone in finance team «Why you should buy and maintain 40 servers when 10 or 20 are fine most of the time»
CAPEX and OPEX are extremely high. You own every piece of hardware, and have to maintain it
Now, say that you are somehow able to make those two lines be married together.
This way you provision just a bit more of capacity than what you need, Just-In-Case, but:
CAPEX is moved toward OPEX – you do not really own anything, you just use servers when you need them
OPEX is minimised too – should something go wrong: destroy faulty server, spin a new one up
The ability to automatically or semi-automatically scale up and down a group of servers based on computing or traffic demand by provisioning new services
Does any of you name your servers?
ON & OFF
FAST GROWTH
VARIABLE
CONSISTENT
ON & OFF
FAST GROWTH
VARIABLE
CONSISTENT
Boolean load
There are three different Autoscaling methodologies to choose from.
They can also be combined and mixed together.
The typical scenario for a web application is…
At 9 o’clock…
…on the 1st of November…
…spinning a Cloud Server up is scheduled!
For Reactive Autoscaling let’s say that there are two servers working at 60%
Load increases and overall load raises to 80%
Autoscaling add a new server to cluster, and overall load decreases
Then overall load decreases to 30% (e.g. less requests)
Autoscaling spin one server down
Let’s discuss the last scaling type: Predictive Autoscaling
It is somehow possible to forecast traffic
Servers are spun up and down according to forecast
Let’s RECAP
Schedule based scaling: set time to scale up…
…then set time to scale down
On the other hand, event-based scaling:
Set thresholds, when hit, scale up or down policies are triggered
The idea behind Cooldown is to set the right pace, much like the pace car in a race.
Let’s say your server requires 3 minutes to be fully provisioned, configured, deployed.
Within those three minutes there is no reason to scale up again.
Wait for the server being built to be live, then re-enable scaling up
Rings a bell?
How could we apply scaling to existing infrastructures?
Let’s view some scenarios.
Cluster of application or front-end servers
Adopting stateless servers
Scaling Boss-Worker clusters
Enough theory, so far!
Let’s start discussing what tools you might want to use to use Auto Scale in the Cloud.
How many of you use Configuration Management System?
How many of you use Ansible?
How many of you use Ansible use, or used, Puppet/Chef/SaltStack?
Let’s see what Configuration Management System does, and what desired state means.
How do you install Ansible on your laptop, or on a management server?
In fact Ansible is agent-less.OK, OK, SSH is an agent ;)
Ansible streamlines managing remote servers, as there is no need of a pre-installed agent.So no golden image (which is not DevOpsy!), or start-up script.
How would you provision a server manually?
Build server up
Attach block devices
Create filesystem
Install packages
Configure it (e.g. users, daemons, firewall policies, etc)
Now think that you can achieve the very same result with Ansible.
You just need to decide what you are aiming to.
How a close control loop looks like
This diagram shows components of a closed-loop control law.
It is called closed-loop because there is a feedback, which is taken into account
Systems load is monitored, and if it hits certain thresholds, then Auto Scale policies are triggered.
In OpenStack universe Otter replace Auto Scale, and a monitoring system of customer’s choice replace Cloud Monitoring
Auto Scale put messages on Cloud Servers message queue, and servers are spun up or down accordingly
Ideally you would like a piece of software which is able to do the following:
read configuration
get list of current servers in scaling group
Fetch data and stats from monitoring system
computes average load of all systems
triggers scaling policies
Infrastructure scales according to the reference, being the configuration file.
Put all together with Garçon
Garçon is a software that I wrote in Python to integrate Cloud Monitoring and Auto Scale
Garçon is the glue between Cloud Monitoring and Auto Scale
It queries Cloud Monitoring, checks load of scaling groups against configured thresholds, and triggers Auto Scale policies.
Garçon is composed of two daemons (can also be run by cron), cm2asd and cfgmgmtd.
The former fetches stats from Cloud Monitoring. The latter triggers Auto Scale policies, and run configuration management system on fresh new servers.
Just to recap: run cfgmgmtd, which runs Ansible, configures new/pristine Cloud Server → makes them ready to Go Live!
Ideally you would like a piece of software which were able to do the following:
read configuration
get list of current servers in scaling group
Fetch data and stats monitoring system
computes average of all systems
triggers scaling policies
Infrastructure scales according to the reference, being the configuration file.
Fetch current list of servers in scaling group.
Only ACTIVE server within the scaling group participate in computing overall load (e.g.: average CPU load, average memory usage, length of message queue)
AIM -- Fetch monitoring data for ACTIVE servers only
== CLICK ==
Get server status
== CLICK ==
Check if key/value pair exists in metadata, and if value is configured.
Scale up
HTTP return code is 202 to prevent information leakage
Scale down
HTTP return code is 202 to prevent information leakage
Scale down
What’s servers’ status? Let’s cycle to find out…
== CLICK==
Just ensure that every single server exists. REMEMBER we are in the Cloud
If metadata tag does not indicate an already managed server
ORthere is no metadata
THEN run Ansible against that server.
Now we are going to discuss how to configure server
Reset server’s password, to let Ansible SSH into it.
Do you remember we said «servers are livestock»?
They are supposed to be managed programmatically. Nobody should never, ever, SSH into a Cloud Server, especially if it is part of a scaling group
Lets Configuration Management System work
== CLICK ==
Set playbook related variables
== CLICK ==
Prepare command statement with options, and run it
Create and attach new check to new servers, so that Monitoring System will be aware of new server to monitor.
== CLICK ==
State you are finished managing the server, aka TAGserver as CONFIGURED.
Configuration Management Systems are idempotent, meaning you can run the same playbook/recipe/manifest against an already configured server over and over again.
For performance purposes, just skip it.