RightScale discussed best practices for operating applications in the cloud based on their own experience running their PaaS platform in the cloud. Key practices included: using ServerTemplates to keep environments consistent, changing configurations rather than servers, monitoring efficiently with off-the-shelf and custom metrics, automating processes, and backing up databases across clouds. The presentation provided examples of how RightScale applies these practices in their own operations.
2. 2#
Your Panel Today
Presenting
• Rafael H. Saavedra, VP Engineering, RightScale
• Josep Blanquer, Sr. Systems Architect, RightScale
Q&A
• David Manriquez, Account Manager, RightScale
Please use the “Questions” window to ask questions any time!
Cloud Management Platform
3. 3#
Agenda
• RightScale architecture
• The release cycle
• Monitoring, alerts and escalations
• When servers fail
• Our best practices
Today’s material will discuss how we run RightScale in the cloud.
From this, we distill best practices that are relevant for all.
Please use the “Questions” window to ask questions any time!
Cloud Management Platform
5. 5#
The scale of RightScale
• > 3M servers launched by RightScale
• RightScale continuously monitors > 100K servers
• Every day at RightScale:
• 2,000 array resize actions are executed
• 35,000 alert escalations are triggered
• 20,000 escalation emails are sent to users
• 9.0TB of monitoring data is exchanged with our servers
• 1.6TB of logging data is sent to our servers
Cloud Management Platform
6. 6#
Architecture of a cloud-based SaaS app
• RightScale is a SaaS application that runs completely in the cloud
• Databases
• Core web app and API
• Services such as monitoring, logging, and MultiCloud Marketplace
Cloud Management Platform
7. 7#
A quick primer on ServerTemplates
Configuring servers
through bundling Images: Configuring servers
with ServerTemplates:
Custom MySQL 5.0.24 (CentOS 5.2)
Custom MySQL 5.0.24 (CentOS 5.4)
MySQL 5.0.36 (CentOS 5.4)
Setup DNS and IPs
MySQL 5.0.36 (Ubuntu 8.10)
boot sequence
MySQL 5.0.36 (Ubuntu 8.10) 64bit A set Restore last backup
of configuration
directives that will install and
Frontend Apache 1.3 (Ubuntu 8.10)
configure Configure MySQL of
software on top
Frontend Apache 2.0 (Ubuntu 9.10) - patched the base image
CMS v1.0 (CentOS 5.4) Install MySQL Server
CMS v1.1 (CentOS 5.4) Install monitoring
My ASP appserver (windows 2008)
My ASP.net (windows 2008) – security update 1
Base Image
My ASP.net (windows 2008) – security update 8 MultiCloudImage
Very few and basic
SharePoint v4 (windows 2003) – 32bit
SharePoint v4 (windows 2003) –64bit
SharePoint v4.5 (windows 2003) –64bit CentOS 5.2 Ubuntu 8.10 Win 2003
CentOS 5.4 Ubuntu 9.10 Win 2007
…
Cloud Management Platform
8. 8#
We use the same ServerTemplates our
customers do
• RightScale uses 15-20 different ServerTemplates in Production
• We don’t build images, we use pre-built MultiCloud Images with RightLink
• We make heavy use of RightScale provided tool boxes (EBS, DNS, LB)
• Off-the shelf: 1 template (MySQL)
• Customized: App servers and load balancers
• Written with RightScripts in Ruby, Bash, etc.
• Mostly Rail apps to run our core services: front-end, API, Marketplace, etc.
• From MultiCloud Image: Messaging and databases
• RabbitMQ, Cassandra
Cloud Management Platform
10. 10#
Best practices: Architecture
• ServerTemplates can be used off the shelf or customized
• Don’t bundle images
• Make heavy use of MCI’s instead of hardcoding base RightImages
• Deployments let you stage servers in the cloud
• The use of inputs guarantee consistency across all servers
• Easily test or failover
• Macros/API automation can quickly stand up entire deployments
Cloud Management Platform
12. 12#
Challenges of the release cycle
• Limited resources and lead time for procuring and
provisioning equipment
• Maintaining multiple environments from development
through production
• Maintaining consistency for reusability and QA
• Distributed teams and team members
Cloud Management Platform
14. 14#
Our development environment
• We keep a number of different deployments
• Each development team has its own mini-environment
• A larger integrated staging environment
• One production environment
• Accounts keep things organized and secure
• We keep a separate accounts for staging and production
• One team of sys admins manage all environments
Cloud Management Platform
15. 15#
RightScale release cycle
• One set of scripts and ServerTemplates are used everywhere
• Gate accounts for security, development vs. production, etc.
• Less test variance between Production and Staging
• Only difference is size of environment
• Easy to bring up development environment on demand using
deployments and macros
• Get it up and running, on demand in less than an hour
• Cloud is pay-by-the-hour, so it is cheap to run temporary environments
Cloud Management Platform
16. 16#
Best practices: Release cycle
• Don’t be afraid to run many environments
• Dynamically clone, launch and teardown environments for quick tests
• Configure a fixed set of environment for development, integration, staging
• Use different accounts to segregate users and configurations.
• Sys admins are expensive. Cloud servers are cheap.
• Reuse ServerTemplates to keep environments consistent
• Make use of the versioning and freeze software repositories
• Share or Publish them through the MultiCloud Marketplace
• Create all-in-one ServerTemplates from the same RightScripts and recipes
• Avoid upgrading existing servers, fail forward instead
• Keep old servers running so you can rollback, or do post-mortem later on
• For databases: Launch additional slaves. Freeze replication at upgrade point.
Take snapshots!
Cloud Management Platform
17. 17#
Release night steps
2) Servers with new code 7) Take snapshot
at cutoff
Main App
9) Reconnect
10) Open access all servers
to site 8) Update schema
Databases
Front Ends
DB Master
DB Slave
Main App DB Slave
3) Add second slave
4) Cut access
6) Stop replication
to site
5) Stop all access
to databases
1) Servers with current code
Cloud Management Platform
19. 19#
Monitoring and alerts: Diagnose & optimize
• Off-the-shelf monitoring
• OS: CPU, Disk, Memory, Network, Processes, System
• App: Apache, IIS, MySQL, Nginx, SQL Server
• Plus many more CollectD plug-ins!
• Custom monitoring
• Cluster monitoring
• Alerts & escalations
Cloud Management Platform
20. 20#
Monitoring, alerts & escalations
• We monitor as much relevant data as possible and display it
in insightful ways to quickly detect patterns and abnormalities
• We proactively eliminate the conditions that raise critical alerts
• No broken windows policy. No critical alerts can remain unresolved.
API Network Activity Dashboard Network Activity
Cloud Management Platform
22. 22#
Off-the-shelf: MySQL reads graphs
• Read-random-next represents a table scan
• Read-next represents an index scan
Cloud Management Platform
23. 23#
Custom: Whatever you want with collectd
• Any statistic you can think of can easily be added as a monitor.
• All of these are graph-able and alert-able in our dashboard!
• Many can be written in less than an hour.
• As easy as printing a line of formatted numbers every few seconds
• support.rightscale.com is an authority on collectd
• How we do it:
• We use Ruby to write our custom monitors
• Cassandra: jcollectd with JMX to pull out monitoring data from JavaBeans
• Passenger: Ruby script that parses data from Passenger command line interface
Cloud Management Platform
25. 25#
Cluster: Monitor hundreds of servers
• We leverage a
monitoring data
warehouse to develop
heat maps
& stacked graphs
Cloud Management Platform
26. 26#
Automated actions using alerts from monitors
• Create an alert for any monitor, even your custom ones
• RightScale example: Cassandra pending reads signals overloading
• Break alerts into critical and warning
• Critical: Wake me up! Page me!
• Warning: Send email to team.
• Trigger many actions: email, run script, scale, relaunch, reboot,…
• Customize to your monitor, situation, and IT processes
• RightScale example: Run a RightScript if swap is too high
• Integrate with 3rd party services like PagerDuty
Cloud Management Platform
27. 27#
Best practices: Monitoring and alerts
• Monitor your critical processes off-the-shelf
• Set monitors with scripts on your ServerTemplates
• Use mon_process (e.g. Ruby)
• Customize to your application needs
• Use collectd plug-ins or easily build your own
• The monitor is graphed in the RightScale dashboard
• Plan out your critical alerts
• Set your response plan: warnings vs. critical
Cloud Management Platform
29. 29#
How to think about server failure in the cloud
• Design for failure
• Make sure your application remains healthy after the failure of a node
• Don’t use sticky sessions
• Distribute your application services
• Debug ServerTemplates and not servers
• Use alerts to reboot and/or relaunch
• Auto-scale app server arrays
• Use dynamic DNS and static IPs for load balancers
• Your app servers and databases will always know where to look
Cloud Management Platform
30. 30#
Deep dive on database failure
• Use database backups for rollbacks or disaster scenarios
• Restore from backups in event of complete system failure
• One-click with fully automated RightScale Database Managers
• Use database redundancy for high availability (example master/slave)
• Promote slave if master fails
• Possible to prime your slave database to make failover more seamless
• After promotion is complete, quick to launch a new slave
• Worry about troubleshooting when you have time
• One-click with fully automated RightScale Database Managers
Cloud Management Platform
31. 31#
Backups to block volumes and object stores
• Block volumes: EBS snapshots • Object stores: S3/Cloud Files
• + Easy to snapshot • + Backup into other clouds
• + Easy to rotate • + Backup individual folders or files
• + Easy consistency • + Incremental backups (e.g. as
• + Instant restore (mount) files/data are flushed)
• - Difficult to move between • - More coding, customization
clouds/regions • - Custom rotation strategy
• - Must backup entire volume • - Download time
• What we do: • What we do:
• EBS: Databases • S3: Monitoring system (Cassandra
in the future)
Cloud Management Platform
32. 32#
Best practices: Planning for failure
• No excuse for not backing up your servers
• RightScale Database Manager + EBS tools make it easy to take backups
• Plan your rotation policy
• Database Manager helps you tailor daily, weekly, and monthly backups
• Backup across clouds and regions
• Database Manager for MySQL and SQL Server make it easy to backup to S3 or
CloudFiles from AWS, CloudStack, Eucalyptus, and Rackspace
• Organize your backups
• Keep track with lineages and timelines using the Database Managers
• Test your backups!
• It is easy and cheap on the cloud
• A crisis is the worst time to find out your backups are corrupted
Cloud Management Platform
34. 34#
Best practices for operating in the cloud
• Keep your environment organized and consistent
• Accounts, deployments, ServerTemplates, and macros
• Change and debug configurations not servers
• ServerTemplates, MultiCloudImages, fail-forward
• Monitor your servers efficiently
• Off-the-shelf and custom monitoring and alerts
• Automate, automate and also automate
• Server arrays, macros/API for more complex flows, alert actions …
• Backup your databases (organize, multi-cloud, rotate, test)
• Database Manager ServerTemplates
Cloud Management Platform
35. 35#
Getting Started and Q&A
Contact RightScale RightScale Conference
(866) 720-0208 Nov 9 in Santa Clara, CA
sales@rightscale.com www.RightScale.com/Conference
•Attend technical breakout sessions
www.rightscale.com
•Talk with RightScale customers
•Ask questions at the Expert Bar
•Training on 11/8 and 11/10
More Info
Webinar archive: RightScale.com/webinars
White Papers: RightScale.com/whitepapers
Free Edition: RightScale.com/Free
Cloud Management Platform
Editor's Notes
RightScale'sServerTemplates allow you to capture best practices for provisioning and automating cloud infrastructure. In this breakout session, we will explore how you can leverage the RightScale platform to share ServerTemplates with others. Specifically, we'll walk through the steps to share and update ServerTemplates across your organization. We'll also show you how to publish ServerTemplates publicly for the whole world to use. This topic is best for: IT members who are responsible for maintaining server configurations within the organization, developers who would like to share work product within their group or ISVs wishing to reach cloud users by publishing through RightScale.
The cluster monitoring is very powerful in that it provides different types of views into the operation of large clusters of servers
The cluster monitoring is very powerful in that it provides different types of views into the operation of large clusters of servers
RightScale'sServerTemplates allow you to capture best practices for provisioning and automating cloud infrastructure. In this breakout session, we will explore how you can leverage the RightScale platform to share ServerTemplates with others. Specifically, we'll walk through the steps to share and update ServerTemplates across your organization. We'll also show you how to publish ServerTemplates publicly for the whole world to use. This topic is best for: IT members who are responsible for maintaining server configurations within the organization, developers who would like to share work product within their group or ISVs wishing to reach cloud users by publishing through RightScale.
RightScale'sServerTemplates allow you to capture best practices for provisioning and automating cloud infrastructure. In this breakout session, we will explore how you can leverage the RightScale platform to share ServerTemplates with others. Specifically, we'll walk through the steps to share and update ServerTemplates across your organization. We'll also show you how to publish ServerTemplates publicly for the whole world to use. This topic is best for: IT members who are responsible for maintaining server configurations within the organization, developers who would like to share work product within their group or ISVs wishing to reach cloud users by publishing through RightScale.
More specifically, we hear the following challenges: (Again, use this to unearth where they are having challenges.) Limited resources – In almost every phase, limited hardware poses problems. In architecting new systems there are rarely enough resources to experiment with alternative architectures or new technologies. For developers, limited resources usually means sharing hardware for testing. Testers rarely have enough hardware or time to do all the testing they would like to do - full performance and load testing, testing on complete production architectures, or testing disaster recovery scenarios. And, delays in development often puts pressure on testers to do their work faster to still reach the same deadline. The inability to spin-up additional testing resources at these times causes quality to suffer. The result is that errors are found later in the cycle where they are more expensive to fix. Limited equipment also means staff are constantly provisioning, tearing down, and re-provisioning the same equipment. It takes time, and if environments are not completely wiped clean, additional errors are potentially introduced. Time to procure and provision equipment - As the load on IT departments increases and the release cycles shorten, the wait for equipment to be procured and provisioned takes time away from valuable work. One customer stated it took 3-5 weeks to procure and provision new hardware. Maintaining consistent environments – As code moves through development, test, staging and production, changes to configurations in one stage rarely make it back into earlier stages. As new code is implemented from environments that haven’t been updated, the same errors are re-introduced. Maintaining multiple environments – As if maintaining one consistent environment across many servers isn’t hard enough, most software requires testing on several different types of configurations – different versions of stacks, for different end user environments – one for each possible production scenario. For example, a software company may need to test their software on different operating systems or alongside various software packages. Most companies need to clone production environments to debug problems without impacting the current users.Whether it happens in development or QA - maintaining & reproducing environments is a time consuming task. If the task is distributed across multiple administrators, the coordination of changes made becomes challenging. If the task is consolidated under one administrator, there is a limit to the number of different environments s/he can reliably maintain.Distributed teams or team members – add collaboration requirements and exacerbate all of the issues mentioned.
With RightScale it’s easy to create consistent, reproducible configurations in each stage. In a typical development lifecycle, the systems architect creates a reference architecture that serves as a model for production, and then that architecture specifies what components are needed in each configuration.
The cluster monitoring is very powerful in that it provides different types of views into the operation of large clusters of servers
The cluster monitoring is very powerful in that it provides different types of views into the operation of large clusters of servers
RightScale'sServerTemplates allow you to capture best practices for provisioning and automating cloud infrastructure. In this breakout session, we will explore how you can leverage the RightScale platform to share ServerTemplates with others. Specifically, we'll walk through the steps to share and update ServerTemplates across your organization. We'll also show you how to publish ServerTemplates publicly for the whole world to use. This topic is best for: IT members who are responsible for maintaining server configurations within the organization, developers who would like to share work product within their group or ISVs wishing to reach cloud users by publishing through RightScale.
RightScale'sServerTemplates allow you to capture best practices for provisioning and automating cloud infrastructure. In this breakout session, we will explore how you can leverage the RightScale platform to share ServerTemplates with others. Specifically, we'll walk through the steps to share and update ServerTemplates across your organization. We'll also show you how to publish ServerTemplates publicly for the whole world to use. This topic is best for: IT members who are responsible for maintaining server configurations within the organization, developers who would like to share work product within their group or ISVs wishing to reach cloud users by publishing through RightScale.
The cluster monitoring is very powerful in that it provides different types of views into the operation of large clusters of servers
RightScale'sServerTemplates allow you to capture best practices for provisioning and automating cloud infrastructure. In this breakout session, we will explore how you can leverage the RightScale platform to share ServerTemplates with others. Specifically, we'll walk through the steps to share and update ServerTemplates across your organization. We'll also show you how to publish ServerTemplates publicly for the whole world to use. This topic is best for: IT members who are responsible for maintaining server configurations within the organization, developers who would like to share work product within their group or ISVs wishing to reach cloud users by publishing through RightScale.
The cluster monitoring is very powerful in that it provides different types of views into the operation of large clusters of servers
RightScale'sServerTemplates allow you to capture best practices for provisioning and automating cloud infrastructure. In this breakout session, we will explore how you can leverage the RightScale platform to share ServerTemplates with others. Specifically, we'll walk through the steps to share and update ServerTemplates across your organization. We'll also show you how to publish ServerTemplates publicly for the whole world to use. This topic is best for: IT members who are responsible for maintaining server configurations within the organization, developers who would like to share work product within their group or ISVs wishing to reach cloud users by publishing through RightScale.
The cluster monitoring is very powerful in that it provides different types of views into the operation of large clusters of servers
The cluster monitoring is very powerful in that it provides different types of views into the operation of large clusters of servers
The cluster monitoring is very powerful in that it provides different types of views into the operation of large clusters of servers
The cluster monitoring is very powerful in that it provides different types of views into the operation of large clusters of servers
The cluster monitoring is very powerful in that it provides different types of views into the operation of large clusters of servers
The cluster monitoring is very powerful in that it provides different types of views into the operation of large clusters of servers
The cluster monitoring is very powerful in that it provides different types of views into the operation of large clusters of servers