Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Why VM Replication Is Your Lifeline when Disaster Strikes
1. VM Replication is Your Lifeline
When Disaster Strikes
Nathan Schmidt
Services Account Manager,
Strategic Solutions
February 12, 2013
2. Agenda
• Our Top 5 Myths about Disaster Recovery (DR)
• Disaster Recovery Overview
• Most Common Problems with Replication
• How VM Replication Solves Those Problems
• Q&A Session
RACKSPACE® HOSTING | WWW.RACKSPACE.COM
2
7. Top 5 Myths About DR
I backup my data, that should suffice.
Example Schedule for Data Backups
Sun. Mon. Tues. Wed. Thurs. Fri. Sat.
Full Backup
Differential
Incremental Incremental Incremental Incremental Incremental
RACKSPACE® HOSTING | WWW.RACKSPACE.COM
7
8. Top 5 Myths About DR
I backup my data, that should suffice.
Example Schedule for Data Backups
Sun. Mon. Tues. Wed. Thurs. Fri. Sat.
Full Backup
Differential
Incremental Incremental Incremental Incremental Incremental
DC Outage
RACKSPACE® HOSTING | WWW.RACKSPACE.COM
8
9. Top 5 Myths About DR
I backup my data, that should suffice.
“Less than half of SMBs back up their data weekly or
more frequently, and only 23% backup daily.
41% of the SMBs surveyed said that putting together a
Disaster Recovery plan never occurred to them.”
Source: Symantec 2011 SMB Disaster
Preparedness Survey
RACKSPACE® HOSTING | WWW.RACKSPACE.COM
9
11. Top 5 Myths About DR
I don’t need a DR plan. My company can weather the
storm or survive whatever disruption may lie ahead.
RACKSPACE® HOSTING | WWW.RACKSPACE.COM
11
13. What We’ve Heard from our Customers
“I know I need DR, but
I’m not really sure what
that should be.”
• Many firms think that a backup of their data is enough
• Goal of a backup is to enable data restoration
• A DR plan helps quickly restore operations
RACKSPACE® HOSTING | WWW.RACKSPACE.COM
13
15. Building the DR Framework
Basic Concept Concern
DC Level Primary DC Outage
App Level App Configuration
Info Level Critical Data
RACKSPACE® HOSTING | WWW.RACKSPACE.COM
15
16. How to Make DR a Goal for 2013
“I’ve tried to socialize the need for a
DR plan, but it isn’t considered a
priority by senior leadership.”
“How do I convince my boss that the
additional cost of resiliency tools are
justified?”
RACKSPACE® HOSTING | WWW.RACKSPACE.COM
16
22. Resiliency Solutions Across the Portfolio
Cloud Backup
RAX Solution
RAX Cloud
Cloud Files Partner Solution
RAX Data Center DNS Failover (Neustar) RAX Data Center
Network Network
Global Load Balancer
Virtual Servers Virtual Servers
VM Replication
Managed DB Servers DB Servers Managed
Backup Database Replication Backup
Servers Servers
MBU Infrastructure Host-Based Replication MBU Infrastructure
SAN SAN
Array-Based Storage
Replication
Remote Data Remote Data
Replication Replication
Customer’s 2
20
Data Center 2
23. VM Replication – Quick Overview
• Helps protects and recover
business-critical VMs when
disaster strikes
• Offers geographic
redundancy by replicating
VMs between Rackspace
DCs
RACKSPACE® HOSTING | WWW.RACKSPACE.COM
23
24. Solving Real-World Problems
VM Replication solution was designed by working closely
with our beta customer, Virtual, Inc.
“We had the opportunity to provide feedback on
the early-stage replication product and talk
through the options that best met our customer’s
needs.
I felt that Rackspace listened carefully to our
feedback and even anticipated how we intended
to use and implement the solution.”
- Russell Kuhl, V.P. of Technology at Virtual
RACKSPACE® HOSTING | WWW.RACKSPACE.COM
24
25. Common Hurdles with Replication Tools
Failover
Cost
Testing
Complexity
RACKSPACE® HOSTING | WWW.RACKSPACE.COM
25
26. Failover
Cost
Testing
Host-Based vs. Guest-Based Complexity
Host-Based Replication Guest-Based Replication
• Occurs at the hypervisor layer • Occurs at the VM layer
• Replication process controlled by VA • Replication process controlled by VMs
• Replicated VMs are inactive • Replicated VMs in target DC are active
RACKSPACE® HOSTING | WWW.RACKSPACE.COM
26
28. Failover
Cost
Testing
VM Replication is Host-Based Complexity
It’s Cost-Effective.
• Replicated VMs in the target site remain off
• Only pay for the replicated VMs when they are powered on after failover
• Replicated VMs are powered down once you failback to the source site
RACKSPACE® HOSTING | WWW.RACKSPACE.COM
28
29. Failover
Cost
Testing
Infrastructure Costs Complexity
Redundant infrastructure represents a significant cost.
• Repurpose redundant hypervisor in secondary DC (e.g. test/dev environment)
• Downsize the server footprint in the target site to accommodate just critical VMs
• Consider using less powerful servers, if degraded performance is acceptable
RACKSPACE® HOSTING | WWW.RACKSPACE.COM
29
31. Failover
Cost
Testing
VM Replication Allows Different Hardware Complexity
It reduces the hardware costs related to redundancy.
• No need to replicate entire source environment – simply select specific VMs
• Heterogeneous storage options are available for target site (e.g. dSAN to local)
• While replicated VMs are powered down, repurpose redundant server
RACKSPACE® HOSTING | WWW.RACKSPACE.COM
31
32. Failover
Cost
Testing
Lack of Expertise & Assistance Complexity
Companies may not have the expertise or manpower, or both.
• What do I need?
• Who’s going to design it?
• Who’s going to manage and monitor it?
• Who’s going to assist with the failover?
• Who’s going to “push the button” for failover?
• Who owns the overall DR strategy?
RACKSPACE® HOSTING | WWW.RACKSPACE.COM
32
33. Failover
Cost
Testing
Levels of Responsibility Complexity
DR Plan Failover Runbook
Customer
“Pushing the Failover Button”
Failover Process
Replication App. (VA)
Virtual Machine Layer
Guest OS Layer Rackspace
Hypervisor Layer
Server Hardware
33
DC & Network
RACKSPACE® HOSTING | WWW.RACKSPACE.COM
34. Failover
Cost
Testing
Rackspace Fanatical Support is the Key Complexity
• Replication team is
TECHNICAL ACCOUNT
available to
SUPPORT MANAGEMENT
ARCHITECTURE BUSINESS
design, monitor &
SUPPORT DEVELOPMENT
manage
PROFESSIONAL SERVICES,
NETWORK SECURITY, BACKUP,
STORAGE, VIRTUALIZATION,
• Virtualization team has
DATABASE ADMINISTRATORS,
CORPORATE SECURITY
VCP-certified architects
DATA CENTER
OPERATIONS
available
RACKSPACE® HOSTING | WWW.RACKSPACE.COM
34
35. Failover
Cost
Testing
Failover Testing Complexity
Companies don’t test their failover plan enough.
• Some replication services charge per test – expensive
• The failover/failback process can be risky in production
• The risk requires extensive planning around every test
RACKSPACE® HOSTING | WWW.RACKSPACE.COM
35
36. Failover
Cost
Testing
VM Replication - Testing via Snapshot Complexity
No disruption to replication during test
VM 2 Snapshot
VM 2 Replication Continues VM 2
VM 1 (Powered Off)
Hypervisor 1 Hypervisor 2
Host 1 Host 2
Rackspace DC 1 Rackspace DC 2
RACKSPACE® HOSTING | WWW.RACKSPACE.COM
17
36
37. VM Replication Simplifies Failover
Failover
Cost
Testing
Testing Complexity
Failover testing that’s fast, free, and frequent.
• No charge for failover testing
• It’s quick to setup and doesn’t require planning
• Testing is done in a sandbox environment
RACKSPACE® HOSTING | WWW.RACKSPACE.COM
37
38. Rackspace VM Replication
For more information on VM Replication,
please call us at 1-877-934-0409.
RACKSPACE® HOSTING | WWW.RACKSPACE.COM
38
Nathan SchmidtServices Account Manager, Strategic SolutionsCertified in Business Continuity from DRI International
Here’s what we’ll cover in today’s webinar. Through talking with our customers, we have compiled a list of our top 5 myths related to DR. We’ll review what DR really means, and a good way to approach the subject. Also, we’ll identify the most common challenges that are related to replication and how our new product helps address those problems. Lastly, we’ll open the floor to questions and discussion around resiliency.
Here’s what made it on our list of Top 5 Myths About DR.We start with number 1. Just because your DC isn’t located in a disaster-prone area, doesn’t mean that mother nature can’t surprise you with a perfect storm. Recent memory reminds us that Florida isn’t the only state that could get hit by a hurricane.click
Although natural disasters do happen, it’s not the leading cause of data center outages. Statistically speaking, most self-managed DC outages are attributed to human error.The 80’s band, The Human League, said it best in their song titled Human…we’re “born to make mistakes.”
There will always be inherent risk. You can’t escape it. It’s kinda like juggling with a chainsaw……After a while…things are going to get messy. So it’s not a matter of if disaster strikes, but when disaster strikes.
Let’s consider this sentence. If I back up my data, that should be enough. In terms of DR, would you consider this statement true or false?click
…It’s a resounding FALSE. Even if you performed full backups weekly, and incremental backups daily, this isn’t enough. Let’s take a look at the example schedule. You perform a full back up of your data every Sunday. Do you know the difference between differential and incremental? A differential backup copies the changes made since the last full backup. An incremental data backup only copies the changes since the last incremental.click
So what if there’s a data center outage on Wednesday? First things first, you need access to the affected servers in order to first reconfigure the machines and bring them online so that you can start restoring your data. Let me stop here. If the entire DC is down, for whatever reason, then you don’t currently have access to the onsite servers. For the sake of continuing this example, let’s assume that you have invested in geographic redundancy and have your data stored offsite and you also have additional server capacity in a different DC. The next step would be to restore the most recent full backup from Sunday to the offsite servers. Then you would restore all of the the subsequent incremental backups. The restoration process could potentially take several hours or even days depending on the amount of data. While your data may be protected, the restoration process is slow and doesn’t address downtime well.click
Here’s a surprising tidbit…According an SMB Disaster Preparedness Survey, “less than half of SMBs back up their data weekly or more frequently, and only 23% backup daily. “41% of the SMBs surveyed said that putting together a DR plan never occurred to them.”
Not testing the failover enough reduces your chances of successful failover when you really need it.If you don’t practice your failover plan and test it, then the results may leave you with additional downtime and a bewildered look on your face when the process doesn’t work like it should.
Companies may not know they need a formal DR strategy, or may not fully realize how downtime impacts their business.click
I think we can all agree that few things are as scary as a zombie apocalypse. However, this frightful fact may keep you up at night. “93% of companies that lost their data for 10 days or more filed for bankruptcy within one year of the disaster, and 50% filed for bankruptcy immediately.” Scary stuff.
A common theme among Rackspace customers is that they know they need a disaster recovery plan, it’s just that they don’t know what their DR might look like. Many of the customers think that backing up their data would suffice. As we saw in Myth number 3, it’s important to protect your data, but backups do not help you quickly recover your business and keep it running after a major disruption. The goal of a backup is to enable data restoration. It doesn’t cover the operational requirements of a business. You would need a more complex disaster recovery strategy in place with the right mix of resiliency tools. A disaster recovery plan helps you restore operations quickly.
Disaster recovery is a holistic strategy that includes process, policies, people and technology. It focuses on restoring the IT systems that are critical to supporting business functions. In other words, it helps keep the business running after a major disruption occurs. A disruption could be Mother Nature’s wrath, or a guy named Bob, who installed a patch that immediately broke a critical application. If you have a DR plan in place, it’ll help you “keep the lights on,” so to speak, and the company open for business.
There are three basic concepts to consider when thinking about a DR plan. The first one is DR at the data center level. The IT manager is concerned with unplanned downtime or an outage in his primary DC. Second, is the application level. A ton of time is spent getting an app to run just right. If a disruption occurs, then the IT Manager needs to be able to simply restart the app and it should run as if nothing happened. The final concept is DR at the information level. The IT Manager is focused on having all of the critical data that he needs in order to rebuild the business quickly.
The most common challenge that we’ve heard from our customers is that DR isn’t a high enough priority.The best way to convince your boss that DR is a necessity is by showing him the impact of downtime on the business.
The impact of downtime can manifest itself in many forms such as loss of employee productivity, new sales, existing customers, brand reputation, etc. Cuba Gooding Jr. said in the movie Jerry Maguire, “show me the money”. C-level execs need to understand the potential loss in terms of a monetary amount. If our company goes down, we stand to lose “x” number of dollars per hour. Gartner estimates that the average cost of downtime for a small to mid-sized company is $42,000 per hour. For larger companies or companies with an ecommerce business model, that number could easily go north of 6 figures per hour. Remember that factoid from myth number 5? “93% of companies that lost their data for 10 days or more filed for bankruptcy within one year of the disaster, and 50% filed for bankruptcy immediately.”Your company’s survival depends on quantifying the impact of downtime.
Two important metrics to include in the DR conversation are RPO and RTO. RPO stands for Recovery Point Objective. This represents how old your data is once it’s been restored. I think it’s easier to explain the metrics with a story. I’ll spin you a yarn about a company called Necktorious, Incorporated. Let’s say Necktorious is a retail store that sells rakish scarves year-round… Even during the summer. In this example that I made up, the shop uses a managed hosting provider to fully manage their servers, storage, network and operating system for their ecomm site. That way they don’t have to worry about the infrastructure so they can focus on designing next summer’s nautical-themed scarf. The owner, Fred Jones, has a very low RPO of 15 minutes for his data. Therefore, he utilizes database replication service to copy his online store’s customer transaction data between two data centers. This resiliency tool provides about a 5-minute RPO which is within Fred’s data loss tolerance of 15 minutes. Ideally Fred would like a zero RPO, but the DB replication fits within his budget and minimizes data loss after a disaster.
Now let’s examine RTO, or Recovery Time Objective. RTO represents how long it takes until your users are able to continue normal operations. Fred has calculated that if his ecomm site becomes unavailable, he loses $10,000 per hour in sales, or much more if People magazine recently published a photo of a B-list celeb spotted in Venice Beach wearing one of his cashmere scarves. Fred has his ecomm platform running on virtual machines that are replicated between data centers using a replication solution. The business-critical VMs are replicated every 4 hours. Since his shopping cart app and the product catalog display doesn’t change often, this interval works well from a data restoration perspective. Replicating the critical ecomm apps helps minimize downtime because there’s no data restoration involved.
Here is a graphical representation of what our ecomm story might look like. You can see the VMware VMs being replicated to a second DC every four hours. The MySQL database is replicated more often – every 5 minutes – because of it’s higher data change rate. The database stores all of the critical information from online transactions.
Here’s an example of an ecomm reference architecture that incorporates VM Replication to protect two critical VMs.
Rackspace has several resiliency tools to choose from depending on your needs.DNS Failover, Database Replication, Host-Based Replication, Storage-Based Replication, Remote Data Replication, Managed Backup, Cloud BackupIn this webinar, we’ll focus on achieving geographic redundancy through the VM Replication tool.
VM Replication provides geographic redundancy to help protect business-critical VMs in the event of a data center outage or unplanned downtime.This solution is part of the Rackspace Managed Virtualization portfolio. Managed Virtualization is dedicated, single-tenant environment that’s based on VMware virtualization technology.
Our VM Replication offering was developed with real customer replication needs in mind. We worked with Virtual, Inc., a leading association management specialist. Virtual was seeking a simple and affordable, host-based replication solution for their client’s fully virtualized, business-critical environment, and decided that VM Replication was a good fit. Virtual engaged with Rackspace during the product development process and also served as the first beta tester.Russell Kuhl, the VP of Technology at Virtual, said, “This beta experience was beneficial for both Rackspace and Virtual. We provided feedback to Rackspace that helped develop a product that solves real-world replication scenarios. In turn, we are very satisfied to have VM Replication as part of our customer’s broader disaster recovery plan, and the peace of mind that we can count on Rackspace to provide the reliable support we’ve come to expect, delivered with a personalized touch.”
By talking with our customers, we uncovered 3 common hurdles that are associated with adopting a replication solution. Cost, complexity and making sure that the failover is tested often. I’ll also present how you can overcome each hurdle with VM Replication.
There are two common ways of replicating VMs. #1, replication at the hypervisor layer, also known as host-based replication, and #2, replication at the VM layer, or guest-based replication. Host-based replication is controlled by a virtual appliance that sits on the source and target hypervisors. The VA replicates active VMs on the source site and they’re transferred and stored on the target site in the inactive state. In other words, the replicated VMs on the target site remain powered down, and do not become active unless a failover is initiated. Guest-based replication utilizes the VMs to control the replication process. A VM on the source site would replicate the data changes to an active counterpart on the target site. The replicated VMs on the target site must be powered on because they receive the data and ensure data consistency. The replication method is an important cost consideration because with a host-based solution, you do not have to pay for the replicated VMs on the target site. Those VMs are powered down so you would only pay for them after a failover has occurred. Conversely, guest-based replication has active VMs on the target site. Even though you don’t use them for production, you still have to pay for them.
With guest-based replication, it would be like owning two homes. A main house that you live in 50 weeks out of the year. And your vacation home that you visit for two weeks in the summer. In the guest-based scenario, the lights in your summer home would be turned on all the time. Whether your vacationing there or not, you would have to pay for the electricity cost of the summer home, on top of the electric bill for your main home.
VM Replication is a host-based solution. You do not have to pay for the maintenance and licensing costs related to the replicated VMs in the target data center. Like I mentioned before, the replicated VMs remain powered off until a failover has been initiated. Only after a failover would you start paying for the replicated VMs in the secondary DC. Once the primary data center becomes available, then you would failback to the primary DC and power the target VMs back down.
In addition to the replication method, you should also consider the costs that are related to the additional infrastructure needed for geographic redundancy.As I mentioned before, with host-based replication, the critical VMs that reside in the target DC are powered off. This is important because it enables you to take advantage of the idle resources in the secondary DC by using it for other purposes like a test or dev environment. Using the redundant hypervisor for non-production workloads help justify the additional hardware costs.If you only replicate the most critical VMs and not the entire environment, then you probably don’t need to reproduce the exact hardware setup that you use for production. The infrastructure in the target DC could have a smaller footprint than the source DC – fewer servers, less cost.Another option that you could consider is using less powerful servers and storage in the secondary data center. Target infrastructure could be designed to use less expensive equipment so that the backup VMs would run with degraded performance for a short time until you failback to the primary DC. Depending on your needs, a temporary performance degradation may be acceptable.Justify the extra cost of adding a redundant hypervisor by repurposing it as a dev sandbox. As part of your failover runbook, you could make sure to shutdown all apps on the redundant hypervisor before you power on the replicated VMs and failover. Using the redundant server for non-production workloads can help justify the added expense.
Keeping with the dual-home analogy from before, let’s say that your main house is a 4-bed, 3-bath with a 3-car garage. It’s the perfect size for a family with three kids and a dog. When you and the spouse go on vacation to the summer home, you leave the kiddos with grandma at the main house. Let’s be honest, all parents deserve a kid-free vacation once a year. Now let’s imagine that your summer home is a beach bungalow. It has only one bedroom and a thatched roof. There’s no garage, just an open grassy backyard with a mango tree. It’s a perfect fit for just the two of you – no kids. Heterogeneous server infrastructure is like having a 4-bed house for a primary data center, and a bungalow for the secondary DC. Since it’s only you two vacationing at the summer home, you don’t need the extra space and storage for a full family. It’s the same concept for replicating just a handful of critical VMs to the target site.
VM Replication allows for heterogeneous infrastructure in the source data center and in the target DC. VM Replication lets you select only the business-critical VMs that you want replicated. And since there is not need to replicate the entire environment, you can have a smaller server footprint in the target DC. For example, you may have a cluster of three servers in the source data center, and a single server in the target DC. Unlike array-based replication, you’re not required to have the same expensive storage systems on both ends. For example, you could have a dedicated SAN storage system in the primary DC and simply use the server’s local storage in the secondary data center.
You know that it’s best practice to have a DR plan, but who’s going to develop it? Even if a plan already exists, who’s going to implement, test, and manage it? These represent just some of the questions that you may have around the complexity of disaster recovery. While Rackspace does not sell an end-to-end DR solution, we do offer a wide-range of resiliency tools and the deep knowledge and experience to help you design, architect and configure a solution that will fit into your DR strategy.
Here are the areas where Rackspace can lend a helping hand, and the areas that the customer must own. The top level is the holistic DR strategy. This is owned by the customer. Remember when we defined disaster recovery? It’s encompasses more than just the technology, but also the policies, people, and process. The customer is responsible for creating the DR plan, training the appropriate employees, and creating the failover runbook, testing the failover often, making the “go-time” decision to failover after a disruption occurs, and then deciding to failback once the primary DC comes online. Rackspace is responsible for failing over to the Target DC once the authorization has been given by the customer. Rackspace also monitors the VM Replication virtual appliance, and alerts the customer when a replication fails to complete. As part of the Managed Virtualization service, Rackspace also manages the VM, guest OS, and hypervisor layer. In addition to the software layers, dedicated hardware, network and the DC is also covered. Failover is the customer’s responsibility but we assist and are on-call during the process.
Rackspace Fanatical Support is your answer to lack of resiliency expertise and an overburdened IT department. Our Virtualization team manages one of the of largest VMware deployments in the world. We have architects available who are VMware Certified Professionals. They are here to help you design and deploy the right mix of infrastructure and the virtualization layer. The Replication team focuses solely on our replication tools. They have deep experience with building replication solutions for all types of workloads, use cases, performance requirements and budgets.
Failover testing can be expensive, risky, and planning intensive. These are the challenges that prevent companies from testing their failover plan enough. How often should the failover be tested? Well, that probably depends on how often the VMs being replicated are updated or changed. Anytime you introduce a new patch on the guest OS, or change an app’s configuration, would warrant a failover test to make sure that everything starts up and runs as expected. Unfortunately, this can be an expensive endeavor. Some services charge a fee for every failover test. This is can really add up and eat into the DR budget. Testing the failover & failback process often increases your chances of incurring actual downtime because your production environment is involved. What happens if your primary DC does not failback properly? In order to mitigate the risk of not being able to recover from a failover test, you would need extensive planning and coordination around every test.
Let’s take a look at how Rackspace VM Replication handles testing failover. The risk of not being able to failback has been removed by conducting the test within the target Rackspace data center. A snapshot is created of the replicated VM 2. The snapshot allows you to test the replicated VM without putting your production environment at risk. When you’re done testing, simply trash the snapshot and continue the replication.
VM Replication makes it easy to test frequently. There’s no charge to test, it’s as simple as creating a snapshot, and you won’t put your production environment at risk.
Thank you for you time. Please stick around if you have any questions.click