Nathan SchmidtServices Account Manager, Strategic SolutionsCertified in Business Continuity from DRI International
Here’s what we’ll cover in today’s webinar. Through talking with our customers, we have compiled a list of our top 5 myths related to DR. We’ll review what DR really means, and a good way to approach the subject. Also, we’ll identify the most common challenges that are related to replication and how our new product helps address those problems. Lastly, we’ll open the floor to questions and discussion around resiliency.
Here’s what made it on our list of Top 5 Myths About DR.We start with number 1. Just because your DC isn’t located in a disaster-prone area, doesn’t mean that mother nature can’t surprise you with a perfect storm. Recent memory reminds us that Florida isn’t the only state that could get hit by a hurricane.click
Although natural disasters do happen, it’s not the leading cause of data center outages. Statistically speaking, most self-managed DC outages are attributed to human error.The 80’s band, The Human League, said it best in their song titled Human…we’re “born to make mistakes.”
There will always be inherent risk. You can’t escape it. It’s kinda like juggling with a chainsaw……After a while…things are going to get messy. So it’s not a matter of if disaster strikes, but when disaster strikes.
Let’s consider this sentence. If I back up my data, that should be enough. In terms of DR, would you consider this statement true or false?click
…It’s a resounding FALSE. Even if you performed full backups weekly, and incremental backups daily, this isn’t enough. Let’s take a look at the example schedule. You perform a full back up of your data every Sunday. Do you know the difference between differential and incremental? A differential backup copies the changes made since the last full backup. An incremental data backup only copies the changes since the last incremental.click
So what if there’s a data center outage on Wednesday? First things first, you need access to the affected servers in order to first reconfigure the machines and bring them online so that you can start restoring your data. Let me stop here. If the entire DC is down, for whatever reason, then you don’t currently have access to the onsite servers. For the sake of continuing this example, let’s assume that you have invested in geographic redundancy and have your data stored offsite and you also have additional server capacity in a different DC. The next step would be to restore the most recent full backup from Sunday to the offsite servers. Then you would restore all of the the subsequent incremental backups. The restoration process could potentially take several hours or even days depending on the amount of data. While your data may be protected, the restoration process is slow and doesn’t address downtime well.click
Here’s a surprising tidbit…According an SMB Disaster Preparedness Survey, “less than half of SMBs back up their data weekly or more frequently, and only 23% backup daily. “41% of the SMBs surveyed said that putting together a DR plan never occurred to them.”
Not testing the failover enough reduces your chances of successful failover when you really need it.If you don’t practice your failover plan and test it, then the results may leave you with additional downtime and a bewildered look on your face when the process doesn’t work like it should.
Companies may not know they need a formal DR strategy, or may not fully realize how downtime impacts their business.click
I think we can all agree that few things are as scary as a zombie apocalypse. However, this frightful fact may keep you up at night. “93% of companies that lost their data for 10 days or more filed for bankruptcy within one year of the disaster, and 50% filed for bankruptcy immediately.” Scary stuff.
A common theme among Rackspace customers is that they know they need a disaster recovery plan, it’s just that they don’t know what their DR might look like. Many of the customers think that backing up their data would suffice. As we saw in Myth number 3, it’s important to protect your data, but backups do not help you quickly recover your business and keep it running after a major disruption. The goal of a backup is to enable data restoration. It doesn’t cover the operational requirements of a business. You would need a more complex disaster recovery strategy in place with the right mix of resiliency tools. A disaster recovery plan helps you restore operations quickly.
Disaster recovery is a holistic strategy that includes process, policies, people and technology. It focuses on restoring the IT systems that are critical to supporting business functions. In other words, it helps keep the business running after a major disruption occurs. A disruption could be Mother Nature’s wrath, or a guy named Bob, who installed a patch that immediately broke a critical application. If you have a DR plan in place, it’ll help you “keep the lights on,” so to speak, and the company open for business.
There are three basic concepts to consider when thinking about a DR plan. The first one is DR at the data center level. The IT manager is concerned with unplanned downtime or an outage in his primary DC. Second, is the application level. A ton of time is spent getting an app to run just right. If a disruption occurs, then the IT Manager needs to be able to simply restart the app and it should run as if nothing happened. The final concept is DR at the information level. The IT Manager is focused on having all of the critical data that he needs in order to rebuild the business quickly.
The most common challenge that we’ve heard from our customers is that DR isn’t a high enough priority.The best way to convince your boss that DR is a necessity is by showing him the impact of downtime on the business.
The impact of downtime can manifest itself in many forms such as loss of employee productivity, new sales, existing customers, brand reputation, etc. Cuba Gooding Jr. said in the movie Jerry Maguire, “show me the money”. C-level execs need to understand the potential loss in terms of a monetary amount. If our company goes down, we stand to lose “x” number of dollars per hour. Gartner estimates that the average cost of downtime for a small to mid-sized company is $42,000 per hour. For larger companies or companies with an ecommerce business model, that number could easily go north of 6 figures per hour. Remember that factoid from myth number 5? “93% of companies that lost their data for 10 days or more filed for bankruptcy within one year of the disaster, and 50% filed for bankruptcy immediately.”Your company’s survival depends on quantifying the impact of downtime.
Two important metrics to include in the DR conversation are RPO and RTO. RPO stands for Recovery Point Objective. This represents how old your data is once it’s been restored. I think it’s easier to explain the metrics with a story. I’ll spin you a yarn about a company called Necktorious, Incorporated. Let’s say Necktorious is a retail store that sells rakish scarves year-round… Even during the summer. In this example that I made up, the shop uses a managed hosting provider to fully manage their servers, storage, network and operating system for their ecomm site. That way they don’t have to worry about the infrastructure so they can focus on designing next summer’s nautical-themed scarf. The owner, Fred Jones, has a very low RPO of 15 minutes for his data. Therefore, he utilizes database replication service to copy his online store’s customer transaction data between two data centers. This resiliency tool provides about a 5-minute RPO which is within Fred’s data loss tolerance of 15 minutes. Ideally Fred would like a zero RPO, but the DB replication fits within his budget and minimizes data loss after a disaster.
Now let’s examine RTO, or Recovery Time Objective. RTO represents how long it takes until your users are able to continue normal operations. Fred has calculated that if his ecomm site becomes unavailable, he loses $10,000 per hour in sales, or much more if People magazine recently published a photo of a B-list celeb spotted in Venice Beach wearing one of his cashmere scarves. Fred has his ecomm platform running on virtual machines that are replicated between data centers using a replication solution. The business-critical VMs are replicated every 4 hours. Since his shopping cart app and the product catalog display doesn’t change often, this interval works well from a data restoration perspective. Replicating the critical ecomm apps helps minimize downtime because there’s no data restoration involved.
Here is a graphical representation of what our ecomm story might look like. You can see the VMware VMs being replicated to a second DC every four hours. The MySQL database is replicated more often – every 5 minutes – because of it’s higher data change rate. The database stores all of the critical information from online transactions.
Here’s an example of an ecomm reference architecture that incorporates VM Replication to protect two critical VMs.
Rackspace has several resiliency tools to choose from depending on your needs.DNS Failover, Database Replication, Host-Based Replication, Storage-Based Replication, Remote Data Replication, Managed Backup, Cloud BackupIn this webinar, we’ll focus on achieving geographic redundancy through the VM Replication tool.
VM Replication provides geographic redundancy to help protect business-critical VMs in the event of a data center outage or unplanned downtime.This solution is part of the Rackspace Managed Virtualization portfolio. Managed Virtualization is dedicated, single-tenant environment that’s based on VMware virtualization technology.
Our VM Replication offering was developed with real customer replication needs in mind. We worked with Virtual, Inc., a leading association management specialist. Virtual was seeking a simple and affordable, host-based replication solution for their client’s fully virtualized, business-critical environment, and decided that VM Replication was a good fit. Virtual engaged with Rackspace during the product development process and also served as the first beta tester.Russell Kuhl, the VP of Technology at Virtual, said, “This beta experience was beneficial for both Rackspace and Virtual. We provided feedback to Rackspace that helped develop a product that solves real-world replication scenarios. In turn, we are very satisfied to have VM Replication as part of our customer’s broader disaster recovery plan, and the peace of mind that we can count on Rackspace to provide the reliable support we’ve come to expect, delivered with a personalized touch.”
By talking with our customers, we uncovered 3 common hurdles that are associated with adopting a replication solution. Cost, complexity and making sure that the failover is tested often. I’ll also present how you can overcome each hurdle with VM Replication.
There are two common ways of replicating VMs. #1, replication at the hypervisor layer, also known as host-based replication, and #2, replication at the VM layer, or guest-based replication. Host-based replication is controlled by a virtual appliance that sits on the source and target hypervisors. The VA replicates active VMs on the source site and they’re transferred and stored on the target site in the inactive state. In other words, the replicated VMs on the target site remain powered down, and do not become active unless a failover is initiated. Guest-based replication utilizes the VMs to control the replication process. A VM on the source site would replicate the data changes to an active counterpart on the target site. The replicated VMs on the target site must be powered on because they receive the data and ensure data consistency. The replication method is an important cost consideration because with a host-based solution, you do not have to pay for the replicated VMs on the target site. Those VMs are powered down so you would only pay for them after a failover has occurred. Conversely, guest-based replication has active VMs on the target site. Even though you don’t use them for production, you still have to pay for them.
With guest-based replication, it would be like owning two homes. A main house that you live in 50 weeks out of the year. And your vacation home that you visit for two weeks in the summer. In the guest-based scenario, the lights in your summer home would be turned on all the time. Whether your vacationing there or not, you would have to pay for the electricity cost of the summer home, on top of the electric bill for your main home.
VM Replication is a host-based solution. You do not have to pay for the maintenance and licensing costs related to the replicated VMs in the target data center. Like I mentioned before, the replicated VMs remain powered off until a failover has been initiated. Only after a failover would you start paying for the replicated VMs in the secondary DC. Once the primary data center becomes available, then you would failback to the primary DC and power the target VMs back down.
In addition to the replication method, you should also consider the costs that are related to the additional infrastructure needed for geographic redundancy.As I mentioned before, with host-based replication, the critical VMs that reside in the target DC are powered off. This is important because it enables you to take advantage of the idle resources in the secondary DC by using it for other purposes like a test or dev environment. Using the redundant hypervisor for non-production workloads help justify the additional hardware costs.If you only replicate the most critical VMs and not the entire environment, then you probably don’t need to reproduce the exact hardware setup that you use for production. The infrastructure in the target DC could have a smaller footprint than the source DC – fewer servers, less cost.Another option that you could consider is using less powerful servers and storage in the secondary data center. Target infrastructure could be designed to use less expensive equipment so that the backup VMs would run with degraded performance for a short time until you failback to the primary DC. Depending on your needs, a temporary performance degradation may be acceptable.Justify the extra cost of adding a redundant hypervisor by repurposing it as a dev sandbox. As part of your failover runbook, you could make sure to shutdown all apps on the redundant hypervisor before you power on the replicated VMs and failover. Using the redundant server for non-production workloads can help justify the added expense.
Keeping with the dual-home analogy from before, let’s say that your main house is a 4-bed, 3-bath with a 3-car garage. It’s the perfect size for a family with three kids and a dog. When you and the spouse go on vacation to the summer home, you leave the kiddos with grandma at the main house. Let’s be honest, all parents deserve a kid-free vacation once a year. Now let’s imagine that your summer home is a beach bungalow. It has only one bedroom and a thatched roof. There’s no garage, just an open grassy backyard with a mango tree. It’s a perfect fit for just the two of you – no kids. Heterogeneous server infrastructure is like having a 4-bed house for a primary data center, and a bungalow for the secondary DC. Since it’s only you two vacationing at the summer home, you don’t need the extra space and storage for a full family. It’s the same concept for replicating just a handful of critical VMs to the target site.
VM Replication allows for heterogeneous infrastructure in the source data center and in the target DC. VM Replication lets you select only the business-critical VMs that you want replicated. And since there is not need to replicate the entire environment, you can have a smaller server footprint in the target DC. For example, you may have a cluster of three servers in the source data center, and a single server in the target DC. Unlike array-based replication, you’re not required to have the same expensive storage systems on both ends. For example, you could have a dedicated SAN storage system in the primary DC and simply use the server’s local storage in the secondary data center.
You know that it’s best practice to have a DR plan, but who’s going to develop it? Even if a plan already exists, who’s going to implement, test, and manage it? These represent just some of the questions that you may have around the complexity of disaster recovery. While Rackspace does not sell an end-to-end DR solution, we do offer a wide-range of resiliency tools and the deep knowledge and experience to help you design, architect and configure a solution that will fit into your DR strategy.
Here are the areas where Rackspace can lend a helping hand, and the areas that the customer must own. The top level is the holistic DR strategy. This is owned by the customer. Remember when we defined disaster recovery? It’s encompasses more than just the technology, but also the policies, people, and process. The customer is responsible for creating the DR plan, training the appropriate employees, and creating the failover runbook, testing the failover often, making the “go-time” decision to failover after a disruption occurs, and then deciding to failback once the primary DC comes online. Rackspace is responsible for failing over to the Target DC once the authorization has been given by the customer. Rackspace also monitors the VM Replication virtual appliance, and alerts the customer when a replication fails to complete. As part of the Managed Virtualization service, Rackspace also manages the VM, guest OS, and hypervisor layer. In addition to the software layers, dedicated hardware, network and the DC is also covered. Failover is the customer’s responsibility but we assist and are on-call during the process.
Rackspace Fanatical Support is your answer to lack of resiliency expertise and an overburdened IT department. Our Virtualization team manages one of the of largest VMware deployments in the world. We have architects available who are VMware Certified Professionals. They are here to help you design and deploy the right mix of infrastructure and the virtualization layer. The Replication team focuses solely on our replication tools. They have deep experience with building replication solutions for all types of workloads, use cases, performance requirements and budgets.
Failover testing can be expensive, risky, and planning intensive. These are the challenges that prevent companies from testing their failover plan enough. How often should the failover be tested? Well, that probably depends on how often the VMs being replicated are updated or changed. Anytime you introduce a new patch on the guest OS, or change an app’s configuration, would warrant a failover test to make sure that everything starts up and runs as expected. Unfortunately, this can be an expensive endeavor. Some services charge a fee for every failover test. This is can really add up and eat into the DR budget. Testing the failover & failback process often increases your chances of incurring actual downtime because your production environment is involved. What happens if your primary DC does not failback properly? In order to mitigate the risk of not being able to recover from a failover test, you would need extensive planning and coordination around every test.
Let’s take a look at how Rackspace VM Replication handles testing failover. The risk of not being able to failback has been removed by conducting the test within the target Rackspace data center. A snapshot is created of the replicated VM 2. The snapshot allows you to test the replicated VM without putting your production environment at risk. When you’re done testing, simply trash the snapshot and continue the replication.
VM Replication makes it easy to test frequently. There’s no charge to test, it’s as simple as creating a snapshot, and you won’t put your production environment at risk.
Thank you for you time. Please stick around if you have any questions.click