10. Replication Resiliency
Resiliency from Failures
Retry and resume semantics
Resynchronization
Seamless handling of VM Mobility
No admin intervention required
Live Migration, Storage Migration and Quick Migration
Within cluster and across cluster
17. VM Mobility
Site A Site B
Pre-requisites:
Primary migration: All primary servers must be authorized
Replica migration: Requires Hyper-V Replica Broker
19. Planned Failover
1. Shutdown primary VM
2. Send last log
3. Failover Replica VM
4. Reverse replicate
• Testing DR or site maintenance or impending disaster
• Zero data loss but some downtime
• Efficient reverse replication
Site A Site B
20. Planned FailOver
• Started on Primary VM, ended on Replica VM
• No duplicate VM is created
• Timeframe: depends on you
• Recommed frequency: 6 months
• Replication: Continues, reversed mode
• Data Loss: No
• Down Time: Yes (Planned)
24. Test FailOver
• Started Replica VM
• Duplicate VM is created
• Timeframe: Short
• Recommended frequency: once a month
• Replication: Continues
• Data Loss: No
• Down Time: No
26. Failover
• When there is an issue
• Replica uses Remote WMI to test if primary is still
running (to prevent split-brain)
• Previous PIT if recovery history is used
• If failover is OK, do a complete to merge
27. FailOver
• Started Replica VM
• No duplicate VM is created
• Timeframe: depends
• Recommed frequency: never
• Replication: Stopped
• Data Loss: Possible
• Down Time: Yes
32. Network Throttling
• Use Windows Server 2012 QoS to throttle replication traffic
• Throttling based on the destination subnet
• Throttling based on the destination port
- Throttling based on Application Name
33. Network Utilization
• Replicating multiple VMs in parallel
• Higher concurrency leads to resource contention and latency
• Lower concurrency leads to underutilizing
• Manage initial replication through scheduling
• Manage delta replication
Network bandwidth Ideal number of parallel transfers
1.5 Mbps, 100ms, 1% packet loss 3 (Default)
300 Mbps, 10ms, 1% packet loss 10
34. Backup Interoperability
• Backup copy to seed Initial Replication
• Back-up Primary VM
• Concurrent backup and replication are handled seamlessly
• Restore of Primary VM requires resync
• Back-up Replica VM
• Replica VM turned off
• Backup is on hold when VHD is modified by replication
• Restore of replica VM requires resync
35. Server Impact
• Impact on primary server
• Storage space: Proportional to writes in the VM
• Storage IOPS on ~ 1.5 times write IOPS
• Impact on replica server
• Storage space: Proportional to the write-churn
• Each additional recovery point ~10% of the base VHD size
• Storage IOPS:
•
• Memory ~50MB per replicating VHD
• CPU impact <3%
36. PowerShell
• Use PowerShell to manage and automate your replica’s
• Get-command –Module Hyper-V | where {$_.Name –like
“*replication*”}
• Get-command –Module Hyper-V | where {$_.Name – like
“*failover*”}
37. Tips
• Use bandwidth control!
• Firewall!
• Cluster: Replica Broker role
• Traffic encrypted or not?
• Which vhd(x)?
• Watch for resynchronization!
42. Saving Disk Space
• Use Dynamic disks at the Replica Side
• Enable replication from the customer to the hosting provider using online IR or
out-of-band IR.
• The hosting provider waits for the IR to complete.
• The hosting provider can then pause the replication at any time on the Replica
server – this will prevent HRL log apply on the disk while it is being converted.
• The hosting provider can then convert the disk from fixed to dynamic using the
Edit Disk and Convert option
• The hosting provider then replaces the fixed disk with the dynamic disk at the
same path and with the same name.
• The hosting provider resumes replication on the Replica site.
• Convert-VHD –Path c:FixedDisk.vhdx –DestinationPath f:FixedDisk.vhdx –VHDType
Dynamic
43. Online Resize supported?
• No need for resync
• No need to delete and reenable
• But you need to do it on both sides manually
• However: Failover older recovery points…
44. Upgrading to R2
• First Upgrade Replica Servers
• Or migrate to new 2012 R2 server
• Then your primary server
45. Deduplication on Replica server
• Without recovery points… No problem
• With recovery points:
• Slower… 5 to 7 times…
• 15 seconds can be a problem… 5 minutes maybe…
• Solution:
• Defragment volume (once every 3 days at least)
• Increase the dedup policy to 1 day instead of 3
47. Best Practices Analyzer
37 A Replica server must be configured to accept replication requests
38 Replica servers should be configured to identify specific primary servers authorized to send replication traffic
39 Compression is recommended for replication traffic
40 Configure guest operating systems for VSS-based backups to enable application-consistent snapshots for Hyper-V Replica
41 Integration services must be installed before primary or Replica virtual machines can use an alternate IP address after a failover
42 Authorization entries should have distinct tags for primary servers with virtual machines that are not part of the same security
group.
43 To participate in replication, servers in failover clusters must have a Hyper-V Replica Broker configured
44 Certificate-based authentication is recommended for replication.
45 Virtual hard disks with paging files should be excluded from replication
46 Configure a policy to throttle the replication traffic on the network
47 Configure the Failover TCP/IP settings that you want the Replica virtual machine to use in the event of a failover
48 Resynchronization of replication should be scheduled for off-peak hours
49 Certificate-based authentication is configured, but the specified certificate is not installed on the Replica server or failover cluster
nodes
50 Replication is paused for one or more virtual machines on this server
51 Test failover should be attempted after initial replication is complete
52 Test failovers should be carried out at least monthly to verify that failover will succeed and that virtual machine workloads will
operate as expected after failover
53 VHDX-format virtual hard disks are recommended for virtual machines that have recovery history enabled in replication settings
54 Recovery snapshots should be removed after failover
49. Site Recovery
Cust
Microsoft Azure
Site Recovery
DROrchestrationDROrchestration
Target:MicrosoftAzure
for Windows Server2012+ Hyper-V
Extensible Data
Channel
SCVMM
&
DRP
SCVMM
&
DRP
50. Orchestration and
Replication:
InMage Scout
Microsoft Azure
Site Recovery
Primary
Site
InMage
Scout
Orchestration and
Replication: Hyper-
V Replica, SQL
AlwaysOn
Microsoft Azure
Site Recovery
Primary
Site
Hyper-V
On-premisestoOn-premisesprotection(Site-
to-Site)
Orchestrated Disaster Recovery
Microsoft Azure
Site Recovery
Orchestration
Channel
Replication
channels:
Hyper-V Replica,
SQL AlwaysOn,
SAN
Primary
Site
Hyper-V
Recovery
Site
Hyper-V
Microsoft Azure
Site Recovery
Orchestration
and Replication
channel:
InMage Scout
Primary
Site
VMware /
Physical
Recovery
Site
InMage
Scout
InMage
Scout
Key features include:
Automated VM protection and replication
Remote health monitoring
Near zero RPO
No-impact recovery plan testing
Customizable recovery plans
Minimal RTO – few minutes to hours
Orchestrated recovery when needed
Replicate to – and recover in – Azure
Heterogeneous physical and virtual support
On-premisestoAzureprotection
(Site-to-Azure)
VMware /
Physical
VMware /
Physical
COMING SOON!
Download InMage
Scout
51. Possibilities
• On-Premises VMM Site to Azure (Hyper-V Replica)
• On-Premises to On-Premises VMM Site (Hyper-V
Replica)
• On-Premises to On-Premises VMM Site (SAN
Replication)
• On-Premises to On-Premises VMware Site protection
• On-Premises to Azure Hyper-V Site protection
Affordability is the key word here! Working as a consultant on Disaster Recovery projects learned me that many customers will look into DR projects but fail to do so because of the cost. Having replica’s to other datacenters can be extremely expensive if you look at dual SAN’s that you need. The licenses to do this kind of replication, setup, training and all other factors that come into play.
For many companies it is already very difficult to afford a SAN in the first place, let alone that you need a second one in another office with the same setup as the production site.
And that is also one of the major drawbacks in building these kind of plans… Having an exact copy of the production site not only costs a lot of money but also brings in a lot of management challenges
Relevance and Overview of Hyper-V Replica
Capabilities and Value Proposition
Deployment Considerations
Hyper-V Replica provides asynchronous replication of Hyper-V virtual machines between two hosting servers. It is simple to configure and does not require either shared storage or any particular storage hardware. Any server workload that can be virtualized in Hyper-V can be replicated. Replication works over any ordinary IP-based network, and the replicated data can be encrypted during transmission. Hyper-V Replica works with standalone servers, failover clusters, or a mixture of both. The servers can be physically co-located or widely separated geographically. The physical servers do not need to be in the same domain, or even joined to any domain at all.
You get a copy of a VM that is asychronous. It offers you some DR scenario’s
There is powershell support
Application agnostic: Sits on the host and just goes over to the other side
Storage agnostic: Whatever runs on the back!
What does replica actually do: High availability in a site you can use clustering however, replica is used when the entire site goes down, then you can use a replica to do some cool stuff
Hyper-V Replica provides a storage-agnostic and workload-agnostic solution that replicates efficiently, periodically, and asynchronously over IP-based networks, typically to a remote site. Hyper-V Replica allows a Hyper-V Administrator, in the event of a failure at a primary site (e.g. fire, natural disaster, power outage, server failure etc…), to execute a failover of production workloads to replica servers at a secondary location within minutes, thus incurring minimal downtime. The configurations at each site do not have to be the same with respect to server or storage hardware. Hyper-V Replica provides a System Administrator the option to restore virtualized workloads to a point in time depending on the Recovery History selections for the virtual machine. Hyper-V Replica provides the necessary management APIs that enable IT management vendors to build an enterprise class Disaster Recovery (DR) solution for their customers. Hyper-V Replica enables Infrastructure as a Service (IaaS) for hosting providers that host dedicated/virtual servers for their customers. With Hyper-V Replica, Hosters can provide solutions that offer DR as a service to their customers (specifically Small and Medium Business (SMB) customers).
Typical deployment for a smaller shop
But there are also other options for hosting service providers where they can offer draas to their customers (certainly with smb’s that have no secondary site!)
Typical deployment for a smaller shop
But there are also other options for hosting service providers where they can offer draas to their customers (certainly with smb’s that have no secondary site!)
How does it work:
Every time a write goes to the vhd, then it goes to the log file, after the 5 minute timer kicks in, it will sent that log file over to the destination host and there it will be
Hyper-V replication works a lot like the Cluster Continuous Replication feature found in Exchange Server 2007. One of the Hyper-V servers (or server clusters) is treated as the source and another host server or host server cluster is treated as the destination. Replicating virtual machines from the source to the destination is based on log shipping.
The process starts by keeping track of the write operations that are made against virtual hard disk files on the source server. The server compiles a series of logs that keep track of all of these write operations. On a periodic basis (usually every five minutes) the log files are copied to the destination server and the log file contents are used to update the virtual machines that are stored on the destination host. That is how the replication process works in a nutshell. There are a few other details that you need to be aware of, but I will be covering those details when it comes time to actually configure the replication process.
WAN is down all the time
Bad times
Storage failures
Automatically follow the VM (replica broker…)
Scenario:
Break replication from host to other host
Build-up replication from the other host
But: Pre-requisites
Most of the time you are going to use what we call planned DR. Testing stuff, doing site maintenance looking at possible disasters waiting to happen etc…
In the end, we have test failover, failover and planned failover
No Data LOSS! Very important here, but there is downtime!
PFO is an operation initiated on the primary VM which allows you to do an e2e validation of your recovery plan. PFO requires the VM to be shut down to ensure consistency.
PFO is *NOT* a substitute for High Availability which is achieved through clustering. PFO allows you to keep your business running with minimal downtime even during planned downtimes and guarantees zero data loss.
Planned Failover is used in the following cases
I want to perform host maintenance on my primary and would like to run from the replica site.
My primary site is expecting some power outage – I want to move over to the replica site.
There’s an impending typhoon – I want to proactively take action to ensure business continuity.
My compliance requirements mandate that every quarter, I run my workloads from the replica site for a week
What about the network?
Different subnets
You can use VLAN tagging
New IP adress: 3 ways of doing that!
Udate DNS records! Needs to be done manually…
Network capacity is crucial!
Will the source and destination servers reside behind the same firewall?
The first question that must be considered is whether or not the source and the destination reside behind the same firewall. Your firewall configuration does not dramatically change the configuration process, but you will need to open certain firewall ports if there were firewalls between the source and destination servers. Assuming that you are using the Windows firewall, Microsoft actually provides preconfigured rules that you can use for Hyper-V replication. I will discuss firewall ports in more detail when we begin the configuration process.
Will the source or destination reside on a failover cluster?
The vast majority of the configuration process for Hyper-V replicas is the same regardless of whether or not you are replicating between a cluster and a host or another cluster. However, the use of clusters does change one aspect of the configuration process. If replication will occur to or from a cluster (or both) then you will have to use a component called the Hyper-V Replica Broker. This component makes the replication process aware of the cluster’s NetBIOS name and IP address.
Does replication traffic need to be encrypted?
With Microsoft’s heavy emphasis on security it might seem strange, but virtual machine replication traffic does not get encrypted by default. If you want to encrypt replication traffic then you will have to use certificate based authentication. Microsoft recommends using certificate based authentication if you are replicating content between hosts that are geographically separated. For example, if you replicate virtual machines to a standby data center or to the cloud then you should be using certificate based authentication. On the other hand, if you are simply replicating virtual machines between two hosts that reside within the same data center then you can probably get away with using Kerberos authentication instead.
Configuring virtual machine replication based on Kerberos is a relatively simple process. However, if you plan on using Kerberos authentication then the host servers will need to belong to a common Active Directory domain or to mutually trusted domains. Otherwise you will have to use certificate based authentication.
Which virtual hard disk files need to be replicated?
The reason why this question is important is because replication occurs at the storage level, not the virtual machine level. This means that you will have to pick and choose which virtual hard disk files need to be replicated.
At first it might seem that the obvious answer to this question is that you should replicate all of the virtual hard disk files. One thing to keep in mind however, is that the replication process can be bandwidth intensive. This is especially true for virtual hard disks that incur a lot of write operations. That being the case, you might not want to replicate the virtual hard disks belonging to virtual machines that are relatively unimportant.
A more practical consideration is that there may be certain types of write operations that you don’t want to replicate. For example, there is no benefit to replicating a virtual machine’s pagefile. That being the case, you can conserve bandwidth by redirecting each virtual machine’s pagefile to a dedicated virtual hard disk and then configuring Hyper-V not to replicate that virtual hard disk.
Check the VHD characteristics of primary and replica VMs: before resync can be done, these have to match. Hyper-V Replica checks the geometry and size of the disk before starting resync. Top on the list of exceptions to watch out for are size mismatches – caused by resizing either a primary or replica VHD without appropriately resizing the other one.
Start tracking the VHDs:
The guest writes are tracked into the log file, but these changes are not replicated until resync is completed.
It is important to note that if resync takes too long then you might hit the “50% of total VHD size for a VM” condition and end up sending the VM into the “Resynchronization Required” state again.
Event number 29242 is logged that specifies the VM, VHDs, start block, and end block.
Create a diff disks for the replica VHDs: this allows the resync operation to be cancelled without leaving the underlying VHD in an inconsistent state. The diff disk with all the resync-ed changes is then merged back into the VHD at the end of the resync operation.
Compare and sync the VHDs: the comparison of the VHDs is done block-by-block and only the blocks that differ are sent across the network. This can reduce the data sent over the network, depending on how different the two VHDs are. While this operation is going on:
Pause Replication will stop the current resync operation. Doing Resume Replication later will continue the resync comparisons from where it left off.
Planned failover or Test failover will not be possible.
At any point the user can always do Unplanned Failover, but this will cancel the resync operation.
Resync can be cancelled at any point. This will keep the VM in the “Resynchronization Required” state, and the next time replication is resumed, it will start from the beginning.
Completion of compare and sync: HVR logs event number 29244 once the compare and sync operation is done, and it specifies the VHD, VM, blocks sent, time taken, and result of the operation.
Merge the resync changes to the VHD: after this operation completes, the resync operation cannot be cancelled or undone.
Delete the recovery points: this is a significant side-effect of resync. The recovery points are built upon the VHD as a baseline. However, resync effectively changes that baseline and makes the data stored in those recovery points invalid. After resync completes, the recovery points are built again over a period of time.
Nuances during failover
If you keep additional recovery points for your replicating VM, there are some key points to be noted:
Expanding a virtual disk that is replicating will have no impact on failover. However, the size of the disk will not be reduced if you fail over to an older point that was created before the expand operation.
Shrinking a virtual disk that is replicating will have an impact on failover. Attempting to fail over to an older point that was created before the shrink operation will result in an error.
This behavior is seen because failing over to an older point only changes the content on the disk – and not the disk itself. Irrespective, in all cases, failing over to the latest point is not impacted by the resize operations.
Hope this post has been useful! We welcome you to share your experience and feedback with us.
At a very high level, if you have a Windows Server 2012 setup containing replicating VMs, we recommend that you use the cross version live migration feature to migrate your replica VMs first. This is followed by fix-ups in the primary replicating VM (eg: changing replica server name). Once replication is back on track, you can migrate your primary VMs from a Windows Server 2012 server to a Windows Server 2012 R2 server without any VM downtime. The authorization table in the replica server may require to be updated once the primary VM migration is complete.
The above approach does not require you to re-IR your VMs, ensures zero downtime for your production VMs and gives you the flexibility to stagger the upgrade process on your replica and primary servers.
After the configuration: One-click failover button (or automated)
Great BC option and maybe even DR depending on your scenario’s
It is asynchronous replication (you can loose data)
It does not protect you against malware, data corruption…
Automated can be dangerous
VMs only
Depending on the chosen option, you will need more or less prerequisites. Run over the exact prerequisites
EMC / NetApp / HP also
Vmware – inmage!