CloudPlus360: Nutanix Metro Availability (Synchronous Replication)

The Term "Synchronous Replication" itself means the process of copying data over a storage area network, local area network or wide area network so there are multiple, up-to-date copies of the data,

In another word, when any data write happen in main site another copy is sent to DR and wait for DR confirm that the data is written in DR before committing the data in main site, that will make the RPO equal to "0", it is required that the round trip latency between both sites to be equal to or less than "5 ms" Maintain adequate bandwidth to accommodate peak writes. It is also recommended that you have a redundant physical network between both sites.

Before you start you need to make sure that your network connectivity between both sites is healthy and both sites are reachable from the other site, and your RTT Latency is within 5MS

When using the Nutanix Metro Availability over VMware ESXI or Microsoft Hyper-V (Nutanix AHV coming soon), Nutanix will handle stretching and the replication of the storage between Main and DR sites,

From Main site Prism interface click on the main tab and go to Data Protection and click on it

Now we need configure each site (Main Site and DR Site) as the Remote Site to the other site, then configure Storage Mapping,

Please go back to the previous post:

Async DR - Basic setup and Protection Domain Configuration for the required steps to finish these configurations,

From Data Protection page, click on Metro Availability tap, then click on + Protection Domain then click Metro Availability,

Add your name for the protection domain and click Next

Select the storage container you need to replicate (one per every protection domain) and click Next

Select your remote site and click Next

On the Failure handling you have three choices (Witness, Automatic Resume and Manual), we will have another blog post addressing and going through the Witness and to configure and use for the DR failure Handling,

With Failure handling set to Automatic Resume site loss causes the break replication timeout to expire in the remaining cluster, Metro Availability is disabled automatically for active protection domains and writes resume against those containers. Containers in standby protection domains become inactive and must be manually promoted to allow VMs to restart

Set your schedule for the Snapshots (Local and Remote Site) and click Next

Review all settings, you will notice the RTT latency check result here, click Next

The warning is about any data located on the Storage container on the DR site as it will be locked for the replication coming from the container on the Main site, all data (if any) in this storage container in the DR will be overwrite, review and click Next,

Now lets take a look how our two Nutanix Clusters looks like in the Hypervisor level, we will add all nodes (Main Site 4 Nodes and DR Site 4 Nodes) in single vSphere Cluster under single vCenter, in my design here the vCenter VM itself is running on the DR site Cluster,

From Nutanix Prism Side this is how it will look like in Data Protection page, In Main Site it will show as (Active) and in DR Site it will show as (Standby)

Now your DR setup is up and run and your workloads are protected (All VM's stored in the protected Storage Container are protected), lets do some testing,

From the Main site i will power-off the Nutanix Nodes to simulate full site disaster, to do that i will use the IPMI interface to force Nodes power-off, you need to do the same for all Nodes in the cluster

From vCenter you can see that Main Sites nodes are Not responding because it is powered-off, CVM's are disconnected, and our protected VM's (Test-VM-01 and Test-VM-02) are also disconnected,

From Prism in the DR Site we need to go to Data Protection page in the Metro Availability tap, to activate the protection domain and start the protected VM's, select the protection domain and click Promote