Disaster recovery (DR) has been challenging from the start, and it certainly isn’t getting any easier. Backup to disk has simplified some aspects of DR while virtualization helps in some ways and complicates it in others.
Large systems running mission critical workloads present a particularly difficult and costly DR challenge. Companies needing to meet very short (measured in seconds) RPO and RTO requirements typically have had to invest in pairs of systems set up as synchronized mirrors with synchronous replication. It works but it is costly and synchronous replication presents distance constraints.
For mainframes, the Geographically Dispersed Parallel Sysplex (GDPS) has been IBM’s primary DR vehicle. A recent IBM announcement expanded on the GDPS options primarily by adding remote asynchronous replication to greatly extend the distance between the paired systems.
DR at this level revolves around system clustering technology. You set up two systems, one as a mirror of the other, and update the data synchronously or asynchronously. When the primary system fails, you bring up the other and resume working as before. How you define your RPO and RTO determines how quickly you can resume operations following a failure and with how much data lag or loss.
Until now synchronous replication let you hit your tightest RPO and RTO. Synchronous replication, however, entails distance constraints that make it inappropriate for many organizations. It’s also quite expensive.
Asynchronous replication, however, is not bound by synchronous distance constraints. IBM offers GDPS/XRC and GDPS/GM, based upon asynchronous disk replication with unlimited distance. The current GDPS async replication products, however, require the failed site’s workload to be restarted at the recovery site, which typically will take 30-60 min. This will not satisfy organizations that require an RTO of seconds.
In its latest announcement IBM presents GDPS active/active continuous availability as the next generation of GDPS. This represents a shift from the failover model, from a situation where systems go down and can be brought online at the failover site in a few hours, to a near continuous availability model, where the system can be brought back online in an hour or less. IBM describes the latest enhancements as combining the best attributes of the existing suite of GDPS services and expands them to allow unlimited distances between your data center sites with the RTO measured in minutes. With its new GDPS offerings, IBM promises to achieve near continuous availability, meaning it can meet an RTO of tens of seconds.
Non-mainframe shops generally follow similar DR strategies using mirrored pairs of servers, monitoring and sensing software to detect a system failure, and switchover software. To hit the tightest RTO, you will set up your cluster as an active/active pair.
Of course, not every organization needs fast RTO. In that case, it can dispense with mirror systems altogether and rely on traditional tape backup and recovery to a standby site.
The concern with RTO usually focuses on the organization’s primary transaction production systems. But with the cloud organizations might begin to rethink what they deem mission-critical and how it should be backed up. Maybe they don’t have to think about mirrored system clusters at all. Maybe the mission critical systems to be protected aren’t even production transaction systems.