The decisions you make when you are selecting the best solution to protect your mission critical data will determine if you are going to be making your business data safe and secure. But, did you ever ask yourself what service you are protecting? Are these SLAs, Data availabilities, data protection strategies, all looking the same when you are considering the service you want to protect?
In this post, I would like to shed some light on an important decision that I sometimes see overlooked by many business owners when they are trying to apply a Recovery Point Objective (RPO) and Recovery Time Objective (RTO) to their data backup strategy. Everyone will agree that during any data backup and recovery workshop, the RPO/RTO terms will pop-up every other minute; but, how exactly is that going to apply to your backup strategy when considering the protection of your services?
Before we dive into the challenges and solutions, let’s take a moment to define the “service” we mentioned in our introduction.
Service is dependent upon servers acting as components that work tightly together to deliver a complete solution; for instance, CRM environments, SharePoint farms and their like.
Let’s consider the following pieces of infrastructure:
- An Authentication Server;
- Database server – Oracle Server;
- Windows IIS Server;
- Reporting Server;
- Linux File Server; and
- and Application Server.
Considering the above environment is likely to be providing a service, you cannot ignore the fact that this environment needs a comprehensive RTO solution; meaning, there is a need for a RPO backup for each server. When it comes to an RPO, you have to think about the data consistency across the entire environment before you can restore the service. Of course, restoring the entire environment will impact the RTO in the case of total site loss.
So, do you still think your RPO/RTO will be enough?
In a discussion with one of my business prospects, I was surprised to find how ready they were to provide all the answers to my questions about their infrastructure; their servers, networking, performance, sizing, etc. My real surprise was realised when they provided me with their RTO for each of the servers listed in their infrastructure I showed you earlier. For example, the RTO for the Linux file server was an hour; for the web and application server, fifteen minutes, and no more. The Oracle server RTO was just three minutes, and not a moment longer.
Excellent work, Mr. Customer, and thank you for all the detailed information. Now, can you please help me understand why you want all of these specific RTO requirements; or, in other words, what exactly this environment all about?
From these RTO requirements listed by the customer, it was very clear that the customer had missed the importance of treating his environment as a service, and instead, was looking at the functions of the individual servers.
Yes, it is true that at the end of the process of applying these specific RTO requirements, he will be able to protect, and most likely bring the service back up and running; but, that will not guarantee that he can restore his “service” within the time he specified in his backup strategy
Examining the Restore/RTO Process
The RTO is the maximum time after a failure or disaster occurs that a computer, system, network or application before business is adversely affected. The RTO is a function of the extent to which the interruption disrupts normal operations; and this includes the revenue lost per unit time as a result of the disaster. The time lost also depends on the setup of the equipment and the installed application(s). An RTO can be measured in hours, minutes and seconds, or even days, and must be considered in disaster recovery planning.
When a disaster strikes, data recovery will not be initiated immediately. There will be several stages of the recovery process that you, as an RPO/RTO planner, must know. I have summarised the three restore stages as follows:
The first stage of the data restoration process starts immediately after the disaster has struck. Usually, the first notification of a server or service disruption will come from the end user; or, if you are lucky, and your company is using a monitoring application, you will be notified of the disruption immediately after it occurs. At this stage, you also need to consider these points:
- Notification time;
- Where you stored your backup;
- Time to find the latest/desire backup; and
- Time to retrieve.
The recovery process can begin after you have retrieved the desired backup media and nominated the recovery point. The recovery process will obviously take some time, and can depend on the technology you have used to store your backup; for example, if you are using tape to backup, add several hours to your recovery time. There are several backup medium options you can consider independatly, or in conjunction:
- Data stored on Disk/Tape;
- Cloud recovery/Bandwidth; and
- Cold/warm/hot Disaster Recovery.
The last stage of the restoration process is the recovery. The recovery includes any restoration steps you need to perform in addition to restoring the backed up data to make it available to the end user. The additional steps include service starts, data adjustments, log replies, and so on. Also consider:
- Manual steps needed to be performed after the restoration;
- Services to be started; and
- Service consistency; such as restoring all the servers to full service before starting the applications.
Your backup routines are often driven by things you might not consider important as single issues in themselves; but together, they will eventually drive your backup strategy.
Consider the many IT functions that make up your daily business activities; Service Level Agreements, data availability, data accessibility, RPO/RTOs, budget restraints, changing programs and projects demands. These functions, and more, will drive your backup strategy, and none is less important than the other. Always chose reasonable and achievable RPO/RTO times for your service site.