Taking IT Disaster Recovery Planning to the Next Level of Competence
31 August 2011
Attention all IT Directors
Do you have a documented disaster recovery plan designed to restore IT systems and services, telephony and data and voice communications within the recovery times required to support continuity of the organization? Yes? Congratulations! But is there anything you can do now to improve your recovery capability and better serve your user community or are you already doing all you can? The answer is 'yes' there is more you can do. Teed's IT DR specialist, Brian Davey, gives his top 10 actions to help you improve the current situation.
In too many organizations the traditional IT disaster recovery plan is seen as the “be all and end all” with testing of the plan not always as robust or regular as it should be. The traditional focus has been on ensuring that IT systems and services can be restored as quickly as budget will allow should some disruptive event result in loss of use of the primary service, critical hosting equipment or indeed the data center itself.
So how can you improve the current situation? Here are 10 actions to take.
1. Regularly Test Your Recovery Capability
Well the first, obvious, action has to be testing the current recovery capability to ensure that it would indeed work in a crisis and, what’s more, would recover all affected business critical systems and services within the recovery time needed.
Incidentally if you cannot get senior management agreement to test the failover of services given the potential disruption (real or perceived) to the business this could cause, then make sure the management team is made fully aware of (and formally signs off on) the risk this is taking in respect of the potential inability to recover systems and services as quickly as needed. If we can’t test the failover, I would contend that we can have no more than 25% confidence that it will work in practice within the time we need it to. That could lead to a serious business disruption…..
2. Engage With Your Business Continuity Owner
Make sure that you keep in regular contact with the person in your organization who owns business continuity management (BCM) in order to understand the business needs and expectations in relation to recovery of IT systems and services. Oh dear, are you the business continuity owner? Well best practice says that BCM should not be owned by the IT Department as you already have your hands full addressing IT disaster recovery, so try and move this responsibility elsewhere within the organization.
3. Focus on Data Recovery
A frequent misunderstanding with IT service users relates to loss of recent data as a result of having to revert to offsite backups. I regularly have to explain this concept to my clients and it usually alarms them when they consider the consequences of recent data loss and how difficult it may be to recreate or work around the lost data. It is essential that your users understand this issue and are happy with the recent data loss exposure resulting from your data backup regime. Incidentally you do need to ensure that you have offsite copies of data, whether that be through disk to disk mirroring, sending tapes off site etc. as destructive incidents do happen and you need to make sure that data can be recovered. Don’t fall into the trap of having all copies of data kept on site, even where you utilise a fireproof safe. A destructive incident can destroy a safe and even if it survives how long will it be before you can get access to it where it is lying at the bottom of a pile of rubble and the building around it is in danger of collapsing?
4. Make Sure Users Understand What Your Recovery Capability Is
As a practising consultant I regularly find major differences between the actual recovery capability for systems and services and what users think is the case. It is essential that there is no such misconception in your organization. The simple way to achieve this is to publish the recovery time objective and recovery point objective for each IT system or service, including telephony and data communications. Ideally have senior managers sign off this “IT statement of recovery” to reflect the fact that they are aware of the recovery capability and, just as important, they accept that it either meets their needs or they accept the residual risk and post incident consequences if it doesn’t.
5. Develop a Continuity Plan for IT People
If you need to recover IT systems and services following a disruptive event, what do you do with your people? Who goes to the designated recovery site? Who should work from home or other agreed locations? What should they do when they get there? Where does the IT Help Desk operate from? Who acts as standby for the primary responders? How soon do your projects need to resume and what about IT support people? Without a continuity plan for your IT people, confusion and delay will inevitably result. So to complement your IT disaster recovery plan, develop a people continuity plan.
6. Document Your Services
As a minimum every component (hardware and software) used to deliver the IT systems and services end to end should be documented showing which system/s or service/s it supports. This information can be captured using a simple spreadsheet and it provides you with invaluable information in understanding the impact on systems and services if any component should fail or otherwise be made unavailable. This information also provides you with a view of where single points of failure exist and you need to consider how to eliminate or reduce the risk associated with each single point of failure being put out of use or otherwise failing.
7. Conduct an IT Risk Assessment
In order to have a service which is resilient to failure, you need to understand the threats to provision of the service and mitigate the risk of those threats materialising. This is achieved by conducting an IT risk assessment with focus given to the most business critical IT systems and services and the threats which could lead to disruption.
8. Run Tabletop Exercises
Tabletop exercises where those responsible for responding to an adverse situation are gathered together and are asked to respond, in theory, to a specific event scenario are an excellent way of taking people through the thought process required to rehearse post incident roles and responsibilities, validate response plans and recovery strategies and generally raise awareness of what to do if disaster strikes.
9. Train Your Staff
Ensure that your IT department employees are aware of what to do and where to go if a disruptive event should occur. Also include awareness of relevant standards and guidance, such as BS25777, ISO27031 and ITIL IT Service Continuity Management (ITSCM). Consider formal training for anyone who has responsibility for continuity of services.
10. Document Recovery Procedures
The technical procedures which describe in detail how to recover IT systems and services, for example how to rebuild a specific server, must be written up and at a level which would allow a technically competent person to execute the procedure in case the procedure’s author is unavailable when an event occurs. Recovery procedures are best tested out using someone other than the author as this validates the procedure’s contents.
Taken together, the 10 actions above can not only significantly improve your IT recovery capability but can help prevent the bad things from happening in the first place or at least reduce their resultant impact. If something bad should happen, your senior management team will very much appreciate your preparation in this area as you will help to keep the organization alive! Go on, be a hero.