It's an infrastructure manager's worst nightmare: The computer room is down. There are several events that can make this scenario a reality. A hurricane knocks out power for several days. Building management disrupts power for scheduled maintenance. Construction workers sever an underground power line.
Small and medium sized organizations may not have adequate UPS and generator systems. In that case, it is only a matter of time before power is disrupted and the computer room must be shut down. Enterprise class computer rooms with an absolute requirement for 24/7 uptime may still be disrupted by an emergency. For example, a liquid spill such as sprinkler discharge or a glycol spill from a broken AC pipe may necessitate an emergency shutdown.
Every six months completely power down the computer room to prepare for the inevitable. Hold a meeting to plan the exercise. The goal is to efficiently stop and start mission critical systems as quickly as possible. Assign tasks to each team member. The meeting is an ideal time to brainstorm.
Start by documenting the order in which systems will be shut down. Address all critical systems. In addition to servers, this includes networking gear, telecommunications equipment, UPS and AC units. Halt systems in order of criticality. This helps minimize damage in the event that UPS or generator systems fail. Shut down systems that are prone to data loss or are painful to restore early on. Also consider dependencies between systems. For example, the infrastructure supporting a three-tiered application must be shut down in order. In most organizations the development systems will be last on the list. Carefully consider the order for starting systems as well. The start order will be slightly different, as there is no concern for failing power systems.
Create an operations guide for each system. Each OPS guide should be a single point of reference. Detail stop/start procedures, where the system is located and how to confirm it is providing services (versus merely running from an operating system perspective). Keep in mind that the guide may be used by a technologist who has little or no experience with the system. Include a revision date at the bottom of each page.
Policies and procedures should ensure current administrative passwords are available and appropriately safeguarded. Maintain a recall roster so that the infrastructure team can be contacted in the event of an emergency.
Label systems and racks for easy identification (front and back). If a keyboard, video, mouse (KVM) device is in use, label it with the systems it is connected to and the key sequence required to switch between them.
Hardware may fail once powered down. Ensure tech support contracts are current and support phone numbers are documented. Current backups and installation media must also be on hand at the time of the exercise.
Consider whether the computer room UPS system can handle the current load. Have new systems been added in the past six months? It might make sense to have a UPS technician on-site and test UPS capacity and system health.
Print the OPS guides and staple them separately. Separate guides enable personnel to work without sharing documentation. Upon completion of a task, they can return to the team lead to address any remaining systems. This also helps track progress and makes efficient use of resources.
Meet again before the exercise and conduct a dry run-through. Take note of any issues and fine tune the documentation.
A senior team member should direct and monitor the progress of the exercise. Coordinate and reassign resources as they become available. Make use of available personnel and system keyboards. Take note of elapsed time, discrepancies in documentation and issues as they arise.
Document functionality testing to ensure that once systems are powered up they are providing the services required. Turn off internal monitoring systems as late as possible. A shutdown exercise is the perfect opportunity to test monitoring. Document notification from external monitoring services as well.
At the conclusion of the exercise, the time required to shut down and restart the enterprise systems will be known. The preparation required keeps documentation current. The exercise itself provides valuable on-the-job training. This continuity helps eliminate single points of failure.
senior management with a formal report detailing the results
of the exercise. Powering down the computer room is one
of the first steps of taking ownership of the organization's
infrastructure. In my experience, many things fall out of
this exercise. It is better to learn about them during a
maintenance window rather than complicate an emergency situation.
Copyright © 2005 CyberGuard Corporation All Rights Reserved.
Reprinted with Permission