Computer Room Emergency - Only a Matter of Time
Gideon T. Rasmussen - CISSP, CISM, CFSO, SCSA
It's
an infrastructure manager's worst nightmare: The computer
room is down. There are several events that can make this
scenario a reality. A hurricane knocks out power for several
days. Building management disrupts power for scheduled maintenance.
Construction workers sever an underground power line.
Small
and medium sized organizations may not have adequate UPS
and generator systems. In that case, it is only a matter
of time before power is disrupted and the computer room
must be shut down. Enterprise class computer rooms with
an absolute requirement for 24/7 uptime may still be disrupted
by an emergency. For example, a liquid spill such as sprinkler
discharge or a glycol spill from a broken AC pipe may necessitate
an emergency shutdown.
Preparation
Every
six months completely power down the computer room to prepare
for the inevitable. Hold a meeting to plan the exercise.
The goal is to efficiently stop and start mission critical
systems as quickly as possible. Assign tasks to each team
member. The meeting is an ideal time to brainstorm.
Start
by documenting the order in which systems will be shut down.
Address all critical systems. In addition to servers, this
includes networking gear, telecommunications equipment,
UPS and AC units. Halt systems in order of criticality.
This helps minimize damage in the event that UPS or generator
systems fail. Shut down systems that are prone to data loss
or are painful to restore early on. Also consider dependencies
between systems. For example, the infrastructure supporting
a three-tiered application must be shut down in order. In
most organizations the development systems will be last
on the list. Carefully consider the order for starting systems
as well. The start order will be slightly different, as
there is no concern for failing power systems.
Create
an operations guide for each system. Each OPS guide should
be a single point of reference. Detail stop/start procedures,
where the system is located and how to confirm it is providing
services (versus merely running from an operating system
perspective). Keep in mind that the guide may be used by
a technologist who has little or no experience with the
system. Include a revision date at the bottom of each page.
Policies
and procedures should ensure current administrative passwords
are available and appropriately safeguarded. Maintain a
recall roster so that the infrastructure team can be contacted
in the event of an emergency.
Label
systems and racks for easy identification (front and back).
If a keyboard, video, mouse (KVM) device is in use, label
it with the systems it is connected to and the key sequence
required to switch between them.
Hardware
may fail once powered down. Ensure tech support contracts
are current and support phone numbers are documented. Current
backups and installation media must also be on hand at the
time of the exercise.
Consider
whether the computer room UPS system can handle the current
load. Have new systems been added in the past six months?
It might make sense to have a UPS technician on-site and
test UPS capacity and system health.
Print
the OPS guides and staple them separately. Separate guides
enable personnel to work without sharing documentation.
Upon completion of a task, they can return to the team lead
to address any remaining systems. This also helps track
progress and makes efficient use of resources.
Meet
again before the exercise and conduct a dry run-through.
Take note of any issues and fine tune the documentation.
Plan
Execution
A senior
team member should direct and monitor the progress of the
exercise. Coordinate and reassign resources as they become
available. Make use of available personnel and system keyboards.
Take note of elapsed time, discrepancies in documentation
and issues as they arise.
Document
functionality testing to ensure that once systems are powered
up they are providing the services required. Turn off internal
monitoring systems as late as possible. A shutdown exercise
is the perfect opportunity to test monitoring. Document
notification from external monitoring services as well.
Lessons
Learned
At
the conclusion of the exercise, the time required to shut
down and restart the enterprise systems will be known. The
preparation required keeps documentation current. The exercise
itself provides valuable on-the-job training. This continuity
helps eliminate single points of failure.
Provide
senior management with a formal report detailing the results
of the exercise. Powering down the computer room is one
of the first steps of taking ownership of the organization's
infrastructure. In my experience, many things fall out of
this exercise. It is better to learn about them during a
maintenance window rather than complicate an emergency situation.
Copyright © 2005 CyberGuard Corporation All Rights Reserved.
Reprinted with Permission
|