Disaster Diary
|
|
|||
|
|
Editor's note: The following is a network engineer's first-person account of a disaster that occurred at his bank. Bank officials agreed to let him write this sensitive story as long as the bank was not identified.
The sixth and top floor of my bank's 70,000-square-foot headquarters had never been level. One row of desks along the outside wall had been placed on top of a two-inch step to raise them to the height of the rest of floor, and there was a running joke about pencils rolling off desks.
During a remodeling to fix some of these cosmetic problems, workers discovered large cracks that exposed corroded tension cables - the tendons that hold the building and its prestressed concrete together. This meant the structural integrity of the building couldn't be guaranteed. Upper management quickly declared a crisis and ordered everyone to evacuate the building.
Departments that had to be relocated
The network
Our disaster relocation plan went into effect. The stakes were high: If operations and systems went down it would cost the bank and its parent company millions of dollars in lost revenue and lost customers, as well as creating a systems, data, monetary and regulatory chaos from which it might never recover.
Headquarters is the brain and nerve center for a banking empire extending across five states. More than 300 people directly involved in keeping the remote branches in operation, as well as supporting core bank functions, were relocated.
After two weeks, the building was inspected by more qualified engineers and declared safe. We returned to headquarters.
During that two-week period, all essential functions were maintained. Users, systems and applications were successfully supported in the branches. Locally, relocated users had access to the data and applications they needed to fulfill their duties. The disaster recovery effort had been, for the most part, a resounding success.
But behind the scenes, there were plenty of frantic days and hectic nights. Glitches kept popping up, but they were overcome with a lot of overtime and a can-do attitude. In the IS department, we learned some important lessons.
This is my account of "the event," as it's now referred to, along with some disaster planning tips.
Y2K: The end of the world was coming, sort of
As a part of its Y2K preparations, the bank made plans for evacuating headquarters in the event that general mayhem somehow rendered the building useless. Plans included creating and stocking two disaster sites with older computers, monitors and hubs, adding full-time, back-up servers and setting up frame relay links to the WAN.
Disaster Site One (DS1) was located 40 minutes across town in the rear of a branch of our parent bank. We added frame relay WAN connectivity, a back-up link to the company that hosts our mainframes (hereafter known as X-Company), a Novell NetWare server, a Windows NT Server, and a back-up Oracle server. More than 80 computers were stacked in closets, along with monitors, phones, cables, fax machines, printers, hubs, cords, more cables, paper and other supplies.
Optimal plans called for two engineers, the IS assistant manager, four administrators and two support personnel to set up and configure DS1 in several hours.
Disaster Site Two (DS2) was the corporate training center, 10 minutes away from headquarters. More than 20 workstations were already plugged into the network and ready to go. There were two Novell servers, an NT Server, as well as plenty of phones and other paraphernalia stacked in the closet.
Because it was already in a state of semireadiness, DS2 only had two engineers and three administrators scheduled for support. Most personnel would then relocate to DS1, where IS would base its operations, while other employees would move to offices in remote branch sites.
Bank policy called for full server backups five nights per week, and a two-week rotation of tapes. Remote branches mailed one tape per week to headquarters for off-site safekeeping. Servers at headquarters had the same policy, with some additions. Full restores were performed on servers at both DS sites weekly. This kept applications, patches and updates current. In the event of a disaster, only data directories would need to be restored, and in the worst case, only a week of data would be lost, and it wouldn't be user accounts. Finally, all headquarters' tapes were stored in a fireproof safe at DS2.
After the great Y2K nonevent, we took all the plans, erased the "Y2K" from the title, and renamed them Business Resumption Plans (BRP). We had several scenarios, ranging in gravity from "building temporarily uninhabitable, systems up" to "building uninhabitable, systems temporarily down" all the way to "large pile of rubble."
All employees had a copy of their BRP off-site. Because personnel, applications and department needs change over time, all departments occasionally walked through their plans, verbally and on-site with the equipment and IS emergency support.
Aug. 16: D-day minus one
On Wednesday, Aug. 16, we had just such a test. We pulled out some dusty computers, hooked them up and waited. The operations people came; checked their logons, their applications, their access to personal and departmental data, their WAN connectivity and left. We compiled a small wish list of equipment and supplies as well as a few small configuration changes needed, but pronounced ourselves ready for anything.
During the debriefing, we were told to prepare for an unplanned rehearsal.
Aug. 17: D-Day, 1800 hours
I was home preparing to go to rugby practice when my pager went off. My boss' number appeared, followed by a 911. This signifies that something bad and important is happening. She said the operations center had been declared uninhabitable, and our crisis plan was in effect.
I was sure it was the test we had been warned about. I threw my gym bag in the car and headed to our disaster site, expecting someone to be waiting there with a stopwatch who would then send me on my merry way. This did not turn out to be the case.
I was the first to arrive. This was a bad sign because beating the timers to the test site was a good hint that it was not actually a test. After trying every key on my ring three times, I finally found one that worked and unlocked the door. Then things slowly began to go south.
I had never been the first one in during our walk-throughs, and had never deactivated the alarm. I didn't even know where it was. I went on a frenzied alarm-box hunt. When I found the box, it only got worse.
The disaster site was located in a branch belonging to a different subsidiary of our parent company. As such, it had a different alarm box than I was accustomed to. I punched in my code. No go. Twice, three times, still no green light. I hit random buttons, and still the nasty red light. I waited for the alarm to sound, cops to come, and began to practice my story. An administrator arrived, and I didn't feel quite as stupid when he couldn't get his code to work either. We hit more buttons, until I finally stopped, read the buttons and saw the word 'enter' on one. We finally had access to the building, and we avoided an embarrassing encounter with law enforcement.
At our second disaster site, a recently promoted engineer had the keys to the building, but not to the server room where he needed to work.
This crisis hit us at a bad time because we were short-staffed due to vacation and attrition. In addition, we had several new hires who hadn't yet learned our systems and applications. Even worse, the most important person in the department, our office assistant, had recently been promoted to another department. The latter issue was quickly fixed: instant demotion. All told, we had three IS staffers at DS2 and five at DS1. DS1 staff were aided by several data center and bank operations people.
All our disaster walk-throughs had been limited in scope, setting up and testing one department at a time. As it turns out, this left several large holes in our "bulletproof" disaster plan. The first problem came when we started setting up the users' workstations and phones: There was nothing to put them on. At DS1, we had always tested in the front room, using the same tables and chairs each time. Now we needed desks and seating for more than 80 employees, and had chairs and tables for only about 20 people.
We also knew from our test on the previous day that we needed more monitors and hubs. We started a shopping list that would grow through the night. Several IS, data center and bank operations personnel went on scrounge missions to our headquarters with orders to grab everything they could before we were completely denied access. This included hubs, monitors, tables, folding chairs and anything else that might prove useful. We spent Thursday night setting up computers, restoring data to the disaster servers, sending people for supplies, eating bank-bought pizza, and calling the wife to say, "No, I'm not out with the boys, I'm working. Go to bed without me."
Disaster plans called for us to be set up in two hours - Hah! At two in the morning, we started the restores at both sites and called it a night. We were 90% ready, but lacked the specific department managers to provide the final pieces to the setup puzzle.
Aug. 18: Day 2, 0745 hours
Disaster plans called for users to show up at 10 a.m. Apparently, not everyone read that part. Starting at 7:45, users began to show up, expecting phones, faxes and computers. This is when the second repercussion of our limited disaster practice sessions came back at us like a large, ugly boomerang. We never checked the licenses on the server. We never needed to - everyone could log on during the tests, and if one person can log on, hundreds, even thousands can, right?
Wrong. As it turns out, our main server, a Novell 4.11 box, was only licensed for 25 users. When the first 25 users logged on, everything went smoothly. Then Mr. 26 tried to log on, as well as Ms. 27 and Mr. 28, the vice president. We tried rebooting, but eventually had to call Novell tech support.
In the middle of the call (right after I told the engineer it couldn't be a licensing issue), Mickey, the engineer at DS2, remotely diagnosed the licensing problem, and copied a 'magic' 1,000-users-can-play file to our server. That taken care of, the departmental workers could finish their emergency preparation procedures and begin to do something resembling regular office work. (Adding licenses to the server was entirely legal under our agreement with Novell, and we paid for only the extra nodes in use at the end of the quarter).
Limited testing also hurt us because we hadn't configured TCP/IP on either LAN. This resulted in error messages and more user panic. Two quick Dynamic Host Configuration Protocol installation/configurations took care of this problem. More small fires popped up and were stamped out; new ones came up, more stamping. It was almost business as usual for IS support.
By mid-Friday morning, however, another key (missing) component of disaster planning came up, but this time it wasn't my fault: parking. There was none. Two lifeguards from a nearby swimming pool nicely asked us to remove our cars from their parking lot before they were towed and sold.
We ended up paying the local Safeway to let us use 20 spots in their lot. The rest of us used on-street, two-hour parking spots. After an hour and 58 minutes, we'd run out and switch places to avoid tickets. The powers-that-be then arranged for a shuttle bus from the headquarters parking lot every morning until the disaster was over.
Another missing piece of our plan involved a secure (and essential) download process. The data processing department performs a daily download of data from X-Company. Data processing crunches the numbers, generates reports and uploads the new data back to X-Company. Because this is a secure process, X-Company only allows access to several specific IP addresses on our network. Now that we were on a different part of the WAN, the machines trying to pull the data had "illegal" IP addresses. Calls to the X-Company help desk finally resolved the issue.
At the end of the day, we were tired, but we had survived. The worst had come, and we were still standing.
Days 3-14: The long haul
While disaster plans specified full occupancy of disaster sites for up to 30 days, lack of space quickly changed that to three days. Having a large number of people crammed into the two small buildings with a total of three working toilets bordered on the inhumane.
When it became apparent that these temporary digs might be used for an extended period of time, management and IS began to move the refugees to other local branches. IS gave up one of its two disaster-support sites and crammed administrators into an extra room in yet another branch. Network engineers were eventually told to work from home. By the middle of the first week, essential personnel were given limited access to the building to pick up supplies.
Day 15: The return
Qualified structural engineers, after a lot of expensive drilling, pulling, pounding and measuring pronounced the building safe. So to ease the transition back into our building, we had a phased return. This worked out well, bringing back one or two departments a day. This gave us time to solve the minor issues that came up with each department's return instead of having a great day of chaos.
Most glitches occurred when departments that had been split up to multiple locations (and thus multiple servers and databases) needed to share data and applications again. Different copies of databases had to be merged, conflicts fixed, etc. Despite the minor issues, everyone was glad to be back home. After two weeks of cramped, overcrowded conditions, for most users their old cubes never looked so good, and they cheerfully waited for us to fix these minor issues.
The moral of the story
Overall, the biggest lesson for the bank's powers-that-be was that we need more space. It is hard to know ahead of time how cramped conditions may be, but a detailed site map can help. In IS, we learned that we have to make sure our servers and networks can handle the full number of users. We know too that we were lucky we didn't need to implement the "large pile of rubble" scenario.
I am working on a new disaster plan for our database servers for that scenario, and we are going to keep copies of important documentation in locked fireproof safes at the disaster sites. We have also begun research on a large, multisite storage-area network.
I have two final pieces of advice for disaster planners. First, you can't think of everything. Some tasks will take longer than expected, people and departments won't have their bases covered or things will just go wrong. Murphy's Law is stronger than gravity. Relax and cope with it. At worst you will get great story material for your grandchildren or your next job interview.
Second, again, no matter how well you plan and how many details you cover, it's going to be expensive. This is life. Get a bean counter to set up a special disaster account and if you need something - monitors, tables, a large tent or sunscreen - don't be afraid to buy it. Your primary thought should be survival, then economy. Keep records and receipts to sort things out later. We did this and it made life a lot easier. Hopefully you won't have to go through what we did, but if you ever do I hope the advice here can prevent future headaches.
The author can be reached at mlane@myrealbox.com
RELATED LINKS
