Network-upgrade horror story

IT executive learns key lessons during four-year effort to get revamp off the ground

An IT executive learns key lessons during four-year effort to get network upgrade off the ground.

Next-generation metropolitan area network topology

Clear sailing in the design phase

In the summer of 2003, a design team of network technologists from campus IT, several campus departments and the medical center began to think about a new network. We considered what technologies offered the best mix of price and performance and which offered the greatest capability for expansion and the lowest risk of downtime.

DWDM quickly became a front-runner in terms of the potential technology. It can scale over time from eight lambdas (light-wave channels) all the way to 32 protected lambdas or 64 unprotected lambdas.

DWDM would provide a graceful evolution for the network's ever-increasing demands for capacity and capability. Each individual lambda running as fast as 2.5Gbps can carry a different service. For example, we could run the production Ethernet network over one lambda and a high-definition video feed over another. Or we could choose to provide a secure second Ethernet network for the medical center to connect the university's hospital facilities. This would let secure, electronic, protected health information move across the medical center's clinical network without coming in contact with student and faculty traffic on the campus network.

Then there is the matter of protected and unprotected lambdas. The bane of any optical-fiber-based network is the feared fiber cut. DWDM offers the option of protected lambdas, which run in one direction in the DWDM ring, while working lambdas run in the other direction.

Most DWDM gear has protection-switching that senses the loss of signal from the failed working lambdas and switches to the protected lambdas in less than 50 microseconds. There are few if any network applications that would notice that short an outage.

To add even more resiliency, we engineered in topology reliability. The new network was designed with diversely routed, dual-concentric rings at the main sites. Thus, a fiber cut or optical failure would have to take out both rings to cause a network failure. Even then, protected lambdas would take over.

Now we had the basis for the new network, which we christened UCSF's Next Generation Metropolitan Area Network (NGMAN).

NGMAN is made up of core and secondary sites. The core consists of the two main campuses and a central administrative building. San Francisco General Hospital, Mount Zion Medical Complex, Laurel Heights Conference Center and the Veterans Administration Medical Center are secondary sites.

Core sites are the locations with the heaviest traffic demands. They also are the sites with the most users. Therefore, they have the highest bandwidth (10Gbps) and the most resiliency. Most secondary sites connect to the core in a point-to-point fashion using unprotected lambdas running at 1Gbps or 10Gbps, depending on their traffic requirements.

The product of building reliability on top of reliability was a resilient, redundant and self-healing network that could survive such events as earthquakes and bioterrorism -- not an unimportant consideration for a patient care network in a seismically active area. In fact, NGMAN's design let it achieve five-nines of reliability -- no more than 5.26 minutes of downtime a year.

UCSF has a "build it and they will come" philosophy. We don't build things frivolously, but we do build them on faith. The university built an entirely new campus at Mission Bay hoping to attract top medical researchers from around the world. A number of educators and researchers in fact made their way to UCSF and wound up doing their research in the new state-of-the-art Mission Bay buildings, which were outfitted with high-performance networks.

There was an element of "build it and they will come" in the NGMAN project as well. The network was built to support future medical applications. It needed to be high-performance and support QoS and multicast. It had to support high-definition video distribution, IP telephony and real-time medical imaging. And it had to be scalable.

We chose a modular approach to minimize forklift upgrades. Modularity extended to more than just the equipment. We intended the modular concept to allow for adding and deleting secondary sites easily. If a site didn't need the full capabilities of DWDM, we could bring it online via alternative technologies, such as optical metropolitan Ethernet service or leased services.

11th-hour snag

Now that we had the design characteristics down, it was time to bid the project. This is where we hit our first speed bump. What is the best way to procure a project of this size and complexity? Should it be an RFP? An RFQ? Or some sort of hybrid of the two? Should we create one large master RFP covering every aspect of the project or divide up the bid into several smaller procurement vehicles?

In thinking this through, it was obvious this would be such a complex project that we weren't entirely sure we could cover everything with a single procurement vehicle.

We realized there were three elements in the project, which corresponded closely to the OSI model. One was the fiber infrastructure -- often referred to as Layer 0. The second element was the optical DWDM gear, or Layer 1. Finally, there was the Ethernet element, or the Layer 2 and Layer 3 equipment. We could therefore issue multiple bids for the fiber infrastructure and optical equipment. Another bid would be for the Ethernet equipment.

Breaking the RFP into components offered the additional advantage of making the procurement documents easier to write and evaluate. Plus, this offered an advantage to vendors, who might be unable to bid on all components of the network.

In July 2004, as we were doing our analysis of the fiber-infrastructure responses, the California Public Utilities Commission issued a landmark decision permitting PG&E and other utilities to sell unused strands of dark fiber to third parties. This created alternatives that became known as "managed and integrated" services.

With dark fiber we would own and operate the fiber network. Managed fiber would let the service provider own and operate the fiber network while UCSF owned and operated the optical DWDM gear. The integrated option offered a turnkey package that provided multiple lambdas through the utility-owned fiber. The carrier would both provide the fiber infrastructure and own and operate the DWDM gear.

The cost for the integrated solution was expected to be a fraction of the cost to lease dark fiber. Of course, we learned all this literally at the 11th hour, just before we were to award the fiber-infrastructure RFP.

We had two choices: award the project to the winning vendor, or pull the fiber RFP and rewrite it to let the carriers bid on the new options. That would delay the project, but in the end the estimated 33% cost reduction prevailed, and we elected to pull the RFP.

Back to the drawing board

We were convinced that we had done everything possible to educate and win over upper management and the campus constituency. We had prepared PowerPoint presentations to explain our technical and business cases. We had discussed NGMAN with a number of campus committees. We had talked one-on-one with administrative decision makers. We even hired consultants in early 2005 to assist us in the NGMAN vendor-selection process. We had covered all our bases -- or so we thought.

The consultants traveled around campus interviewing key researchers and faculty about NGMAN. You can imagine our shock when they reported they had heard three key questions repeatedly during their interviews: What exactly was NGMAN? Why did the university need a new network? Wasn't there a better use for the money?

Clearly, whatever venues we had used, whatever discussions we had held and however we attempted to sell the NGMAN project, we had missed the target.

We realized we had to start the process all over again. We had to go back and meet with the key researchers, staff and faculty. We needed to explain why the current campus network needed to be replaced, describe the original design assumptions that had created NGMAN and justify why this project was an appropriate use of campus resources.

This had to be done at the same time as we continued to move the NGMAN procurement process along -- except now we faced new uncertainties. We didn't know how much of NGMAN would be built, how much funding we could count on or how much campus support we could build for the project.

Finally, after much discussion, and more than a little gray hair on the design team's part, the university approved building the core of the network in June 2006. The funding commitment for the core was now certain.

However, the secondary sites were open to discussion -- and there was considerable dialogue as to which secondary sites would be connected directly to NGMAN. For example, it was decided for cost reasons not to link the Mount Zion and VA Medical Center sites.

Law & order: UCSF

There was one final twist. In October 2006, a federal grand jury in San Francisco indicted a former contract employee in the procurement office of UCSF on public-corruption charges.

Steven Donnelly was charged with selling confidential information about AT&T's bid to a Verizon employee in exchange for a BMW automobile. Donnelly later met with the employee and agreed to provide the employee with a thumb drive containing the confidential material for $5,000 rather than the BMW. Verizon fully cooperated in this investigation, according to the U.S. Attorney's Office.

Of course, this set the process back again, because UCSF officials had to decide how to proceed. Finally, UCSF awarded the bid May to AT&T's Healthcare Markets Group. In the end, we decided to go with the integrated, turnkey option. AT&T is expected to take six months to construct the network and one month to test it. UCSF then will have an additional month for our own acceptance testing.

NGMAN will go live in early 2008. Afterward, there will be a nine- to 12-month migration period when we will move users from the ATM-SONET network to NGMAN. After that, the ATM-SONET network will be sunset.

It has been painful to run a first-rate medical institution on a 1990s network, but UCSF has managed to work around the current network's limitations by accepting network capabilities that are significantly less than state of the art.

This has not been a problem for users who mainly do e-mail and database lookups. It has been more problematic for users who wish to do video distribution, high-bandwidth medical imaging, and enhanced-performance and security-based applications.

While the delays have been frustrating, NGMAN will be here shortly to remove many of these network-related limitations and let the university move into 21st-century network technologies.

Lessons learned

The NGMAN project taught us important lessons about selling major projects to a large and diverse university. Many of these lessons contain principles that can be used for any enterprise network project.

* Establish a strong business case at the beginning of the project: It is important that the proposed network make financial and business, as well as technical, sense. The days of implementing new IT projects simply because they look technically promising are gone. Today IT is all about business benefits.

* Sell the project: It is essential that you obtain buy-in from key decision makers. This extends far past technical decision makers. The key influencers need to be consulted, no matter where they work in the organization.

* Understand the project scope and funding from the beginning: It sounds obvious, but setting a scope with which all parties agree and obtaining a reliable funding source aren't as straightforward as they once were. Today, more homework is needed to assure that upper management, the budget office and the network designers are all on the same page.

Fritz is director of enterprise network services for the University of California, San Francisco, and a member of the Network World Lab Alliance. He is the author of Remote LAN Access: A Guide for Networkers and the Rest of Us, and Sensible ISDN Data Applications. He can be reached at jnfritz07@yahoo.com.

Learn more about this topic

Spending priorities shift from security to consolidation, virtualization: survey - Network ...

05/18/07

Verizon Wireless to spend $6 billion on network upgrade

03/26/07

Just-in-time network upgrades 06/24/04

Community

From CSO: 7 security mistakes people make with their mobile device
Join the discussion
Be the first to comment on this article. Our Commenting Policies