Search and DocFinder
 
Search help/advanced search

 


News NetFlash: Daily News Internat'l News This Week in NW The Edge Net.Worker Features Research Buyer's Guides Reviews Technology Primers Vendor Profiles Forums Columnists Knowledgebase Help Desk Dr. Intranet Gearhead Careers Free Newsletters Subscription Center Seminars/Events Reprints/Links White Papers Partner with Us Site Map Contact Us Awards Corporate info Home






Send to colleague
  


Plan for the worst, hope for the best

You can build an e-commerce site that never fails. Here's how others have done it.

By Jason Meserve
Network World, 02/26/01

On the busiest shopping day of the year your Web site - the company's lifeblood - goes down. How did it happen? Will customers return?

Last Thanksgiving weekend, Amazon.com lived this nightmare. The popular e-tailer's site was down for 30 minutes on Nov. 24 and for another 15 minutes on Nov. 30; the company cited software problems. Still smarting from the massive denial-of-service attacks last February, Amazon.com didn't need another service black eye.

A better response
In the Web world, much is made out of average response times. But they can't be the only measure of a site's performance.

When a system is under serious stress, you will get weird behaviors," says Alberto Savoia, chief technologist in Keynote Systems KeyReadiness Services group. "Averages can hide those problems."
Click here for more...

No 24-7 global e-commerce company does. An unresponsive site could mean long-term lost customers because "people remember their negative experiences," says Alberto Savoia, chief technologist for KeyReadiness Services at Keynote Systems, a Web monitoring and testing firm in San Mateo, Calif.

Even occasional slow performance is a threat. "If a Web site is slow, a vendor might be losing customers and not even know it," says Savoia, who likens the experience to not complaining to the chef after a bad meal but never returning to the restaurant.

So how do you build a complex e-commerce site that never fails? IT executives running successful Web sites for AmericanGreetings.com, EOS Bank, Monster.com, Penn State University, United Parcel Service and the World Wrestling Federation (WWF) say that redundancy in core systems is the secret weapon.

"The infrastructure should be ‘fail-safe' at any critical point," stresses Bruce Petro, CIO of AmericanGreetings.com in Cleveland. AmericanGreetings.com is one of the 50 most-visited Internet sites, with more than 8.5 million unique visitors in October 2000, according to Media Metrix, a Web-traffic measurement company. Petro says the site serves about six million pages per day during nonholiday weeks and double that amount during a holiday period.

AmericanGreetings.com has had a few growing pains since launching in May 1995, Petro says, but unscheduled downtime has been minimal. For Petro, fail-safe means building in redundancy at the firewall, load-balancing devices, major network switches and its database. If any of these fail, traffic is automatically shuttled to a backup. Other systems, such as servers, send alerts when they approach peak loads.

"We have 250 application and Web servers, which mitigates the impact of any one being lost," Petro says.

Robert O'Connor, supervisor of network architecture research and development at Penn State, says a comprehensive risk analysis "will help you decide where money should be spent on redundancy."

For instance, Penn State uses two internal power supplies but only one uninterruptible power supply for each server. He admits that the choice gives Penn State a possible single point of failure, but adds that those devices are reliable.

A data center or two

Monster.com, UPS and the WWF employ multiple data centers to handle their massive traffic loads. If one goes down, the other takes over. Employment finder Monster.com owns two data centers on separate coasts to handle the roughly 390-million page views it gets per month across the 15 Web sites it operates around the globe, says Brian Farrey, Monster.com's CTO.

Monster.com built its own data centers because most collocation facilities could not meet the company's need for space. It uses 300 Windows NT-based Dell servers and an untold number of Cisco switches and routers. "We're too big for the [collocation facilities]. We scare them away with our size," Farrey says.

Traffic loads are balanced between the two sites, based on geography. Each is serviced by multiple ISP trunks and power supplies. If one goes down, the other can handle the global traffic on its own, Farrey says. Currently, the company uses load-balancing devices from HydraWeb Technologies that check for server availability before sending packets, Farrey says.

In addition to distributing load between the two facilities, Monster.com places its applications across server groups in what Farrey calls "functional clustering." For example, the job search application is spread across 50 servers, 25 in each data center. If one fails, 24 others in the data center carry its load. If a whole cluster fails, the traffic goes to the mirror cluster at the other data center. If both clusters fail, the site's other applications would remain active, he says.

Wrestling with traffic loads

The WWF airs approximately nine hours of television programming per week. If an announcer mentions WWF.com during a broadcast, 100,000 users can hit the site minutes later. In November 2000, the company had 239 million page views across the many Web sites it operates.

"We're running at a very small percentage of capacity, but we have to be ready for the peaks," says Gerry Louw, CTO at the WWF. Louw's approach to balancing load across its 100 Web and streaming media servers is similar to Farrey's, although the network spans less physical distance and uses a collocation service. WWF runs its Web operation out of two Level 3 Communications facilities in New York. The WWF plans to collocate a third data center at a West Coast Level 3 facility later this year, Louw says.

The two main WWF sites - wwf.com and wwfsuperstars.com - are split 50-50 between the New York data centers. Several servers in Virginia (from an earlier outsourcing deal) handle 52 smaller Web sites for individual pay-per-view events and individual stars. Those will be moved to the new West Coast facility.

Within Level 3's racks, the WWF uses 60 Compaq DL360 servers running Red Hat Linux and Squid Web Proxy Cache software for serving HTML content. Six servers deliver streaming media clips with another 34 machines on hot standby for live events. The company uses Cisco routers, switches and load-balancing devices for managing traffic coming to the two main sites.

Louw's rule of thumb is that peak load should never be more that 60% of capacity. If load exceeds that threshold, he brings up more servers. He measures throughput via custom scripting and from the baseline data Level 3 provides.

While the site has never gone down because of traffic loads, the company occasionally has to limit the number of people accessing some of the streaming media content to ensure high-quality viewing experiences. The cap varies depending on the amount of bandwidth being consumed by the total number of users.

Delivering Web content

At UPS, daily hits mount to an estimated 100 million during the holidays. From Nov. 15 to Jan. 15, the company's Internet reliability team meets daily to review performance issues. During such meetings, the team will cover specific trouble tickets, such as a 5-minute slowdown on the West Coast, says John Nallin, vice president of IS at UPS. "On some of these, the problem is resolved before we hear about it."

UPS supports worldwide Web operations with two nearly identical data centers in New Jersey and Georgia. It aims for capacity usage at each location to hover at around 40%, so if one site fails the other could handle the overflow, Nallin says. Factoring in capacity for data center communications is the trick.

"It used to be that you would buy a box to get a little extra capacity, but now you have to buy two boxes for each data center and a box for talking to both data centers," he adds.

The company uses a virtual Domain Name System scheme to direct traffic between the data centers. Load balancing is handled at the ISP and again before the firewall. The Web servers run mainly on Sun Solaris machines, but also on a few NT servers. Applications such as package tracking and signature capturing run on IBM AS/400 servers and 15 IBM 3090 mainframes at two locations. "With 100 terabytes of data, you don't run on dinky Unix boxes," Nallin says.

One shift per day, UPS operates one data center from the other for practice. This way, if a snowstorm in New Jersey keeps workers at home, any problems at that data center can be handled by the engineers in Georgia, for example, Nallin says.

Keeping ahead of capacity

To measure capacity and monitor performance, UPS uses a combination of reports from Keynote Systems and data from a custom-built application that tracks application and proxy server performance.

"We traditionally see a growth in activities of about 100% per year," Nallin says. "We see it incrementally as you go from quarter to quarter and half year to half year. We always have staging and provisioning equipment come in [throughout the year]."

EOS Bank, an online-only bank, plans capacity based on business growth projections. The bank builds for triple its customers, which it expects to number 300,000 within the next three years. "We won't run out of capacity before we can add more," says Roy Henderson, president and CEO.

Outsourcing was cheapest for EOS. It houses its back-end servers at Exodus Communications' El Segundo, Calif., facilities. The Web server applications are operated by Home Accounts, a spinoff of credit card processor FDR. Its servers are connected via multiple T-1 lines to prevent any single point of failure.

But the best plans can still result in problems. When Penn State brought its grade-checking system online about three years ago, it had a vendor build load-balancing software to equalize traffic across multiple servers. The custom code worked poorly, and the servers began crashing. "We had two students sit in front of the servers to restart them if they crashed," O'Connor says.

Needless to say, Penn State threw out the third-party application. Now it uses IBM's Enterprise Network Dispatcher.

Planning for performance

Many bottlenecks can be attributed to bad system configurations between products from different vendors. Keynote's Savoia says a company could spend $10 million on a system that acts like a 500-MHz Pentium connected to the Internet via a 28.8K bit/sec modem. "The solution could be as simple as making a configuration change in a Cisco router and getting back up to full power," he says.

For Monster.com, this means testing applications with Segue Software's SilkTest regression testing software before anything goes live. "We like to do a basic performance smoke test before we put an application into production," Farrey says. "It helps to get a warm-and-fuzzy feeling that the application can handle the traffic."

Still, try as you might, you can't control all load-related problems. "You don't have control over all the components that control your service levels," UPS's Nallin says. "AT&T and UUNET could be having problems, but to the end user it looks like you. There's no control over that."

But if you've built for plenty of capacity and have redundancy at the critical junctures, then you've done more than simply hoping for the best.

Contact Multimedia Editor Jason Meserve at jmeserve@nww.com.

Send this article to a colleague

Recipient's name:

Recipient's e-mail:
Your name:

Your e-mail:
Comments:

Feedback

Tell us your thoughts on this article or the issues raised in it. We'll cc: the author and editors on all comments.

Comments:

Name:
E-mail address:

Can we post your comments in an online forum on the topic?
Yes No

What did you think of this article?
Very useful Somewhat useful Not at all useful

Would you want to see:
More articles on this topic
Fewer articles on this topic

Thank you! When you click Submit, you'll be taken back to this article.



Responsible for insuring the safety of your network?

NWFusion offers two FREE security e-mail newsletters to help you keep your enterprise network secure.

Click here to sign-up.

Advertisement:


Editorial Partners program
Three free and easy ways to bring Network World's in-depth editorial content to your own Web site.
Learn more




  Copyright, 1995-2002 Network World, Inc. All rights reserved.