You can build an e-commerce site that never fails. Here's how others have
done it.
By Jason Meserve
Network World, 02/26/01
On
the busiest shopping day of the year your Web site - the company's
lifeblood - goes down. How did it happen? Will customers return?
Last
Thanksgiving weekend, Amazon.com lived this nightmare. The popular
e-tailer's site was down for 30 minutes on Nov. 24 and for
another 15 minutes on Nov. 30; the company cited software problems.
Still smarting from the massive denial-of-service attacks last
February, Amazon.com didn't need another service black eye.
A better response In the Web world, much is made out of average response times. But they can't be the only measure of a site's performance.
When a system is under serious stress, you will get weird behaviors," says Alberto Savoia, chief technologist in Keynote Systems KeyReadiness Services group. "Averages can hide those problems."
Click here for more...
No
24-7 global e-commerce company does. An unresponsive site could
mean long-term lost customers because "people remember their
negative experiences," says Alberto Savoia, chief technologist
for KeyReadiness Services at Keynote Systems, a Web monitoring
and testing firm in San Mateo, Calif.
Even
occasional slow performance is a threat. "If a Web site is
slow, a vendor might be losing customers and not even know it,"
says Savoia, who likens the experience to not complaining to the
chef after a bad meal but never returning to the restaurant.
So
how do you build a complex e-commerce site that never fails? IT
executives running successful Web sites for AmericanGreetings.com,
EOS Bank, Monster.com, Penn State University, United Parcel Service
and the World Wrestling Federation (WWF) say that redundancy in
core systems is the secret weapon.
"The
infrastructure should be fail-safe' at any critical
point," stresses Bruce Petro, CIO of AmericanGreetings.com
in Cleveland. AmericanGreetings.com is one of the 50 most-visited
Internet sites, with more than 8.5 million unique visitors in
October 2000, according to Media Metrix, a Web-traffic measurement
company. Petro says the site serves about six million pages per
day during nonholiday weeks and double that amount during a holiday
period.
AmericanGreetings.com
has had a few growing pains since launching in May 1995, Petro
says, but unscheduled downtime has been minimal. For Petro, fail-safe
means building in redundancy at the firewall, load-balancing devices,
major network switches and its database. If any of these fail,
traffic is automatically shuttled to a backup. Other systems,
such as servers, send alerts when they approach peak loads.
"We
have 250 application and Web servers, which mitigates the impact
of any one being lost," Petro says.
Robert
O'Connor, supervisor of network architecture research and
development at Penn State, says a comprehensive risk analysis
"will help you decide where money should be spent on redundancy."
For
instance, Penn State uses two internal power supplies but only
one uninterruptible power supply for each server. He admits that
the choice gives Penn State a possible single point of failure,
but adds that those devices are reliable.
A data
center or two
Monster.com,
UPS and the WWF employ multiple data centers to handle their massive
traffic loads. If one goes down, the other takes over. Employment
finder Monster.com owns two data centers on separate coasts to
handle the roughly 390-million page views it gets per month across
the 15 Web sites it operates around the globe, says Brian Farrey,
Monster.com's CTO.
Monster.com
built its own data centers because most collocation facilities
could not meet the company's need for space. It uses 300
Windows NT-based Dell servers and an untold number of Cisco switches
and routers. "We're too big for the [collocation facilities].
We scare them away with our size," Farrey says.
Traffic
loads are balanced between the two sites, based on geography.
Each is serviced by multiple ISP trunks and power supplies. If
one goes down, the other can handle the global traffic on its
own, Farrey says. Currently, the company uses load-balancing devices
from HydraWeb Technologies that check for server availability
before sending packets, Farrey says.
In
addition to distributing load between the two facilities, Monster.com
places its applications across server groups in what Farrey calls
"functional clustering." For example, the job search
application is spread across 50 servers, 25 in each data center.
If one fails, 24 others in the data center carry its load. If
a whole cluster fails, the traffic goes to the mirror cluster
at the other data center. If both clusters fail, the site's
other applications would remain active, he says.
Wrestling
with traffic loads
The
WWF airs approximately nine hours of television programming per
week. If an announcer mentions WWF.com during a broadcast, 100,000
users can hit the site minutes later. In November 2000, the company
had 239 million page views across the many Web sites it operates.
"We're
running at a very small percentage of capacity, but we have to
be ready for the peaks," says Gerry Louw, CTO at the WWF.
Louw's approach to balancing load across its 100 Web and
streaming media servers is similar to Farrey's, although
the network spans less physical distance and uses a collocation
service. WWF runs its Web operation out of two Level 3 Communications
facilities in New York. The WWF plans to collocate a third data
center at a West Coast Level 3 facility later this year, Louw
says.
The
two main WWF sites - wwf.com and wwfsuperstars.com - are split
50-50 between the New York data centers. Several servers in Virginia
(from an earlier outsourcing deal) handle 52 smaller Web sites
for individual pay-per-view events and individual stars. Those
will be moved to the new West Coast facility.
Within
Level 3's racks, the WWF uses 60 Compaq DL360 servers running
Red Hat Linux and Squid Web Proxy Cache software for serving HTML
content. Six servers deliver streaming media clips with another
34 machines on hot standby for live events. The company uses Cisco
routers, switches and load-balancing devices for managing traffic
coming to the two main sites.
Louw's
rule of thumb is that peak load should never be more that 60%
of capacity. If load exceeds that threshold, he brings up more
servers. He measures throughput via custom scripting and from
the baseline data Level 3 provides.
While
the site has never gone down because of traffic loads, the company
occasionally has to limit the number of people accessing some
of the streaming media content to ensure high-quality viewing
experiences. The cap varies depending on the amount of bandwidth
being consumed by the total number of users.
Delivering
Web content
At
UPS, daily hits mount to an estimated 100 million during the holidays.
From Nov. 15 to Jan. 15, the company's Internet reliability
team meets daily to review performance issues. During such meetings,
the team will cover specific trouble tickets, such as a 5-minute
slowdown on the West Coast, says John Nallin, vice president of
IS at UPS. "On some of these, the problem is resolved before
we hear about it."
UPS
supports worldwide Web operations with two nearly identical data
centers in New Jersey and Georgia. It aims for capacity usage
at each location to hover at around 40%, so if one site fails
the other could handle the overflow, Nallin says. Factoring in
capacity for data center communications is the trick.
"It
used to be that you would buy a box to get a little extra capacity,
but now you have to buy two boxes for each data center and a box
for talking to both data centers," he adds.
The
company uses a virtual Domain Name System scheme to direct traffic
between the data centers. Load balancing is handled at the ISP
and again before the firewall. The Web servers run mainly on Sun
Solaris machines, but also on a few NT servers. Applications such
as package tracking and signature capturing run on IBM AS/400
servers and 15 IBM 3090 mainframes at two locations. "With
100 terabytes of data, you don't run on dinky Unix boxes,"
Nallin says.
One
shift per day, UPS operates one data center from the other for
practice. This way, if a snowstorm in New Jersey keeps workers
at home, any problems at that data center can be handled by the
engineers in Georgia, for example, Nallin says.
Keeping
ahead of capacity
To
measure capacity and monitor performance, UPS uses a combination
of reports from Keynote Systems and data from a custom-built application
that tracks application and proxy server performance.
"We
traditionally see a growth in activities of about 100% per year,"
Nallin says. "We see it incrementally as you go from quarter
to quarter and half year to half year. We always have staging
and provisioning equipment come in [throughout the year]."
EOS
Bank, an online-only bank, plans capacity based on business growth
projections. The bank builds for triple its customers, which it
expects to number 300,000 within the next three years. "We
won't run out of capacity before we can add more," says
Roy Henderson, president and CEO.
Outsourcing
was cheapest for EOS. It houses its back-end servers at Exodus Communications'
El Segundo, Calif., facilities. The Web server applications are
operated by Home Accounts, a spinoff of credit card processor FDR.
Its servers are connected via multiple T-1 lines to prevent any single point of failure.
But
the best plans can still result in problems. When Penn State brought
its grade-checking system online about three years ago, it had
a vendor build load-balancing software to equalize traffic across
multiple servers. The custom code worked poorly, and the servers
began crashing. "We had two students sit in front of the
servers to restart them if they crashed," O'Connor says.
Needless
to say, Penn State threw out the third-party application. Now
it uses IBM's Enterprise Network Dispatcher.
Planning
for performance
Many
bottlenecks can be attributed to bad system configurations between
products from different vendors. Keynote's Savoia says a
company could spend $10 million on a system that acts like a 500-MHz
Pentium connected to the Internet via a 28.8K bit/sec modem. "The
solution could be as simple as making a configuration change in
a Cisco router and getting back up to full power," he says.
For
Monster.com, this means testing applications with Segue Software's
SilkTest regression testing software before anything goes live.
"We like to do a basic performance smoke test before we put
an application into production," Farrey says. "It helps
to get a warm-and-fuzzy feeling that the application can handle
the traffic."
Still,
try as you might, you can't control all load-related problems.
"You don't have control over all the components that
control your service levels," UPS's Nallin says. "AT&T
and UUNET could be having problems, but to the end user it looks
like you. There's no control over that."
But
if you've built for plenty of capacity and have redundancy
at the critical junctures, then you've done more than simply
hoping for the best.