• United States

What makes Harvard’s net tick

Mar 06, 200613 mins
Data CenterSecurity

An interview with Jay Tumas, who oversees the network at Harvard University.

Jay Tumas, overseer of the network at Harvard University.

Harvard’s data network supports 125,000-plus users, its Border Gateway Complex routes about a half-million IP addresses and the network carts around 150TB to 200TB of data per day. Jay Tumas, who oversees the operations center at the heart of the network, recently gave Executive News Editor Bob Brown a peek behind the scenes.

Give me a thumbnail sketch of Harvard’s network.

The Harvard Core Network (HCN) serves an extremely diverse user population in metro Boston and beyond. We have everything from dual Gigabit Ethernet feeds serving the entire Harvard College network with tens of thousands of clients and a Class B chunk of address space, to a channelized T-3 circuit serving remote affiliates in Washington, D.C., or a T-1 serving a remote library repository in central Massachusetts. The [University Information Systems] NOC [network operations center] is the primary maintenance organization for the Northern Crossroads (NoX), New England’s Internet2 aggregation point, which serves 1 million-plus users.

With its scope encompassing close to 1,000 buildings, we solicit advice from all connecting members to solidify customer demarcs, network ownership and funding models. The 120-plus connecting members may manage their own LANs and data centers, or they may have outsourced everything from network maintenance to Windows client updates to us. (To see Harvard’s bandwidth use and more, go here.)

I sometimes hear people refer to organizations such as Harvard as having networks that are like phone company networks. Given your background at New England Telephone, is Harvard’s network really like a phone company’s?

Data networks in the ’90s were notoriously undocumented, with physical plants that looked more like spaghetti than anything that an institution would want to trust with its critical data. Harvard and other research and medical institutions began to realize that this network that was quickly becoming part of its critical infrastructure was largely an unknown quantity and that this had to change. So it did.

You will now find that many institutions have carefully documented their physical plant with tools from GIS systems linking underground conduits to fiber inventories, to tagging each end of every fiber and copper cable in their production network. These cycles can be expensive, especially when faced with the daunting challenge of documenting and inventorying a large-scale production network such as Harvard’s, but they are cycles well spent and which prove invaluable as we all strive to make our networks as physically robust as our routing protocols are logically robust.

Harvard, as of late, has been exhibiting another telco trait – considering the network as part of the university’s critical infrastructure.

As such, its construction is considered during the initial planning phases of building renovation, new construction and campus expansion projects. The data networks that are being built today, at Harvard and similar institutions, are being built to host a variety of IP-based traffic. Most every physical-plant control device, whether it be security cameras, chilled water-valve actuators or parking garage card readers, are being designed to work with the IP network. There is no better way for the network to provide ROI to the university than to provide a robust, high-availability piece of physical infrastructure that not only supports the data communications requirements of the research and academic communities but also serves as platform that fosters convergence of other plants’ control and communications requirements.

What lessons did you take from your time at New England Telephone that you’ve been able to apply at Harvard?

I learned how to maintain a robust network. Here are a few concepts that I brought with me:

  • A test lab. The telcos had Bellcore (now Telcordia) to ensure their rollout of critical infrastructure went smoothly. You need a lab, too. There is no better way to ensure your architecture or code upgrades proceed smoothly than to have your own lab environment to test your future configurations. It’s best not to cheap out when selecting lab equipment either. You should build a lab that mirrors your production environment to ensure you are comparing apples to apples. A great way to accomplish this is to use your network spares in your lab. This keeps your spare chassis and blades hot, so you know they are good, and ensures that you are testing with configurations compatible with your production environment.
  • Document everything. This includes assets, processes and procedures. The telcos realized this early on and documented everything from proper office etiquette of the day to power plant maintenance in a voluminous set of manuals called the Bell System Practices. You don’t need to go to those extremes, but a document containing current architecture descriptions, maintenance procedures, hardware inventory and access procedures is a good start. We started the NOC document about nine years ago. While its roughly 160 pages cover the bulk of our operational processes, vendor contacts and other information vital to supporting the HCN both on and off hours, there is always more that can be added. You’ve got to match the shear size of the document with what your staff can keep current.
  • Organize your plant. No one did this like Ma Bell. Through the thousands of COs [central offices], tens of thousands of frames and cross-connect systems, and probably millions of miles of cross-wire, a trained CO technician can go to any CO and put his finger on any circuit in the building. This feat was brought to us by an inventory system called TIRKS (Trunk Inventory Record Keeping System). In data networking there is little opportunity to keep a system that complex. However, you should demand that all is inventoried and labeled. I have moved the NOC data center twice in 10 years. The first time we performed in the back of our pickup trucks. So we skimped in that realm; however, we were sure to improve our plant structure by installing overhead cable trays, well-designed data cabinets and cable-management systems. The last move improved our data-center plant organization even more with the implementation of multilevel, under-the-floor cable trays, strict cable installation, tie-down and tagging requirements. We even invested in glass 2-by-2 floor tiles so we can display the results.
  • Exercise your DR architecture. Perform real-world power-failure scenarios to test your power backup, whether it is emergency power supplied by the building infrastructure or a room or rack UPS system. Disconnect the commercial power and allow the emergency power source to handle the production load as it would in the event of an emergency. Make sure you know how long your emergency sources will supply power to your network equipment, and keep in mind that as you add blades to those chassis the amount of time an [uninterruptible power supply] will be able to power the attached gear can significantly decrease. Also, if you have a DR plan for your data center that includes a remote data center linked back to campus, ensure that you simulate or estimate actual server loads on your connecting infrastructure.
  • Keep your customers informed. Come up with agreed-upon notification procedures for your internal and external customers in the event of a network outage, or if emergency maintenance is required and the network will be unstable during a particular window. If you have a customer portal, archive the events so they can be accessed by all who may need to correlate some sort of local failure or access problem to a core network outage. 

How do you gain visibility into what’s going on in a network of this size?

We have long polled network interfaces using to count the octets crossing interfaces from which we create real-time bandwidth-capacity graphs as a baseline to measure our overall network use.

This data serves as an auditing tool every time we bring in a vendor with the latest and greatest network accounting suite, because if the application can’t detail actual network resource usage, then the rest of its space-age graphics and modeling capabilities are useless.

To complement our locally developed, SNMP-based tool kit, we use commercial applications that rely on other data sources to get at our overall network usage:

  • QRadar from Q1 Labs – It serves as our primary network-traffic anomaly-detection system. It uses flow-based knowledge gained from live traffic surveillance performed out-of-band and presents a real-time analysis of current active threats on the network. It’s also intelligent enough to interface with our NOC Portal. So when a network administrator logs into the portal and observes that our infrastructure indicates we may have some compromised systems on a local network, he can log into qRadar and observe all network traffic specific to his address space. QRadar also presents anomalous data in reference to the total traffic, so it can be used secondarily as a traffic accounting system to display utilized resources across the network.
  • Peakflow SP from Arbor Networks – Our primary traffic-capacity planning tool, it derives its information from NetFlow traffic data generated from the University Border Gateway Complex. I look to this app for customer bandwidth statistics across my border. It does an outstanding job of slicing Layer 3-7 traffic data, which assists greatly when customers wonder, ‘What does my network’s traffic profile look like?’ Its traffic-engineering capabilities are enhanced by the fact that the application acts as a [] peer to the university border. This allows for target [autonomous system] analysis, so when it comes time to look at commercial ISPs, we can make sure that we are selecting a carrier that best serves Harvard’s network community.
  • Orion from SolarWinds – This Web-based, network fault-management system collects data from SNMP-enabled devices across our network and provides an accurate, low-cost view into it. It posts up nicely with our SNMP-generated traffic graphs, and presents us with a wealth of vital info like CPU and memory use, configuration info and interface-specific traffic stats.

We gain all this visibility with out-of-band management architectures, using a variety of vehicles to get at the traffic data. Nothing should be placed in the packet’s path that’s not absolutely necessary.

How much of what you’re using to manage and secure the network is built in-house vs. bought from vendors?

About 50/50.

Give me a few examples of homegrown tools, how you’re using them and why they beat what’s available commercially?

SNMPoll is our primary network-monitoring and alerting system. It’s a simple Perl program that uses topology-aware SNMP polling for ifOperStatus and sysUptime from more than 450 network devices and 1,500 interfaces every minute. If an anomaly is discovered, the appropriate engineers are alerted via an e-mail to their Treo 650s. The alerting e-mail contains a secure Web link, allowing engineers to quickly request additional information related to the event. The alerts also contain a live link to an application called MobileNOC, a Treo-fied version of the NOC Portal specifically for [speeding] information queries and remote troubleshooting. SNMPoll relies on another program, SNMProwl, to do core networkwide topology discovery. A variety of shell scripts and applications use SNMProwl’s data for other purposes, such as automatically building a private DNS zone for easy management of all core router and switch interfaces. Another Perl program, d3m0n, monitors other SNMP objects of particular interest. They include UPSs, environmental probes, BGP sessions, critical routes, data-center content switches; power, fan and temperature in our chassis; interface errors and anything else we feel the need to poke at to improve service delivery.

PacketFence is an open source, network-based solution to the problems posed by open academic networks. It provides passive or in-line operation, network registration, worm/bot detection/isolation, user-directed mitigation and proactive vulnerability scans. Its lineage can be traced to another utility called Mousetrap, a set of Perl scripts developed by the UIS Network Security Team to trap users via scope manipulation. The scripts worked quite well until the summer of 2003. As the Blaster and Nachi worms rampaged through the residential networks of academic institutions around the world, and infection rates within many residential networks approached 80%, we realized something more was necessary. In September 2003, PacketFence was born. After one year of continuous development, it was recently open-sourced and is in production on several large academic networks. PacketFence operates by manipulating the address resolution protocol cache of client systems.

Our Critical Alerts DashBoard Security Event Manager provides local network security administrators with better overall visibility by delivering archived and real-time security data from the core network [intrusion-detection system], border anomaly-detection systems and centralized infrastructures. The admin receives a graphical representation of the subdomain address space that dynamically changes depending on the “temperature” of their security environment. Just like at a telco – red is bad, and green is good. There’s a recent alerts listing, an interactive graph displaying overall alert volume for your networks.

Finally, our NOC portal was developed primarily to streamline customer-service delivery and enhance the information-sharing capabilities of all these other management and accounting tools. Customers use their university logons to access the portal. Depending on who they are, they are displayed a unique view allowing them access to the tools and information they require to manage their organizations’ network presence. Everything from current network equipment installation standards, to an access-control list/FW Ruleset maintenance interface is available for their use. All of our vendor-supported network management systems are portalized.

Getting personal: Jay Tumas
Organization:Harvard University
Title/job responsibilities:Hired in 1996 as network operations manager for the University Information Systems (UIS) Network Operations Center (NOC), the ISP for the university’s 100-plus departments, faculties and affiliates. Primarily responsible for management of the design, around-the-clock maintenance and operation of the Harvard Border Gateway Complex and Core Data Network. He also serves as the Network Security and Incident Response Team manager and the Longwood Medical Area technical sub-committee chair.
IT staff size:Manages 18 reporting network staff across five network operation center groups; Network Engineering and Planning, Network Security and Incident Response, Systems and Services, Triage and Converged Services.
IT budget:Undisclosed
Previous jobs:

From 1984 to 1986 he worked for Aritech, a maker of motion detection/security systems, as an electro-discharge technician. In 1986, he

worked for Bose as an automation engineer. In 1987, he started at New England Telephone as a central office technician and finished as an

operations manager in technical support.
If I wasn’t in IT, I’d probably be…:Into robotics and artificial intelligence.
Last good business book read:Dan Brown’s 3Digital Fortress2.
Fun facts:

He’s a private pilot (in training), and enjoys skiing; lives with his wife and two daughters on a lake in New Hampshire; has a 240-mile roundtrip

commute to work; his father worked as a telephone switchman.