Five years ago Clemson University named James Bottum chief information officer and gave him the mandate to overhaul the school's IT infrastructure and build out a high performance computing environment. The goal: catapult the school into a leading research university and help attract faculty and students.
"Last year the Clemson president told us our best years of public sector funding from the state were most likely behind us because of the financial crisis, and we needed to rethink our business model," Bottum says. "The encouragement was to become entrepreneurial."
Fortunately many of the changes Bottum's team made properly positioned Clemson for the new normal. The university has seen 180% growth in revenue from external sources, which helps supplement the school's IT budget, and a 250% increase in federal grants, part of which help offset IT costs.
"The main goal is to continue to run and support a robust set of services and infrastructure for Clemson University," Bottum says, "but do it in a way where we can grow and leverage what we're doing and create a stronger set of infrastructure and services that also contributes to the state economic development."
Bottum has unique qualifications that are helping get it all done. He spent 20-plus years in the research sector, including a stint at the National Science Foundation, then 15 years at the National Center for Supercomputing Applications, and for the last 10 years he has been a CIO (at Purdue before this).
Bottum's team at Clemson has a lot of recent achievements to be proud of, but they also get to investigate leading-edge stuff, everything from the huge HPC grid to new OpenFlow tools and the school's own Orange File System. It's a rich environment.
When Bottum ( pictured at right) arrived at Clemson the school had 48 IT groups, each of which had its own servers and storage and many of which ran their own networks.
"I saw a departmental IT person in a room with fans blowing on a server," he says. "All of the high-performance computing was in a little data center in the engineering science college. They had about six or seven clusters but didn't have enough juice to power them all up at the same time. It was a real belt and suspenders kind of operation, a cluster in the closet model."
A couple of other surprises: The university was buying commodity 100Mbps Internet service at a much-inflated price from local telecom companies, and the school had a large data center 10 miles off campus with expansion potential to 30,000 square feet. The former meant the university could make a big leap forward by joining Internet2, and the latter was going to make it easier to aggregate the IT operations and modernize.
While the initial funding for the overhaul would come from the school itself, the new HPC capabilities attracted new monies along the way and Clemson won many grants, including an NSF Research Infrastructure Improvement Award.
MORE ON NETWORK RESEARCH: Follow our Alpha Doggs blog
Job one was rehabbing the data center and the Information Technology Center, and aggregating most of the IT groups and resources. The building was 20-plus years old and was upgraded in two phases.
"We had 7,000 or 8,000 square feet of space, half a megawatt, and 20-something-year-old power and air conditioning when I got here," says CTO Jim Pepin, who came over from the University of Southern California (USC). "We went up to 2 megawatts and filled that up in less than two years as we consolidated operations and started to build our HPC cluster."
From left to right in front of the HPC cluster: Jay Harris, director of operations; Boyd Wilson, executive director of computing, systems and operations; Mike Cannon (front), data storage architect; Jim Pepin (back), CTO; Lanae Neild, HPC administrator; Becky Ligon, file system developer. (Photo by Zac Wilson)
The first phase ended in December 2007, and in the second phase, which was completed in December 2010, the data center space was built out to 16,000 square feet and split between two environments, one for enterprise gear -- everything from email and student systems to a mainframe to support the state's Medicaid system -- and the other for the HPC system, a 1,629-node Linux cluster. "So now we have two physically separate rooms with different air conditioning profiles and 4.5 megawatts," Pepin says.
Connectivity was increased from the 100Mbps connection serving the university to multiple 10G fiber wavelengths to Charlotte, N.C., and Atlanta, which are used to access Internet2 and link to partners and other universities. "We're also building out multiple 10G wavelengths around the state," Pepin says. Together these links -- and access to the National LambdaRail -- enable Clemson to connect to national infrastructure, allow other state institutions to access Internet2 through Clemson, and provide nationwide access to the Clemson HPC cluster and other collaborative resources.
The school also now has two gigabit connections on the National Higher Education Network to Pepin's former employer, USC, where Clemson has three racks of backup gear for disaster recovery. "No money changes hands, but I have rack space in California and they have rack space here and it makes their data center look like an extension of mine and vice versa," Pepin says. "That's the model we're looking at building, where the network is the basic building block of how we can connect these things together."
Demand for HPC
The cluster -- what the group sometimes refers to as a cloud -- is one of the crown jewels.
"We're not building some generic Joni Mitchell cloud," Pepin says. "Not some vanila, virtualized, blah, blah, blah. There's all of that stuff inside, but it's much more comprehensive, it's a much richer texture than that. We're building a cloud that is really infrastructure and services so we can actually do science with national labs and other people in the state."
The massive 1,629-node cluster is a combination of Dell, IBM, HP and Sun gear (mostly four FLOPs Intel/AMD architecture). Each node is a physical server with two sockets holding quad core processors, meaning eight cores per device and a total count of 14,304 server cores.
Nodes are interconnected using a combination of 88 10G Ethernet ports from Arista and Cisco, and 3,008 ports of low-latency 10G Myrinet network technology from Myricom. Four 16-port, 4Gbps QLogic Fibre Channel switches are used to support storage needs.
The servers aren't virtualized because the jobs supported are typically numerically intensive and very high performance. "So this is more of a grid than a cloud," Pepin says. "We call it a cloud because it's the shared resources model, but we run it like a grid you would see at one of the national labs."
All told, the cluster, with its latest nodes, will benchmark at above 100 trillion floating point instructions per second, making it about 90th on the list of the fastest supercomputers in the world.
The open source Maui Cluster Scheduler is used to allocate cluster resources -- which are allotted by the cores required -- but some users are guaranteed access to specific resources at specific times in condominium fashion.
Cluster usage has been tremendous, but Bottum had some trepidation going in. "One of the things I was afraid of was, if we spent this money and put up these capabilities, that nobody would come and use it," Bottum says.
Turns out he didn't need to worry. "In a state like South Carolina where no public institutions were on Internet2, if you build something like this you start attracting attention," Bottum says. "The one thing I did that you could construe as marketing was speak at a South Carolina IT Directors meeting in Charleston. They wanted to know what we were doing, so I threw out the idea of building a South Carolina cloud, an environment for shared services, and told them if they were interested to sign up at the door."
A half a dozen signed up. "We then went and we got some capital from various sources, including private and federal, and tried to stand this HPC thing up under the rubric of what we call the Cyber Institute. And that allowed us to have a neutral ground for bringing in researchers and other parties and not run this out of the IT organization. We were bootstrapping it out of IT but it gave us a way to think about it and not just break the backs of people who had more than full-time jobs to do. We now have about a dozen universities -- and even a high school -- that have allocations on high-performance computing."
Since then Clemson has held high-performance computing workshops around the state, many of which attract 70 or more people. "There's this sort of pent-up demand," Bottum says.
Today cluster utilization rates run at 80%-85% and often peak above 90%. "In the cluster world, this is incredible," Bottum says.
Clemson NOC: Used to monitor and control the local and wide area networks and the research, education and business computing systems, including the cluster. (Photo by Zac Wilson)
OrangeFS and OpenFlow
Of course the cluster is also core to a lot of work the university is doing, including development of a parallel virtual file system and work on OpenFlow, one of the highest-level projects to come out of the Global Environment for Network Innovations (GENI).
After trying several popular file systems for Clemson's cluster, researchers determined they needed higher performance and greater reliability, says Boyd Wilson, executive director of computing, systems and operations. The result: revival of development work on the open source Parallel Virtual File System (PVFS) with the original architect, Clemson faculty member Walt Ligon. Ligon is working with a Clemson spin-off company called Omnibond that is providing commercial services for the file system.
In the Clemson cluster, OrangeFS is used to virtualize 32 commodity Dell storage servers while providing a single name space for the cluster nodes, Wilson says. Directory and file metadata are distributed on 1.6TB of solid state drives across the 32 storage nodes and there is a total of 256TB of raw rotational disk storage.
Unlike other high-performance file systems such as Lustre, which can only have a single metadata server, OrangeFS' distributed metadata approach and unified name space enable the file system to scale nicely while also simplifying operations, Wilson says.
These capabilities may ultimately benefit enterprise computing environments. "With a unified name space across potentially hundreds of storage nodes, you can add and remove nodes as needed and customers won't notice their files moving or ever have to be pointed to a new storage location," Wilson says. "Your unstructured data stores can grow and resize and be redundant and you won't have all of these different little silos of data. So it holds some potential to become an enterprise computing solution a couple of years down the road."
One Clemson researcher, Sebastien Goasguen, is using OrangeFS to develop a cloud-based infrastructure that can launch and work with tens of thousands of cluster-based virtual machines at once. "It leverages OrangeFS by enabling you to have a shared high-performing file system between all cluster nodes," Wilson says.
Goasguen is collaborating with KC (Kuang-Ching) Wang to build software-defined networks between VMs and client machines using OpenFlow, "which represents a nice convergence point with the university's work on OpenFlow," he says.
Clemson is one of seven collaborators with Stanford on the initial OpenFlow deployment. What started out as a tool to facilitate network research by adding an open, centralized, software-defined layer of network routing, OpenFlow promises to "change the whole way we think about networking," Wilson says. "A lot of people are realizing they would like more software-based control over their network infrastructure. ... You can do some really neat stuff."
For example, while it isn't too painful for Clemson to shift IP addresses from its main data center to a smaller center on campus because they share subnets, when you start doing that over long distances and with multiple locations, it becomes extremely difficult, Wilson says. OpenFlow should vastly simplify the task by allowing dynamic networks to be created and changed at the infrastructure level, but also at the application level, opening up significant opportunities for improvement in network flexibility and security.
While it is unclear when and if Clemson will be able to profit from work on OpenFlow, it is already profiting from OrangeFS and other software that is licensed through Omnibond Systems, Wilson says. For example, companies interested in OrangeFS can purchase a 10-server bundle from Omnibond with support for $45,000.
Other Clemson work that Omnibond licenses includes identity management tools (including drivers for Novell's Identity Manager) and even traffic vision technology that state transportation departments can use to help turn roadside video feeds into sensors.
While the license fees help offset Clemson IT costs, the work also helps attract and keep really good people, Wilson says.
As important as the HPC cluster is, if it goes down, "researchers understand that's the way life goes," says CTO Pepin. "If the enterprise side goes down, we get fired. It's a smaller portion of the computer electrical power but 90% of the pain, so we care deeply about it."