Bank scores with server virtualization

They say old habits die hard. It's a adage that's certainly true for ICICI Bank's senior GM and the Group CTO, Pravir Vohra. As a man who was part of the team that popularized online banking and helped create a new revenue stream for ICICI Bank, Vohra is already known as an IT leader who can make a difference. He's also celebrated as a CIO who not only leverages new concepts and technologies to create -mover advantages for his organization, but also adopts solutions at such an unprecedented rate and scale that it advances his bank beyond the reach of its peers. Even solution and service providers have found it hard to keep up.

About four years ago, for instance, ICICI Bank was one of the first, at the scale it operates, to successfully leverage enterprise-wide data warehousing and business intelligence. And now the second largest bank in the country has again scored another first with technology, this time with server virtualization.

ICICI Bank's IT team, led by Vohra, has used virtualization to arrest an electronic infrastructure spill-over at its datacenters. They consolidated 230 physical servers to just five, running a little under 650 applications on a virtualized environment. It required them to develop the unparalleled technology ability to run 60 virtual machines on a single server but it saved the bank over a crore annually in power, cooling and space.

The result? While the server count of its closest competitors runs into four or five digits, ICICI Bank services its customers with just a fraction of that. That's incredibly low for a bank of its size with assets amounting to Rs 384,970 crore (US$7899), and with 1,400 branches and 4,530 ATMs across the country.

Big, Real Big

The business problem ICICI Bank forever grapples with lies at the core of its standardized Windows NT architecture. Any application typically requires a Web tier, an application tier and a database tier -- it's a necessary evil. "Now if somebody asks for a development environment, add three more. Move onto a testing environment, add another three servers. So even if you are deploying something as simple as a library management system, you have to take nine servers into account. At ICICI Bank, we run about 650 applications. Go figure," says Vohra.

Running that many application has a domino effect. It demands an ongoing investment in servers, power consumption, rack space, switching gear because as all these servers need to be interconnected to storage and networking sub-systems for management, availability and recoverability. "We were actually worried that we were ending in a server or an electronic sprawl," he says. "It is a kind of an exponential problem. We were not utilizing our servers properly but had to keep them because some development or some testing could happen. Let's say that without virtualization, I'll provide a server to run a library, holiday home and collection applications on the same server. But if you run a user acceptance testing (UAT) environment at the same time, you'll have problems. The world has found a way of consuming the manufacturing it manufactures," Vohra says.

The problem wasn't new. Though the problem piled up over time, the bank's IT team had only experimented with different technologies from time to time to seek an effective solution. But a couple of years ago, they started looking at a solution in earnest. "We found an embryo of a solution that we believed could work and improve over time to adequately arrest server, rack and power sprawl. We considered it to be workable enough to start dabbling with. It was a struggle for us," recalls Vohra.

Vohra refers to a two-year-ago old initiative that was fundamentally concentrated on server consolidation. Over the last year, the scope of the project has expanded to include other infrastructure consolidation, and an overall focus to reduce the bank's carbon footprint. But it's been a journey of discovery, he admits. "We can't take credit for scripting a story to a design principle. We found a way as we discovered new things and worked with different technologies. The idea was to improve our IT management capabilities and to reduce power and cooling consumption. It's about working around a theory of constraints," he says.

School of Hardknocks

Vohra formed a core team of 12 who were part of the NT admin team in the shared services vertical that takes care of the bank's datacenters. The team ran a few proof-of-concepts, and started by virtualizing environments that were lower on the showstopper scale.

Vohra points out that out of the 650 applications, there are about 200 applications, which nobody would even notice if they were shutdown for a day. For example, a one-day outage of applications such as ATM cash analysis or dead-stock inventory MIS generation would not raise any eyebrows.

But as the team started testing in live and more critical environments, they set high-water marks for the thresholds of running applications in a virtualized ecosystem. About 14 months ago, the team managed to run about 51 virtual machines on a single physical server. "We were trying to figure out what we were running out of: compute resources, I/O bandwidth or memory? We'd take say a server of 4-CPUs with 8-cores, running Windows and run a mixed load of applications on 51 virtual machines. Not only did we break Sun Microsystem's record of running 50 VMs on a server but we topped it. We touched a figure of 60 virtual machines on a single server. Of course, we later determined the optimal threshold at about 35 virtual machines," says Vohra.

As the proof-of-concepts succeeded, turnaround times for the project were defined, allotted and rolled out. The turnaround times for the identification and redressal of problems were monitored closely. "With technology, it is very easy to say that something doesn't work. It is much harder to make it work. Obviously, it takes effort to make something hard work. But the problem in such cases, is that you don't know what you are going to do but you discover what you need to do. You do it and take the next step," he says.

As the team scrambled forward with its server virtualization push, it had to pick its way through numerous technical challenges that surfaced. High CPU and memory utilization led to frequent performance degradation, which were in turn compounded by network bottlenecks. This resource issue was addressed by using dynamic memory and CPU allocation to avoid creating performance chokepoints. Patching and upgrading to higher versions were also undertaken to overcome various technical limitations.

"You run into a choke and after some analysis you realize that the internal disks are not good enough or you need a higher I/O bandwidth pipe. Or you might find that the machine is running out of memory for no logical reasoning. The physical machines you're virtualizing, may add up to only 32GB of RAM, while on a target machine you have 64. Since we were pioneers in implementing such a solution at this scale, there were no easy answers available. Not even with our solution providers. We understood the theoretical concepts well, but we became experts by living through all the live classrooms," he recalls.

The Smaller They Are, the Rarer They Fall

Today, ICICI Bank runs about 40 virtual machines on a server, with VMware virtualizing the environments of database server running SQL instances; application servers such as Websphere, Pramati and Oracle; and Web-servers. Vohra explains that as a strategy the current implementation has been executed only on 8-CPU dual core, 64GB RAM servers so that the features of over-commitment of memory and CPU resources are leveraged and VMware is able to scale up instead of scale out, taking full advantage of the Bank's licenses.

To decrease the use of multiple network cards, the servers have been moved to the same subnet of the NAS storage. This way, the same network card could be virtualized and deployed. This also ensures that connectivity to the storage through iSCSI is consistent and there are not too many hops.

"You can now over-commit resources. If I really needed 24 cores to do something spread across 30 applications I can now give them two cores each. That is a total is of 60 cores but physically I have only 24," says Vohra. The logic is that not all the applications peak at the same time. Some of these systems allow to over-commit resources beyond the boundaries of the physical box.

The required disk space on the home server has been provisioned on the connected iSCSI and Fiber Channel-based storage to meet the requirements of hosted VMs. I/O bottlenecks had been avoided by segregating storage connectivity on different network interfaces, says Vohra. This requires separate network cards for individual storage connectivity.

The virtualization effort forced various processes to be relooked and improved. It has translated into speedy provisioning that takes no more than two minutes of. This has directly reduced the average downtime of all the virtualized applications. Earlier, though the bank's IT team could provision five servers as standby for 30 servers it took three hours to bring up those server, in case of a failure. Each server had to be manually configured, loaded and restored. And if the incident occurred at 2AM, it could take as much as five hours to bring up.

Now, with automatic provisioning and over-commitment in place, running applications can failover seamlessly and automatically. Features such as V-Motion have been employed to transfer applications to higher configuration slice. Such innovative online fallback mechanisms have led to zero downtime.

Virtual machine slices with requisite operating system configurations have been created on virtualized disk space. Cloning feature of such VM slices help in the rapid provisioning of resources when they are required. Downtime has been minimized by provisioning alternate servers with the V-Motion feature for auto failover of the entire system to another base server or for individual virtual machine failover.

Though the business is exposed to all 650 applications, not all the applications have been virtualized, says Vohra. There are a few applications (running on 900 servers) that are too critical and too monolithic to be put on a virtualized environment. Applications such as the core banking system and credit card applications demonstrate no advantage even if they were virtualized as they need power-packed servers to run in any case. "You don't do it for religion. You do it only if it makes business sense. Anything that doesn't require super-sized servers has been virtualized. All the new applications also are being virtualized. Only about 20-odd applications are running on very old servers. We will either retire them and have them virtualized eventually. They are part of the last mile of the journey," he says.

Such technological advancements have made an impact on the resources and learning skill sets in ICICI Bank's shared services team. They need to stay abreast with new technologies. It, however, doesn't affect the application development team. As long as they see a server name, an IP address, they have local admin rights to the server; they don't know whether that server translates into a pizza box or waferware, Vohra says.

Vohra maintains that given the amount of money a CIO needs to sink in a project like this, it had better make sense and a CIO better believe in what he or she is doing. At ICICI Bank, once the proof-of-concepts were successfully executed, there was no doubt over what would work and what would not.

But Vohra warns of peripheral things a CIO can never test, unless they get their feet wet. "We took a considered view. If some of these don't run, we were comfortable that we had the ability to work with our partners to get upgrades or patches to make them run. When you are a pioneer, you are bound to trip up. But if your relationships are strong, then your partner will also work with you and solve your problems," says Vohra. ICICI Bank had quarterly targets of how many net servers were de-inducted. Payback was how many physical servers were sold for scrap or sent for recycling.

"In the end, we saw clear business payback. Business may or may not see it because for them it is just an event. They will see results only when an application goes down. In the life of a business manager, it will happen only three times. If it is 4AM, he is not bothered. But if it is at 10AM, and if it is a trading application, he would kill you for even 10 minutes of downtime. When an incident happens, a 3-hour or a 30-minute outage hurts business equally. Applications ran reasonably smoothly earlier but now they don't see any outage at all," he points out.

Pocket Power

1 2 Page 1
Page 1 of 2
IT Salary Survey: The results are in