Riddle this: All your core performance metrics are glowing green, but customers on the other end of the network are still cursing your service. How can IT get to the root of the problem? Network World Editor in Chief John Dix put the question to two experts: Tony Davis, vice president solution strategy at CA Technologies, who once faced that challenge working for FedEx on the company’s web site and now shares what he learned with CA customers, and Jimmy Cunningham, senior manager tech support enterprise monitoring, BlueCross BlueShield of South Carolina.
We’re here to talk about improving customer experience so Tony, why don’t we start with you because you talk a lot about the concept of Business Service Reliability. Explain what that’s all about.
DAVIS: Business Service Reliability is a top-down approach to how IT should operate compared to traditional models. It is an actual execution strategy that emphasizes the reliable creation, production care and feeding, and mathematical measurement of a business service. It then takes that business service and translates it into real time customer experience. We provide solutions to automate this transformation strategy, but the secret sauce is in the methodology used to implement the strategy and the automation and measurements used to verify success.
How did you get involved with BlueCross?
DAVIS: The CA Account Executive for BlueCross BlueShield South Carolina had heard about the concepts of Business Service Reliability and stepped up to provide the program at no charge as a way to ensure long term success with the investments they had made in CA solutions. I first met with the executive in charge of the enterprise strategy for reliability, and he introduced me to Jimmy who was leading the initiative and we have been partnering ever since.
So Jimmy, when Tony plugged in you were already neck deep in an effort to improve customer experience by getting a better handle on the performance of core services. Explain that effort.
CUNNINGHAM: Five years ago we didn’t have a unified approach to monitoring system performance or system availability. Our CIO, Steve Wiggins, used to say we were flying blind in that we would deploy our applications and often find out from customers that we had pieces and parts that were broken. He was tired of customers telling us we had something wrong before we knew it was wrong, so he decided to form the Enterprise Monitoring System (EMS) group to rectify that, pulling in people from around the company. One of the first things our group did was take control of the main monitoring tools that existed inside BlueCross. Prior to that every group would buy and deploy tools as they saw fit, meaning another group could buy a similar tool and deploy it.
We were tasked with consolidating the tools and, at the same time, we invested in CA’s Customer Experience Manager (CEM) and Introscope, both of which are now part of the company’s Application Performance Management product, to augment and replace some existing tools. CEM does customer experience monitoring and allows us to monitor HTTP(s) traffic and see the elapsed time and experience our customers have as they use our websites and desktops. Introscope is a Java deep dive analysis tool that works in tandem with CEM to provide detailed metrics on the services and programs supporting our applications and desktops. We also use Introscope to monitor MQ, MQ Broker, DB2 calls to the host, and more. When we had all of that in place our group built standards around the tools and worked with various levels of management to figure out how to deploy the tools in a holistic fashion to help monitor applications.
What was the stated goal?
CUNNINGHAM: We had two main goals. One was to improve our mean time to resolution (MTTR). We wanted to be able to find problems within our applications or our infrastructure and be already working on them when the customer called so we could tell the customer we’re already working on it and have an estimated completion time. And whether we were minutes ahead of the customer call or hours ahead of the customer call, the important thing was being ahead.
Then at the same time, we planned to marry the monitoring group with capacity planning so we could improve our mean time between failures (MTBF). As the monitoring team gathered data and fed it to the capacity team, we could start doing predictive troubleshooting, by saying, “OK, you’re starting to have a problem here. If you address it now you might not fail.” And that way we can increase the time between when our applications failed.
When you started to dig in and pursue that first goal to identify problems early, did you find about what you expected to find?
CUNNINGHAM: One of the things we discovered was, in areas that had monitoring tools they owned, they would do what we call focused monitoring. For example, Server A would have a problem so they’d put a monitor on Server A, but Server B wasn’t having the problem so they didn’t put it on B. They focused the monitoring where something had occurred to try to stop it from occurring again.
And when we came in, one of the things we said was, “If you’re going to have it on A, you might as well have it on B. They’re mirrors of each other. If it’s going to happen to A, it could happen to B, so let’s get the big picture.”
At BlueCross BlueShield of South Carolina we have a fairly healthy virtual machine environment. We’re one of the biggest zLinux [Linux compiled to run on IBM mainframes] shops in the world (top 1% in the world according to IBM), so we have tons of guests running on mainframes and all of our data lives on the mainframe. Anything that starts off on a webpage or a desktop has to go to the host to get its data. So we cross a lot of infrastructure.
And if that’s a 10-step process, what we found was people had deployed monitors on three of the steps, and the other seven were, “Well, they work so we don’t really need to know about them right now.” We came in and said, “OK, tell us the 10 important steps and we’ll watch all 10 steps. That way we’ll let you know as soon as something happens.” Again, my group is building automated monitors, we’re not actually sitting at our desks watching things. We’re building the automated monitoring solutions to feed our support people.
How many people are in the group and where did they come from?
CUNNINGHAM: We have 10 in Monitoring and five in Capacity Planning. We got a couple of people from the infrastructure group and we got some of our team from what we call LCAS (Leveraged Core Application Systems), and they’re the ones that code and maintain the apps on the non-host part of the environment. And we got a couple of people from the host side. We were trying to get some experience from the different silos so everybody would be represented.
Was it clear how you were going to reach your goals?
CUNNINGHAM: It was a staged process. The first stage was to get visibility into our big apps. Our CIO went through our app list and said, “This is your A priority, this is your B priority, this is your C priority.” And we had just gotten CEM in, so one of the first things we did was start instrumenting CEM to watch the A Priority apps as they came into our system. So CEM started to give us visibility into those apps. Then we started to work with our customers to say, “OK, what do you want to know about this app?” That was stage one. That took about a year.
Stage two was happening in the background as we were developing our standards for holistic monitoring. We went to internal IS customers, and asked, “Tell me about your app, tell me the important pieces. Tell me where they live, tell me how you use them, so we can deploy our entire toolset and watch your app as holistically as possible.”
We started monitoring the heck out of stuff, generating thousands of tickets that went to the support areas. We went from flying blind to flying in a snowstorm. That was our “lets monitoring everything” phrase. We were ticketing everything. I mean anytime the system hiccupped, we’d ticket. So if you could picture it, we had a little ball of monitoring, it mushroomed up huge, and then we settled back down to somewhere in between where we can say, “Now we’re monitoring your important pieces. We know which domino is the main one and if it falls something has happened,” and we are continuing to refine that process today.
Tony, is that common when customers add a lot of instrumentation, they initially get buried?
DAVIS: I would say so. And that’s sort of my mission. Instead of going down a path that puts you into a snow blind situation, what if we design your monitoring around the core business services. That will inherently cut out some of that noise.
So Jimmy, how long did it take to tone down the noise level so you could actually make some progress on the important stuff?
CUNNINGHAM: The first year we were deploying monitors like crazy, so we were constantly adding to the snow storm. We were killing our support people, and we realized we couldn’t keep that up. So we brought together our top app development managers and our app support guys and a couple of infrastructure support guys and said, “OK, how do we make this better? What do we do to give you on-time, relevant information that helps you put your finger on problems and send your guys off to the right spot to fix it.” And we started refining our overall process of how we gathered requirements.
How did you achieve it?
CUNNINGHAM: We went to the systems experts and the top support guys for each app, and between the three of us, figured out how to refine the requirements gathering process so the monitoring data and output allows them to jump in there and fix something before it actually stops working, or as quickly as possible after it stops. The EMS team works with those two areas to set what specifically should be monitored and the monitoring thresholds.
Is everything fully instrumented at this point?
CUNNINGHAM: No, never is. We’re constantly modifying and growing as apps are developed. Five years ago we broke our list down into A, B and C, and we’re in the B's now.
How far along are you in terms of the monitoring system integration processes? Is that done?
CUNNINGHAM: Yes. We originally had CA’s CEM and Introscope and we upgraded to CA APM to ensure the performance and availability of business-critical applications, transactions and services, as well as the end-user experience of customers that access our online services. At the same time, we bought CA Cross-Enterprise APM to gain 24/7 monitoring of business transactions on the mainframe. By providing CA APM with this data on a single pane of glass we now have true cross-platform APM monitoring, and that was the final piece of the puzzle that stitched everything together. Because prior to that, we had a lot of tools in the non-host world, and once you hit the host monitoring sort of disappeared. You threw it over to the host and you know stuff happened on the host and you know you got an answer back, but that was about it. The host itself is well monitored, but there was no integration between what was happening there and what was happening in the non-host world. So getting that cross-platform APM monitoring set up gives us that bridge, and that’s been huge for us.
Tony talks a lot about user experience. Do you look at it from that end as well?
CUNNINGHAM: Absolutely. Internally people were saying, “Nobody’s complaining.” And our CIO would come back, “Well, just because they are not complaining doesn’t mean everything is great, we have to find out what their experience is like, and measure it so we can figure out how to improve it.” And that was one of the things he wanted to fix. He wanted to know what their experience was, so we could make the experience better. So when they do call and complain, we know it’s a legitimate complaint because generally they’re happy with us.
What goes into that calculation of user experience?
CUNNINGHAM: Primarily it’s response time. From the time they click until they get their answer back, what was that time frame? But we also divide our tickets into three categories: We generate availability tickets – how much were we available? Reliability tickets -- was the whole app available and responsive? And capacity tickets. So in generating those tickets and gathering the metrics, we can determine how available we were, how reliable we were, and did we have enough capacity to meet demand?
Capacity performance is a key metric because BlueCross is a low-cost claims processor so we don’t have a whole lot of extra MIPS and I/O lying around. We try to run as lean as possible and we’re always asking, “Are we meeting our requirements without having a whole lot of resources just sitting around idling?”
Tony, coming back to you -- given you work with a companies in different industries, can you compare and contrast what Jimmy is talking about here to what you’re seeing in other shops?
DAVIS: When I was onsite doing some work with Jimmy and his team, one thing we stressed was the need to identify the important business services -- which, by the way, are always in the language of your customer. So for one of Jimmy’s online presences, we identified things such as claims as a business service, and then under claims, checking your eligibility, and under eligibility, checking the service you need. Those are business services. So Jimmy is actually tying all of the technical jargon about transactions moving through the enterprise back up into the business service that impacts.
My role at CA is to move all of our customers in that direction, because that’s where you get value out of paying all these millions for software. It’s not just by looking at lights. It’s by understanding how you’re impacting a business service. So as I look across my clients, I would put Jimmy and what he’s doing at BlueCross BlueShield of South Carolina at the upper end of the maturity scale. He’s approaching a high maturity phase. The majority of my customers are not there yet. The majority are clamoring to understand customer experience, understand the impact to business services.
Jimmy, you mentioned connecting this monitoring effort to capacity planning. Can you expand on that?
CUNNINGHAM: One of the things we did early on when we brought the tools together under the EMS umbrella was figure out how to extract data, because each of the monitoring tools stored their own data. Since we have a host that is really good at crunching numbers, we said, “OK, let’s figure out how we can pull data out of these tools and feed them into DB2 tables on the host.”
So we started pulling data from the monitoring tools once a week. We would pull key metrics out and work with App Dev, App Support and Capacity Planning to say, “OK, if you were going to capacity plan this app, what would you need to know? What data is important?” So you pull pieces out of each of the tools, feed it up to the host, Capacity Planning would load their DB2 tables and then they’d use business objects to start tracking how the app is performing. They would look at the number of users, how fast it responded, the CPU it took to run on this server, on that server. Here’s how much memory it took. Here’s how much storage in the background was spent doing this eligibility work as an example.
So with this more holistic view of monitoring they would see a jump from 100 users a day using an app to 150 users a day, and eligibility going from 90 users to 140 users, which means maybe we need to start to plan an increase in resources, storage, etc. So it’s that feeding back of data that helps improve the MTBF statistic, helps keep your apps up and happy for longer periods of time.
Was that something you recognized you would be able to do when you started this up, or was it something that came to you after you’d been at it for a while?
CUNNINGHAM: When he put the two groups together, our CIO Steve Wiggins said, “I want you to lay the groundwork today because you will mature to the point where we can actually use the capacity data to feedback into monitoring.” And as we grew into it and as we got more experience in the monitoring team, we looked at it and said, “The guy knew what he was doing, because now this stuff is valuable.”
We haven’t hit on cloud yet. Are you using cloud today or will you?
CUNNINGHAM: Yes, we are. We’ve already embraced virtualization, so now it’s a matter of putting the right pieces in place to be able to truly call it a cloud. So yes, we are working towards the cloud and monitoring is an integral part because, if you’re going to allow customers to auto-deploy resources, you’ve definitely got to be able to report on what they’re doing, how they’re using it and keeping your finger on the pulse of your system and the resources that are available. For the past year we’ve been working diligently to formalize and put in place all the processes and procedures needed to truly be a cloud.
How will that complicate what you have achieved through this EMS effort?
CUNNINGHAM: There’s going to be a little bit of complication application-wise, but not so much. If our application deployment on any particular app is fairly mature, then it stays in place cloud-wise. Now our infrastructure monitoring has to be accurate. It has to be on time, and it has to be automated so that, as infrastructure is deployed within the cloud the monitoring is auto-deployed and auto-reporting back, doing the feeds to the monitoring system and the feeds to capacity planning as we deploy.
How about your take on that, Tony? As companies stitch in cloud resources, how does that change the whole effort here?
DAVIS: I hate to sound like a banging drum, but as long as all the technical information is based upon the business services delivery, it seems to work out. I’m not trying to minimize the technical effort, but in my case at FedEx.com about five years ago, we went from a non-virtualized environment to a fully virtualized environment, and even though there was technical effort involved, it did not pose problems that were daunting. And I think the reason that was the case is we already had instantiated our business service program. So the movement to cloud, the movement to total virtualized and hybrid-type environments, will take some technical effort to continue the momentum that folks like Jimmy are making, but I do not see it as a major challenge or a daunting task.
All right. Well, anything else that we didn’t touch on that you think is important to get out there?
DAVIS: A little shout-out here to Jimmy. Where I see failure in implementing business services, there’s no leadership support. In other words, leadership is just, “Oh, yeah, we want customer experience.” But they don’t understand how to drive to business services. In this case company leadership has made this a top priority that we’re not going to just focus on monitoring. We’re going to do everything based on business services.