As noted in the last newsletter, application management is notably more difficult than fault management. That follows in part because when a router fails it is usually pretty obvious. When an application degrades, however, it is usually not at all obvious. Our research indicates that in roughly three-quarters of the instances in which an application is degrading, the degradation is noticed first by the end user and not by the IT organization.
One of the reasons that application management is so difficult is because when the performance of an application is beginning to degrade, each and every component of IT could be the cause. This includes the network, the servers, the database and the application itself. This means that unlike fault management, which tends to focus on one technology and on one organization, diagnosing the cause of application degradation crosses multiple technology and organizational boundaries. In general, most IT organizations do not have a track record of efficiently solving problems that cross multiple technology and organizational boundaries.
The problems cited in the preceding paragraphs lead to a MTTR (Mean Time to Repair) for application management that is often measured in days or weeks. This is in sharp contrast to the MTTR that is associated with traditional fault management, which is typically a few hours or less.
We discussed the issue of MTTR with a network engineer for a financial services organization. He stated that a when a user calls in and complains about the performance of an application a trouble ticket is opened. He said: “The [MTTR] clock starts ticking when the ticket is opened and keeps ticking until the problem is resolved.” In his organizations there are a couple of meanings of the phrase “the problem is resolved,” one being when the user is no longer impacted. Another meaning is that the source of the problem has been determined to be an issue with the application. In these cases, the trouble ticket is closed and they open what they refer to as a bug ticket.
The financial engineer added that in some cases, “The MTTR can get pretty large.” He added that roughly 60% of application performance issues take more than a day to resolve. In cases where the MTTR is getting large, his organization forms a group that is referred to as a Critical Action Team (CAT). The CAT is comprised of technical leads from multiple disciplines who come together to resolve the difficult technical problems.
In the next newsletter we will continue to discuss the factors that drive the MTTR application performance issues to be so large and suggest some steps that IT organizations can take to reduce the MTTR. In the mean time, more information can be found here.