Should IT operations be event-driven or data-driven?

What is the right path to cost-effective service quality?

An overview of events

The essence of IT Operations Management for the last 30 years (ever since the advent of distributed systems in the mid 1980’s) has been to understand what is “normal” and what is “abnormal” and to then alert on the anomalies. Events are anomalies.

Now events come from an incredible variety of sources. Every element of the entire stack (disks, storage arrays, network devices, servers, load balancers, firewalls, systems software, middleware and services and applications) are capable of sending events.

Events tend to come in two broad forms. Hard faults and alarms related to failures in the environment (this disk drive has failed, this port on this switch has failed, this database server is down), and alerts that come from violations of thresholds set by humans on various monitoring systems.

An overview of metrics (data)

The operation of any IT environment can also be characterized by metrics or data. There are thousands of metrics across any kind of complex hardware and software stack, and the important ones can be boiled down in the following categories:

  • Capacity – how much capacity of each type exists. This covers free storage capacity, free network bandwidth, available memory on servers, and available CPU resources across the environment.
  • Utilization – how much of your capacity of each type is being used at each point in time. Trends in utilization are important to understand when you will run out of each kind of capacity.
  • Contention – for which key resources are applications and processes “waiting in line." CPU Ready in a VMWare environment tells you what the contention is for virtual and physical CPU resources. Memory swapping can indicated contention for memory resources. I/O queues at the storage layer indicate that the storage devices may be saturated.
  • Performance – this is a crucial point. Performance in abstracted environments (virtualized and cloud based environments is NOT resource utilization. Performance is how long it takes to get things done. So performance is equal response time at the transaction level and latency at the infrastructure level.
  • Throughput – these metrics measure how much work is being done per unit of time. Transactions per second at the transaction layer, and reads/writes per second at the network and storage layers are good examples of throughput metrics.
  • Error rate – these metrics measure things like failed transactions and dropped network packets.

Today’s state of affairs

Where most enterprise IT Operations teams find themselves today is that they are in the event driven camp. Many teams are stuck with legacy event management systems that were invented in the era of the mainframe. Modern teams are not evaluating a new generation of event management systems that use natural language processing or advanced machine learning techniques or AI.  But no matter how sophisticated your event management system is, you will still face the following issues:

  1. All of the events that come from the myriad of monitoring tools are based upon manually set thresholds. The problem with this is that those thresholds are set differently by different humans, making these alerts into a very inconsistent source of data.
  2. There is nothing that relates these events to each other before they are sent to the event management system. That leaves it to the event management system to have to try to correlate what is related to what after the fact.
  3. The entire event management process is reactive and after the fact. By its very nature it does not start until after an alarm has been received which means that it does not start until after the problem has started to occur.
  4. Tuning the thresholds to not miss anything (no false negatives) and to not get overwhelmed with false alerts (false positives) is a massive challenge.

How can metrics (data) help?

In this era of big data, it is possible to combine and mine the data that measures the performance, throughput, contention, utilization, and error rate across the stack and get the following types of insights:

  • Where are the current hotspots in the environments? Where are the sources of contention in key resources that are likely impacting transaction and applications performance?
  • What are the trends in contention? Where will there likely be in issue in the near future and how can that issue be proactively avoided?
  • Can relationships between metrics help with root cause? Advanced big data systems for IT Operations do not just capture metrics, but also capture the relationships between transactions and applications and where they run in the virtual and physical infrastructure.
  • Identifying zombie VM’s and cloud images that are just costing you money but not doing any useful work
  • Communicating the service level status of crucial transactions and their supporting infrastructure to business constituents and application owners.

Summary recommendation

Hard faults regarding the availability (or lack thereof) of critical elements of the hardware and software stack should clearly be sent directly to a modern event management system.  However for the crucial performance and throughput related metrics, a modern big data back end will allow those metrics to be analyzed in a related manner, and ultimately help the event management system become much more accurate.

This article is published as part of the IDG Contributor Network. Want to Join?

Join the Network World communities on Facebook and LinkedIn to comment on topics that are top of mind.
Now read: Getting grounded in IoT