An overview of events\nThe essence of IT Operations Management for the last 30 years (ever since the advent of distributed systems in the mid 1980\u2019s) has been to understand what is \u201cnormal\u201d and what is \u201cabnormal\u201d and to then alert on the anomalies. Events are anomalies.\nNow events come from an incredible variety of sources. Every element of the entire stack (disks, storage arrays, network devices, servers, load balancers, firewalls, systems software, middleware and services and applications) are capable of sending events.\nEvents tend to come in two broad forms. Hard faults and alarms related to failures in the environment (this disk drive has failed, this port on this switch has failed, this database server is down), and alerts that come from violations of thresholds set by humans on various monitoring systems.\nAn overview of metrics (data)\nThe operation of any IT environment can also be characterized by metrics or data. There are thousands of metrics across any kind of complex hardware and software stack, and the important ones can be boiled down in the following categories:\n\nCapacity \u2013 how much capacity of each type exists. This covers free storage capacity, free network bandwidth, available memory on servers, and available CPU resources across the environment.\nUtilization \u2013 how much of your capacity of each type is being used at each point in time. Trends in utilization are important to understand when you will run out of each kind of capacity.\nContention \u2013 for which key resources are applications and processes \u201cwaiting in line." CPU Ready in a VMWare environment tells you what the contention is for virtual and physical CPU resources. Memory swapping can indicated contention for memory resources. I\/O queues at the storage layer indicate that the storage devices may be saturated.\nPerformance \u2013 this is a crucial point. Performance in abstracted environments (virtualized and cloud based environments is NOT resource utilization. Performance is how long it takes to get things done. So performance is equal response time at the transaction level and latency at the infrastructure level.\nThroughput \u2013 these metrics measure how much work is being done per unit of time. Transactions per second at the transaction layer, and reads\/writes per second at the network and storage layers are good examples of throughput metrics.\nError rate \u2013 these metrics measure things like failed transactions and dropped network packets.\n\nToday\u2019s state of affairs\nWhere most enterprise IT Operations teams find themselves today is that they are in the event driven camp. Many teams are stuck with legacy event management systems that were invented in the era of the mainframe. Modern teams are not evaluating a new generation of event management systems that use natural language processing or advanced machine learning techniques or AI.\u00a0 But no matter how sophisticated your event management system is, you will still face the following issues:\n\nAll of the events that come from the myriad of monitoring tools are based upon manually set thresholds. The problem with this is that those thresholds are set differently by different humans, making these alerts into a very inconsistent source of data.\nThere is nothing that relates these events to each other before they are sent to the event management system. That leaves it to the event management system to have to try to correlate what is related to what after the fact.\nThe entire event management process is reactive and after the fact. By its very nature it does not start until after an alarm has been received which means that it does not start until after the problem has started to occur.\nTuning the thresholds to not miss anything (no false negatives) and to not get overwhelmed with false alerts (false positives) is a massive challenge.\n\nHow can metrics (data) help?\nIn this era of big data, it is possible to combine and mine the data that measures the performance, throughput, contention, utilization, and error rate across the stack and get the following types of insights:\n\nWhere are the current hotspots in the environments? Where are the sources of contention in key resources that are likely impacting transaction and applications performance?\nWhat are the trends in contention? Where will there likely be in issue in the near future and how can that issue be proactively avoided?\nCan relationships between metrics help with root cause? Advanced big data systems for IT Operations do not just capture metrics, but also capture the relationships between transactions and applications and where they run in the virtual and physical infrastructure.\nIdentifying zombie VM\u2019s and cloud images that are just costing you money but not doing any useful work\nCommunicating the service level status of crucial transactions and their supporting infrastructure to business constituents and application owners.\n\nSummary recommendation\nHard faults regarding the availability (or lack thereof) of critical elements of the hardware and software stack should clearly be sent directly to a modern event management system.\u00a0 However for the crucial performance and throughput related metrics, a modern big data back end will allow those metrics to be analyzed in a related manner, and ultimately help the event management system become much more accurate.