You\u2019re supposed to meet someone for coffee. If they\u2019re three minutes late, no problem, but if they\u2019re thirty minutes late, its rude. Was the change from \u201cno problem\u201d to \u201crude\u201d a straight line, or were there steps of increasing rudeness? Do we care why? A good reason certainly increases our tolerance. Someone who is always late reduces it.\nNetwork performance follows many of the same dynamics. We used to talk about outages, but they have become less frequent. \u201cSlow\u201d is the new \u201cout.\u201d But how slow is slow? Do we try to understand the user experience and adjust our performance monitoring to reflect it? Or is the only practical answer to just wait until someone complains?\nThere was a recent study by Enterprise Management Associates that queried 250 network professionals. One of the questions asked, \u201cwhat percentage of network performance issues were first reported by end users, rather than discovered by the network operations professionals.\u201d The average answer was 39 percent, and the median answer was 35 percent. So, a third of the time (and much higher in some organizations) we don\u2019t know about an issue until a user complains? We must do better!\nThe problem isn\u2019t that we don\u2019t get enough reports. Network operations teams are flooded with information, but too much information is little better than noise. We need to be able to condense insight from the vapor of data (to paraphrase Neil Stephenson). But, how do we do that?\nThe place to start is by defining network performance in terms that matter to the end user. The focus on end-user experience follows the old \u201ctree falling in the forest\u201d argument: if there is a problem that has absolutely no impact on the end-user experience, now or later, is it still a problem? Unless we\u2019re talking about IoT or specialized systems, the answer is no.\nOnce we know what matters, we can start looking at filtering out what doesn\u2019t. A great resource for determining what matters is Google\u2019s Site Reliability Engineering (SRE) team. This group has written a (free) book called \u201cSite Reliability Engineering,\u201d edited by Betsy Beyer, et. al. The book questions some of our traditional thinking about IT. When it comes to monitoring, one of the key concepts it describes is what the team calls \u201cThe Four Golden Signals\u201d \u2013 or latency, traffic, errors, and saturation. (Other well-known approaches include Brendan Gregg\u2019s USE Method, or Tom Wilkie\u2019s RED Method.)\nWhy do these \u201cGolden\u201d signals matter for network performance? And, how can you use this information to guide your network performance monitoring strategy? Let\u2019s dive into these individually.\nLatency\n\u201cLatency,\u201d delays in meeting requests, may be the most useful signal, if for no other reason than that end users so often experience it. The user makes a request of a remote application. Nothing happens. Just when they\u2019re about to try their request again, they get a response. They keep experiencing this latency for minutes at a time, but then it goes away and the application is responding normally. Then it comes back. And goes away. How much of this mildly painful experience do they tolerate before they decide to create a trouble ticket?\nIf the network operations team can monitor latency, they can see the issue while the user is first experiencing it. But just seeing that latency is occurring isn\u2019t enough. They must determine whether the latency is occurring because the network is introducing delays or because the application server is responding slowly. Or are both happening at the same time? (A not infrequent occurrence.) Once that is determined, where exactly is the problem located? Knowing the answer to that is often enough to solve the problem.\nTraffic\nThe next Golden Signal is \u201ctraffic,\u201d defined by the Google SRE team as monitoring how many requests are occurring. A good way to monitor traffic is to view the number of network conversations.\nI know of a large enterprise that had a periodic problem on a network segment. Strangely, though, it didn\u2019t correlate with any of the metrics they monitored. There was some alignment, but, frustratingly, not enough to establish the root cause. Volume of network traffic (in Gbps) would go up and the issue would occur more often, but not always. Time of day. Which kinds of traffic. The most active servers. All of these only loosely corresponded to the issue. Finally, they started measuring the number of network conversations, and found that as soon as it hit about 750,000 on a 10G link, a piece of their infrastructure hit the wall, no matter the type or amount of traffic. Knowing that, the problem was solved quickly.\nErrors\nThen there is the \u201cerrors\u201d signal. Errors are more than just failed requests. Think of it as standing in for the quality of the user experience. If you\u2019ve ever been on a VoIP call that was very responsive, but you still couldn\u2019t easily understand the words being spoken, you\u2019ve obviously experienced low quality. But quality isn\u2019t just an RTP (Real Time Transport Protocol) issue, even if that is where it is most obvious. Even though we seldom see persistent data corruption, where some bit gets flipped in the payload, for example, poor TCP (Transmission Control Protocol) quality can cause a host of problems. Retransmits, dropped frames, even latency. And perhaps most importantly, errors are often a warning sign of an impending larger problem.\nSaturation\nThe last golden signal is \u201csaturation,\u201d the amount of traffic (as opposed to the number of transactions.) Clearly, we want to take advantage of our network capacity, but we also need to allow for spikes in utilization. A saturated network can cascade into very bizarre failure modes, where the error and retry messages add to the traffic, making the situation worse. This cycle escalates until it is so bad that enough transactions fail, and the segments goes back to functioning again until the pattern repeats.\nAs you can see, when evaluating how to manage network performance \u2013 both to support ongoing operations and to prepare for future digital transformation \u2013 the four \u201cGolden Signals\u201d can play a significant role. They allow us to get in front of the cycle of waiting for trouble tickets and start managing the network proactively.\nWhat challenges do you and your organization face when setting performance standards? Do you rely on other \u201csignals?\u201d If so, share them in the comments section so we can have an open dialog.