You’re supposed to meet someone for coffee. If they’re three minutes late, no problem, but if they’re thirty minutes late, its rude. Was the change from “no problem” to “rude” a straight line, or were there steps of increasing rudeness? Do we care why? A good reason certainly increases our tolerance. Someone who is always late reduces it.
Network performance follows many of the same dynamics. We used to talk about outages, but they have become less frequent. “Slow” is the new “out.” But how slow is slow? Do we try to understand the user experience and adjust our performance monitoring to reflect it? Or is the only practical answer to just wait until someone complains?
There was a recent study by Enterprise Management Associates that queried 250 network professionals. One of the questions asked, “what percentage of network performance issues were first reported by end users, rather than discovered by the network operations professionals.” The average answer was 39 percent, and the median answer was 35 percent. So, a third of the time (and much higher in some organizations) we don’t know about an issue until a user complains? We must do better!
The problem isn’t that we don’t get enough reports. Network operations teams are flooded with information, but too much information is little better than noise. We need to be able to condense insight from the vapor of data (to paraphrase Neil Stephenson). But, how do we do that?
The place to start is by defining network performance in terms that matter to the end user. The focus on end-user experience follows the old “tree falling in the forest” argument: if there is a problem that has absolutely no impact on the end-user experience, now or later, is it still a problem? Unless we’re talking about IoT or specialized systems, the answer is no.
Once we know what matters, we can start looking at filtering out what doesn’t. A great resource for determining what matters is Google’s Site Reliability Engineering (SRE) team. This group has written a (free) book called “Site Reliability Engineering,” edited by Betsy Beyer, et. al. The book questions some of our traditional thinking about IT. When it comes to monitoring, one of the key concepts it describes is what the team calls “The Four Golden Signals” – or latency, traffic, errors, and saturation. (Other well-known approaches include Brendan Gregg’s USE Method, or Tom Wilkie’s RED Method.)
Why do these “Golden” signals matter for network performance? And, how can you use this information to guide your network performance monitoring strategy? Let’s dive into these individually.
“Latency,” delays in meeting requests, may be the most useful signal, if for no other reason than that end users so often experience it. The user makes a request of a remote application. Nothing happens. Just when they’re about to try their request again, they get a response. They keep experiencing this latency for minutes at a time, but then it goes away and the application is responding normally. Then it comes back. And goes away. How much of this mildly painful experience do they tolerate before they decide to create a trouble ticket?
If the network operations team can monitor latency, they can see the issue while the user is first experiencing it. But just seeing that latency is occurring isn’t enough. They must determine whether the latency is occurring because the network is introducing delays or because the application server is responding slowly. Or are both happening at the same time? (A not infrequent occurrence.) Once that is determined, where exactly is the problem located? Knowing the answer to that is often enough to solve the problem.
The next Golden Signal is “traffic,” defined by the Google SRE team as monitoring how many requests are occurring. A good way to monitor traffic is to view the number of network conversations.
I know of a large enterprise that had a periodic problem on a network segment. Strangely, though, it didn’t correlate with any of the metrics they monitored. There was some alignment, but, frustratingly, not enough to establish the root cause. Volume of network traffic (in Gbps) would go up and the issue would occur more often, but not always. Time of day. Which kinds of traffic. The most active servers. All of these only loosely corresponded to the issue. Finally, they started measuring the number of network conversations, and found that as soon as it hit about 750,000 on a 10G link, a piece of their infrastructure hit the wall, no matter the type or amount of traffic. Knowing that, the problem was solved quickly.
Then there is the “errors” signal. Errors are more than just failed requests. Think of it as standing in for the quality of the user experience. If you’ve ever been on a VoIP call that was very responsive, but you still couldn’t easily understand the words being spoken, you’ve obviously experienced low quality. But quality isn’t just an RTP (Real Time Transport Protocol) issue, even if that is where it is most obvious. Even though we seldom see persistent data corruption, where some bit gets flipped in the payload, for example, poor TCP (Transmission Control Protocol) quality can cause a host of problems. Retransmits, dropped frames, even latency. And perhaps most importantly, errors are often a warning sign of an impending larger problem.
The last golden signal is “saturation,” the amount of traffic (as opposed to the number of transactions.) Clearly, we want to take advantage of our network capacity, but we also need to allow for spikes in utilization. A saturated network can cascade into very bizarre failure modes, where the error and retry messages add to the traffic, making the situation worse. This cycle escalates until it is so bad that enough transactions fail, and the segments goes back to functioning again until the pattern repeats.
As you can see, when evaluating how to manage network performance – both to support ongoing operations and to prepare for future digital transformation – the four “Golden Signals” can play a significant role. They allow us to get in front of the cycle of waiting for trouble tickets and start managing the network proactively.
What challenges do you and your organization face when setting performance standards? Do you rely on other “signals?” If so, share them in the comments section so we can have an open dialog.
This article is published as part of the IDG Contributor Network. Want to Join?