It seems that “everyone” wants metrics, sometimes just for the sake of having pretty graphs to put up on those giant monitors they insist on having everywhere. Sometimes those metrics and KPI’s mean something, sometimes they are massaged in a way to only show how great the organization is, no matter how bad things are going at the moment. Those types of metrics don’t mean much to me.
What I want to focus on are metrics that actually mean something to people in the trenches. Those who toil day after day trying to keep systems running and their phones from ringing. For those of us like that, metrics are usually classified as just metrics (from now on I will refer to those as hard metrics) and the elusive “soft” metrics.
The “Hard” Metric
Hard metrics, in my mind, are those things we stare at that give us actual factoid numbers that mean something. How many AP’s do we have on a WLC, how many radios are turned up, how many are turned off. These numbers don’t fluctuate much, and at a quick glance, can give you the overall “health” of a system. Other metrics that we can look at will show us clients, how many on each band, what AP’s are the most “popular” at the moment, and how much data are they transporting at the moment.
Above and to the right are examples of these metrics. These are the type of metrics that C-Level execs like to see because it helps them understand, and then point to and gloat, when their peers start to try and one-up each other. Unfortunately, for the day to day babysitters of these systems, this doesn’t really help us as we deal with the ever constant murmur of “the Wi-Fi sucks!”
Don’t get me wrong, hard metrics can be useful, if used in the correct context, but in my experience, they aren’t as useful as the other metric I want to talk about.
The “Soft” Metric
This question comes up every once in a while as someone hears the term “soft metric” and has no idea what that means. Then the question starts to be bandied about in different online forums, never with a great answer. The answer isn’t the same for everyone, thus making it hard to quantify in some clean document. I’m going to use a recent experience and some of my own graphs to explain what I consider as soft metrics and how I use them.
That picture above, the one showing the active client count, can be both a hard and a soft metric! It’s a hard metric because it’s a simple fact that can be recorded and displayed, and people can oooh and awww over it while drinking coffee. For me, the soft metric side of that picture is the ratio between those 2 numbers. 7,187 clients on 5 GHz vs 1,097 clients on 2.4 GHz. 86.7% of the clients on 5 GHz is a little lower than I would like, but I’ll accept it. It was early.
At the time of day that I took that snapshot, that is what I have come to know as a “normal” ratio. In the middle of the night when the only devices on the network are the cleaning crew and the IoT devices, that ratio skews more to even. During the height of traffic during the day, it skews even more towards 5 GHz. If I see that ratio start to even out in the middle of the day, I know I have a problem. Hard metrics don’t tell you that, soft metrics can.
Next, let’s take a look at some graphs that I used to solve an issue that we were having recently. This graph shows the number of clients on the system broken out by client SNR as reported by the WLC. For most C-Level execs, this chart means nothing. It looks cool, it updates by itself every couple of minutes, but it will cause more trouble than it’s worth thanks to the red, yellow, and green numbers on the bottom. Remove the colors and I’m fine with it.
This is also one of the charts I used to determine that we had a problem with the system recently. The graph I saw had a different “shape” to the red line and most of the clients had a really poor SNR. After some checking, I discovered that my Dynamic Channel Assignment (DCA) had set a predominant amount of my 5 GHz radios to Ch 36. More noise, less SNR, more clients skewed lower on the chart, and the hump “moved” to the left. Manual restart of the DCA process corrected a lot of the channel issues, and the hump “moved” to the right.
Bar graphs = hard metric. Red line above the vertical bars = soft metric.
I’m sure the question is going to come up about the 20 dB vs 25 dB. Would I like to see that hump moved further to the right? Of course. Is it worth the price I would pay in other parts of the system to make that happen? No. For this system, this soft metric has become the normal. Also, I don’t know why DCA did what it did, and I didn’t have time to suffer through a TAC case to figure it out. It’s easier to fix it and stop the phone from ringing than deal with that mess!
Fixing that issue led to a new one. The chart below is the one that I looked at after I found the DCA issue. What you see here is the proof that the issue was fixed.What’s the proof you might ask? Good question. The proof is the sawtooth pattern of the throughput line graph. In a system like mine, predominately guest clients, when functioning correctly, the throughput should be very jagged. Up and down, very bursty, shouldn’t be smooth at all. The demand on the system is very random, so the throughput should be as well.
What I saw was a line that was fairly smooth. A “hard” metric wouldn’t catch anything because there was still throughput, things were still functioning. Soft metrics told me there was a problem, and as I started some initial troubleshooting, I found that my Internet speeds weren’t living up to expectations.
In fact, I was complaining that the Wi-Fi was slow.
Now based on the fact that the SNR graph was back to my normal, and my tools told me that my RF was pretty good for my device, I started to dig further. One of our wired network guys actually found an interface that was part of a port channel that was misbehaving between our Internet edge and the AP. Further troubleshooting from the fiber folks discovered that there was a faulty patch in the fiber line. Fix the fiber, add it back to the port channel, Wi-Fi was fixed!
No one else caught it because the all the metrics they look at showed that traffic was flowing. Interface was up, connection was alive, it just wasn’t happy. If your packets ended up on that fiber, it was a problem but if they didn’t, no problem. Soft metrics are literally understanding what your system looks like when things are healthy so when things go bad, you can spot them before they go REALLY bad.
Great, what does this mean to me?
For people that get to sit in front of a system day after day and learn the quirks and anomalies that make up a specific system, this starts to become second nature. If you are new to a system, learning what soft metrics to pay attention to is something that can be done, but realize it’s going to take time. One thing you should realize by this point is in order to be really good, it requires some effort and for you to have some skin in the game. Maybe you don’t know what the normal is for a system, but with time you should be able to ask that question within minutes.
Of course, the good news is with advances in Artificial Intelligence and Machine Learning, soft metrics will soon come standard with all the advances in technology that are coming out every day. The even better news is there will always be a need for someone to program and run all the bots that are watching all your hard and soft boiled eggs. I mean metrics.
Leave me a comment and let me know what “soft” metrics you like to watch on your system.
Photo of the eggs is courtesy of The Food Network. Do you think I could actually soft boil and egg?