Every once in a while I hear someone throw out the term “over-subscription” or “over-subscribed.” It dawned on at some point that there could be a big segment of the community that has heard it, has an idea of what it means, but maybe doesn’t have a firm grasp on what it means, or how and when they need to be worried about it.
Over-subscription is something that I have dealt with almost my entire professional career, and as an old radio guy, it makes a lot of sense. Trunked radio systems, like the ones used in public safety radio, are designed and built on the idea of over-subscription. Getting licensing from the FCC for trunked radio frequencies are expensive, and difficult to even find in some areas. Just like in Wi-Fi, spectrum (or channels) are limited so making the best use of what you have is increasingly important. Understanding why we got to where we are is just as important so as things progress, you don’t get lost.
SIDEBAR: I was going to call this “Over-subscription for Dummies,” like the books, but thought that was a little disingenuous so I changed it.
In The Beginning
Way back in the day, the phone companies started to install telephones in businesses. It was easy to predict the type of load because only one person could use the phone at any given time, and since only a couple of businesses in a given area had phones, the formula for calculating demand, and the resources to accommodate said demand, was pretty simple.
Count the phones.
Over time, the phone companies realized that, for a price, they could extend the amazing phone service into more businesses and homes and then charge for the service. That was pretty cool until people realized they could also call someone from a different town, not just someone in their town. Phone companies then needed to install trunks that would connect their main phone switch to the next towns phone switch to handle the load. All of a sudden, the price to provide a service that allowed every house in Town A to call every house in Town B became an issue. Eventually someone at the phone company called a meeting and invited an MBA.
SIDEBAR: I’m not sure if this is really how this happened, but after spending time around marketing folks, I’m pretty sure this is how it went down. For clarification, what you are about to read I made up sitting on my couch. It’s a hunch. A pretty good hunch, but a hunch nonetheless.
Newly minted MBA walks into a sales meeting with the president of the phone company and said “Hey, this phone thing is pretty cool. What would be even cooler is if you could sell it to EVERY house but not invest any additional money into research or infrastructure. The ROI on that type of model would be awesome and I should get a huge bonus for thinking of it!” It’s at this point I believe the president of the phone company went “Cool! I love the model of selling a ton without actually having to do anything. Maybe we bring in some math smarty pants to figure out how much we can oversell and under deliver before someone notices. Either way, MBA guy, here’s your wheelbarrow of cash, thanks for coming!”
What the math smarty pants guy (his name was Agner Krarup Erlang) (serious, check the link in a few words) came up with is what is known as the Erlang-B and Erlang-C formulas. They are a couple of nasty formulas, as pictured below, that I one hundred percent don’t understand. Basically, what it predicts is the optimal number of resources actually needed under normal conditions to provide service in a random access medium. (I think.)
Erlang B Formula aka Erlang Loss Formula
Erlang C Formula aka Erlang Probability Formula
Look, any mathematical calculations that uses “birth rate” as part of the calculations I am not even going to attempt. You think I’m joking? Click on the link for the formulas and look it up. It’s there. Some crazy icon in the pictures above is a reference to birth rate. I’m not even going to try.
OK, back on track.
At the heart of this is the idea that not everyone will require access to the service at the exact same time, but if they do, someone will have to queue, or wait, until the resource becomes available. What they also figured out was that this concept and formula were not only useful in telephony systems, but in any system that needs to understand load when the demand from users is random.
Sounds pretty familiar, doesn’t it?
In More Recent Times
So where this formula breaks down is in the event of what is termed “re-entrant traffic“. When this happens it’s called a “high-loss system” where congestion leads to more congestion. For example, if a TV show provides a number for users to call for a short period of time at a specific time. If the phone provider doesn’t anticipate this demand, and viewers start to call the same number over and over again, it will crash the system until the demand goes away and service can be restored. In a Wi-Fi environment, I equate this to an instructor in a classroom standing up and saying “go watch this video on the count of three. Ready? 1-2-3 Go!”
In one heated debate I had with a really smart RF guy, he pointed out that it specifically states that Erlang’s models doesn’t apply to “circuits carrying data traffic.” He is correct, and I agree with him. CSMA/CD is a mechanism that anticipates that the demand for the medium (a switched, wired network) will always be there and can deal with peaks in demand. The basis of my argument is that Erlang’s model does apply to Wi-Fi because even though the payload is a data packet, Wi-Fi uses a radio transmitter and receiver. Two way radio uses Erlang’s formula to calculate needed resources in a radio system, the only difference is the payload is voice, and in newer systems, that voice that has been digitized.
In my mind, a radio is a radio.
Just like everything with 802.11, it depends is a response to pretty much any question. In my example above of an instructor in a classroom telling everyone to watch a video at the same time, that demand can be anticipated ahead of time, and the folks that work in K-12 or EDU have a good idea of what the demand for a given space is going to be by counting chairs, assuming that as some point the instructor is going to tell everyone to go watch a video, and make their calculations from there. For environments like that, over-subscription isn’t something you even consider. It’s still good to understand the concept because there are spaces on campus where there isn’t an instructor giving the “go” signal, so access to the medium in that space is now randomized.
Wi-Fi, by design, is a re-entrant traffic system. Re-entrant traffic is when the person, or device, that is trying to access the medium continues to attempt to access the resource until they are successful or kicked off for good. When a system goes “high-loss” it means a spike in demand due to re-entrant traffic that is above the designed capacity. In the phone world, a spike in re-entrant traffic used to occur during the holidays. I remember being a kid and wanting to call relatives on Christmas and all the lines would be busy. We would hang up, wait some random time (usually governed by food readiness or the current TV program) and then try again. Keep doing that and the phone system goes into a high-loss state.
Sounding familiar, isn’t it?
In the trunked radio world, this “high-loss system” occurs right before lunch (when all the users are coordinating where they are going to meet for lunch,) and then rapidly drops off during lunch, only to resume normal system loads after lunch, tapering off until the end of the normal working day. In the trunked radio world, they look at these numbers and determine how much more capacity is needed to support the current users (or how many users need to go away) to keep this queueing and wait times down to a minimum. For the trunked radio systems, they use a formula similar to the one below.
SYSTEM LOAD = ACTIVE USERS x CALL RATE x CALL DURATION
Divide that by 3600 and you get the estimated system load for an hour.
In each example; phone, radio, and Wi-Fi the end result is the same. An usual load on the system for a short period of time causes high re-entrant traffic with none of the users wanting, or able, to back off like I did as a kid during the holidays. With phone systems, after a couple times of getting a busy signal or a recording saying all circuits are busy, most people will back off and try a different approach. In radio, they get a deny tone long enough they just simply give up. In Wi-Fi, there isn’t much human response to this scenario from the end user, other than simply whipping out their personal hot spot and going “rogue.”
Sidebar: The term rogue is a debate for another day, I just used it as an example here. Don’t worry, much more heated material to debate on this topic coming up!
The Simple Formula
SYSTEM LOAD = ACTIVE USERS x CALL RATE x CALL DURATION
In order to prevent a high loss system due to re-entrant traffic, system designers need to understand Erlang’s formula, or least what it boils down to. For the rest of this discussion, I will be referring to the simple formula used by trunk radio designers of system load is equal to the number of active users times call rate times call duration. At first glance what becomes obvious in this equation is that for the most part, all of these are variables that are outside our control, or are they? Let’s break it down and see what these three variables mean and are they really unpredictable and outside of our control.
Active users are just that; the number of users that are active at any given point. This figure is indeed random, and hard to predict. For radio and phone engineers, they know about how many new devices will be added in any given area, and know when to raise the red flag. As Wi-Fi designers / engineers, we can look at a given space and estimate how many people can physically cram into the space before a fight breaks out, but thats about it. We can guess based on the current number of 2.5 devices per person and then assume that all of them are going to try to work at the same time, but I think that number is a little skewed. I will admit that the 2.5 number is a good number, but on my last flight where I had 2 laptops, 2 tablets, and 1 phone (5 Wi-Fi enabled devices) I was only trying to use one at any given time. I might have used 2 IP addresses from the DHCP lease pool, and used some ports in the NAT pool, but I was really only trying to use one device at a time.
The other part of this variable I urge you to consider is what is the user actually doing? In LPV (Large Public Venue, I had to ask as well), the users are mostly uploading content to social media, or watching instant replay’s of what they just saw in real life. When comparing the download data amounts to the upload data amounts for events in these areas, you can see that the active user population has a different agenda than they would at say an EDU or a coffee shop. EDU and K-12 users will have a different “profile” of user types even between areas on campus. Classrooms will get one type of demand where cafeterias or libraries will get a different type of user profile.
Knowing why your active users are accessing the system (mostly at random times I would point out) is just as important as knowing how many active users there are. Again, “it depends” creeps up here as well. In classroom or lecture halls, this isn’t a wide variable any longer, so paying attention is crucial. Know your user population!
Call rate is the rate at which the active user population tries to access the system. Is it once every second, once every 10 seconds or once every minute? Of the three variables, this is the one that we have the least control over, and the most difficult to predict or even guess at. For my money, I estimate on the high side, and fall back on the active user variable to help with my estimations. Once I figured out what the user population will be doing in the room, estimating the call rate falls in line with that, but it’s still hard to predict.
Of the variables, this one is the most prone to outside influences. Any unpredictable event that can influence the active users has to be taken into account, and it’s at this point a decision has to be made about how much, and for how long, the management will accept a high-loss system with lots of re-entrant traffic that can’t be handled. I’ve never seen a Wi-Fi system “crash” under a heavy load, but I have seen it come to a screeching halt. Risk aversion and risk acceptance also comes into play with this variable; a decision that has more political issues than technical ones.
Call duration refers to how long an active user, after placing a “call”, tries to retain that resource for exclusive use. Again, this is a variable that as a designer or engineer we don’t have much control over. This variable has the biggest impact on the system load of a system. Harken back to the days of when you only had one phone in the house and an older sibling wouldn’t get off the phone long enough for you to call you friends and talk forever. In the trunked radio business, this variable is the key to their success.
During my heated debate with the really smart RF guy, he informed me that as of Q4 of 2017 the average call duration for a group call (think multicast) on a public safety trunked radio system across the nation was 2.5 seconds. 2.5 seconds to gain access to the channel, transmit your “message”, and then get off the air, thereby freeing up the resource for the next individual to use. When you think about that, it’s actually really quick. Maybe not quick when compared to a trace file of Wi-Fi traffic that’s measured in microseconds, but for a human being that is quick! As a follow up to that stat, the average individual call (think unicast) is around 14.1 seconds.
Let that sink in for a second.
Fourteen point one seconds for an individual to key up, gain access to the resource, send his “private” message to a second individual, and then release the resource for the next call. This is such an inefficient use of the resource that most trunked radio system administrators limit that type of call to only one at a time. The reason being is it is so hard, and costly, to add another frequency, or channel, to the system that they don’t allow users to waste the resource. During that 14.1 seconds, 4.6 additional calls could have been made (the initial call plus 4.6 additional users) but 4 users had to wait, or be pushed to another resource, because someone was hogging the airtime for a message that is very inefficient compared to the group calls.
Now you’re thinking “Jim, I’m glad I’ve wasted 2,500 words and some crazy formulas to learn about trunked radio call loading. What in the name of all that is holy and good does this have to do with Wi-Fi?” Remember when I said I wasn’t going to debate if your mobile hotspot is a rogue or not, that there was going to be much more to debate later on?
Well here it is, get out your wrasslin’ pants because it’s about to get heated in here!
ALL OF THIS APPLIES TO Wi-Fi!
Erlang’s formulas are all about random access timing to a resource that is limited and required to transmit information intended for a distant end. Phones, radios, and Wi-Fi. Of the three, phones are the only one that has a wire involved, the other two are wireless! Remember when I said that call duration was a variable that we don’t have much control over?
What if that isn’t true?
We might not have much control over what the user wants to send, or receive, but we have an inordinate amount of say on how FAST they can send or receive!
40 MHz channels? BRING ‘EM ON!
80 MHz channels? If you can, DO IT!
160 MHz channels? Now we’re just being silly. I’m crazy, not insane.
Rate limiting? GET RID OF IT!
50 clients per AP? WHY SO LITTLE?!?!?!
120 clients per AP? NOW WE ARE TALKING!
Jimmy has not lost his mind
Anything that slows the user down, and in return raises their “call duration,” messes with the entire system in general. If we look back at the formula for system loading, it is:
SYSTEM LOAD = ACTIVE USERS x CALL RATE x CALL DURATION
What if we can shrink the call duration field of the equation? Since call rate is the one variable we really have no control over, simple math tells me that if we REDUCE the value in the call duration field of the formula, the number in the active users field can INCREASE while the actual product of the equation, System Load, can stay the same!
I know, Airtime Fairness makes sure that the slow talker in the back of the room doesn’t hog the resource, and that’s not where I’m going here. Frame size and packet size are standardized, and it’s going to take a predetermined amount of frames to get my data across. My contention is that if we speed up the ability of the devices to transmit or receive those frames, the client devices will demand LESS time on the system, reducing re-entrant traffic that we discussed earlier, and free up the resource for someone else to talk.
If you recall, re-entrant traffic is the key part about driving a system into a “high-loss” state. I have been to two Wi-Fi specific conferences, and listened in on web casts and read papers, and the one thing that I do know is that loss of any kind is bad. Why is not ok to have loss on the RF spectrum but perfectly fine to induce a high-loss system because we are trying to strangle the system?
Radio is radio, loss is loss.
During WLPC 2018, Joel Crane stood up and did a fantastic presentation called “Look Into My Eye P.A.” Joel is a fantastic presenter, and very knowledgeable about his topics. If you haven’t seen it, I highly recommend watching it, I included the link for your convenience. He is one of the few presenters that I will stop and pay attention to because I know it’s going to be good. I am also buttering him up because I am going to refer to a couple of his slides below and I hope he doesn’t mind!
Wi-Fi is Half-Duplex
Half-Duplex I Say!
Now Joel doesn’t know what I’m about to say, and he in no way endorses my view point, so don’t go after him. I am using this because I was watching his presentation yesterday and was reminded about what he said, and it’s a great visual that I don’t have to reproduce.
Part of his presentation, as seen above, is that Wi-Fi is half-duplex, which means that only device can talk at a time. Pretty sure we are all in agreement on this one. In fact, the second picture is when Joel was talking about the fact that when one device is transmitting, all the other Wi-Fi devices in the BSS are in a listen mode.
Using this theory, why not allow the one device that is allowed to talk at any given time the best chance to get as much across while they have the medium? I know the growing consensus is to use 20 MHz channels and nothing else but my argument is why start off constrained? Go as big as you can until you simply can’t. I have seen arguments that 2 devices can transmit more data using 2 separate 20 MHz wide channels coming off 2 AP’s than those same 2 devices can transmit on a single AP with a 40 MHz wide channel. I have seen the math and I can’t argue with the math, but I wonder what would happen if you expanded that test out to a user base that is really random?
Disclaimer: Again, my theory starts to break down when you get into K-12 and EDU classrooms, but I think there is some benefit here, so hang with me.
Given that Wi-Fi is half duplex, and only one station can talk at any given time, do you really only get 3 Mbps per 1 spatial stream device when you reach 10 stations on a BSS? If only one station can transmit, or receive at a time, wouldn’t they get all the bandwidth for that given slot?
If we all agree that Wi-Fi is half duplex, doesn’t that simple factor alone mean that the airtime slots for the stations actually aligns all the data in a neat row? In LPV where users are generally uploading data, the data leaving the AP on the wire HAS to be in a neat row because that’s how the AP received it. Since the AP isn’t examining the payload since it isn’t the destination MAC address, it will automatically send it up the wire. When the traffic destined for the client device from elsewhere in the world hits the AP at random times, there is really no predictability if one client station will receive their packets first or not. It’s actually at the discretion of the source device at the distant end and the rest of the goo between the source and destination devices, of which we generally have no control over at all.
In a unicast scenario, every packet or frame is going to have a destination address and a source address. Whether that is coming from multiple client devices destined for one server or from one server headed to multiple clients, they will all still be a in a row. I don’t have any packet captures or trace files showing the opposite, but I have seen a lot of wired and wireless captures and they are all in a row, with a neat little time stamp.
The best I can do is show that client load on an AP has no bearing on the actual throughput of the AP in a scenario where Erlang C is in full effect; a guest Wi-Fi system running full open with 40 MHz channels:
I have been watching this chart for the past 14 months, and haven’t been able to find any correlation between number of clients and throughput. On this graph specifically, that I actually grabbed on 4 March 2018 at roughly 5:30 pm, look at the last three AP’s on the right. Pretty similar client counts, in fact the second from the right has the highest quantity of 2.4 devices of the three, yet it’s throughput for this particular instant is well above the far right, and almost double of the one third from the right. I have even watched as the AP with the highest number of clients have the most throughput and then immediately drop to one of the lowest. The one thing I have picked up is that the throughput, per AP and globally, is much more “spikey” and random. For our money, that means we are working more efficient, and the customer complaints have all but vanished.
More of Jim’s Arguments
Another part of my argument comes down to the question of “what is noise”? There are a lot of opinions on this, and a lot of people can agree on what is noise, but I take it one step further and agree with Keith Parsons on a presentation he gave at Wi-Fi Trek 2017 in Orlando, Florida. During his presentation (just before the 11 minute mark) he reminded us that noise is noise, and that my signal is your noise, and in some cases, my signal is my noise. The more signal you jam into a space, the more the noise is. I don’t care if it’s on frequency or not, there is a reason the hardware tries to figure out what the noise floor is and doesn’t use a calculation based on what it thinks is in the space.
Radio is radio, loss is loss, noise is noise.
By going with 20 MHz wide channels in order to “increase capacity” in a space, all that ends up happening is you add noise into a space that is already noisy. My signal is my noise, and your noise. My signal becomes everyone’s noise. Instead of adding more AP’s, if the designer spends a little bit of time to understand what is going on in the space (active users and what they are doing), why not use a 40 MHz wide channel off a single AP? Less transmitters, less noise, better performance. If you think your cheap phone has a hard time working in a given space, try sitting next to a laptop with high transmit power and a decent antenna on channel 36 while you sit next to him on channel 40 trying to listen. He’s on a different channel you say, no problem! Think so?
Radio is radio, loss is loss, noise is noise, power is power.
Energy generated from a device, be it an 802.11 modulated signal or a dreaded microwave, still raises the noise floor in an environment. If it wasn’t an issue we wouldn’t spend so much time and money trying to eliminate those sources of noise. If there are three different tools developed to tell me if it’s a video camera or microwave, then knowing that I am generating my own noise should be a thing as well. Less transmitters, especially high power with great antenna transmitters, should be a goal we all strive for. Since one single AP transmitting at 10,000 watts isn’t going to happen, we need to keep working on a better solution.
A wider channel gives a station more capacity to transmit his data. I recently saw a flowchart from Peter MacKenzie during his CWAP class on what it takes to actually transmit a frame on an 802.11 channel. My first reaction was “how in the hell does it ever work?” I think I even tweeted that remark out. He then followed it up with a video showing a traffic intersection in Ethiopia somewhere. The fact that this works at all still astounds me. Why not, when the opportunity presents itself, allow the station to speak as much of his peace as he can, because he sure earned that right after all the junk he had to do to get the mic?
Are We Done Yet?
Admittedly, I went off the rails from talking about “Over-Subscription” because I have heard the term used without any real consideration to what it means. Sure, we can over-subscribe the equipment, and accept it, but why not use it to our advantage if we are going to “accept some over-subscription” on the hardware.
My contention, in a nut shell, is this. What some people call “Over-Subscription” I call using the hardware that I purchased to it’s full potential. Don’t buy a supercar if you are only going to drive through an active school zone on the way to the grocery store. Don’t discount 40 MHz wide channels until you really prove that you CAN’T support 40 MHz wide channels. Get the station the information it’s looking for as fast as possible, drive down re-entrant traffic, and improve system performance. Over-Subscription? Only if you can balance the equation and drive down those other two numbers.
Radio is radio, loss is loss, noise is noise, load is load.
Understanding how subscription, and over-subscription works, and how it all ties together, will allow you to “subscribe” your AP’s to a station number they can actually support, not what’s on the sales sheet.
Where does this leave us? I will tell you; in the exact same place as we started. I am not going to stand up and say that everyone who preaches 20 MHz wide channels is wrong or that everyone who preaches 40 MHz channels is right. I am here to tell you that:
At it’s heart, 802.11 is a medium where we constantly have to play the balance game. We are CONSTANTLY robbing Peter to pay Paul. (Not Peter MacKenzie, a hypothetical Peter.). Every time we turn the power up in one place, we have to turn it down in a different part of the system. Changing bandwidth means either more or less available channels. Directional antennas mean signal isn’t going someplace that it used to but going farther in a direction we might not have intended it to. More AP’s means “more capacity” but it also means “more noise.” Less AP’s mean wider channels, less noise but “less capacity.” Wi-Fi is half duplex but we still run capacity equations based on all the clients that are in the space transmitting at one time, even though all but one should be listening. Someone asked if an antenna is tuned to handle 1,000 clients even though Joel tells us in his presentation that it only matters about the one client.
We have jobs because we have a deep understanding of technology and how to deploy it in the wild, and then go back behind and “correct” someone else’s design. One of the things I marvel at is how much l have learned about human behavior while designing Wi-Fi, and not just technology. There is a reason why things were done the way they were. It is dependent on us as a community to ask the question of “why?” Just because someone designed a system to be over-subscribed, I will ask the question of why. No to second guess them, but to understand their thought process to achieve the design they deployed. Due to all of this ambiguity, almost every day on Twitter I see a point where I want to jump in and say:
**I welcome any all responses to my theory and ideas, and would love to hear any ideas to the contrary. My only ask is please don’t try and debate me on Twitter. This wasn’t over 5,000 words long simply to resort to a couple of hundred word arguments. Leave a response, ask a question, I will get to them as fast as I can. Thanks for taking the time to get this far!**
**Update 8 March 2018 – Changed “A wider channel gives a station more time to transmit his data” to “A wide channel gives a station more capacity to transmit his data.” The time slot doesn’t increase with the wider channel but the bandwidth does increase.