Wi-Fi

You Should Care About DHCP Option 51

Edit 4 Jan 2019 – It has been pointed out to me that instead of packets, these are frames.  You can read more about how I was mistaken here.  The link to RFC 2131, Dynamic Host Configuration Protocol, can be found here.  I’m not going to go through and change all the words, just replace them in your head as you read through!

Dynamic Host Configuration Protocol, or DHCP, is one of the first things you learn about with IP devices and the super basics of how they work; even before you learn binary and MAC addresses and layer 3’s and LAN vs WAN and, and, and, and…

Hopefully you get where I am going with this.  DHCP is one of the building blocks of IP networking, and most people know just enough to survive.  As WLAN Professionals maybe you have heard of DHCP Option 43 (Vendor Specific information) or Option 60 (Class ID) but did you know about Option 51?  It’s OK, neither did I. 

The point of this diatribe is to point out how much most people, including me, don’t know about DHCP and the different options that are defined within the protocol itself, and how this can affect clients on a WLAN.  My interest in this was piqued by a conversation on the WLAN Pro’s Slack site (more on that at the end) and how Apple devices negotiated their DHCP lease.  Didn’t know that there was a negotiation during the DHCP process?  Neither did I.  Let’s get at it.

Basic Options

Some of the basic options that most people already know about, but probably didn’t know their defined option numbers are:

  • Option 3 – Router
  • Option 1 – Subnet Mask.
  • Option 4 – Time Server.
  • Option 6 – Domain Server.
  • Option 15 – Domain Name.
  • Option 51 – Address Time.
  • Option 53 – DHCP Message Type.
  • Option 138 – CAPWAP Access Controller Address

Honestly, the list of options that are available, and what they do, is pretty astounding when you start to dig into it.  It makes me think about some of the different things I could configure on my DHCP servers to try and help my clients negotiate the network easier and faster, improving customer experience.  At least that’s the “pie in the sky” thought.  The honest answer is the ever classic and pervasive It Depends™.  Hence the conversation on Slack about Apple devices and how they “negotiate” their DHCP lease from the DHCP server.

There are a myriad of things that happen within the DHCP process, starting with the typical 4 way exchange that most are aware of:

  1. DHCP Discover (A client transmitting an initial BootP packet)
  2. DHCP Offer (The intitial response from the DHCP server)
  3. DHCP Request (The client requesting the IP offered in Step 2)
  4. DHCP Ack (the DHCP server confirming the IP as being assigned to that client)

What surprised me was the number of options and things that happen within those 4 “simple” packets and how they differed between vendors.  Now as anyone who has dealt with client devices can attest, different vendors can wreak havoc within an infrastructure, but did you realize how much it can do just to a DHCP scope?

Option 51

With just a little bit of experience, and some time staring iPad DHCP Discoverat packet captures, this is a pretty easy exchange to watch, and then take for granted.  For the sake of this conversation, I performed a bunch of captures on my test DHCP server (pfSense – I should probably do a post about that) and focused on Option 51.  This is the part where an Apple device will “negotiate” its lease time.  On the right is a look at my new iPad as it starts the DHCP process with a DHCP Discover message (Option 53).  Notice the time in Option 51?  This is in the initial packet of the process and this Apple device is requesting 90 DAYS (!) for its lease.  I tested this with multiple Apple products and found this to be the same across the board.  Every time an Apple product that I tested sent a discover packet to the DHCP server, it asked for 90 days.

Greedy buggers!

Conversely, the Android devices I tested asked for what the leaseWindows DHCP Req Packet time was as part of the Option 55 section (Parameter Request List), but never asked for a time as a specific option.  Windows devices, pictured here, never even inquire about Option 51; either as a standalone option request or part of the Option 55 Parameter Request List.  This becomes critical when we get to the “DHCP Request” packet or #3 in the process. 

Now while a lot of the parameters requested in this DHCP Request packet on the right aren’t configured on my test DHCP server, the device is still asking for them.  That’s fine, the server will only respond with what it knows.  The cool thing about a pfSense DHCP server is that it knows about TWO different timers for the DHCP Lease Time.  A “default time” and a “maximum time.”  These are configurable through the the GUI and until recently, I never knew why this was such an important thing.

Apple Devices and Option 51

Enter the Apple discussion on Slack that I referred to earlier.  The conversation centered around the fact that Apple devices didn’t like having short lease times for its DHCP.  I don’t have the conversation to post, but I need to give credit to Kristian Roberts for originally bringing light to this subject.  I couldn’t find him on Twitter but he is on Slack (more at the end.)

What Kristian discovered, and I confirmed, is that Apple products will always request for 90 days.  What gets weird is when it request 90 days.  As part of the standard 4 packet exchange, the discover and request (1 & 3) in the exchange come from the clients, the offer and acknowledge (2 & 4) come from the server. An Apple device will only request the 90 days in the discover packet of an initial DHCP process.  In the request packet, it doesn’t include Option 51 for the initial request.  Where this changes is the renewal that happens at the half life of the lease time as defined in the last packet of the exchange, the acknowledge packet.

In my scenario with 2 different lease times defined, this is what an Apple Mac Book Pro looks like from a DHCP scenario:

mbp 75 min dhcp overview

Mac Book Pro DHCP Summary

I did the math for you, and 7,776,000 really is 90 days.  3,600 seconds is an hour that the server responds with, which in pfSense is defined as the maximum lease time.  Notice that the request packet (#3) has no value?  Apple devices don’t request a time in their initial  DHCP request packet so the server responds with 1,800 seconds, or the default time of 30 minutes.

900 seconds after the first DHCP ACK from the server, the client sends a DHCP Request packet when it starts the renewal process; half life of the 30 minute lease that both the server and client respect from the first ACK packet.  If the client didn’t respect that value, the renewal wouldn’t have come in until later.  What I want to call your attention to is the IP Address Lease Time for packet numbers 5 through 10.  In the initial request packet (line 3) the Mac Book Pro didn’t ask for a time but in every subsequent request it asks for 90 days.  The server, programmed with a maximum lease time of 60 minutes, keeps offering it.  The other oddity happens at line 7.  The request comes 900 seconds after an ACK with a 3,600 second (1 hour) lease.  The source and destination IP address revert to a broadcast like the initial request, but it’s a renewal.  This time, however, the client “accepts” the 3,600 second lease because the renewal at line 9 happens 1,800 seconds (30 minutes) later.

The one thing that I can state is that Apple definitely has some “negotiation” happening within their DHCP process.  What I saw above I saw on multiple Apple products so it’s not just a one off.  When the same type of test is compared to a Windows laptop, it’s easy to see the similarities, and the differences.

Windows and Option 51

After examining the Apple products in depth, I wanted to contrast that to other devices.  In environments that support a mix of devices and can’t just focus on a single vendor, this might come in handy in the future.  What I learned, and have alluded to earlier, is that Windows and Android devices just don’t care about Option 51.  This is why, in my opinion, that Windows DHCP servers don’t offer a second lease time in the normal configuration.  We are still playing around with the Windows server to see if we can add a max lease time, but for now I can’t find it.

windows 75 min dhcp overview

Windows Laptop DHCP Summary

 

What we have here is the same test as before, but this time with a Dell laptop running Windows 10.  The only things that changed was the client device and the time of day.  While the general look is the same; starting with the standard 4 packets and then going into the renewal process.  The first thing that jumps out is there isn’t 90 days anywhere in the IP Address Lease Time (Option 51) so even though the server will allow a lease of 3,600 seconds, it only ever offers the 1,800 second lease.  At no point in this test does the laptop EVER request a time.  The only time values come from the server.  The test I did with an Android device looks identical to the summary above.  The only way to tell the difference is to look into the request packet and see that the Android device included Option 51 in the Option 55 Parameter Request List.

That’s it.  Windows and Android devices just don’t care to use Option 51 the way that Apple does.

The last thing that I can draw from comparing the 2 summaries above is why the DHCP Request during the renewal process every once in a while comes as a broadcast from 0.0.0.0 to 255.255.255.255.  For both Apple, Windows, and Android, even though the packet is a DHCP “renewal”, the device still remembers the initial lease time.  When you see the request from 0.0.0.0 without a preceding offer packet, it means that it is the end of the initial lease period.  I’m pretty sure that if I went and read an RFC it would explain that, but I learn so much better doing it this way!

Conclusion

So what does all of this mean?  That’s an easy answer!

It Depends™

I spent a bunch of time digging into what was really happening with Apple devices on my network, and made an adjustment to allow those devices to eventually gain a longer lease.  I don’t have empirical evidence it made a difference, but I feel like it did.  I still have some more work to do, but one thing I can tell you is I have a much better idea of what happens during this process than I did two weeks ago.  All it took was some free software, some time, and a bunch of different wireless devices to play around with.

In an environment with almost all Apple, I could see some benefit to having 2 different lease times, and adjust those to find a sweet spot based on how long the clients stay in the environment.  If you only have Windows or Android, this won’t help you.  What I hope it does is to prod you to do your own testing and see what options you can use for your environment.  It’s not difficult, just takes some dedication and some curiosity.

If you aren’t on the Wi-Fi Pros Slack and want to be, contact the infamous Sam Clements and he can hook you up.  The conversations are more detailed and thorough, and you can even meet Kristian Roberts!

My site won’t let me upload the actual packet captures I collected during my research, but if you want them send me a message and I will work to get them to you.

Thanks for reading!

Advertisements

Long Live the Controller!

Lately, it appears that every time I turn around, I read somewhere where everything, and I mean EVERYTHING is moving to the cloud.  Maybe I am an “old geezer” in this respect, but I believe that not everything belongs in “The Cloud.”

In this particular post I want to focus on the heart of WLAN infrastructure, the venerable WLC.  Now granted, there are situations and the always present “It Depends” that can call for a controller in the cloud, or offsite controller, or controller-less, or mesh, or whatever the vendor is calling it this week, but sometimes, in some situations, having a physical, on-site good old fashioned controller just can’t be beat.

In my current employment, I work at a facility that covers 53 square miles.  Granted, not all of that space if covered in buildings and facilities that have Wi-Fi, or network connectivity (although we have received that request more than once) but we do have facilities that are pretty well spread out.  While I don’t want to spell out all the details, we also have a massive fiber infrastructure that allows us to do some pretty cool things all in house, and we don’t rely on leased lines, or ISP’s, for anything other than our internet connectivity.

Hopefully, at this point, you get the idea of where I am coming from when I say that in an environment like mine, having a centralized, on-premises, good old fashioned chunk of metal and electronics programmed to be a Wireless LAN Controller is a great thing!

Look, I get it.  Not every customer is going to be.  Not every customer can provide their own dedicated fiber between buildings miles apart to get sub-millisecond latency between hardware, but I can.  Not every customer benefits from centralized forwarding, and that’s fine.  I’m not saying that all of the other solutions are not warranted, and don’t have their advantages; they really do.  I can think of a myriad of customers and/or situations where either fully cloud based or a hybrid solution is definitely the way to go.  Companies that have a large central office with branch offices spread across the country immediately springs to mind of a situation where either a full cloud based or hybrid solution would be, and should be, the solution of choice.

Everybody can agree that when it comes to RF coverage, AP placement and AP count, that it all depends on the requirements of the space.  The same thing applies to selecting how the WLAN will be managed and controlled and which type of solution is eventually installed.  Requirements should be the first decision, then cost.  Whether or not your chosen vendor has just rolled out a new shiny cloud based solution should NEVER factor into that decision making process.  I get that sometimes cost will over-ride everything, I’ve been on that side of the fence before, but please don’t immediately jump there, give hardware a chance!

Let me give you some examples in my argument for centralized forwarding to an on-site controller.  Sorry, I can’t bring myself to call in “on prem” or “on premises” or whatever marketing calls it this year.

  1. Configuration of my access layer switch ports has been standardized to a single configuration.  Since I only need an access port with a single VLAN, the wired network team now knows how to configure a switch port where an AP is being installed without the wireless “team” getting involved.  You would be surprised how confusing WLAN technology can be to wired guys who have never dealt with it in the past.  If I need to do a flex connect type scenario, it’s rare enough that I don’t mind dealing with it personally.
  2. VLAN segmentation is much, MUCH easier.  I currently have 28 active VLAN’s off of my WLC’s, and only having to deal with them on a couple of switches relieves a lot of stress, questions and mis-configurations from the wired team.
  3. Security is easier to implement.  I run a Cisco WLAN, so there is an encapsulated (not encrypted) CAPWAP tunnel between the AP and the WLC.  In my environment we added an additional routing “feature” around the CAPWAP to keep it locked down.  That was a one-time configuration challenge that we haven’t had to go back and touch, no matter how many VLAN’s I have added to the WLC.
  4. Using the CAPWAP functionality allows me to “get around” network segmentation on the logical network.  In certain circumstances, it can be very advantageous to have 2 devices 10 miles apart but on the same subnet since they both terminate at the same location.  Yes, concentrators can be used to achieve the same thing but if I have to add hardware onsite, why add just that?  A concentrator will add complexity and another point of failure to deal with, so now I need to add in redundancy.
  5. I have full control over when and how my upgrades are done.  Yes, in theory this shouldn’t be an argument since it is your cloud instance, but how many times have you had a service in the cloud have an update or reboot done simply by accident?  As the engineer/architect on record, I am always the first one blamed.  This leads to the next point.
  6. Troubleshooting during outages is frustrating.  Even when things are in the cloud we are blamed for outages, and in our group alone we have spent countless hours trying to show that issues with reaching an offsite service is an ISP problem, not ours or the cloud data center’s fault.  What ends up happening is we point the finger at the cloud provider, the cloud provider points the finger at us.  Eventually we point a finger at an ISP.  Ever try to get two different ISP’s working together to solve a problem?  It’s bad enough when you are paying them for service and you need them to work for you, let alone work with a different ISP to figure out routing problems between themselves.  It’s a nightmare, and as the customers technical people we are always left holding the bag.

I could go on, but I think you get the point.  Keep in mind, I am not here to say that cloud based controller solutions are the devil or should go away.  On the contrary, I think in the correct situation, cloud based is 100% the way to go, and all vendors should be able to support that model.  I am just here to argue that in that same vein of thought, in the correct situation, physical, on-site, metal chassis based controllers are still very pertinent and needs to be considered as a viable, if not the correct, solution for some solutions. And just like with cloud based controllers, all vendors should be able to support that model.  If not, in my mind, they will always be a second tier vendor since they can’t support ALL possible solutions needed for any given customer.

As Lee Badman reminded us in the #WIFIQ for 8/21/18, try and take emotions out of the discussion.  Emotion should never be part of the conversation when designing the correct WLAN solution for any customer.  Define the requirements and design the solution based on those requirements.  The solution will change based on other factors but to say that I won’t recommend a physical controller no matter what just isn’t fair, and isn’t in keeping with the spirit of designing the best Wi-Fi for any given scenario.

Let me know your thoughts on the subject, sometimes 288 characters just isn’t enough to make your argument.

P.S. – I also don’t think 2.4 GHz is dead and will argue that one until the end of time!  Maybe I am the old geezer who won’t change!

 

NDP and You; A Continuing Saga

So a couple of months ago I wrote a blog about how I came to discover the Wi-Fi community of CWNP, Wireless LAN Professionals, and WLPC. In that blog I discussed a different blog that I found about Cisco NDP that was written by Rowell Dionicio from Packet 6. That blog was the start of a journey that led me to CWNP and my 9 month struggle with Cisco TAC. Since I know enough to not say this is the conclusion of that journey, I will just say this is the next part of my NDP journey.

For starters, read this blog by Rowell. It’s what started my journey and it is the basis for the experience I went through over the past 9 months, fighting with TAC, learning more about background processes that happens in a Cisco Wireless LAN deployment. Read this, then come back and I’ll attempt to put a bow on my story and lessons learned, all while trying to keep things professional.

https://www.packet6.com/cisco-ndp-neighbor-discovery-protocol/

In this story, it all stemmed from our RRM not working at all. Like not even close to acting like it was working. We would come in on a Monday morning and every 5 GHz radio would be on the same channel, and all at the highest transmit power. Looking back on it, I can now explain it as if all the AP’s appeared to be on an island; isolated and all alone. On the contrary they should have had multiple neighbors, instead they had none. Oddly enough, when looking at the 2.4 GHz channels, they had more neighbors than they should have, and at one point TAC even suggested the problem was too many neighbors.

About 6 weeks after opening up our case, we had some time in the office so we decided to get back to basics and do some super basic troubleshooting. We took 2 AP’s and turned them up in our office, and started running some debugs and packet captures from the 2 AP’s. While watching the active debug’s, I noticed that one of the AP’s showed 2 different lines, while the other one only had one line that kept repeating. The values changed, but the “header” info stayed the same. The 2 lines are:

LWAPP NEIGHBOR and CAPWAP_RM

Doing what I do best, I went to the Google to look for those terms and lo and behold, Rowell’s blog post came up first for the LWAPP entry. What happened after that was a blur, but what I can tell you know is my world changed. The reason why RRM didn’t work? NDP wasn’t happy. Why didn’t anything work? NDP wasn’t happy. Why didn’t our super-duper expensive hyper-location service not work? NDP wasn’t happy. I think you see where I am going with this.

Why wasn’t RRM happy? NDP wasn’t working. The special commands to look at these packets are listed in Rowell’s blog, but I will list them here. Both are run from the CLI on the AP it’self, and you better be saving the output because it comes at you fast and furious.

debug capwap rm measurements and debug capwap rm neighbor

I will tell you, looking at these at first can be intimidating, and for a while they were for me as well, but then I found this guide. When I first found it Cisco had it listed as a White Paper, it has now progressed to being an actual guide. Before I allow anyone outside our group to touch our system or work on it, they have to read this guide. It’s so important that I have 2 hard copies printed out; one for me and one marked as a guest copy. If you are going to do Cisco RRM, don’t do anything without reading this document. Rowell’s blog is a condensed version of this guide, but this is worth every page.

https://www.cisco.com/c/en/us/td/docs/wireless/controller/technotes/8-3/b_RRM_White_Paper.html

After reading the first couple of sections, I was able to go back in to the debugs and read them like they were a novel. My whole world opened up. Everything I thought I knew about Wireless LAN changed, and I realized I didn’t know even a fraction of what I thought I knew. I also realized that Cisco, at least Cisco TAC, didn’t know about this information. It was at this point I knew I needed to find a group that could teach me the IEEE standards of Wi-Fi, not just what button to click on the WLC. That’s an earlier blog post you can find here. The first thing we noticed was NDP simply wasn’t being transmitted on the 5 GHz channel. That was simple, we did a wireless capture and sure enough, no NDP being sent out on any channel, by any AP. By reading the guide and Rowell’s post, you should realize that NDP is supposed to go out on EVERY channel, not just it’s assigned channel. When we sat in an area surrounded by 6 AP’s and didn’t see one NDP packet on a given channel in 10 minutes; it was an easy problem to present to TAC. That resulted in our second code upgrade to deal with this problem. (For those with any experience, you know as soon as you open a ticket you plan on doing a code upgrade, rather annoying actually.)

So we finish our first code upgrade and things are working OK, but after about 48 hours we realize that there actually isn’t much change. Turns out when you bounce the entire system, which happens in a code upgrade, it magically fixes everything for a short period of time. For more on that check out this rant. What we found is even though the NDP packet was being sent out, we had a problem with the radio actually listening for the packet. Using the debug capwap rm measurements command we found this:

CAPWAP_RM: RRM measurement completed. Request 2007, slot 0 status TUNED
CAPWAP_RM: RRM measurement completed. Request 2007, slot 0 status SUCCESS
CAPWAP_RM: noise measurement channel 8 noise 81
CAPWAP_RM: Rx Timer expiry
CAPWAP_RM: Neighbor Interval timer(slot 0) expired
CAPWAP_RM: Generating aggregated neighbor report for slot 0
CAPWAP_RM: RRM measurement completed. Request 2017, slot 1 status TIMEOUT

For reference, on a Cisco AP, “slot 0” is the 2.4 GHz radio and “slot 1” is the 5 GHz radio. Hours of debugs showed the same thing, the radio would never successfully tune on the 5 GHz band, it would always return the status of “TIMEOUT”. The good news is on our “new” code we did see the following output:

CAPWAP_RM: Timer expiry
CAPWAP_RM: Neighbor interval timer expired, slot 1, band 0
CAPWAP_RM: Triggering neighbor request on ch index: 4
CAPWAP_RM: Sending neighbor packet #4 on channel 149 with power 1 slot 1
CAPWAP_RM: Scheduling next neighbor request on ch index: 5

This particular AP was assigned a UNII-2 channel so we can confirm that it is indeed sending out NDP on every channel at the highest power level as seen on line 4 (bold is my enhancement). As a sidebar, the first line of this example shows the timer expiring. That timer is configured using the Wireless > 802.11a/n/ac > General tab, Monitor Intervals section on the RF Group leader WLC. What’s an RF Group leader you ask, read the guide. We run what has been called an “Ultra Redundancy” configuration, and when doing that the section on RF Grouping is critical. Turns out the configurations for all of this stuff has to match EXACTLY on all WLC’s when running in ultra redundancy mode. If it’s not, the whole TIMEOUT line starts to show up. That’s not in any guide; we learned that one the hard way. Moving on to our troubleshooting after our second code upgrade and configuration change.

Things had really broken down between Cisco TAC and me so they ended up bringing in a mediator to try and keep things civil and keep the case moving. I had been so far in to the weeds on this one that I needed a fresh set of eyes just to verify we were following sound troubleshooting techniques. I didn’t need a wireless guy, I just needed a tech guy. Using the fresh set of eyes, we were able to determine that NDP was being transmitted. We had an AP that was set to monitor mode and that guy could see EVERYTHING. However, the nearby AP had varying levels of success. When the operating channel was UNII-2, it was bad news for NDP and RRM. When the operating channel was UNII-1 or UNII-3, it worked fine. It was almost like an AP assigned to a UNII-2 channel as it’s primary channel stopped listening for NDP messages. For reasons that were wrong, but lead to the correct answer in the end, search the Cisco guide and read the small paragraph about “NDP and DFS.” It’s under the chapter about RF Grouping, and turns out it’s pretty critical. Not in the way TAC wanted it to be but in how the system operates and how I was able to prove to TAC their stuff was broken. (Spoiler alert: They already knew it was broken at this point, but it was still nice to find it on my own.)

DFS poses a unique issue when it comes to NDP, and normal operation in general. In order for any Wi-Fi compliant device to transmit on a UNII-2 channel, it has to first hear a beacon frame from a master AP, or a directed probe from a client that is associated to a master AP. An AP becomes a master AP by being assigned an operating channel that is in the UNII-2 band, and then following a set monitoring protocol, deem itself a master AP. For Cisco, that monitoring protocol is to listen on the channel for 60 seconds for radar, and if hearing no radar, assume the master AP status and start beaconing using the normal protocols. From the AP CLI, issue the command show interfaces dot11Radio 1 dfs to get a report from the AP on what it thinks it’s DFS events are.

In this case, what I found using debug capwap rm measurements was the following log:

17:08:32.634: CAPWAP_RM: Timer expiry
17:08:32.634: CAPWAP_RM: Interference onchannel timer expired, slot 1, band 0
17:08:32.634: CAPWAP_RM: Starting rx activity timer slot 1 band 0
17:08:32.918: CAPWAP_RM: RRM measurement completed. Request 2008, slot 1 status TUNED
17:08:32.966: CAPWAP_RM: RRM measurement completed. Request 2008, slot 1 status SUCCESS
17:08:32.966: CAPWAP_RM: noise measurement channel 100 noise 97
17:08:32.966: CAPWAP_RM: Enabling signal seen on DFS ch 100, triggering neighbor packet
17:08:32.966: CAPWAP_RM: [On-demand] Neighbor packet request channel 100
17:08:32.966: CAPWAP_RM: Skipping chan 100; Radar detected
17:08:33.714: CAPWAP_RM: Timer expiry
17:08:33.714: CAPWAP_RM: Neighbor interval timer expired, slot 1, band 0
17:08:33.714: CAPWAP_RM: Skipping neighor request chan 132; DFS channel
17:08:33.714: CAPWAP_RM: Scheduling next neighbor request on ch index: 14

For this debug, I left the time stamp in to show how fast this stuff is happening. A couple of things learned from this capture is slot 1 is now tuning and reporting, so that’s good. The next hurdle is in the middle of the capture. Notice that at 17:08:32.918, it starts a RRM measurement. At 17:08:32.966 it completes the measurement. In sequence, with the same time stamp, we see a noise measurement on channel 100 (-97), an enabling signal (a beacon from from a master AP or directed probe) which in turn triggers an “[On-demand] Neighbor packet” for that channel, and then within the same millisecond, skips the NDP packet on channel 100 because “Radar detected.” The next nugget is the second line from the bottom. The AP simply skips the NDP packet on channel 132 because it’s a DFS channel. From my perspective, it didn’t even try.

Some more information here, before moving on. While setting up the neighbor intervals in the Wireless > 802.11a/n/ac > General tab, Monitor Intervals section on the WLC, the timer is set for how often the NDP packet is transmitted. Spend some time reading this section in guide because it determines how often you see the lines above. Default is set to once every 3 minutes; we currently run once every 1 minute. It’s a balancing act of how much time you want your system to devote to keeping the neighbor lists alive, giving the system a better chance to run a successful RRM cycle. While realizing that the system will attempt to send NDP only AFTER it sees an enabling signal, it becomes critical that there is a Master AP in the area, operating on that UNII-2 channel. We fought about the section on Master AP’s for a while; Cisco arguing that there wasn’t a master AP, I was arguing that there was one. While important, it wasn’t the lynch pin to the case.

As part of this excercise, I learned that Cisco LOVES to use the WLC config analyzer. I really hadn’t played with it much, but it can give you some good information. Cisco TAC loves it so much they stop paying attention to the physical distances between AP’s. TAC never thought there was a problem because the WLCCA showed all the AP’s having neighbors; no problem. What I realized is when I took this line and looked at the map of where the AP that reported this, I found a problem.

Skipping chan 100; Radar detected

On the surface, very innocuous. In the WLCCA, never even considered. When looking at a map or standing in the location, turns out there was an AP 30 feet away from the AP we collected this log from.

The AP was a Master AP. On channel 100. And had been for at least 18 straight hours.

Further digging revealed that every 60 seconds, this AP was skipping on demand NDP messages being transmitted because it kept seeing radar in the same millisecond that it saw an enabling event. One step further; this same scenario was happening on THREE UNII-2 channels surrounding this AP. In each scenario, the adjacent Master AP on a UNII-2 channel had been on that channel for at least 18 hours – WITHOUT DETECTING ANY RADAR! Our guy in the middle of this mess, on channel 149, was detecting radar once a minute, every minute, for 18 hours, and therefore never sending out an NDP message on that channel. Due to the amount of time a Master AP has to spending watching for radar to appear on it’s channel, it doesn’t have much time to scan other channels looking for other AP’s NDP messages on other channels. With the code we were running, it wasn’t even possible to make this work.

Bottom line – to use RRM in a high dense deployment scenario and use UNII-2 channels to get the number of channels needed to accomplish this, be very careful of the code you are using. I can attest to the 8.2 train, but nothing else.

After taking all this evidence and reporting it to the Cisco Mobility Business Unit (BU), they came back and said they had some new code for us to try. The difference between the new code and the code we were running is 10,000 lines long. They knew they had a problem, they just never told anyone. The shortened version of what we were told is there are different chips in the AP used to detect radar. In the first attempt they used a single chip to detect radar, and it didn’t work. In the second attempt, they used a different chip, and it didn’t work. In the latest attempt, they are comparing the output from both chips and will only trigger if both report radar. While it isn’t perfect, I can report that it has resolved about 99% of the issues we were seeing. Now when I run a debug on the AP, after the enabling event, the NDP packet IS transmitted. It still doesn’t transmit on the posted schedule, but I can deal with that.

My NDP is now happy. My RRM is now happier to the point we can actually start to tune it. The super-duper hyper-location system still isn’t happy, but it’s no longer the fault of the NDP packet. That’s another case for another story time.

To sum it all up, follow these steps:

  1. Read the guide, read the guide, read the guide.
  2. Follow basic troubleshooting steps. Just because it’s wireless doesn’t mean troubleshooting rules change.
  3. Do over the air packet captures. It’s the only way to confirm what you think is being transmitted is actually being transmitted.
  4. Use the AP CLI commands. debug capwap rm measurements, debug capwap rm neighbor, show interfaces dot11Radio 1 dfs
  5. Understand that while the WLCCA is good, it’s not foolproof. Use the correct tool for the job at hand.
  6. Use Cisco Prime Infrastructure (CPI). I was able to walk out and stand in the space and understand the RF in person. If you are remote, CPI, especially the new version, is a life saver.
  7. Don’t be afraid to push back on TAC. If the answer you are getting doesn’t jive with what you are seeing, call them on it. If the answer violates IEEE protocol, CALL THEM ON IT! TAC can have a bad day, just like us.
  8. Don’t be afraid to use UNII-2. We use it and according to the guide, we are the one place you CAN’T use it.

When you put all this together, and really understand what is happening in the environment, it’s like pulling the cover off the matrix, if only a little bit. Hope this helps!

I Am Mildly Indifferent To Captive Portals

I know this has been a topic in the past about why captive portals exist, should they be there, what purpose do they serve, and why do companies want to monetize a service that most people believe is as crucial to running a facility as indoor plumbing and running water is.

At the recent CWNP Wi-Fi Trek in Orlando, we had many discussions about captive portals that followed in this same train of thought.  What I noticed, and had the exact same conversation about twice, is no one knows what to do when traveling and you find yourself stuck behind one of these monstrosities.  What I have found in my professional life is the executives of my company complaining that the Wi-Fi in the hotel they were staying in while traveling “didn’t work.”  Of course the Wi-Fi didn’t work, I DIDN’T DESIGN IT!  Sad part is there might be a very qualified Wi-Fi professional on the other end of that design, very well could be one with more certifications and experience than I have, but some mid-level manager horked up their Wi-Fi system with a captive portal and now users complain.  The point of this discussion is captive portals are a way of life for the foreseeable future, and this is how to deal with them.

Back to the poor Wi-Fi professional who put tons of time and effort in to designing a system that has perfect RF, great data flow, everything that a great Wi-Fi architect/engineer has dedicated years of their professional life training to do, only to hand it off to a server guy to mess up our work.  It’s always the Wi-Fi system’s fault so as a guy in the know; I came up with a plan to deal with this bane of our existence.

In the spirit of full disclosure, I have been responsible for deploying captive portals in the past, and I know of at least 3 that are still in operation to this day.  Guess what, I’m a MUCH better RF guy than I am an HTML coder, so in my example, I’m the server guy that messed up my Wi-Fi system.  That’s how I know these tricks.  There are 3 steps, and I will list them off and then explain why and how it works.

  1. Never, and I mean NEVER, click on the little window that pops up that says “Click here to access the internet.”
  2. After NOT clicking the window or dohicky pop-up thing discussed in step 1, launch the browser of choice on your given device.
  3. Browse to a website of your choosing that is http and not https.

Explanation time, so hold on tight.  For those that want to bail out, now is the time to do it.  Spoilers do come next.  As with all rules, and these in particular, you can break the rule but understand why you are breaking the rule and if things go sideways realize you might have to come back to step one, possibly even resetting the TCP-IP stack type step one.  If you are reading this, after you get some idea of the mechanisms behind a captive portal, you will be able to tell rather quickly when you can break the process and when you have to go through the steps.  My wife, who makes me do all this for her anyway, is never allowed to break this process.

Step 1 – Never click the little window or pop up.  This ties back to the server guy who configured the captive portal and wrote the HTML script.  Sometimes that guy is awesome; sometimes he still lives in his mother’s basement.  This also ties in to step three, so we will refer back to this.  The pop up windows can be defeated if the server is programmed to pass the URL’s that common devices ping after they establish a connection to tell if you are connected to the internet.  Can I ping apple.com?  I’m on the Internet and there is no portal.  I can’t – must be a captive portal, let me show my user the pop up window.  The issue with this is the browser you get after clicking on that window isn’t always a full-blown browser.  In my experience, iPad’s are the worse.  The Safari browser that launches on an iPad doesn’t support Bluetooth keyboards.  Try explaining to your CEO why she has to turn off her Bluetooth keyboard, get the onscreen keyboard to launch, type her room number in, press enter on the screen and then turn her keyboard back on, right after she got off a 15-hour flight to Asia.  Step 1 was painful to be learned.  Step 1 also allows you to fulfill Step 3.  If you skip Step 1, you might get stuck on Step 3.

Step 2 – Open up a browser session.  I like this because I have taken control back from the robots that inhabit my devices and makes it do crazy things when I don’t want it to.  This will also allow me to proceed to Step 3, the most important step actually.  Different browsers will behave differently, and I can pick which browser I want to be in.  It doesn’t really matter, but the CEO of my company, and my wife (not the same person in this case, I really do have a separate CEO of the company) don’t get this and think they are stuck using whatever browser pops up.  Having something familiar when navigating the coding of some unknown person after a long day of traveling always helps.

Step 3 – Browse to an http website.  This is crucial, and I can’t stress this step enough.  Recent security concerns have prompted device, OS, and application folks to really lock down what your browser will allow you to do.  Hijacking an https session, which is what a captive portal is trying to do, makes the previously mentioned folks unhappy, which in turn makes you unhappy.  Fortunately for us, there are some websites that are still http and I keep a list of them handy and distributed throughout our company for this purpose.  Entering an http website allows a couple of things to happen, and understanding them is REALLY helpful.  When someone hits enter, their device tries to reach the Internet.  DNS servers “should” be white listed so your device tries to browse to remote server, and since it is an unsecured connection, your device will allow the captive portal to hijack the session and return it’s own HTML page in it’s place.  Someone at WiFiTrek said they enter 1.1.1.1, bypassing any possible DNS issues, to trigger the captive portal and it does work.  Depending on how savvy the end user is, this is a possibility.  Chris Reed recently suggested using http://neverssl.com which oddly enough was built for this purpose specifically.  I do take umbrage with them attacking the Wi-Fi system but that’s another battle for another day.

Back to my wife; she gets the http URL. There are ways to make it happen when accessing an https site but it takes a lot of time babysitting the server to keep the security certificates up to date and everything kosher on the back end.  Large organizations with dedicated IT can pull this off somewhat successfully.  This process is really for the Days Inn you find yourself staying in while visiting the scenic town that is Rawlins, Wyoming.

Anyway, you will finally be seeing the captive portal in a browser you know and one that has all the functionality you expect to have.  Now you can actually interact with the page and enter your name, room number, place of birth, blood type, your first neighbor’s pets middle name, and what you ate for dinner exactly 259 days ago; you know, the normal stuff.  Hit enter and then drop to your knees and pray.  Not really, but it can’t hurt.

Assuming the information you entered is correct, you will be allowed past the captive part of the captive portal.  Remember, a negative answer from a server isn’t a malfunction, it’s simply a negative answer, but it’s still an answer!  A negative answer means you now need to contact someone onsite to verify what you entered is correct.  In some instances, staff have to manually enter your credentials and the problem might reside there.  If you are validated on the captive part of the system, you are now in a grey area that can change based on any number of factors.  This is where following the steps eliminates the amount of grey area you encounter and give you the best shot of not being confused.  To clarify this, we now need to talk about server programming that will “break” the Wi-Fi system.

Captive Portal systems have a “feature” known as “post-authentication redirection.”  What this means is the server has the ability, if enabled, to send you to a predetermined URL that is entered by the programmer.  This is used to send you to the homepage of the location you are at or a different URL if the system owner decides that.  Either way, it’s simply a URL that’s entered on a single line in the server.  If the portal you are navigating has this enabled, and the URL is still valid, you will see a web page.  This is great because it means you have completed the process.  The portal will log your Wi-Fi MAC address for a predetermined time, like a DHCP lease timer, and you are now free to surf the internet on your preferred browser; the one from Step 2.  The process is done until the timer runs out and you must repeat the process.  The issues come when this post-authentication redirection isn’t enabled or they are trying to be fancy and the redirection gets lost and never sent to you.  This scenario is why this process even exists.

Possibility 1 – Post-authentication redirection isn’t set up.  This is the most common and easiest to diagnose / solve.  If Step One isn’t followed, this becomes an issue.  When clicking on the little pop-up window, you are opening up a browser with the sole purpose of loading an HTML page.  You never actually tried to go anywhere.  Without the redirect, you now have nothing to show.  Depending on the device, browser, OS, personal settings, etc., you may get something, a blank screen, or possibly even the log in page again.  It was the last HTML page it displayed, so it just shows it again.  If you don’t know any better, you start cussing the Wi-Fi and throw your device against the opposite wall.  Funny thing is you are actually online and just don’t know it.  Close the browser and continue on your way.  If you followed these steps, you entered a website in Step 3 that will now appear.  This is a visual indicator to you, your spouse, and your boss that they are now online because they got where they were headed.  Wi-Fi obviously works and no devices are injured in this experience.  Wi-Fi designer is AWESOME (of course we are) and they continue on with their life and go drink cognac wearing a smoking jacket next to the fire, or whatever normal people do in hotels.

Possibility 2 – Post-authentication redirection is set up but they tried to be fancy and it’s broken.  This is harder to diagnose and solve because now it’s not just one designer that failed, it’s a whole team of them.  There is a certain airport Wi-Fi provider that is pervasive around the world and they have this problem all the time.  They don’t admit it, but they do.  Based on where you are and how you logged in and how much you pay they will show you a different experience.  Sometimes they get too fancy and you end up being shown a dead-end road.  From my experience this is a white screen with a banner at the top.  Also in my experience, you may or may not be online; the end user has to attempt to browse to a new website to see if works.  This scenario is harder for the end user to overcome because what you see seems to contradict what you think you know.  If your intended web page shows up, you are online and can close the browser and go get a stiff drink as a reward for successfully completing this gauntlet of terror.

If attempting to browse to your website doesn’t work you will see the home page of the portal again.  Try to navigate it again and if you get the dead end again, you are now at the mercy of the portal operator.  If the name of that operator starts with a “B” and sounds like something you would hear at the local Bingo Parlor on a Thursday night, give up and go hide your head.  If it’s a different provider, you might be able to contact them and convince them to white list your Wi-Fi MAC address through the captive portal, preventing you from needing to navigate the portal altogether.  It’s an outside shot, but it can be done.  The unfortunate truth is the client is ALWAYS the one who suffers in this outdated attempt to monetize a service that should be free.  If you aren’t going to plan a guest Wi-Fi that is fast, free, and frictionless, just don’t do it at all.  End of story.  If your real-world experience ends here, I offer my heartfelt apologies.  Wish there was more I could do for you.

There it is.  The end of our journey.  All I can say is unless something drastically changes in the industry very soon, captive portals are going to be a part of our lives for a long time to come.  While dealing with them isn’t pleasant, I hope this helps you to at least reduce some of the friction when trying to explain something that shouldn’t exist to someone who doesn’t understand it.  I would love to hear others experience and their tips for dealing with captive portals so please share your story!

Wi-Fi Trek

Orlando is hot and muggy, even in October.

I am at the airport, waiting for my flight back to the crisp, cold weather of Colorado and leaving this stuff behind.  I won’t miss the weather, but I will miss the people.

I met some amazing people who actually accepted a nerdy radio guy in to the nerdy Wi-Fi club.  The list is too long to name, but if you were around me this week, then you know who you are.  I took a design class for 3 days, 10 hours a day and got to talk and share stories about nothing but Wi-Fi.  I got to hang out with professionals from all over the world and talk Wi-Fi and technology in general.

I took the Certified Wireless Design Professional certification test on the second to last day.  Even though I disagree with at least 2 of the questions on my test, and I KNOW that one of them is total garbage, I still passed.  Go me!

What do I take away from this week?  There are some crazy smart guys in this world that can talk about wireless processes that are measured in nano-seconds, NANO-SECONDS, for 20 minutes and keep me enthralled!  I thought I was pretty good but compared to these guys, I’m a monkey who can ask for grapes.  (Yes, I stole that line.  I like to steal obscure social references and incorporate them into conversations.  If you get them, you’re welcome.)  Listening to these people talk is inspiring.  Interframe spacing times, frame and packet analysis, and general philosophies about how things can be done is refreshing to hear from a room full of people.  Overall, it makes me want to be a better person so next year I feel like join their world and not be an interloper.  Hell, I started this blog to try and contribute what I can to the community.  I might be the dancing clown in the corner, but at least I feel like I brought something to the table to make up for all the stuff I’m stealing from it in the mean time.

What next?  No one really cares but me, but I do have some ideas for blog posts.  The antenna theory in the community needs a shot in the arm.  We call it theory but in reality there isn’t much theory, it’s all practical.  So some blogging, waiting for the podcast so I can be formally introduced into the community, and waiting for the WLPC conference in February.  I will be working on a 10 minute presentation for that to get some speaking skills honed up.  Training, certifications, and general learning is also in my future.  What really inspired me was hearing that CWNE’s number 2 and 3 either taught, or took classes at Wi-Fi Trek this year, but also took tests again.  In so many other technology disciplines, the “experts” sit back on their laurels and never keep up with technology and trends.  The fact that they still care is the biggest takeaway from my week.

Experts who care, and care about the monkeys and their grapes.