NDP and You; A Continuing Saga

So a couple of months ago I wrote a blog about how I came to discover the Wi-Fi community of CWNP, Wireless LAN Professionals, and WLPC.  In that blog I discussed a different blog that I found about Cisco NDP that was written by Rowell Dionicio from Packet 6.  That blog was the start of a journey that led me to CWNP and my 9 month struggle with Cisco TAC.  Since I know enough to not say this is the conclusion of that journey, I will just say this is the next part of my NDP journey.

For starters, read this blog by Rowell.  It’s what started my journey and it is the basis for the experience I went through over the past 9 months, fighting with TAC, learning more about background processes that happens in a Cisco Wireless LAN deployment.  Read this, then come back and I’ll attempt to put a bow on my story and lessons learned, all while trying to keep things professional.

https://www.packet6.com/cisco-ndp-neighbor-discovery-protocol/

In this story, it all stemmed from our RRM not working at all.  Like not even close to acting like it was working.  We would come in on a Monday morning and every 5 GHz radio would be on the same channel, and all at the highest transmit power.  Looking back on it, I can now explain it as if all the AP’s appeared to be on an island; isolated and all alone.  On the contrary they should have had multiple neighbors, instead they had none.  Oddly enough, when looking at the 2.4 GHz channels, they had more neighbors than they should have, and at one point TAC even suggested the problem was too many neighbors.

About 6 weeks after opening up our case, we had some time in the office so we decided to get back to basics and do some super basic troubleshooting.  We took 2 AP’s and turned them up in our office, and started running some debugs and packet captures from the 2 AP’s.  While watching the active debug’s, I noticed that one of the AP’s showed 2 different lines, while the other one only had one line that kept repeating.  The values changed, but the “header” info stayed the same.  The 2 lines are:

LWAPP NEIGHBOR and CAPWAP_RM

Doing what I do best, I went to the Google to look for those terms and lo and behold, Rowell’s blog post came up first for the LWAPP entry.  What happened after that was a blur, but what I can tell you know is my world changed.  The reason why RRM didn’t work? NDP wasn’t happy.  Why didn’t anything work? NDP wasn’t happy.  Why didn’t our super-duper expensive hyper-location service not work?  NDP wasn’t happy.  I think you see where I am going with this.

Why wasn’t RRM happy?  NDP wasn’t working.  The special commands to look at these packets are listed in Rowell’s blog, but I will list them here.  Both are run from the CLI on the AP it’self, and you better be saving the output because it comes at you fast and furious.

debug capwap rm measurements and debug capwap rm neighbor

I will tell you, looking at these at first can be intimidating, and for a while they were for me as well, but then I found this guide.  When I first found it Cisco had it listed as a White Paper, it has now progressed to being an actual guide.  Before I allow anyone outside our group to touch our system or work on it, they have to read this guide.  It’s so important that I have 2 hard copies printed out; one for me and one marked as a guest copy.  If you are going to do Cisco RRM, don’t do anything without reading this document.  Rowell’s blog is a condensed version of this guide, but this is worth every page.

https://www.cisco.com/c/en/us/td/docs/wireless/controller/technotes/8-3/b_RRM_White_Paper.html

After reading the first couple of sections, I was able to go back in to the debugs and read them like they were a novel.  My whole world opened up.  Everything I thought I knew about Wireless LAN changed, and I realized I didn’t know even a fraction of what I thought I knew.  I also realized that Cisco, at least Cisco TAC, didn’t know about this information.  It was at this point I knew I needed to find a group that could teach me the IEEE standards of Wi-Fi, not just what button to click on the WLC.  That’s an earlier blog post you can find here.  The first thing we noticed was NDP simply wasn’t being transmitted on the 5 GHz channel.  That was simple, we did a wireless capture and sure enough, no NDP being sent out on any channel, by any AP.  By reading the guide and Rowell’s post, you should realize that NDP is supposed to go out on EVERY channel, not just it’s assigned channel.  When we sat in an area surrounded by 6 AP’s and didn’t see one NDP packet on a given channel in 10 minutes; it was an easy problem to present to TAC.  That resulted in our second code upgrade to deal with this problem.  (For those with any experience, you know as soon as you open a ticket you plan on doing a code upgrade, rather annoying actually.)

So we finish our first code upgrade and things are working OK, but after about 48 hours we realize that there actually isn’t much change.  Turns out when you bounce the entire system, which happens in a code upgrade, it magically fixes everything for a short period of time.  For more on that check out this rant.  What we found is even though the NDP packet was being sent out, we had a problem with the radio actually listening for the packet.  Using the debug capwap rm measurements command we found this:

CAPWAP_RM: RRM measurement completed. Request 2007, slot 0 status TUNED
CAPWAP_RM: RRM measurement completed. Request 2007, slot 0 status SUCCESS
CAPWAP_RM: noise measurement channel 8 noise 81
CAPWAP_RM: Rx Timer expiry
CAPWAP_RM: Neighbor Interval timer(slot 0) expired
CAPWAP_RM: Generating aggregated neighbor report for slot 0
CAPWAP_RM: RRM measurement completed. Request 2017, slot 1 status TIMEOUT

For reference, on a Cisco AP, “slot 0” is the 2.4 GHz radio and “slot 1” is the 5 GHz radio.  Hours of debugs showed the same thing, the radio would never successfully tune on the 5 GHz band, it would always return the status of “TIMEOUT”.  The good news is on our “new” code we did see the following output:

CAPWAP_RM: Timer expiry
CAPWAP_RM: Neighbor interval timer expired, slot 1, band 0
CAPWAP_RM: Triggering neighbor request on ch index: 4
CAPWAP_RM: Sending neighbor packet #4 on channel 149 with power 1 slot 1
CAPWAP_RM: Scheduling next neighbor request on ch index: 5

This particular AP was assigned a UNII-2 channel so we can confirm that it is indeed sending out NDP on every channel at the highest power level as seen on line 4 (bold is my enhancement).  As a sidebar, the first line of this example shows the timer expiring.  That timer is configured using the Wireless > 802.11a/n/ac > General tab, Monitor Intervals section on the RF Group leader WLC.  What’s an RF Group leader you ask, read the guide.  We run what has been called an “Ultra Redundancy” configuration, and when doing that the section on RF Grouping is critical.  Turns out the configurations for all of this stuff has to match EXACTLY on all WLC’s when running in ultra redundancy mode.  If it’s not, the whole TIMEOUT line starts to show up.  That’s not in any guide; we learned that one the hard way.  Moving on to our troubleshooting after our second code upgrade and configuration change.

Things had really broken down between Cisco TAC and me so they ended up bringing in a mediator to try and keep things civil and keep the case moving.  I had been so far in to the weeds on this one that I needed a fresh set of eyes just to verify we were following sound troubleshooting techniques.  I didn’t need a wireless guy, I just needed a tech guy.  Using the fresh set of eyes, we were able to determine that NDP was being transmitted.  We had an AP that was set to monitor mode and that guy could see EVERYTHING.  However, the nearby AP had varying levels of success.  When the operating channel was UNII-2, it was bad news for NDP and RRM.  When the operating channel was UNII-1 or UNII-3, it worked fine.  It was almost like an AP assigned to a UNII-2 channel as it’s primary channel stopped listening for NDP messages.  For reasons that were wrong, but lead to the correct answer in the end, search the Cisco guide and read the small paragraph about “NDP and DFS.”  It’s under the chapter about RF Grouping, and turns out it’s pretty critical.  Not in the way TAC wanted it to be but in how the system operates and how I was able to prove to TAC their stuff was broken.  (Spoiler alert: They already knew it was broken at this point, but it was still nice to find it on my own.)

DFS poses a unique issue when it comes to NDP, and normal operation in general.  In order for any Wi-Fi compliant device to transmit on a UNII-2 channel, it has to first hear a beacon frame from a master AP, or a directed probe from a client that is associated to a master AP.  An AP becomes a master AP by being assigned an operating channel that is in the UNII-2 band, and then following a set monitoring protocol, deem itself a master AP.  For Cisco, that monitoring protocol is to listen on the channel for 60 seconds for radar, and if hearing no radar, assume the master AP status and start beaconing using the normal protocols.  From the AP CLI, issue the command show interfaces dot11Radio 1 dfs to get a report from the AP on what it thinks it’s DFS events are.

In this case, what I found using debug capwap rm measurements was the following log:

17:08:32.634: CAPWAP_RM: Timer expiry
17:08:32.634: CAPWAP_RM: Interference onchannel timer expired, slot 1, band 0
17:08:32.634: CAPWAP_RM: Starting rx activity timer slot 1 band 0
17:08:32.918: CAPWAP_RM: RRM measurement completed. Request 2008, slot 1 status TUNED
17:08:32.966: CAPWAP_RM: RRM measurement completed. Request 2008, slot 1 status SUCCESS
17:08:32.966: CAPWAP_RM: noise measurement channel 100 noise 97
17:08:32.966: CAPWAP_RM: Enabling signal seen on DFS ch 100, triggering neighbor packet
17:08:32.966: CAPWAP_RM: [On-demand] Neighbor packet request channel 100
17:08:32.966: CAPWAP_RM: Skipping chan 100; Radar detected
17:08:33.714: CAPWAP_RM: Timer expiry
17:08:33.714: CAPWAP_RM: Neighbor interval timer expired, slot 1, band 0
17:08:33.714: CAPWAP_RM: Skipping neighor request chan 132; DFS channel
17:08:33.714: CAPWAP_RM: Scheduling next neighbor request on ch index: 14

For this debug, I left the time stamp in to show how fast this stuff is happening.  A couple of things learned from this capture is slot 1 is now tuning and reporting, so that’s good.  The next hurdle is in the middle of the capture.  Notice that at 17:08:32.918, it starts a RRM measurement.  At 17:08:32.966 it completes the measurement.  In sequence, with the same time stamp, we see a noise measurement on channel 100 (-97), an enabling signal (a beacon from from a master AP or directed probe) which in turn triggers an “[On-demand] Neighbor packet” for that channel, and then within the same millisecond, skips the NDP packet on channel 100 because “Radar detected.”  The next nugget is the second line from the bottom.  The AP simply skips the NDP packet on channel 32 because it’s a DFS channel.  From my perspective, it didn’t even try.

Some more information here, before moving on.  While setting up the neighbor intervals in the Wireless > 802.11a/n/ac > General tab, Monitor Intervals section on the WLC, the timer is set for how often the NDP packet is transmitted.  Spend some time reading this section in guide because it determines how often you see the lines above.  Default is set to once every 3 minutes; we currently run once every 1 minute.  It’s a balancing act of how much time you want your system to devote to keeping the neighbor lists alive, giving the system a better chance to run a successful RRM cycle.  While realizing that the system will attempt to send NDP only AFTER it sees an enabling signal, it becomes critical that there is a Master AP in the area, operating on that UNII-2 channel.  We fought about the section on Master AP’s for a while; Cisco arguing that there wasn’t a master AP, I was arguing that there was one.  While important, it wasn’t the lynch pin to the case.

As part of this excercise, I learned that Cisco LOVES to use the WLC config analyzer.  I really hadn’t played with it much, but it can give you some good information.  Cisco TAC loves it so much they stop paying attention to the physical distances between AP’s.  TAC never thought there was a problem because the WLCCA showed all the AP’s having neighbors; no problem.  What I realized is when I took this line and looked at the map of where the AP that reported this, I found a problem.

Skipping chan 100; Radar detected

On the surface, very innocuous.  In the WLCCA, never even considered.  When looking at a map or standing in the location, turns out there was an AP 30 feet away from the AP we collected this log from.

The AP was a Master AP.  On channel 100.  And had been for at least 18 straight hours.

Further digging revealed that every 60 seconds, this AP was skipping on demand NDP messages being transmitted because it kept seeing radar in the same millisecond that it saw an enabling event.  One step further; this same scenario was happening on THREE UNII-2 channels surrounding this AP.  In each scenario, the adjacent Master AP on a UNII-2 channel had been on that channel for at least 18 hours – WITHOUT DETECTING ANY RADAR!  Our guy in the middle of this mess, on channel 149, was detecting radar once a minute, every minute, for 18 hours, and therefore never sending out an NDP message on that channel.  Due to the amount of time a Master AP has to spending watching for radar to appear on it’s channel, it doesn’t have much time to scan other channels looking for other AP’s NDP messages on other channels.  With the code we were running, it wasn’t even possible to make this work.

Bottom line – to use RRM in a high dense deployment scenario and use UNII-2 channels to get the number of channels needed to accomplish this, be very careful of the code you are using.  I can attest to the 8.2 train, but nothing else.

After taking all this evidence and reporting it to the Cisco Mobility Business Unit (BU), they came back and said they had some new code for us to try.  The difference between the new code and the code we were running is 10,000 lines long.  They knew they had a problem, they just never told anyone.  The shortened version of what we were told is there are different chips in the AP used to detect radar.  In the first attempt they used a single chip to detect radar, and it didn’t work.  In the second attempt, they used a different chip, and it didn’t work.  In the latest attempt, they are comparing the output from both chips and will only trigger is both report radar.  While it isn’t perfect, I can report that it has resolved about 99% of the issues we were seeing.  Now when I run a debug on the AP, after the enabling event, the NDP packet IS transmitted.  It still doesn’t transmit on the posted schedule, but I can deal with that.

My NDP is now happy.  My RRM is now happier to the point we can actually start to tune it.  The super-duper hyper-location system still isn’t happy, but it’s no longer the fault of the NDP packet.  That’s another case for another story time.

To sum it all up, follow these steps:

  1. Read the guide, read the guide, read the guide.
  2. Follow basic troubleshooting steps.  Just because it’s wireless doesn’t mean troubleshooting rules change.
  3. Do over the air packet captures.  It’s the only way to confirm what you think is being transmitted is actually being transmitted.
  4. Use the AP CLI commands.  debug capwap rm measurements, debug capwap rm neighbor, show interfaces dot11Radio 1 dfs
  5. Understand that while the WLCCA is good, it’s not foolproof.  Use the correct tool for the job at hand.
  6. Use Cisco Prime Infrastructure (CPI).  I was able to walk out and stand in the space and understand the RF in person.  If you are remote, CPI, especially the new version, is a life saver.
  7. Don’t be afraid to push back on TAC.  If the answer you are getting doesn’t jive with what you are seeing, call them on it.  If the answer violates IEEE protocol, CALL THEM ON IT!  TAC can have a bad day, just like us.
  8. Don’t be afraid to use UNII-2.  We use it and according to the guide, we are the one place you CAN’T use it.

When you put all this together, and really understand what is happening in the environment, it’s like pulling the cover off the matrix, if only a little bit.  Hope this helps!

Advertisements

3 comments

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s