Odd AD issue, opendirectoryd using 100% CPU

raymondap
Contributor

We've been struggling with an odd AD issue at one site for days now, and I'm running out of ideas to try. I'm hoping someone has seen something like this in the past and can point me in a direction. I suspect DNS, but haven't been able to find anything definitive.

We have one subnet that every Mac client cannot look up AD accounts, even though they appear bound to AD. Affected machines are both Macbook Airs and Minis, and either 10.12.3 or 10.12.4. If these users take their computers to another subnet, they have no issues, and anyone who visits the site with this subnet has the problem while there.

Time sync is not an issue, and our usual tricks of restarting opendirectoryd and/or flushing directory and DNS caches doesn't work. Unbinding and rebinding also does not resolve the issue; the machines unbind and rebind fine, but are unable to authenticate AD users or even look up AD accounts. The machines also appear to be fine in Active Directory, with the proper machine record in the proper OU, and appear correctly in Windows DNS. We use Infoblox (managed by another State agency) for DHCP, and our own Windows DNS for DNS at all our sites. This subnet appears to be configured identically to all our other (40 ish) subnets.

I can sometimes work around the issue by manually configuring DNS servers on the client, but I'm not convinced that that hasn't been a coincidence, since I can't recreate that 100% of the time.

On the clients, aside from not being able to communicate with AD:

  • opendirectoryd is approx 100% CPU
  • I get "Invalid Path" and "DS Error: -14009 (eDSUnknownNodeName)" when trying to use dscl to browse AD
  • User lookups with id result in "unknown user"
  • If I crank up odutil logging, I see "Module: ActiveDirectory - DDNS update - failure -- '10.118.185.201' - exit status 2 -- ; TSIG error with server: tsig verify failure" in the logs

We've pretty much exhausted internal resources, as well as help we've brought in from the group that handles Infoblox. We've also tagged in a Microsoft AD/DNS consultant who is also at a loss.

Has anyone here seen behavior like this in the past? I've pasted a log file [LINK REMOVED] that shows me rejoining the domain, then errors.

7 REPLIES 7

The_Lapin
New Contributor III

I haven't seen this for a few years but a Mavericks update caused similar behavior a while back.

OS X: If the opendirectoryd process CPU utilization is high after updating to OS X v10.9.5

Might be worth a look

Look
Valued Contributor III

Do you run ESET antivirus?
We are still on an older version because the last few versions have caused this on our devices.
It looks like it might finally have been addressed in the very latest version.

raymondap
Contributor

@The_Lapin and @Look, thank you both for the responses. The Apple article about search policy in Mavericks was one of the first roads we went down, but it didn't help. As far as antivirus, at the time the problem started (about April 11), one client at this site used McAfee, and the others had no antivirus client. Since that time, we've removed McAfee from all clients and pushed out Microsoft SCEP on April 25. Removing antivirus doesn't seem to make a difference, but it was worth a try.

Some more info: if I connect any of these clients to our VPN, the issues vanish (just like if they go to another site), so it's definitely something network related with that subnet, but I have yet to find anything that I can point the network and/or Infoblox folks to. The other problem this is starting to cause is that because opendirectoryd is spinning out of control, we're having to go in and clear out /var/log/DiagnosticMessages because / is running out of space due to the amount of log files being generated, even at the default logging level.

mickgrant
Contributor III

do you happen to use apple remote desktop in your environment?

we had a very similar issue and we traced it back to apple remote desktops reporting functions, once we turned them off we stopped seeing the errors

raymondap
Contributor

@mickgrant do you mean turning off reports under the the client settings for Remote Management?

Oddly no one at this site is having an issue today, which is the first time in a while. I'm not sure if the issue will return, but if it does, I will keep your suggestion in mind.

raymondap
Contributor

Wanted to give an update on this...

Unfortunately the issue returned, so I made a site visit this week. As soon as I plugged in my laptop (which has a completely different image than the clients at the site) to the network at this location, it was like the kiss of death. I have honestly never seen anything like this. opendirectoryd spun out of control to the point that I couldn't even log into a local account. The only way to get my computer back to a usable state was to unplug it from the network. Once I logged in and plugged back in, things went out of control again. Even though I was logged in, I couldn't even open a new terminal window; it would hang at a blank screen for a while before saying "invalid account". This was while I was logged into a local account.

I suspected that some device on the local network was causing the issues, but ruled that out by first unplugging every other device except my test laptop, and then plugging my test laptop directly into the router. In both cases, I saw the issues immediately. I collected packet captures from the site and from another working site and sent them off to our network folks, but so far, no one has found the smoking gun.

The other cool part... since I started aggressively monitoring this problem, I've noticed that it goes away for brief periods, then returns. I haven't noticed a pattern about times.

At the moment, we're working with our network group, but they're as stumped as us. My group thinks the problem is with the router, but since this problem is so weird, the network folks understandably aren't convinced. They see nothing weird on their side, so they're reluctant to swap out hardware.

raymondap
Contributor

I wanted to give an update just in case someone else runs into this. The problem ended up being the MTU setting on our network provider's end for this site. After discovering this, I was able to work around the issue by manually setting the MTU to 1458 for the Ethernet adapters on the clients at this location. Once the network provider updated their config, I set the MTU back to automatic and we haven't had issues since.