JAMF recon and inventory update failing - Unknown Error. Listed Services over 15000

Qwheel
Contributor II

Hello all,

I currently have a support call open with JAMF about this and thought I'd put the feelers out to see if anyone else has encountered this problem.

Essentially, after a period of time, part of our iMac fleet started falling out of touch with the JSS. I noticed that 'Inventory update' and the 'JAMF recon' commands were failing with an 'unknown error'. Further inspection of affected devices show that there's been some recursive duplication of 'services' and the device breaks connection with the server.
To give you an idea, normally a device has around 300-400 services listed. I have devices with between 15000-20000 at their last point of communication.

Curiously, running 'launchctl list' on affected device doesn't show the large list found on the JSS (maybe this isn't related). I haven't been able to catch a device as it's spiralling out of control.

Anyone encountered this issue before?
So far I've only witnessed it on Big Sur 11.4.0 - 11.6.1.

Currently have a metric butt-tonne of unique services like this:
com.apple.loginwindow.1902C8C2-2135-440E-9420-CB77D9773FB5 - etc
com.apple.security.agent.login.00000000-0000-0000-0000-0000000186A6 - etc

1 ACCEPTED SOLUTION

Qwheel
Contributor II

Further to @jbisgett, I did a similar thing.
Most lab devices should be getting startup/shutdown scheduling, but if for whatever reason the device is awake at 1am, it'll continue to be awake the following day also.
I run this on all lab devices daily with client side limitations. If it's awake in the middle of the night, it'll run the policy.
I run a restart because I questioned whether a shutdown would start the machine back up the next day (in my own mind, I imagine a manual shutdown ignores the start-up scheduling).
The delayed shutdown command is to allow the script to finish and submit an exit code 1 before restarting. I used exit 1 so I could see in the dashboard at a glance machines that were awake, and as you can sort the logs by date, you can see if it's the same old machines staying awake.

I also did a massive overhaul of config profiles.
I created a slew of 'Devices with X application' policies and assigned relative config profiles that way. I then went to the related app install policy and threw in a 'jamf policy -trigger recon' one-liner. So after installation, the profile comes down.

Another 'in my own mind' KEXTs etc need to be loaded in advance of application installation, so those are still going out in advance to devices that 'could' install affected apps if they choose to.

I haven't seen the issue in sometime, so I think a mixture of both things has resolved the problem - unless somebody stealth fixed it elsewhere.

#!/bin/bash
echo "Restarting device if still awake at silly oclock in the morning..."
echo "Restarting..."
sudo shutdown -r +4 &
echo "exit 1 to determine if there is a repeat offender..."
exit 1


 

View solution in original post

12 REPLIES 12

kissland
New Contributor III

Same issue, amongst others, recon is failing to work...

jbisgett
Contributor II

@Qwheel I just noticed this issue. It seems to be limited to a few devices at the moment, but exactly the symptoms you describe. Were you able to get a resolution? Is there a way to restore communication without resetting the device?

Qwheel
Contributor II

Not sure it's entirely fixed.

It seems that there may still be an underlying issue as to why the services are building up. JAMF support eluded to the cause being policy or profile related but it's difficult to determine what/which is causing the problem.

As an observation, it appears the JSS will accept a recon containing thousands of listed services, but once the total is above an amount, it then rejects any further submissions from the client.. probably as a protection mechanism.

Initial fix was to locate the device, click 'remove MDM' from the management tab. When complete, delete the device from JSS and then re-enroll with UIE by hand.

You can get a list of running services by entering 'sudo launchctl list' into terminal.
Also when you restart the device, the list is refreshed and the items reset. (Hence why I suggest the JSS rejects the device after recieving 2000+ listed services).

As the list clears after a restart, I've configured inventory updates and recons to go out ASAP per device. 

We have apps that run a quick recon after installation because the JSS needs to know application was installed, to then apply a license script to a device for users who're logging in (update smart groups). To prevent a device from borking device entries in the JSS, I've added an if statement to our basic 'sudo jamf recon' script.

#Getting count of services
count=$(sudo launchctl list | wc -l)
#Removing white space from variable
count=`echo $count | sed 's/ *$//g'`
echo "Checking count of launch services. Device has $count running."
ref="1000"
#Checking count against reference
if [[ $count -le $ref ]]
then
	echo "$count is less than 1000. Running recon."
	sudo jamf recon
    exit 0
else
	echo "$count exceeds 1000. Skipping recon. Will attempt the following morning."
	exit 1
fi

Using ARD, I'm running a check periodically to get a general count of services on machines. I'm hoping to catch a device in the act so I can prod it a little. I tried going through the various JSS maintenance payloads, but they haven't cleared the services away (ByHost files/flush System/UserCache/Verify disk). The only thing that definitely clears them is a system restart.

I've noticed devices that are having the issue show the 'Optimising your mac' notification when logging in.

If you have a large amount of affected devices, the cloud team can schedule a SQL query to empty table entries from affected devices. They'll then start playing along again.

You'd think it would be good if they put such checks into the recon command itself, but I guess you do want to know when something is broken 😄
 

 

jbisgett
Contributor II

Thanks @Qwheel, that's great info.

In looking at my devices with the issue, it appears to be headless servers running Jamf Connect, the duplicating services being com.apple.loginwindow and com.apple.security.agent.login.

I've opened a ticket with support to see if we can see what is happening. 

@jbisgett Just curious if you ever sorted this out?  I'm seeing exactly this issue on a few stations (out of 500+).  I see approximately 7800 'com.apple.loginwindow' and 'com.apple.security.agent.login' instances showing in Jamf Services records (but only one of each locally via launchctl list).  Flushing caches and rebooting makes no difference, recon repeatedly fails.

jbisgett
Contributor II

I had to open a ticket with support, who provided an SQL command that I could run on our local database server to clear the services list on the devices, then they would update inventory again. We did not identify a cause of the issue, although I believe they were going to investigate it internally. Since these were always on servers who mostly just sit on the login screen that were generating the errors, I implemented a nightly reboot policy to keep the services from populating above the ~1500 limit.

Cool thanks!

Qwheel
Contributor II

Further to @jbisgett, I did a similar thing.
Most lab devices should be getting startup/shutdown scheduling, but if for whatever reason the device is awake at 1am, it'll continue to be awake the following day also.
I run this on all lab devices daily with client side limitations. If it's awake in the middle of the night, it'll run the policy.
I run a restart because I questioned whether a shutdown would start the machine back up the next day (in my own mind, I imagine a manual shutdown ignores the start-up scheduling).
The delayed shutdown command is to allow the script to finish and submit an exit code 1 before restarting. I used exit 1 so I could see in the dashboard at a glance machines that were awake, and as you can sort the logs by date, you can see if it's the same old machines staying awake.

I also did a massive overhaul of config profiles.
I created a slew of 'Devices with X application' policies and assigned relative config profiles that way. I then went to the related app install policy and threw in a 'jamf policy -trigger recon' one-liner. So after installation, the profile comes down.

Another 'in my own mind' KEXTs etc need to be loaded in advance of application installation, so those are still going out in advance to devices that 'could' install affected apps if they choose to.

I haven't seen the issue in sometime, so I think a mixture of both things has resolved the problem - unless somebody stealth fixed it elsewhere.

#!/bin/bash
echo "Restarting device if still awake at silly oclock in the morning..."
echo "Restarting..."
sudo shutdown -r +4 &
echo "exit 1 to determine if there is a repeat offender..."
exit 1


 

@Qwheel Thanks for the tips, I already do something similar and it's always cool to see what others are doing!  

I think I've narrowed my issue to just a few stations that had likely crashed at the login window or had an improper shutdown which in turn triggered the flood of 'duplicate' services being reported. Once this hit the magic number of (>1500?) services, it eventually broke future recons.  At a minimum we reboot bi-weekly, so this illustrates just how quickly data collections can get bogged down.

As all other aspects of the Jamf binary are working fine, including daily recon attempts that were just not reporting this failure, I implemented an alert to report when this occurs (probably should have done this sooner but ehh, who knew).

Realistically I think this is a problem the data collection process should address (not us) and I've submitted a ticket.

jp2019
New Contributor III

Thanks for the share, can you share the alert you implemented that reports when this occurs?

Thanks in advance.

No problem, we recon at least once a day, so this works well for lab computers that are 'supposed' to be on 24/7.  Just create a smart group that sends an email notification on membership change.  Also fwiw, after submitting a support ticket, Jamf performed maintenance on our Jamf Cloud DB which resolved the issue.

Screen Shot 2022-06-28 at 12.54.39 PM.png

Qwheel
Contributor II

We don't use JAMF connect over here, so perhaps it isn't related to that?
We still use NoMAD/NoLOAD for shared device login.