Restarting the Jamf binary when Macs stopping checking in

jamesandre
Contributor

Howdy... I’ve detected an issue where the Jamf binary stops checking in, this can be for hours, days, weeks or even months. This is evident when Macs have run their Inventory Update for "x days". It would appear that the Jamf binary begins its check-in process but never completes, this stops any further check-in attempts as the process is still running and wont attempt to check-in until the original process has completed. If you attempt to manually check-in you will get a similar error to:

This policy trigger is already being run: root 88591 0.0 0.0 34245156 1048 ?? Ss 21Jul22 0:03.85 /usr/local/jamf/bin/jamf policy -stopConsoleLogs -randomDelaySeconds 300

I suspect this issue is caused by a network interuption when a policy is running or a script within a policy that cannot complete (the softwareupdated process has been hanging on some versions of macOS Big Sur). There does not appear to be a time-out for the Jamf binary.

CasperCheck or the new Jamf-Management-Framework-Redeploy API function do not resolve ths issue. Killing the Jamf binary resolves the issue and the Mac can check-in with the jamf server again. Since getting access to a Mac that isn't checking-in is can be somewhat difficult, I have made something I am calling "Jamf Restart" which lifts ideas and code from CasperCheck and AppProcessKiller.

 

Jamf Restart consists of:

  • A Launch Daemon ( /Library/LaunchDaemons/com.expample.jamfRestart.plist) that will run the script once a day.
  • A script (/Library/Scripts/jamfRestart.sh) that checks if the process has been running for 1 day or more.
  • A log file (/var/log/jamfRestart.log) that captures that output of the script.
  • An Extension Attribute (Jamf Restart) that reads the log and displays if the Jamf binary has been killed.

 

The LaunchDaemon and Script can be packaged and deployed via Jamf (a lot easier to do when all Macs are checking-in). You can launch the LaunchDaemon with the command

/bin/launchctl load /Library/LaunchDaemons/com.example.jamfRestart.plist

Once Jamf Restart has been deployed to a Mac, it will check if the process for the Jamf binary has been running for more than 1 day, if the Jamf binary has been running for more than a day, it will kill the process. After the Jamf binary has been killed, the next scheduled check-in will run correctly. Any policy run from Jamf can reasonably be expected to complete within 1 day, so killing the process when in it has been running for so long will not stop any policy with any chance of success from completing.

Jamf Restart will not fix the "Device Signature Error” which stops the Jamf binary from running, I have been testing the Jamf-Management-Framework-Redeploy API function for that.

Jamf Restart should ensure Macs keep checking into Jamf, and will allow you to identify which Macs have had issues checking in, so they can be investigated further.

 

The LaunchDaemon:

 

 

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
	<key>Label</key>
	<string>com.example.jamfrestart</string>
	<key>ProgramArguments</key>
	<array>
		<string>sh</string>
		<string>/Library/Scripts/jamfRestart.sh</string>
	</array>
	<key>RunAtLoad</key>
	<false/>
	<key>StartInterval</key>
	<integer>86400</integer>
</dict>
</plist>

 

 

 

The Script:

 

 

#!/bin/bash

processRuntime=$(ps -ax -o user,pid,etime,args | grep "/usr/local/[j]amf/bin/jamf policy -stopConsoleLogs -randomDelaySeconds 300" | awk '{ print $3; }' | grep -o '.*[-]' | awk -F\- '{print $1}')
processCheck=$(ps -ax -o user,pid,etime,args | grep "/usr/local/[j]amf/bin/jamf policy -stopConsoleLogs -randomDelaySeconds 300" | awk '{ print $2; }')

logLocation="/var/log/jamfRestart.log"
scriptLogging(){

    DATE=`date +%Y-%m-%d\ %H:%M:%S`
    LOG="$logLocation"
    
    echo "$DATE" " $1" >> $LOG
}

	if [ "${processRuntime}" = "" ]; then
        scriptLogging "JamfBinary has not run for more than 1 day"
    else
        scriptLogging "JamfBinary has run for ${processRuntime} days"
        scriptLogging "JamfBinary Process ID: ${processCheck}"
        scriptLogging "Quitting JamfBinary..."
        sudo kill -9 ${processCheck}
    fi
    
exit 0

 

 

 

The Extension Attribute:

 

 

#!/bin/bash

jamfRestart=`/usr/bin/tail -10 /var/log/jamfRestart.log | grep "JamfBinary has run for"`

if [ "$jamfRestart" != "" ]; then
   echo "<result>$jamfRestart</result>"
else
   echo "<result>No Restarts</result>"
fi

 

 

 

Hopefully this is of use to someone.

43 REPLIES 43

AVmcclint
Honored Contributor

I've been experiencing this for at least 5 months.  I can't wait to give this a try!

AJPinto
Honored Contributor II

I may be totally glossing over something but, why are we not just rebooting the devices regularly? Set a FV auth reboot policy to run daily or weekly if the devices are unattended. Or make an EA to read up time and reboot devices that have not rebooted in XYZ days. It just seems like a lot of hoops to just bounce daemons and agents which restart when you reboot a Mac.

AVmcclint
Honored Contributor

The problem is that if the jamf process is stuck, then it won't ever check in to run any policies. Currently I have been reaching out to affected users (which sometimes is over 100) and politely asking them to restart their computers. I only get about 20% response rate with that. I like what @jamesandre has done here because it makes the computer fix itself. If we see repeated instance of it happening on a computer, then we can focus our support efforts only on those. 

AJPinto
Honored Contributor II

Seems like a long way around to avoid forcing Macs to reboot on a schedule preemptively before any problems occur. I absolutely see a usefulness in something like this as a fall back, but the main solution should simply be dont let the Macs get an up time of more than a week or so.

 

On a side note if SSH is enabled, and you have a local admin accounts on the macs that you have access to. You can just SSH the devices to do whatever you need to. May be helpful for the users who dont respond, just run sudo shutdown -r now on their devices. Not the nicest of solutions, but its also not nice to ignore emails. 

AVmcclint
Honored Contributor

The SSH solution is blocked by the fact that many users aren't on the local network. They are at home behind their NAT'd home routers. And if they are on the company LAN, they aren't checking in with Jamf to update their IP address.  Getting users to restart their Macs is a never-ending battle. It doesn't help that the jamf process gets stuck from time to time. Implementing this will not be as simple and pushing the policy out to all computers. We will still need to reach out to users who are already in this state of non-check-in and get them to restart. But once that's done and they get the jamfRestart Daemon and script, then my Forced Reboot policy can play a more active role in keeping all the systems functioning properly.

AVmcclint
Honored Contributor

I can say that most of the computers I am discovering this problem on have uptimes in excess of 30 days - some as high as 100 or more. I think the lowest uptime I've seen this affect was about a week.  I am proposing to my management that I implement a forced reboot policy that will tell (not ask) the user that the computer will restart in 2 hours when it is detected that their uptime is greater than [TBD] days. The weak link in that is if the jamf process is stuck, then that policy will never run. Automating the restart of the jamf process is a great idea that I wish Jamf would implement on their own.

SCCM
Contributor III

why not just create a uptime script which runs locally. If the device has been up for x number of days then force a restart (which should fix your issue).

AVmcclint
Honored Contributor

I've considered that as well, but the threshold may not be etched in stone. This month, management wants to limit it to 30 days, next month they may change their minds and want to limit it to 14 days. It's a lot easier to change the script variable than it is to push out a new  script. And there's no easy way to know which version of the script a Mac may have on it at any given moment.  

scottb
Honored Contributor

This is a bigger problem with "clients" not enforcing some sort of reboot/update schedule with hard dates.  I have Macs that have not rebooted for MONTHS!  Can we get client reps to let us do something?  No.  So then I just say "it's broken until you let us manage these Macs with common sense or you learn your peeps on proper computer use."

Nobody has the sack to say OK to any of it.  

jamesandre
Contributor

Most of the Macs won't be reachable via SSH. I'm not going to regularly check a Smart Group for Macs that haven't checked in, then try and connect to them via SSH, then kill the Jamf binary. That sounds like too much work, I'm going to automate it.

Forcing everyone to restart after an arbitrary amount of days is going to create an increase in the number of calls to the Helpdesk. I do not want people contacting the Helpdesk because they have a deadline or presentation and they want to cancel a forced restart, that's not fair on the Helpdesk or person using the Mac and I want to ensure a good relationship between the two.

The Mac is working fine, the binary is the part that is causing the issue for me (and not the person using the Mac). I'm also not going to force everyone to restart when it may only be 10% of Macs affected, I want to keep people happy.

By implementing this method rather than waiting to restart every 7 or 30 days, I can ensure the Jamf binary checks in at least every day. If I have a patch for a zero-day vulnerability, then I want the Macs to be checking in at least every day. As long as security updates are getting applied, I do not have an issue with long uptimes. It also gives me visibility over which Macs are experiencing an issue with the Jamf binary.

 

 

jamesandre
Contributor

And if the Jamf binary would just time out when a process runs to long then that would be great. Maybe it could self heal when it gets a device signature error too. Then I could get on with managing Mac, rather than managing the thing that manages the Macs. 🤠

AVmcclint
Honored Contributor

@jamesandre I was going over the Extensions Attribute code and shouldn't the tail command look at the very last line instead of the last 10 lines? 

tail -1 

instead of

tail -10

 

jamesandre
Contributor

I'm looking for the log entry "JamfBinary has run for X days" which won't be the last line in the log file. It should be in the last 10 lines of the log file depending on how often you run an Inventory Update. You could increase it if you want a better idea of history. 

I also have a Smart Group that looks for like "JamfBinary has run for" within the Jamf Restart Extension Attribute.

martindulguerov
New Contributor

Thanks for sharing, @jamesandre !

I'm just testing it in our environment, and will deploy it wider soon.

 

Just a heads up that there's a typo in your post;

This line:

A log file (/Library/Scripts/jamfRestart.sh) that captures that output of the script.

should be:

A log file (/var/log/jamfRestart.log) that captures that output of the script.

 

 

thanks again!

Thank you! I read it over a million times and never spotted it. 😿

christian_ortiz
New Contributor

This seems like a great idea
@jamesandre 
I'm trying to wrap my head around how this will all be doable in my environment. 
From the looks of it, we'll have to use Jamf Policy to push the launchdaemon using:
/bin/launchctl load /Library/LaunchDaemons/com.example.jamfRestart.plist

But, the problem is that anyone who's currently affected by a stuck jamf policy won't get this new policy. 
In terms of deployment, is there something more simple i'm not seeing here? 

You are correct. That's the Catch-22 situation. To work around this, I've been reaching out to Macs that I can identify as not having checked in for a while. I let the users know there's a technical glitch on their Mac that needs to be addressed by a restart. Once they restart, then the policy runs to install the LaunchDaemon. So far it seems to work ok. 

Yeah, not the easiest. But generally a restart will allow the policy to install. I also set the policy to install on "Network State Change" as this seems to work as a seperate process. 

cc_rider
New Contributor III

hi @jamesandre,

"Jamf Restart will not fix the "Device Signature Error” which stops the Jamf binary from running, I have been testing the Jamf-Management-Framework-Redeploy API function for that."

Did the API function fixed the error?

As long as MDM commands are still working on the Mac, then yes it will repair the jamf Binary and Device Signature Error. I'm using the Jamf Heal Script for this.

RobinJJ
New Contributor III

I added a signed configuration profile created in iMazing Profile Editor, so be able to not allow the user to disable the launchdaemon, thought I share.

Open iMazing: Search for "Service Management - Managed Login Items" 

Add label: com.example.jamfrestart

Save profile and sign it.

Upload to jamf and deploy. 

 

This makes sure users can't disable it.

 

MatG
Contributor III

@jamesandre 
is there a way to test this out on a Mac that is working fine?

In the script you could substitute the Jamf binary for an App that you've had open for more than a day, say

"/usr/local/[j]amf/bin/jamf policy -stopConsoleLogs -randomDelaySeconds 300"

for 

"/Applications/Microsoft Outlook.app/Contents/MacOS/Microsoft Outlook"

 

Then run the script. 

 

 

cc_rider
New Contributor III

Did anyone test this solution somehow?

@jamesandre, also, in your LaunchDaemon, the key is set to False...Wasn't supposed to be True?

<key>RunAtLoad</key>
	<false/>
	<key>StartInterval</key>

And the EA, what is it for? Is it just a sort of sanity check, to see if the whole solution worked?

You can RunAtLoad if you want.

The EA can be used to identify which Macs have had the Jamf Binary running for more than 1 day, use a Smart Group with JamfRestart | like | JamfBinary has run for Then you can investigate why it is getting stuck... probably softwareupdated.

itinspectorio
New Contributor II

I have a question if you please. Do we upload plist via Profile, create policy with 


/bin/launchctl load /Library/LaunchDaemons/com.example.jamfRestart.plist


But where we input Script? Also in this policy? 

itinspectorio_0-1675954313164.png

 

You install the script (say via a package) to the Mac in this location;

/Library/Scripts/jamfRestart.sh

 It has to run on the Mac, as the JamfBinary may not be working and not checking in.

i mashed together a single script that should create the files in the proper locations and start the launchdaemon. this makes it easier to install via a jamf policy for newbies like me. you still need to setup the computer extension attribute for the reporting. 

 

#!/bin/bash

cat << 'EOF' > /Library/Scripts/jamfRestart.sh
#!/bin/bash

processRuntime=$(ps -ax -o user,pid,etime,args | grep "/usr/local/[j]amf/bin/jamf policy -stopConsoleLogs -randomDelaySeconds 300" | awk '{ print $3; }' | grep -o '.*[-]' | awk -F\- '{print $1}')
processCheck=$(ps -ax -o user,pid,etime,args | grep "/usr/local/[j]amf/bin/jamf policy -stopConsoleLogs -randomDelaySeconds 300" | awk '{ print $2; }')

logLocation="/var/log/jamfRestart.log"
scriptLogging(){

    DATE=`date +%Y-%m-%d\ %H:%M:%S`
    LOG="$logLocation"
    
    echo "$DATE" " $1" >> $LOG
}

	if [ "${processRuntime}" = "" ]; then
        scriptLogging "JamfBinary has not run for more than 1 day"
    else
        scriptLogging "JamfBinary has run for ${processRuntime} days"
        scriptLogging "JamfBinary Process ID: ${processCheck}"
        scriptLogging "Quitting JamfBinary..."
        sudo kill -9 ${processCheck}
    fi
    
exit 0
EOF

chmod 644 /Library/Scripts/jamfRestart.sh
chown root:wheel /Library/Scripts/jamfRestart.sh

cat << EOF > /Library/LaunchDaemons/com.example.jamfRestart.plist
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
	<key>Label</key>
	<string>com.example.jamfrestart</string>
	<key>ProgramArguments</key>
	<array>
		<string>sh</string>
		<string>/Library/Scripts/jamfRestart.sh</string>
	</array>
	<key>RunAtLoad</key>
	<false/>
	<key>StartInterval</key>
	<integer>86400</integer>
</dict>
</plist>
EOF

chmod 644 /Library/LaunchDaemons/com.example.jamfRestart.plist
chown root:wheel /Library/LaunchDaemons/com.example.jamfRestart.plist

/bin/launchctl load /Library/LaunchDaemons/com.example.jamfRestart.plist

 

MatG
Contributor III

@MCfreiz Nice...just out of interest how can this be tested?

connorb
New Contributor

Hi @jamesandre, do you think it's necessary at all to call a jamf recon and/or jamf policy once the binary has been killed?

I'm thinking of implementing this into my organization but just want to be sure there isn't anything additional I should throw in to get the devices talking back immediately.

You can add that if you want, should not cause any harm.

chelm
New Contributor III

@jamesandre Just curious why you are choosing to look for:
/usr/local/jamf/bin/jamf policy -stopConsoleLogs -randomDelaySeconds 300

instead of just

/usr/local/jamf/bin/jamf policy

I get this message if I try to interrupt:
/usr/local/jamf/bin/jamf policy -event CLIENT_CHECKIN -stopConsoleLogs

Would it better to look for just "jamf policy" so you catch more?  Is there a downside?  What if the binary gets hung during an inventory cycle?  Should we also be looking for other actions like recon?  

 

Testing my memory here... I think "/usr/local/jamf/bin/jamf policy -stopConsoleLogs -randomDelaySeconds 300" was what I was seeing every time we had an issue. 

Looks like it has changed to "/usr/local/jamf/bin/jamf policy -stopConsoleLogs -runOnQueue -randomDelaySeconds 300", so looking for "/usr/local/jamf/bin/jamf policy" might be a better option now.

It doesn't seem to be such a big issue now, I'm not seeing issues with softwareupdated anymore.

bsmithAP
New Contributor

@jamesandre newbe here, thank you for this, I have been searching for a solution to this problem for a while now. I am trying to implement your solution as we have been running into this issue with about 30 or so macs not checking in and with the amount of developers we have, scheduled restarts are not a viable option.

During testing I am failing to get the LaunchDaemon to load. When using the command sudo /bin/launchctl load /Library/LaunchDaemons/com.example.jamfRestart.plist I am getting the error: Load failed: 5: Input/output error. 

I have verified that the .plist file and script are in the right places. Any ideas on what could be causing this?

 

BoscoATX
New Contributor III

@jamesandre I'm seeing the same error as @bsmithAP : Load failed: 5: Input/output error 

Any solutions for this?

mm2270
Legendary Contributor III

I'm just coming across this, because I'm also seeing a fair number of devices not checking in, even though the Macs are confirmed to be online.

I noticed a typo in your script. You define the variable for the log location as logLocation, but then in the function you are echoing out to a variable labeled $LOG

echo "$DATE" " $1" >> $LOG

Other than that, good work on this. I plan on doing some testing with it to see if it improves our situation.

Never mind! I see now that you define LOG by assigning it to $logLocation. All good!

As for those saying just reboot your Macs, well, yeah, that is ideal, but it's much harder to enforce in some environments than you might think. Things just aren't so cut and dry as that, so for now, this might help us out. Thanks again for posting it @jamesandre 

I hope it helps. I'm not able to edit the original post to update the script. I should have put it on GitHub... maybe one day.

itinspectorio
New Contributor II

I dont know why, but I am also getting error like @bsmithAP 
Load failed: 5: Input/output error

itinspectorio
New Contributor II

Expecting a LaunchAgents path since the command was ran as user. Got LaunchDaemons instead.
`launchctl bootstrap` is a recommended alternative.
Load failed: 5: Input/output error
Try running `launchctl bootstrap` as root for richer errors.