Stuck on startup

Chuey
Contributor III

Hello all,

I'm still seeing some issues with 10.10.5 and 10.11.4 that won't fully load and get stuck on startup. It will load either halfway or even 95% but never past that. Sometimes I'll turn them off, let them go over night, come back and they work next day? Sometimes I'll boot to single user mode and run fsck -fy and reboot and it works. Just so hit and miss. These machines are bound to AD. Is anyone still seeing these issues in their environment?

195 REPLIES 195

russeller
Contributor III

@Rocky Thanks! I'm going to test this out. Some of our Macs are full to the brim with student accounts. Its nice that his will expire out old stale accounts while leaving the fresher ones on the Macs instead of just wiping all students at startup.

SGill
Contributor III

@Rocky I think they are saying that you have to both clear out old mobilized AD accounts and de-mobilize new ones in the AD Bind settings (on desktops only) to see relief from the random startup failures. Many of the computers seeing the issue will show a "kauth" error at startup.

I'll be testing turning off Mobile accounts in the AD-plugin on desktops soon, but haven't yet. In my testing, it only becomes an issue on Macs with a large buildup of mobilized AD accounts.

allanp81
Valued Contributor

I've tested turning off mobie accounts but unfortunately it seems to stop user level config profiles stop working. I've logged a ticket with jamf about this but haven't heard anything back since sending screenshots of the issue.

The main question still is what is getting clogged etc. when a load of mobile accounts have been logged into a mac? Clearing the accounts using the dscl command as above doesn't fix it as we're routinely doing that anyway. If you "sudo profiles -P" from terminal you'll see all of the config profiles still there for all of the users that have used the machine, whether the account still exists or not and there's no simple way to clear these but even when you do clear them it doesn't change anything.

What other caches could there be that need clearing or is it just something we'll never get to the bottom of without Apple's assistance? (fat chance of Apple helping as they'll just say upgrade to Sierra).

SGill
Contributor III

If upgrading to Sierra fixes this...I'm in! :)

amiller6
New Contributor

Upgrading to Sierra does not fix this issue. I hadn't seen the problem in a while, but just had to re-load a Sierra test machine to resolve the issue.

SGill
Contributor III

@amiller6 Were you able to retest with "Create Mobile Accounts" set to Off ?

amiller6
New Contributor

I just reloaded the machine and am using that configuration now. I see that has caused some issues for others though, so not sure how that will work. Unfortunately, the machine went a couple of weeks without showing the issue, so it may be some time before I would see the issue again.

allanp81
Valued Contributor

I think it's when you get around 100+ accounts. That seems to be when we start to see the issue. I could in theory make 100 temporary guest accounts and start going through the laborious process of logging in and out with each of them.

davidhiggs
Contributor III

when creating mobile accounts, i'm not sure where Apple keeps the user cached credentials. it's possible they're separated and might not be cleaned up when the dscl delete command is used, which sounds like what @allanp81 is experiencing.

to be sure, i would re-image the machine if possible. when i made this AD config change, i didn't find it necessary to delete/reset the machines in the JSS.

Kong
New Contributor

Hello, we have the same issues as posted in this post. Have re-imaged our affected Macs, roll back the OS and we have different models of Macs, after a few weeks of heavy usage, same issues again.

David Higgs was right, it was the CreateMobile Account setting that is enabled that is causing this issue.
If you do not wish to re-build/re-image your existing affected Macs, you can apply the fixes below manaully that I have applied and tested on all our affected Macs.

  1. Disable CreatedMobile Account.

  2. Enable the root account.

  3. Logon as root and delete these AD cached user files below;
    / var/db/dslocal/nodes/default/groups and delete all with com.apple.sharepoint.group..plist - = number of user
    (you can spot these easily as that are listed in numerical order.)

/var/db/dslocal/nodes/default/groups/sharepoint and delete USERNAME public folder.plist (all your AD users actual Names will be listed here)

  • var/db/dslocal/nodes/default/groups/Users and delete all USERNAME.plist
    (all your AD users will be listed depending on how you have assigned these e.g. our students are registered by the year they start so 16000000 etc.) (Deleting the affected users' plist here will also automatically removes them from System Preferences Users & Groups section.)

  • Empty the Recycled bin.

  • Must restart the Mac. The first boot will take awhile as it needs to rebuild the databases.
    We have a script that deletes User's local home folders and Users' folders in /Library/Managed Preferences. So if you do not allow users to save work on the local Mac, if you wish, you can delete these manaually. So it's nice and clean,

If you do get a Mac that just refuse or take ages to boot even after reseting the SMC, PRAM, Safemode fixes etc. Target disk mode the affected Mac with another Mac. Must ensure that all Hidden/System Files are enabled so that you can see the affected files. I have a nice App that I found on the web that does this nicely. Then follow the steps above.

Hope this helps.
Frank

allanp81
Valued Contributor

@Kong I'm assuming if you regularly cleaned up those locations you could leave mobile accounts enabled?

Kong
New Contributor

Hi allap81. If you leave the CreateMobile accounts enabled after my cleanup, after heavy AD usage you will ge the same issues again. Unless you are know how to write a script to does the cleanup on startup or maybe once a week of these hidden AD users cached credentials. In our case, we have no need to use the CreateMobile aacount enabled as these are Student Macs and most of them are in open access areas. We never have this option enabled over the last few years but it was a mistake that this option was enabled that's when we start to get the stuck at booting issues.

allanp81
Valued Contributor

@Kong we run a script on startup to clear the accounts so in theory I could just clear those locations at the same time

Chriskmpruitt
Contributor

@allanp81 do you mind sharing that script?

We have been using Rockys script to delete accounts on startup. We are deleting all accounts that have not been modified in 5 days. This machine is still locking up with only 8 accounts on it.

I just manually deleted the / var/db/dslocal/nodes/default/groups and delete all with com.apple.sharepoint.group..plist - = and my test machine just booted 10/10 with no lock ups.

Some of our machines go home with students so we need Managed mobile accounts turned on.

allanp81
Valued Contributor

Interestingly I have noticed that if you remove a mobile account via the gui, it removes those references under the /var/db/dslocal/nodes/Default location...

amosdeane
New Contributor III

Hi, can I just ask who uses Autodmg when building your base image? Has anyone had this issue when thin imaging?

SGill
Contributor III

Yes, I'm seeing it in thin deployments...no Autodmg in the loop.

Happens on busy lab macs with about 100 or more AD-mobilized accounts.

I think I'll be moving forward with turning off mobilization soon (via a Configuration Profile--the Mobilize choice is also in the Directory payload).

Still not sure why there seems to be an upper limit here other than local storage--it could be an Apple bug that began around 10.10.3. This one is difficult to recreate due to the conditions that must be present in order to see it.

PeterClarke
Contributor II

Yes, we do - Although i did wonder about that, we were seeing this issue - less often - before we started using AutoDMG built images.

But previously, we had an old-account removal script running, where as presently we don't.
When we were using the old-account removal script, it had bug, that caused some accounts not to be removed.
I was going to re-write it - without the bug. (The original version was copied from elsewhere)
The 'bug' by the way, was technically a 'feature' - the original script made use of the unix mtime function - which actually works differently to the way that everyones expects to work !

i am thinking that, because the number of (mobile) accounts, were mostly, kept limited (though in busy areas, even with the account culling script, we sometimes got to over 300 accounts a a library computer) we rarely saw this issue.

It seems to be happening more since we have used OS X 10.11 (OSX 10.11.6 currently)
we also saw it more rarely in OS X 10.10.x, and it almost never happened in OS X 10.9.x..
- But that's just my observation..

Busy areas with lots of account churn - such as in library areas, seem most prone to this problem

allanp81
Valued Contributor

We cleared out the groups and sharepoints directories on over 15 today and it fixed all of them instantly. We've added it to our cleanup scripts so will see what happens from this point so fingers crossed.

I'm also going to use composer to do a snapshot to see what deleting a mobile account via the gui actually does.

Zeek
Contributor

I use the Autorun Data to re-image a computer with a different name and for some reason every time I change the name its goes to the old one. Any idea how to fix it?dd1d731b71c94fef8b210d778bd83394
a67e1a7dd4024418981e81387ba92238

I also went to the sharing option on the device and change the name but when I run sudo jamf recon it change it to the preview name.

Rocky
New Contributor III

Incomplete solution deleted.

allanp81
Valued Contributor

It appears that this has now fixed our issue based on our testing today. It has fixed all machines with the issue.

We are just running the following script on each boot of a student machine:

#!/bin/sh

UserList=`ls /Users | grep -v "Shared" | grep -v ".localized"`

Dansarray=( $UserList )
#printf "%s
" "${Dansarray[@]}"

for u in ${Dansarray[@]} ; do
    if [ "$u" = "administrator" ] || [ "$u" = "admin" ] ||  [ "$u" = "kingston" ] ||  [ "$u" = "Administrator" ]  || [ "$u" = "arduser" ] ;
    then
        echo "$u -- detected skipping..."
    else
        echo "$u -- Deleting..."
        `/usr/bin/dscl . delete /Users/$u && /bin/rm -rf /Users/$u`
        find /private/var/db/dslocal/nodes/Default/sharepoints -name "*" -type f -delete
        find /private/var/db/dslocal/nodes/Default/groups -name "com.apple.sharepoint*" -type f -delete
    fi
done

This was written by a colleague about 18 months ago so I have just appended the 2 lines to remove the references to sharepoint.

amosdeane
New Contributor III

Interesting to see that it's occurring without Autodmg in thin images. We currently have a support case with Apple and they insisting that we build an image without any 3rd party to tools to remove them from the equation. We are doing this but it sounds like this is a red herring.

amiller6
New Contributor

Details are fuzzy, but I recall using internet restore on a machine and still seeing the issue. I don't believe this is an Autodmg issue(although that is what I'm using to create my base image).

allanp81
Valued Contributor

@amosdeane @amiller6 look above, it's essentially been solved by Kong and from my testing looks like a fix has been found. No need to disable mobile accounts etc, just do a proper cleanup of them and bob's your mother's husband's brother.

amosdeane
New Contributor III

Ok, that sounds very positive. I'm just slightly cautious as we've thought that we've fixed this one a few times and then it's come back! We're going to test this out. Thanks all for the suggestion!

allanp81
Valued Contributor

@amosdeane It is looking promising. Our main affected room has 47 macs in it and so far we've never had a morning where all of them would power on successfully.

This morning I watched using our custom availability tool and all 47 came on first time without any hitches so it really is looking good so far.

Usage will drop off in the run up to Christmas so I'm not going to call it properly until we reconvene after the break in January but so far every mac that wouldn't boot worked fine after clearing those obsolete plists.

I find it ridiculous that just a build of ~100 plists can stop a whole OS from booting if this does turn out to be the fix.

amosdeane
New Contributor III

allanp81 that sounds encouraging. If we could finally crack this problem I feel like just starting the christmas festivities right away....

allanp81
Valued Contributor

I did a quick compare of a file system, before and after deleting a mobile account via the gui compared to deleting an account using the dscl command. And the main differences appears to be that deleting via the gui removes the following (along with the /Users/account directory):

/private/var/db/dslocal/nodes/Default/groups/com.apple.sharepoint.group.1.plist (this increments for each new user)
/private/var/db/dslocal/nodes/Default/sharepoints/user, name's Public Folder.plist
/private/var/db/dslocal/nodes/Default/users/username.plist

Using the dscl command only removes the last of those 3 lines and leaves the other 2 files. Over time you could end up with 100s of these and this seems to be what causes the intermittent boot issues. All of our machines now appear to be working fine since adding those 2 lines to our account cleanup script.

I don't know if this is a bug in the way the dscl command works but you can manually clear these easy enough by removing them all or making something more complicated if you wanted to.

The com.apple.sharepoint and public folder.plist are to do with user shares and the public directory that exists within a user's home directory. If you are deleting all local mobile accounts then there's no harm in doing this.

apizz
Valued Contributor

Just to be clear @allanp81 , you have not modified your script as posted here in this thread?

Chuey
Contributor III

@allanp81 @amosdeane We had 4 MacBook Airs that were stuck on startup this morning. We booted to an external hard drive and then browsed to the troubled computers partition and ran these two commands suggested by allanp81 above:

find /private/var/db/dslocal/nodes/Default/sharepoints -name "*" -type f -delete
find /private/var/db/dslocal/nodes/Default/groups -name "com.apple.sharepoint*" -type f -delete

Once we rebooted the machine, BOOM, it worked and even seemed to boot faster. I think this is definitely the fix. We did not delete the Users home folder or the dscl record. All we did was delete those plist files and it instantly booted. Thanks so much. I think we are going to create a script with those commands and apply it as a LaunchDaemon.

allanp81
Valued Contributor

@aporlebeke Yes that is correct, we've been running that script now for over a year and I just added the 2 find commands to the script. We have always cleared local mobile accounts on our student machines to prevent build ups.

@Chuey I'm assuming you put in the path to the local disk though and not the external hard drive that you booted from?

Chuey
Contributor III

@allanp81 Correct, we made sure we deleted them on the local hard drive and not our external drive.

allanp81
Valued Contributor

@Chuey Excellent, well it's looking good then. I might see if I can try and streamline it a bit as those 2 commands will also delete the files associated to the admin account, although I'm not sure it matters unless you've changed any sharing settings to do with that user.

Chuey
Contributor III

@allanp81 I noticed if you do not delete home folders associated to users that it will not re-build those files in private/var/folders. I was able to login and browse mounted shares no problem.

Not sure if that is an issue or what ?

allanp81
Valued Contributor

Yes not sure. We're only going to clear those sharepoint files at the same time as deleting the users.

draeconis
New Contributor II

From what we can tell, it looks like an area that handles the user's local 'Public Folders' and sharing rights for these folders specifically.

It seems to reliably resolve this issue, although we've seen many things fix it, only for the issue to come back without reason, so we'll keep testing it for now.

Since you're deleting these files without using $u in your script, you could always put this outside the do loop, since after the first time it runs it'll be redundant :).

allanp81
Valued Contributor

@draeconis haha,yes good point. Will update it once we roll it out properly.

allanp81
Valued Contributor

How are people getting on with this now? Does it look like it's fixed it for everyone?

Rocky
New Contributor III

It's been a week now since the first machines I did (end of last week), other machine about 4 days (beginning of this week) and have seen no recurrence thus far. It's finals week at the university I'm at, so pretty heavy use at the beginning of the week tapering off towards the end. I'm very optimistic this is working.