Timing logins and applicaion startup

Not applicable

Greetings.

At my site, we are experiencing a maddening issue wherein 1) sometimes (network/mobile) users are unable to log in (or, they log in but can't access their home folder, or, logging in is really slow), and 2) Office 2008 applications crash. Both issues are alleviated temporarily when we reboot our open directory master [the Office crash only started happening after we used WGM to redirect the user's cache folder from the network to the local computer to reduce the ridiculous startup times], despite the fact that our client machines are all bound to replicas.

In order to understand the extent of the problem and to see if any changes we apply alleviate it (and believe me, I would sooner just fix the problem if I could!), I would like to know if it is possible to deploy a script that will time how long it takes for a user to log in, and another one to time how long it takes for certain applications to start (or see if they crash when a user tries to start them.) Does anyone know how I might do this?

Thank you,
Clinton Blackmore

11 REPLIES 11

Bukira
Contributor

as for office, it seems to rely on the local cache, as i had the crashing office problem with redirected cache folders, once i stopped that procedure office works fine

criss

Criss Myers
Senior Customer Support Analyst (Mac Services)
Apple Certified Technical Coordinator v10.5
LIS Business Support Team
Library 301
University of Central Lancashire
Preston PR1 2HE
Ex 5054
01772 895054

tlarkin
Honored Contributor

Are you by chance running 10.5.3 or 10.5.4? There were known bugs that caused all sorts of sync and log in issues and I saw them myself. Where it would take literally, 2 minutes just to log in with a network account.

Also, how many clients are bound to your Directory Servers?

Not applicable

Most of our clients are running 10.5.4. A handful go back as far as On 12-Feb-09, at 8:13 AM, Thomas Larkin wrote:
10.5.2, and some are up-to-date. [This is not counting our older machines that are running Tiger, but they aren't a concern right now.] Most of our 12 directory replicas are running 10.5.5, although the master is running 10.5.6.

For number of clients, I ran:

dscl /LDAPv3/[IP of ODM] list Computers | wc -l
dscl /LDAPv3/[IP of ODM] list Users | wc -l

We currently have 1085 computers in our directory, and 4463 users.

We had a similar login-failure issue three of four months ago, and, after trolling through the logs availed us nothing, we instated a new open directory master. [One of my co-workers did it; I think he imaged a server, made it a replica, and then promoted it and made all the other replicas use it as the master.] Things worked great after we did that, until the day that I tried to give a user lesser directory administration privileges, at which point slapd on the master went off the rails and the CPU usage was at 100% for hours at a time. I revoked the privileges, but we have been having problems since then. [Further, we don't recall exactly, but out first master may have started acting up when we gave a user sub-diradmin privileges.] I can not fathom why this would cause the issue, but it is our best suspicion.

Another symptom is sometimes a machine will show that network users are available, but they can not authenticate. On such a machine, dscl sees the LDAP server and the Users directory, but listing said directory brings up zero results. Rebooting or rebinding to the directory often fixes this. So far as we can tell, there is no pattern involving which users or machines will have problems. Just yesterday I saw a user take over 5 minutes to log in to a 2008 iMac connected via a 100 MB/s (or maybe even gigabit) network, while 2/3s of his class logged in without a problem [except for Word crashing for some of them].

While I am on the topic, can anyone recommend tools for merging or correlating log files?

Cheers,
Clinton Blackmore

Eyoung
Contributor

Some office 2008 best practices I gleaned while getting it to run here with our user population.

Here's our setup: about 400 users. 10.5 client and server. All OD network users (laptops are mobile sync'd users). Xserve raid storage. WGM for server-side caching and we use WGM for our application white list.

the ToDo list:

kill server side spotlight indexing for the users homes at the root of the directory your user's home are located put the following file (i just did a touch): .metadata_never_index This came from Apple. seems the feature is not working as expected under 10.5 server. This will fix the looooong login times, and seriously reduced the server processor load.

make sure a ".TemporaryItems" folder exists. and is 777. This will fix the save issues and errors from MS apps about temp items and the renaming of files to .tmp files. This is also from Apple care of MS. It seems the save calls look for that directory first, they are supposed to be able to look elsewhere for the temp directory but it does not seem to work right for networked users.

For WGM application filtering. Make sure to never use the application lists for office. always use the folders option and be sure to add the items in the application support folder. this will stop the crashes from MS Office setup, MS update, and other random launch crashes.

If you get auto-recover issues for a particular user you will have to define a location in that user's Word prefs. Seems that some, not all, installs of office go funny when trying to save to the user's home for auto-recovery. it sees the home as "removable media" and refuses to save. Setting a location seem to fix it... at least the errors go away.. I am not convinced auto-recovery works correctly once that bug occurs though.

all in all the roll out of 2008 was a nightmare.. many weeks of work went into the simple list above.. I am happy to share to hopefully avoid that for others :-)

tlarkin
Honored Contributor

Are there any errors with the log ins? Like if you ssh into a client and watch it's system log while a user tries to log in, does it produce any errors? If all your servers are 10.5.5 you should be in good shape. I did notice vast amounts of improvements when we ditched 10.5.3 and 10.5.4 on our servers. 10.5.4 was a pile of dung if you ask me. Also, make sure you are using the correct version of server tools, as this can also cause issues if you are using mismatched versions.

I would suggest you watch a client log in and see what happens by ssh into it and watching the systemlog while it tires to log in.

Also, have there been any changes to your servers and I assume that at one point in time this was all working great?

When we had our similar problems I got an Apple engineer involved and they pretty much told me that OD Masters and Replicas are kind of built around the idea of having no more than 1,000 simultaneous connections at once.

Also, if you do folder syncing you may want to look at your AFP data throughput charts in Server Admin and see if they fall way below for any reason, then also check out your servers CPU usage history as well.



Thomas Larkin
TIS Department
KCKPS USD500
tlarki at kckps.org
blackberry: 913-449-7589
office: 913-627-0351

Not applicable

Thank you, I've been needing a to-do list. I've got a few questions to be sure I understand correctly.

On 12-Feb-09, at 9:04 AM, Eric Young wrote:

Some office 2008 best practices I gleaned while getting it to run here with our user population. Here's our setup: about 400 users. 10.5 client and server. All OD network users (laptops are mobile sync'd users). Xserve raid storage. WGM for server-side caching and we use WGM for our application white list.

What server-side caching are you doing? Did you do the cache-folder redirection on the clients (as in http://www.afp548.com/article.php?story=MCXRedirector )?

the ToDo list: kill server side spotlight indexing for the users homes at the root of the directory your user's home are located put the following file (i just did a touch): .metadata_never_index This came from Apple. seems the feature is not working as expected under 10.5 server. This will fix the looooong login times, and seriously reduced the server processor load.

To disable spotlight indexing on the sharepoint, I connect to the server using Server Admin, find the sharepoint, and ensure the "Enable Spotlight searching" (and the magnifying glass icon) are turned off (not displayed). I take it that helps but does not completely solve the problem. The .metadata_never_index file needs to go at the root of the sharepoint (and each and every sharepoint that I don't want indexed), yes? I suppose that .metadata_never_index at the root of the drive doesn't cut it.

make sure a ".TemporaryItems" folder exists. and is 777. This will fix the save issues and errors from MS apps about temp items and the renaming of files to .tmp files. This is also from Apple care of MS. It seems the save calls look for that directory first, they are supposed to be able to look elsewhere for the temp directory but it does not seem to work right for networked users.

I saw that Office created these folders, but not with the proper permissions. It seemed to be working, although I remember that on occasion users have had problems re-saving a file, so I'd better fix that.

For WGM application filtering. Make sure to never use the application lists for office. always use the folders option and be sure to add the items in the application support folder. this will stop the crashes from MS Office setup, MS update, and other random launch crashes.

Does this only apply if you are whitelisting applications?

If you get auto-recover issues for a particular user you will have to define a location in that user's Word prefs. Seems that some, not all, installs of office go funny when trying to save to the user's home for auto-recovery. it sees the home as "removable media" and refuses to save. Setting a location seem to fix it... at least the errors go away.. I am not convinced auto-recovery works correctly once that bug occurs though.

I don't believe our issues (right now) are related to auto-recovery. Users will be told that the there is a problem with the Office database (and if you try to run the repair tool, it will tell you there is no database), or sometimes they will be told "normal.dotx is in use by 'another user'. Would you like to make a copy of it?"

The other message we sometimes see is

"The home folder for user is not located in the usual place or cannot be accessed.

The home or Users folder may have been moved or deleted. If the home folder is located on the network, the server may be unavailable temporarily. If you continue to have problems, see your system administrator."

which indicates to me that the user authenticated but could not mount the file share (while users all around had no problem.) There isn't any chance that they can mount their home folder but not access the normal.dotx file, is there? [Where does that file live, anyways?]

all in all the roll out of 2008 was a nightmare.. many weeks of work went into the simple list above.. I am happy to share to hopefully avoid that for others :-)

I appreciate that. One of the teachers was extolling the virtues of Google Docs, and I would be quite happy not to deal with Office again!

Cheers,
Clinton Blackmore

Eyoung
Contributor

replies in line.

~~~~~~~~~~
A cynic is a man who, when he smells flowers, looks around for a coffin. --H. L. Mencken

Eric Young
eyoung at thayer.org

On Feb 12, 2009, at 11:56 AM, Clinton Blackmore wrote:

Thank you, I've been needing a to-do list. I've got a few questions to be sure I understand correctly. On 12-Feb-09, at 9:04 AM, Eric Young wrote: Some office 2008 best practices I gleaned while getting it to run here with our user population. Here's our setup: about 400 users. 10.5 client and server. All OD network users (laptops are mobile sync'd users). Xserve raid storage. WGM for server-side caching and we use WGM for our application white list. What server-side caching are you doing? Did you do the cache-folder redirection on the clients (as in http://www.afp548.com/article.php?story=MCXRedirector )?

-- Yes.

the ToDo list: kill server side spotlight indexing for the users homes at the root of the directory your user's home are located put the following file (i just did a touch): .metadata_never_index This came from Apple. seems the feature is not working as expected under 10.5 server. This will fix the looooong login times, and seriously reduced the server processor load. To disable spotlight indexing on the sharepoint, I connect to the server using Server Admin, find the sharepoint, and ensure the "Enable Spotlight searching" (and the magnifying glass icon) are turned off (not displayed). I take it that helps but does not completely solve the problem. The .metadata_never_index file needs to go at the root of the sharepoint (and each and every sharepoint that I don't want indexed), yes? I suppose that .metadata_never_index at the root of the drive doesn't cut it.

-- the file needs to go at the root of the directory where your user's home dorectories are stored. for example on my faculty volume I have: /Volumes/FACHOMEDIR/FACULTYHOME/.metadata_never_index. just the one file at the root doe sit for the whole directory.

make sure a ".TemporaryItems" folder exists. and is 777. This will fix the save issues and errors from MS apps about temp items and the renaming of files to .tmp files. This is also from Apple care of MS. It seems the save calls look for that directory first, they are supposed to be able to look elsewhere for the temp directory but it does not seem to work right for networked users. I saw that Office created these folders, but not with the proper permissions. It seemed to be working, although I remember that on occasion users have had problems re-saving a file, so I'd better fix that.

-- Ugh. I should have been WAY more specific. the .TemporaryItems folder needs to be at the root of the directory just like the metadata file is.

For WGM application filtering. Make sure to never use the application lists for office. always use the folders option and be sure to add the items in the application support folder. this will stop the crashes from MS Office setup, MS update, and other random launch crashes. Does this only apply if you are whitelisting applications?

-- Yes... I think we are saying the same thing but we might have a semantics issue.... We use an inclusive apps list. if its not on the list it will not run.

If you get auto-recover issues for a particular user you will have to define a location in that user's Word prefs. Seems that some, not all, installs of office go funny when trying to save to the user's home for auto-recovery. it sees the home as "removable media" and refuses to save. Setting a location seem to fix it... at least the errors go away.. I am not convinced auto-recovery works correctly once that bug occurs though. I don't believe our issues (right now) are related to auto- recovery. Users will be told that the there is a problem with the Office database (and if you try to run the repair tool, it will tell you there is no database), or sometimes they will be told "normal.dotx is in use by 'another user'. Would you like to make a copy of it?" The other message we sometimes see is "The home folder for user is not located in the usual place or cannot be accessed. The home or Users folder may have been moved or deleted. If the home folder is located on the network, the server may be unavailable temporarily. If you continue to have problems, see your system administrator." which indicates to me that the user authenticated but could not mount the file share (while users all around had no problem.) There isn't any chance that they can mount their home folder but not access the normal.dotx file, is there? [Where does that file live, anyways?]

--this sounds like it might be fixed with the .TemporaryItems directory and/or be issues with Office components missing from the WGM approved app list.

all in all the roll out of 2008 was a nightmare.. many weeks of work went into the simple list above.. I am happy to share to hopefully avoid that for others :-) I appreciate that. One of the teachers was extolling the virtues of Google Docs, and I would be quite happy not to deal with Office again! Cheers, Clinton Blackmore

Not applicable

Replies inline:

On 12-Feb-09, at 9:11 AM, Thomas Larkin wrote:

Are there any errors with the log ins? Like if you ssh into a client and watch it's system log while a user tries to log in, does it produce any errors? If all your servers are 10.5.5 you should be in good shape. I did notice vast amounts of improvements when we ditched 10.5.3 and 10.5.4 on our servers. 10.5.4 was a pile of dung if you ask me. Also, make sure you are using the correct version of server tools, as this can also cause issues if you are using mismatched versions. I would suggest you watch a client log in and see what happens by ssh into it and watching the systemlog while it tires to log in.

Sigh. No users experienced this problem when I was nearby today, so I was unable to do that. I did look through the log files on a unit that exhibited problems, and have a long section at the end of this where I have annotated a log. Incidentally, I found that colorizing the log made it less mind-numbing to trawl through. A perl script called loco http://www.zjuul.net/~jules/loco/ , when used like "./loco CJHS-iMacLab-22/system.log | less -R" worked nicely. [I wonder if TextWrangler's syntax highlighting could be (ab)used to do this.]

I'm now using the 10.5.6 server admin tools -- does that cause problems with previous versions of the OS (are are problems more likely when using outdated tools?)

Also, have there been any changes to your servers and I assume that at one point in time this was all working great?

I don't believe so. Indeed, we seldom even touch our servers to upgrade them.

When we had our similar problems I got an Apple engineer involved and they pretty much told me that OD Masters and Replicas are kind of built around the idea of having no more than 1,000 simultaneous connections at once.

Huh. Well, this is the first year we've used replicas -- previously ever site had its own master and was a universe unto itself.

Also, if you do folder syncing you may want to look at your AFP data throughput charts in Server Admin and see if they fall way below for any reason, then also check out your servers CPU usage history as well.

The CPU graph on the server at a school called CJHS -- where, in particular, I was having problems -- is at a constant 75% -- which is about 10 times what I would expect. [I wish I had proper monitoring in place and could go back further than seven days. It was hovering at a constant 60% almost a week ago, and then jumped up to 75% and remained there.] Running top on the machine, I see that AFP has gone off the deep end -- using 599.9% of the available CPU time. Time to reboot that box. [Only one of our other servers was misbehaving in the same way.] I had turned on all the AFP logging features on that machine, and now, when they could be useful, the access log starts at Jan 29 and ends on Feb 5th. It was too verbose, so I have turned off many of the logging features.

Seeing the AFP problem, I've changed my mind about putting up long traces from the current log. This does rather explain why rebooting the open directory master didn't help this particular student much.

I am suspecting that this instance of the user (repeatedly) being unable to log in is attributable to the AFP service on the school's server being too busy. The log has lines like:

Feb 11 13:52:31 CJHS-iMacLab-22 com.apple.loginwindow[1964]: Checking for policies triggered by "login" for user "Nels288"...
Feb 11 13:52:31 CJHS-iMacLab-22 com.apple.loginwindow[1964]: Gathering Policy Information from https://192.168.65.185:8443/...
Feb 11 13:52:31 CJHS-iMacLab-22 com.apple.loginwindow[1964]: The disk you specified could not be found.
Feb 11 13:52:31 CJHS-iMacLab-22 loginwindow[1964]: USER_PROCESS: 1964 console
Feb 11 13:52:33 CJHS-iMacLab-22 loginwindow[1964]: Couldn't create temp file /Network/Servers/cjhs.wwsd.net/Volumes/DataHD/CJHSstudents/ CJHS_Grade_07/[full name redacted]/Library/Keychains/ ~hQVDheOBTW3E955I: Unknown error: 118
Feb 11 13:52:33 CJHS-iMacLab-22 loginwindow[1964]: ERROR | -[Login1 setupEnvironment] | Unable to unlock the keychain, SecKeychainLogin returned 100118
Feb 11 13:52:33 CJHS-iMacLab-22 com.apple.launchd[1] (com.apple.UserEventAgent-LoginWindow[1970]): Exited: Terminated
Feb 11 13:52:33 CJHS-iMacLab-22 com.apple.launchd[2026] (0x103c30.zombie[1972]): Failed to add kevent for PID 1972. Will unload at MIG return
Feb 11 13:52:36 CJHS-iMacLab-22 SecurityAgent[1974]: NSExceptionHandler has recorded the following exception: NSUncaughtSystemExceptionException -- Uncaught system exception: signal 11 Stack trace: 0x3721e 0x9183309b 0xffffffff 0x61466 0x6e84d 0x6491d 0x62067 0x907ba95e 0x92e0cb45 0x92e0ccf8 0x962ebda4 0x962ebbbd 0x962eba31 0x91ca6505 0x91ca5db8 0x91c9edf3 0x10fc7 0x202a 0x1
Feb 11 13:52:37 CJHS-iMacLab-22 ReportCrash[2192]: Formulating crash report for process SecurityAgent[1974]
Feb 11 13:52:38 CJHS-iMacLab-22 ReportCrash[2192]: Saved crashreport to /Library/Logs/CrashReporter/SecurityAgent_2009-02-11-135236_CJHS- iMacLab-22.crash using uid: 0 gid: 0, euid: 0 egid: 0
Feb 11 13:52:42 CJHS-iMacLab-22 ARDAgent [2162]: *ARDAgent Launched*
Feb 11 13:52:42 CJHS-iMacLab-22 blued[46]: [setUserPreference] syncs returns false
Feb 11 13:52:43 CJHS-iMacLab-22 ARDAgent [2162]: *
ARDAgent Ready**
Feb 11 13:52:43 CJHS-iMacLab-22 blued[46]: [_setUserPreference] syncs returns false

and lots and lots of lines like:

Feb 11 13:53:38 CJHS-iMacLab-22 /System/Library/CoreServices/ SystemUIServer.app/Contents/MacOS/SystemUIServer[4176]: FolderManager: Failed looking up user domain root; url='file://localhost/Network/Servers/cjhs.wwsd.net/Volumes/DataHD/CJHSstudents/CJHS_Grade_07/ [full name redacted]/' path=/Network/Servers/cjhs.wwsd.net/Volumes/ DataHD/CJHSstudents/CJHS_Grade_07/[full name redacted]/ err=-120 uid=7100 euid=7100

Thanks for your time. I will see if I am able to get a proper trace of what is going on, especially if I can attribute it to something other than AFP.

Cheers,
Clinton Blackmore

tlarkin
Honored Contributor

Replies in bold...



Thomas Larkin
TIS Department
KCKPS USD500
tlarki at kckps.org
blackberry: 913-449-7589
office: 913-627-0351

Not applicable

Thanks for the help! Replies in teal.

On 13-Feb-09, at 7:58 AM, Thomas Larkin wrote: Replies in bold... Replies inline: I'm now using the 10.5.6 server admin tools -- does that cause problems with previous versions of the OS (are are problems more likely when using outdated tools?) Yes, using the wrong server tools versions can cause issues, especially with Work Group Manager, it can cause BSD database corruption, which will hose your LDAP. Ever see a new user generate a negative UID number?

Gee! I would not have expected that. Is it possible to tell if the BSD database is corrupt or not?

If it is corrupt, is there a way to recover? (Googling shows me http://sdb.open-xchange.com/node/29 , and I imagine something similar might work. Oh, hey, http://www.barbariangroup.com/posts/1668-fixing_b0rked_open_directory_ldap_databases has steps similar to what we took when the first master failed.) If the master's LDAP DB is corrupt, would I expect all the replicas to have the same corruption? Would fixing the master cause the fix to replicate?

Huh. Well, this is the first year we've used replicas -- previously ever site had its own master and was a universe unto itself. We had a small problem. In the master image the client is bound to the ODM. I then have casper policies that change bindings via a shell script. Well, for some reason they weren't running and all 6,000 clients were bound to the ODM and they proceeded to bend over my Xserve and throw it to it's knees. Everything ran slow. That is remedied now, the casper policy is running and all client machines get bound to the ODR in their building.

Ouch!

The CPU graph on the server at a school called CJHS -- where, in particular, I was having problems -- is at a constant 75% -- which is about 10 times what I would expect. [I wish I had proper monitoring in place and could go back further than seven days. It was hovering at a constant 60% almost a week ago, and then jumped up to 75% and remained there.] Running top on the machine, I see that AFP has gone off the deep end -- using 599.9% of the available CPU time. Time to reboot that box. [Only one of our other servers was misbehaving in the same way.] I had turned on all the AFP logging features on that machine, and now, when they could be useful, the access log starts at Jan 29 and ends on Feb 5th. It was too verbose, so I have turned off many of the logging features. how many connections are you seeing on AFP? I assume that all home folders are on AFP? Do you do portable home directories?

Looking at the graphs, there are peaks and plateaus. The last plateau (before I rebooted) was at ~70 connections. The last peak was double that. 70 connections is largely accounted for by our two desktop labs, which use network home folders. Our two laptop labs are using portable home directories, and may explain the peak.

... [The log file has] lots and lots of lines like: Feb 11 13:53:38 CJHS-iMacLab-22 /System/Library/CoreServices/ SystemUIServer.app/Contents/MacOS/SystemUIServer[4176]: FolderManager: Failed looking up user domain root; url='file://localhost/Network/Servers/cjhs.wwsd.net/Volumes/DataHD/CJHSstudents/CJHS_Grade_07/ [full name redacted]/' path=/Network/Servers/cjhs.wwsd.net/Volumes/ DataHD/CJHSstudents/CJHS_Grade_07/[full name redacted]/ err=-120 uid=7100 euid=7100 Thanks for your time. I will see if I am able to get a proper trace of what is going on, especially if I can attribute it to something other than AFP. Cheers, Clinton Blackmore That last line where it can't look up the home folder path, kind of makes me think, DNS issue. Is your DNS fully resolved both forwards and backwards? In OS X Server the changeip command is actually what is used to check this, and of course set this. I have had my share of small DNS issues and they will always come back to bite your leg off. So, make sure you get your DNS in order. So, you can ssh into your server and run this command xs106-a:~ root# changeip -checkhostname Primary address = 10.160.3.30 Current HostName = xs106-a.kckps.org DNS HostName = xs106-a.kckps.org The names match. There is nothing to change.

The results came back as expected ("the names match. There is nothing to change") on our master, former master, and all but two of the replicas. Those two came back with "The DNS hostname is not available, please repair DNS and re-run this tool." I'll look into that, but problems have been occurring at sites where this is not an issue.

Just trolling through the logs. On the CJHS school server, the Password Service Error Log shows this line this quite frequently:

Feb 13 2009 07:40:22 DoSyncWithServerChangeList: "Parent" has a transaction ID beyond the current value, resetting to 0.
Feb 13 2009 08:00:44 DoSyncWithServerChangeList: "Parent" has a transaction ID beyond the current value, resetting to 0.
Feb 13 2009 08:10:17 DoSyncWithServerChangeList: "Parent" has a transaction ID beyond the current value, resetting to 0.
Feb 13 2009 08:20:55 DoSyncWithServerChangeList: "Parent" has a transaction ID beyond the current value, resetting to 0.
Feb 13 2009 08:30:28 DoSyncWithServerChangeList: "Parent" has a transaction ID beyond the current value, resetting to 0.
Feb 13 2009 09:00:30 DoSyncWithServerChangeList: "Parent" has a transaction ID beyond the current value, resetting to 0.

On our ODM, I see some lines like this in the Directory Services Error Log:

2009-02-06 14:20:44 MST - T[0xB05A6000] - dsDoReleaseContinueData - PID 0 error -14071 while checking if reference <16777292> is a node
2009-02-11 06:01:37 MST - T[0xB0699000] - dsDoReleaseContinueData - PID 0 error -14071 while checking if reference <16777276> is a node

The Kerberos Administration Log shows lots of entries like:

Feb 13 09:56:03 odm.wwsd.net kadmin.local[6683](info): No dictionary file specified, continuing without one.
Feb 13 09:56:03 odm.wwsd.net kadmin.local[6683](info): No dictionary file specified, continuing without one.

Well, I'm going to continue to look at the logs and see if I see anything more.

Cheers,
Clinton

tlarkin
Honored Contributor

My email is ghetto here so I don't have a lot of options so I will just answer in sections needed items from previous emails. I don't get any fancy colored text options.....

Yes, WGM can cause all sorts of issues if you aren't using the proper version. This came straight to me from an Apple engineer and from official Apple server books (the ACSA books). Also, if you are seeing LDAP and BSD database corruption you should first dsexport your users and groups to plain text immediately. This will preserve their account information and UIDs, but not their passwords.

You may at worse case scenario, have to rebuild LDAP from scratch. It sounds horrid I know, because I had to do it once, but I did it in one day (one 13 hour work day). All you have to do is demote everything to stand alone. Then wipe out the LDAP from your ODM (demoting it first) reimport everything, then go back and promote all your stand alones to replicas so they get a fresh sync of LDAP.

10.5.4 client and server were a head ache here, we bumped everything up to 10.5.5 and a lot of our problems disappeared.

If your replicas are returning DNS errors and if you map home directories by FQDN, that can too cause problems. We have a legacy DNS that some of the older PCs use, and a server or two picked up our old DNS and it screwed lots of things up, so now our DNS database points all Mac servers to the proper DNS and specifically omits them from the other DNS.

In the ACSA books Apple says they do not recommend netbooting more than 50 clients for imaging purposes. Imaging is done over AFP, and I have examples of how flaky AFP is. I took screen shots of AFP throughput when we were imaging this summer. If we did not kick off the file transfer at the same time on all clients, AFP would flake out trying to load balance the connections. Data throughput would half itself.

As for your specific problem, I would try to figure out what accounts have problems, watch the logs as they log in and see what specific errors you get.

FYI, when I had the LDAP corruption I was getting PasswordService failures on my replicas as well, and Kerberos wasn't working properly either. It is hard to tell what your exact problem is. As a first step I would try to first demote your replicas to stand alone configuration, then promote them back to replicas. This will force down a fresh copy of your LDAP to them.

Good luck!



Thomas Larkin
TIS Department
KCKPS USD500
tlarki at kckps.org
blackberry: 913-449-7589
office: 913-627-0351