9.96 Upgrade Issues

mahughe
Contributor

Anyone who has upgraded to 9.96 seen any bizarre behaviors in their clustered environment?

81 REPLIES 81

cbrewer
Valued Contributor II

Had some performance issues this morning after going from 9.93 to 9.96. Against the recommendation of Jamf support I increased my connection pools by large margins. It seems to have helped, but I'll be scaling it back down tonight.

What are you seeing @mahughe ?

mahughe
Contributor

all of my webapps behind the load balancer dump their ability to communicate about every thirty minutes. Meaning, you can't ping them, remote into them, they can't ping out either. A reboot solves the issue and the timer begins again. I upgraded on Friday, went down yesterday with this going on.

mike_graham
New Contributor II

The only issue I ran into was a corrupted log_actions table which took forever to fix via the myisam command line tools. After that, things started up fine. We have several Ubuntu based webservers behind a load balancer and I haven't run into any problems with them since the upgrade on Sunday.

jchurch
Contributor II

what did you upgrade from? i had a similar issue when i installed 9.93. turned out the 9.93 installer reset the tomcat memory settings so my jss that was running on a vm with 16gb ram was only able to use 256mb. look into your catalina logs. thats where i found the out of memory error that pointing me in the right direction.

mahughe
Contributor

upgraded from 9.93 to 9.96. Checked the memory settings first thing, no adjustments need. Today we did adjust the Max Pool size it had been reset back to 90, adjusted those to 1000. OSX devices can recon, but it is not correctly displaying the Inv info for the device correctly. Can enoll iOS fine, don't get profiles or self service web clip or any of the apps set to drop down on enrollment. Repaired database. optimizing it now.

cpdecker
Contributor III

Bumping to see if anyone else is having problems? We will be forced into JSS 9.96 real soon as users keep moving up to 10 while iPads are off-site.

We are coming up from 9.92.xxxxxxxxx maintenance release.

Thanks!

CasperSally
Valued Contributor II

I went from 9.92 MR to 9.96. Everything was fine until I added a few patch reporting titles. Then connections to my cluster went out of control and random JSS slowness. Removed patch titles and we're back to known good state. Ticket in with JAMF

cpdecker
Contributor III

Thanks for the info! I will approach patch reporting cautiously, although I was looking forward to trying it out.

mahughe
Contributor

we had success getting our environment back in good working order last Thursday evening make several adjustments to pool sizes, mysql, and more. At about 2AM last night webapps started dropping again and now they are bouncing about every 30 mins again.

cbrewer
Valued Contributor II

Ironically we also had performance issues today. Things were fine for a week then today lots of table level locks and sql queries getting stacked up. I increased connections on each JSS from 50 to 1000 and we're back in business. We'll see what JAMF ends up recommending I do.

mahughe
Contributor

@cbrewer We too in our own troubleshooting bumped up them up to a 1000 also, and they still crashed, however it wasn't as quick as it was with them at the default 90. While working with JAMF engineer when we got things back up I learned there is a direct correlation between the webapp connections (thread)vs sql connections and how they impact each other. The total # of threads cannot exceed your sql connections without crashing sql which we experienced when we bumped up from 90 to 1000. Currently we are at 80, which we stressed tested last Thursday eve and all seemed well, everything was fine Friday and Monday. Tuesday in the earlier morning something caused one of the webapps tomcat to crash, and the roller coaster was back on again today. Today was 7 days from the initial issue beginning, so I started thinking in terms of what might running at this interval that would cause this and at this point I'm still looking for anything that does. I'll have another call with out STAM in the morning.

CasperSally
Valued Contributor II

Anyone else who is having performance issues with 9.96 in clustered environment either not using patch reporting or tried deleting the patch reporting titles and noticed improvement?

Our issues went away after removing the 8 titles I had added. Wondering if that helps others (paging @cbrewer and @mahughe .

jross
New Contributor

We also had major performance issues yesterday afternoon, and it cleared up after disabling the patch reporting. We'd also worked with JAMF over the past weeks cleaning up some complex smart group criteria, which they told us was exacerbating the problem. The improvement from the smart group cleanup wasn't anywhere near as dramatic as disabling patch reporting.

Chris_Hafner
Valued Contributor II

@mahughe As a general rule, if one is to modify those settings one should probably modify both MaxPoolSize and maxThreads.

maxTreads can be found in server.xml and should be 2.5 times the size of MaxPoolSize (So 'engineers' have told me in the past). I used to use 400 for my MaxPoolSize and 1000 for Max Threads. MySQL itself should allow a maximum of 1 more than the MaxPoolSize. So in this instance, it should allow 401 connections as MySQL needs to be able to connect to itself even when running at the maximum number of external connections. Changing this depends on which OS you are hosting MySQL from.

P.S. I was also notified by JAMFSoftware a few years back that Appel warned them against this practice for reliability reasons. In the end, working is working but here's to hoping the issue gets resolved

jrwilcox
Contributor

We are cloud hosted and had to remove our patch reporting to get the JSS to remain functional.

Kennedy
New Contributor II

Patch reporting took down our JSS, too.

Disabled Patch Reporting and all good.

jhalvorson
Valued Contributor

JSS (tomcat and db) hosted on a single Mac mini with 10.11.6. Patch Reporting enabled for 5 applications. At various times, when I've adjust the criteria to some smart groups based on Patch Reporting, the mysqld process will spike between 200 to 300%. It lasts from 1.5 to 3 hours. The JSS web interface is slow to process many of the pages. Sometimes newly imaged computers are not able to enroll during the spike. Resolves it's self until I mess with the smart groups again. Happened with 9.93, but I haven't tried to reproduce with 9.96. I have an open case with JAMF Support.

Nick_Gooch
Contributor III

9.93 upgraded tomcat to 8 by default, at least on the windows side. Wonder if that is part of the issues. We ran into several issues after updating to 9.93 once school started and it was getting heavy use. So far things seem to be running better after many hours on the phone with JAMF support. We made so many tweeks and changes I really don't know what fixed our issue, or if it is really fixed. We have had sql crash occasionally as well. No patch reporting being used at all here.

mahughe
Contributor

@Chris_Hafner I'm at 80 on the Pool, 300 on the threads and 1010 on sql these were all modified on a call with an JAMF engineer last Friday. I just got everything back up, drove back to the office and down again in 30 mins or close to it. We currently do not have Patch Reporting enabled.

Chris_Hafner
Valued Contributor II

@mahughe Interesting. I'm hoping that you're not saying that your JSS is down now?

mahughe
Contributor

@Chris_Hafner that's what I'm saying..down. Just got back onsite to put humpty back on the wall after making a few suggested changes.

CasperSally
Valued Contributor II

I turned on a policy for patching (not JAMFs patch stuff, I'm using Autopkgr) before I left yesterday and it seemed to kill mysql overnight at a random time after midnight. Woke up to 150+ outage emails from jss going down/up behind the load balancer. Under settings - clustering the connection counts were crazy even after I rebooted the 3 JSS servers this morning. I disabled the patching policy for now.

When I got in this morning, I rebooted the sql server and restarted tomcat on the 3 JSS servers again and now the connection counts seem normal again.

Chris_Hafner
Valued Contributor II

Wow! What a pain! I hadn't made it around to really testing this one and was planning to do that next week. I guess I'll be watching this thread carefully! @mahughe what did you end up changing to get back up and running?

mahughe
Contributor

This morning when I arrived I had one of the webapps down without it killing the NIC so I didn't get a notification of the port being down from missing it's monitoring pings. @Chris_Hafner late in the day based on input from my STAM, I changed a couple of policies that were running as ongoing and disabled one completely. Imaging worked through the evening and has still been working to this point this morning. All webapps are up and seem to be happy at this time. This morning my colleague and I found we had some remnant VPP user associates that needed to be removed and we did.

Will keep this updated as new events arise..

andyinindy
Contributor II

@mahughe: what platform are your JSS servers on? Same question for MySQL. Also, are you on physical hardware or VM's?

mahughe
Contributor

@andyinindy 3 - webapps run on Mac Mini's 1TB of space, 16 GB RAM 8 GB for Tomcat, Tomcat master runs on an Quad Core XServe 32GB RAM half to Tomcat, MySql runs on a Quad Core Xserve, 32 GB of RAM and the master distribution point is now a Mac Mini, same specs as the webapp minis

cvgs
Contributor II

Patch reporting seems to be checking for new titles quite often (as seen in the JAMFSoftwareServer.log). If you have large databases, there might be issues if the queries are not answered fast enough and connections could pile up.

Did you check the mysql-slow.log and look for queries which take a long time or are actually blocking the whole server while executing?

I had a lockup problem on 9.81, anaylzed the SQL logs and added new SQL indexes; now everything is going very smooth. The only problem is that you have recreate the indexes every time the Master JSS application restarts and does its database check, but they really help if you have a lot of extension attributes.

For reference, those are the indexes for 9.81 (and the execution times for creating them, which are quite revealing).

Add mySQL indexes to Casper database. This has to be done after every restart of the JSS master node:

CREATE INDEX `computer_group_name` ON `computer_groups` ( `computer_group_name` )
52ms

CREATE INDEX `display_name` ON `extension_attributes` ( `display_name` )
173ms

CREATE INDEX `extension_attribute_id` ON `extension_attribute_values` ( `extension_attribute_id` )
11.6s

CREATE INDEX `is_managed` ON `computers_denormalized` ( `is_managed` )
506ms

CREATE INDEX `last_contact_time_epoch` ON `computers_denormalized` ( `last_contact_time_epoch` )
536ms

CREATE INDEX `last_report_date_epoch` ON `computers_denormalized` ( `last_report_date_epoch` )
525ms

CREATE INDEX `operating_system_build` ON `computers_denormalized` ( `operating_system_build` )
542ms

CREATE INDEX `operating_system_name` ON `computers_denormalized` ( `operating_system_name` )
565ms

CREATE INDEX `operating_system_version` ON `computers_denormalized` ( `operating_system_version` )
592ms

CREATE INDEX `package_file_name` ON `cached_packages` ( `package_file_name` )
103ms

CREATE INDEX `package_name` ON `package_receipts` ( `package_name` )
60s

CREATE INDEX `type` ON `package_receipts` ( `type` )
73,9s

Really, why on earth would you not have an Index on extension_attribute_values.extension_attribute_id, package_receipts.package_name and package_receipts.type? The creation of the index takes almost 3 minutes, imaging how long the queries take without those!

zinkotheclown
Contributor II

Patch reporting did me in too. It was the first thing I enabled after I upgraded from 9.92 supplemental to 9.96. The next day, I got max database connections on my three nodes which were slamming our poor database server.
I should have keyed on that detail as I was troubleshooting this issue with our TAM who finally found a bug report regarding patch reporting on large databases.
If you had to deal with this for a few days, I highly recommend repairing your database after removing patch management. One of my policies got corrupted and started removing the Office suite from all of our 7000 Macs.

mahughe
Contributor

todays update ...so far so good since yesterday about 930AM. Tuesday seems to be a trigger for some reason, so I'm hoping this weekend is good and that we make it past Tuesday. If something changes I'll post what has taken place.

mahughe
Contributor

Survived the weekend and today without any issues with the JSS or it's webapps. Tuesday has been the trigger day for it's instability, here's hoping we can get past it. Either way I'll update this thread.

CasperSally
Valued Contributor II

how's everyone's clustered environments doing?

Our issues got better when we turned off patch reporting, but it still goes down consistently if I push out a config profile to as little as ~900 devices.

It also goes down randomly a few times a week overnight around 12:08-12:13am. My guess is it's when iOS apps are updating.

It's getting really old trying to restart mysql/tomcat on the servers to get everything back up.

Anyone else running into this still? JAMF may want to move our DB next from the D drive over to the C drive, I hate to do this if it's not going to help.

Nick_Gooch
Contributor III

@CasperSally MySQL for us is stopping about twice to three times a week. We aren't finding anything as to why. It just stops. Reinstalled MySQL twice now. We are clustered but I don't think that matters. I left the dmz server off for a week and it still went down twice. Ours goes down at random times. Sometimes 8am, 3pm, 1am not while stressed/updating apps. Patch reporting has never been turned on.

CasperSally
Valued Contributor II

@Nick_Gooch is mysql hosted on windows? if so, c drive? thanks, this sounds very much like our issue so want to send to our TAM in case they don't know our cases are so similar.

Emmert
Valued Contributor

I'm seeing something similar. Both the JSS and MySQL are running on a Windows server on the C drive, and I had to restart the services twice this weekend since upgrading to 9.96.

I tried doing a full inventory request on all of our devices and that seemed to kill it this morning, which was never a problem in the past.

CasperSally
Valued Contributor II

thanks @Emmert - this is great info.

Our cluster just crashed again. 2nd time today. I get alerts from the load balancer and then check JSS settings - clustering - and the connection counts are always high and usually at least one of the JSS's behind the load balancer I can't get the website to load.

This is getting really old.

Nick_Gooch
Contributor III

@CasperSally Windows, C drive for both JSS and sql. Does sql just stop running for you when it goes down? That's what we are seeing. And it ALWAYS happens when I am out of the office.

CasperSally
Valued Contributor II

@Nick_Gooch Sql has never stopped for me. For us, I get email from load balancer that one of the child JSS's is down. I check connections in settings and they're out of wack and usually one or both of the web interfaces of the child servers is down.

To get everything back up, I usually have to restart mysql on our db server and tomcat on our parent and 2 child servers. About half the time restarting tomcat doesn't work on one of the child servers and I have to reboot it.

For us, 75+% of the time it happens over night around 12am. Our logs flush at 1am and our backup happens at 2am so it's not either of those. The other 25% of the time it happens if I try to do anything that involves APNS (push a config profile to 900 devices crashes it every time a few mins after the push) or it just went down again and I suspect it was because our iOS tech was changing settings on apps so they don't all try updating at midnight.

lehmanp00
Contributor III

We are not a clustered environment; single Ubuntu JSS server with a separate MySQL Ubuntu DB server. We are also seeing some of these issues since 9.96. Very slow JSS web performance and if we make a change to a config profile the server load (using TOP) spikes to 50 or more and the JSS stops responding. I have to wait until the load drops to under 20 and then things get back to normal.

Have NOT seen the JSS just crash though.

thoule
Valued Contributor II

My windows JSS's (two behind LB) started going down 2-3x a week (tomcat hanging), starting a couple weeks ago. Just upgraded to 9.96 and changed some memory settings last week and its no better. Our windows admin is started some process trace work to figure out why. Is everyone having issues on Windows? I'm wondering if it was a MS patch causing the problems... I'm working on a plan to convert these boxes to linux...