all of my webapps behind the load balancer dump their ability to communicate about every thirty minutes. Meaning, you can't ping them, remote into them, they can't ping out either. A reboot solves the issue and the timer begins again. I upgraded on Friday, went down yesterday with this going on.
The only issue I ran into was a corrupted log_actions table which took forever to fix via the myisam command line tools. After that, things started up fine. We have several Ubuntu based webservers behind a load balancer and I haven't run into any problems with them since the upgrade on Sunday.
what did you upgrade from? i had a similar issue when i installed 9.93. turned out the 9.93 installer reset the tomcat memory settings so my jss that was running on a vm with 16gb ram was only able to use 256mb. look into your catalina logs. thats where i found the out of memory error that pointing me in the right direction.
upgraded from 9.93 to 9.96. Checked the memory settings first thing, no adjustments need. Today we did adjust the Max Pool size it had been reset back to 90, adjusted those to 1000. OSX devices can recon, but it is not correctly displaying the Inv info for the device correctly. Can enoll iOS fine, don't get profiles or self service web clip or any of the apps set to drop down on enrollment. Repaired database. optimizing it now.
Bumping to see if anyone else is having problems? We will be forced into JSS 9.96 real soon as users keep moving up to 10 while iPads are off-site.
We are coming up from 9.92.xxxxxxxxx maintenance release.
I went from 9.92 MR to 9.96. Everything was fine until I added a few patch reporting titles. Then connections to my cluster went out of control and random JSS slowness. Removed patch titles and we're back to known good state. Ticket in with JAMF
we had success getting our environment back in good working order last Thursday evening make several adjustments to pool sizes, mysql, and more. At about 2AM last night webapps started dropping again and now they are bouncing about every 30 mins again.
Ironically we also had performance issues today. Things were fine for a week then today lots of table level locks and sql queries getting stacked up. I increased connections on each JSS from 50 to 1000 and we're back in business. We'll see what JAMF ends up recommending I do.
@cbrewer We too in our own troubleshooting bumped up them up to a 1000 also, and they still crashed, however it wasn't as quick as it was with them at the default 90. While working with JAMF engineer when we got things back up I learned there is a direct correlation between the webapp connections (thread)vs sql connections and how they impact each other. The total # of threads cannot exceed your sql connections without crashing sql which we experienced when we bumped up from 90 to 1000. Currently we are at 80, which we stressed tested last Thursday eve and all seemed well, everything was fine Friday and Monday. Tuesday in the earlier morning something caused one of the webapps tomcat to crash, and the roller coaster was back on again today. Today was 7 days from the initial issue beginning, so I started thinking in terms of what might running at this interval that would cause this and at this point I'm still looking for anything that does. I'll have another call with out STAM in the morning.
Anyone else who is having performance issues with 9.96 in clustered environment either not using patch reporting or tried deleting the patch reporting titles and noticed improvement?
We also had major performance issues yesterday afternoon, and it cleared up after disabling the patch reporting. We'd also worked with JAMF over the past weeks cleaning up some complex smart group criteria, which they told us was exacerbating the problem. The improvement from the smart group cleanup wasn't anywhere near as dramatic as disabling patch reporting.
@mahughe As a general rule, if one is to modify those settings one should probably modify both MaxPoolSize and maxThreads.
maxTreads can be found in server.xml and should be 2.5 times the size of MaxPoolSize (So 'engineers' have told me in the past). I used to use 400 for my MaxPoolSize and 1000 for Max Threads. MySQL itself should allow a maximum of 1 more than the MaxPoolSize. So in this instance, it should allow 401 connections as MySQL needs to be able to connect to itself even when running at the maximum number of external connections. Changing this depends on which OS you are hosting MySQL from.
P.S. I was also notified by JAMFSoftware a few years back that Appel warned them against this practice for reliability reasons. In the end, working is working but here's to hoping the issue gets resolved
JSS (tomcat and db) hosted on a single Mac mini with 10.11.6. Patch Reporting enabled for 5 applications. At various times, when I've adjust the criteria to some smart groups based on Patch Reporting, the mysqld process will spike between 200 to 300%. It lasts from 1.5 to 3 hours. The JSS web interface is slow to process many of the pages. Sometimes newly imaged computers are not able to enroll during the spike. Resolves it's self until I mess with the smart groups again. Happened with 9.93, but I haven't tried to reproduce with 9.96. I have an open case with JAMF Support.
9.93 upgraded tomcat to 8 by default, at least on the windows side. Wonder if that is part of the issues. We ran into several issues after updating to 9.93 once school started and it was getting heavy use. So far things seem to be running better after many hours on the phone with JAMF support. We made so many tweeks and changes I really don't know what fixed our issue, or if it is really fixed. We have had sql crash occasionally as well. No patch reporting being used at all here.
I turned on a policy for patching (not JAMFs patch stuff, I'm using Autopkgr) before I left yesterday and it seemed to kill mysql overnight at a random time after midnight. Woke up to 150+ outage emails from jss going down/up behind the load balancer. Under settings - clustering the connection counts were crazy even after I rebooted the 3 JSS servers this morning. I disabled the patching policy for now.
When I got in this morning, I rebooted the sql server and restarted tomcat on the 3 JSS servers again and now the connection counts seem normal again.
This morning when I arrived I had one of the webapps down without it killing the NIC so I didn't get a notification of the port being down from missing it's monitoring pings. @Chris_Hafner late in the day based on input from my STAM, I changed a couple of policies that were running as ongoing and disabled one completely. Imaging worked through the evening and has still been working to this point this morning. All webapps are up and seem to be happy at this time. This morning my colleague and I found we had some remnant VPP user associates that needed to be removed and we did.
Will keep this updated as new events arise..
Patch reporting seems to be checking for new titles quite often (as seen in the JAMFSoftwareServer.log). If you have large databases, there might be issues if the queries are not answered fast enough and connections could pile up.
Did you check the mysql-slow.log and look for queries which take a long time or are actually blocking the whole server while executing?
I had a lockup problem on 9.81, anaylzed the SQL logs and added new SQL indexes; now everything is going very smooth. The only problem is that you have recreate the indexes every time the Master JSS application restarts and does its database check, but they really help if you have a lot of extension attributes.
For reference, those are the indexes for 9.81 (and the execution times for creating them, which are quite revealing).
Add mySQL indexes to Casper database. This has to be done after every restart of the JSS master node: CREATE INDEX `computer_group_name` ON `computer_groups` ( `computer_group_name` ) 52ms CREATE INDEX `display_name` ON `extension_attributes` ( `display_name` ) 173ms CREATE INDEX `extension_attribute_id` ON `extension_attribute_values` ( `extension_attribute_id` ) 11.6s CREATE INDEX `is_managed` ON `computers_denormalized` ( `is_managed` ) 506ms CREATE INDEX `last_contact_time_epoch` ON `computers_denormalized` ( `last_contact_time_epoch` ) 536ms CREATE INDEX `last_report_date_epoch` ON `computers_denormalized` ( `last_report_date_epoch` ) 525ms CREATE INDEX `operating_system_build` ON `computers_denormalized` ( `operating_system_build` ) 542ms CREATE INDEX `operating_system_name` ON `computers_denormalized` ( `operating_system_name` ) 565ms CREATE INDEX `operating_system_version` ON `computers_denormalized` ( `operating_system_version` ) 592ms CREATE INDEX `package_file_name` ON `cached_packages` ( `package_file_name` ) 103ms CREATE INDEX `package_name` ON `package_receipts` ( `package_name` ) 60s CREATE INDEX `type` ON `package_receipts` ( `type` ) 73,9s
Really, why on earth would you not have an Index on extension_attribute_values.extension_attribute_id, package_receipts.package_name and package_receipts.type? The creation of the index takes almost 3 minutes, imaging how long the queries take without those!
Patch reporting did me in too. It was the first thing I enabled after I upgraded from 9.92 supplemental to 9.96.
The next day, I got max database connections on my three nodes which were slamming our poor database server.
I should have keyed on that detail as I was troubleshooting this issue with our TAM who finally found a bug report regarding patch reporting on large databases.
If you had to deal with this for a few days, I highly recommend repairing your database after removing patch management. One of my policies got corrupted and started removing the Office suite from all of our 7000 Macs.
todays update ...so far so good since yesterday about 930AM. Tuesday seems to be a trigger for some reason, so I'm hoping this weekend is good and that we make it past Tuesday. If something changes I'll post what has taken place.
Survived the weekend and today without any issues with the JSS or it's webapps. Tuesday has been the trigger day for it's instability, here's hoping we can get past it. Either way I'll update this thread.
how's everyone's clustered environments doing?
Our issues got better when we turned off patch reporting, but it still goes down consistently if I push out a config profile to as little as ~900 devices.
It also goes down randomly a few times a week overnight around 12:08-12:13am. My guess is it's when iOS apps are updating.
It's getting really old trying to restart mysql/tomcat on the servers to get everything back up.
Anyone else running into this still? JAMF may want to move our DB next from the D drive over to the C drive, I hate to do this if it's not going to help.
@CasperSally MySQL for us is stopping about twice to three times a week. We aren't finding anything as to why. It just stops. Reinstalled MySQL twice now. We are clustered but I don't think that matters. I left the dmz server off for a week and it still went down twice. Ours goes down at random times. Sometimes 8am, 3pm, 1am not while stressed/updating apps. Patch reporting has never been turned on.
I'm seeing something similar. Both the JSS and MySQL are running on a Windows server on the C drive, and I had to restart the services twice this weekend since upgrading to 9.96.
I tried doing a full inventory request on all of our devices and that seemed to kill it this morning, which was never a problem in the past.
thanks @Emmert - this is great info.
Our cluster just crashed again. 2nd time today. I get alerts from the load balancer and then check JSS settings - clustering - and the connection counts are always high and usually at least one of the JSS's behind the load balancer I can't get the website to load.
This is getting really old.
@Nick_Gooch Sql has never stopped for me. For us, I get email from load balancer that one of the child JSS's is down. I check connections in settings and they're out of wack and usually one or both of the web interfaces of the child servers is down.
To get everything back up, I usually have to restart mysql on our db server and tomcat on our parent and 2 child servers. About half the time restarting tomcat doesn't work on one of the child servers and I have to reboot it.
For us, 75+% of the time it happens over night around 12am. Our logs flush at 1am and our backup happens at 2am so it's not either of those. The other 25% of the time it happens if I try to do anything that involves APNS (push a config profile to 900 devices crashes it every time a few mins after the push) or it just went down again and I suspect it was because our iOS tech was changing settings on apps so they don't all try updating at midnight.
We are not a clustered environment; single Ubuntu JSS server with a separate MySQL Ubuntu DB server. We are also seeing some of these issues since 9.96. Very slow JSS web performance and if we make a change to a config profile the server load (using TOP) spikes to 50 or more and the JSS stops responding. I have to wait until the load drops to under 20 and then things get back to normal.
Have NOT seen the JSS just crash though.
My windows JSS's (two behind LB) started going down 2-3x a week (tomcat hanging), starting a couple weeks ago. Just upgraded to 9.96 and changed some memory settings last week and its no better. Our windows admin is started some process trace work to figure out why. Is everyone having issues on Windows? I'm wondering if it was a MS patch causing the problems... I'm working on a plan to convert these boxes to linux...