As noted above, this really sounds like a resource management issue, either with SQL or with Tomcat. In my experience, the issue is often with SQL - especially when you have a ton of clients submitting data at once. One way to quickly help this is to increase the check-in time (15 is standard, but 30 or 1 hour may be helpful on a temporary basis), and to reduce the amount of data being reported - do you really need application logs, for example?
Some questions that would help clarify things:
Assuming there isn't a major issue with your JSS server, like having no disk space, or a corrupted ROOT.war, it should be straightforward to adjust things to improve performance.
We have ~10,000 clients with a mandatory inventory once a week.
Both Tomcat and mySQL are on the same MacPro
Max DB connections is 901, max tomcat threads is 1500. Mysql averages ~90 connections. Tomcat has 8GB allocated to it and is using ~6GB.
We're not collecting application logs, running services, or usage logs.
MacPro mid-2013, 2.7GHz 12-core xeon e5, 32GB RAM, 512GB SSD with 400GB free.
Mac OS X 10.11 JSS 9.82
Java JDK 1.8.0_66
H'okay, yeah, that is way too many MySQL connections - assuming the MySQL server's
max_connections variable allows that many in the first place.
As a rubric, you really only need 10-25 MySQL connections, except for the largest of sites. Even if having less connections causing some requests to queue, the reduction in parallel operations (especially on a system that doesn't have RAID) ultimately allows MySQL to work much faster. It's a tradeoff: you can work really slowly on a ton of requests at once, or really fast on a few requests at a time. Generally, the latter is better - you just want to keep an eye on the responsiveness of the JSS to ensure it doesn't get bogged down by queued requests.
In addition, at 10k clients, you really should be looking at separating out the SQL server from the JSS, and ideally be running two JSS' behind a load balancer. A single JSS can handle 10k clients if it's on sufficiently powerful hardware - but having a load balanced solution helps deal with sudden spikes of activity (like a policy with a recon), and ensures some redundancy. Ideally, you'd throw in a third "Admin" JSS on a VM somewhere that acts as the Cluster Master, and ensures you have access to the admin interface if something goes wrong.