I already saw that and there were 3 cases where this happened
- all your Work-Item-Thread are in the "STUCK" state (see AppServer details) - this means that at some point they were used to communicate with agents that are in a state where a socket was opened and the thread waits for an answer for ever.
- If you have more than one AppServer configured on your DB, by default Jobs can be split on several AppServers, this means that some tasks will be performed by remote Work Item Threads. This mechanism uses special Threads called "RMI threads" (basically these permit Remote Method Invocations from a JVM to another) these threads can be in some cases in a stuck state (although not that obvious in the AppServer details as the first case) in which case the job will stay in the waiting to run state until another job that's already running terminates.
- the third case is a variant of the second one: the same "RMI threads" are involved in communication with the Process Spawner, and the same kind of problem might then happen.
There are several tickets and defects opened for this problem (not solved yet) but I think you probably need to open a ticket as well...
The workarounds are for now:
1. restart the Process Spawner (if used)
2. restart the AppServers
3. reconfigure the AppServers to avoid using Job splitting capabilities and external spawning:
Note: you gotta do this editing the "Config.xml" file as some keywords aren't available to blasadmin.
does the process spawner work? i've had problems in < 7.3 getting it to function properly - i get the RMI timeouts or errors.
... well that's a question I ask myself actually, the thing I know is that I used it for a couple of days without problem before falling again in this RMI issue.
At first I thought that disabling inter AppServer communication would be sufficient but now I'm in a situation where basically I use my AppServers as I used to back in 6, so I believe the right answer is no although the problem seems to be a generic RMI problem.
Oliver, thanks for the information. I updated some of the items you suggested in blasadmin and now I am not seeing the issue anymore..
MaxConcurrentRemoteWorkItemRequests:5 (default = 5)
MaxTimeForCancelToFinish:5 (Changed, default = 10) PropagateWorkItemTimeout:false (Changed, default = true)
IdleConnectionPruneTime:30 (Changed, default = empty)
RegistryPort:9836 (default = 9836)
ComplianceResultMaxNumberOfAssets: (default = empty) RemoteServerTimeout:60 (default = 60) ServerMonitorInterval:10 (default = 10) SocketConnectTimeout:30 (default = 30) SocketTimeout:600 (default = 600) UseSSLSockets:no (default = no)
RequireClientAuthentication:yes (default = yes)
MaxJobTimeInSchedulerQ:60 (default = 60)
MinJobExecutionConnections:0 (default = 0)
MaxJobExecutionConnections:100 (default = 100)
MinGeneralConnections:0 (default = 0)
MaxGeneralConnections:20 (default = 20)
... Did you experience your problem as soon as the AppServer was up?
I faced this issue of jobs being in 'waiting to run' state for ever. To fix this, I had to go back to the basics of syncing the clocks of the participating app servers... Turns out my app servers were on different time zones. I was able to run the jobs after this.