are you using the process spawner?
Yes we are...
We also have multiple job servers running (configured in the application server launchers...). This example used the 4th one, whms11915_job4
I'd make sure the process spawner is running on each job server. then i'd look in the spawer.log to see if there are any errors there. i'd also look for ulimit related errors (too many open files, etc)
what version of solaris is this and what version of bladelogic ?
Not sure if this was ever resolved but experiencing a similar issue running file deploys in a batch. The source of the files is a linux server, the targets (~5200) are linux. The appserver is win2k8 running bsa 8.3.03. Spawners are running (turned on yesterday) but might be crashing given some of the errors i'm seeing.
These are Job errors I'm seeing :
1) Error in master script on server target-host1
Error Jul 9, 2014 11:02:36 PM Caught exception running command - "C:/Program Files/BMC Software/BladeLogic/NSH/bin/nsh" --norc -c "/C/Program Files/BMC Software/BladeLogic/NSH/tmp/appserver_job_4/1389b78b-d985-462a-8691-dd41c6b2b538/master_b506228a-af13-45dc-bce9-bb89782d55f4" target-host2
Error: There was an error connecting to the Process Spawner: Connection refused to host: appserver; nested exception is:
java.net.ConnectException: Connection refused: connect. Please confirm that the Process Spawner is running.
Error Jul 9, 2014 11:02:57 PM Caught exception running command - "C:/Program Files/BMC Software/BladeLogic/NSH/bin/nsh" --norc -c "/C/Program Files/BMC Software/BladeLogic/NSH/tmp/appserver_job_3/49bb4bbc-4a0a-4520-8c3d-e4e7a7f1f509/master_b506228a-af13-45dc-bce9-bb89782d55f4" target-host3
Error: Connection refused: connect
Info Jul 9, 2014 11:02:54 PM -w Preserve file times only
Yesterday when the spawners were off I was also seeing the 'error in master script' issue but was also seeing nsh fork issues.
I opened a ticket to BMC about a month ago, they said to try replacing the cygwin1.dll (to version: 1.7.30) on the appservers, we tried this but it caused some major issues so we reverted back to cygwin1.dll (version: 1.7.17)
I don’t think the spawner is running. what’s in the spawner log ?
Most of it looks like normal processing.. here is the only error I was able to find from this AM
(This error is repeated every hour on all appservers)
[10 Jul 2014 03:52:51,039] [Scheduled-System-Tasks-Thread-2] [WARN] [System:System:] [Mnt Win Opening Update] Waiting for application_server_code_lock lockid: 6 to be released.
10 Jul 2014 03:53:00,793] [Scheduled-System-Tasks-Thread-3] [ERROR] [System:System:] [Mnt Win Opening Update] Service bladelogic.service.AppServerService is not available.
com.bladelogic.om.infra.app.service.ServiceNotAvailableException: Service bladelogic.service.AppServerService is not available.
at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
at java.util.concurrent.FutureTask$Sync.innerRunAndReset(Unknown Source)
at java.util.concurrent.FutureTask.runAndReset(Unknown Source)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(Unknown Source)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.runPeriodic(Unknown Source)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
that’s in the spawner log ? if you aren’t using the maintenance windows in the acl policies you can ignore that.
can you zip up the appserver and spawner logs covering the time of your problem and attach ?
I think you'll get this error if one or more of your app server is down. Each app server talk to each other periodically over the RMI port to check which servers are available for process spawning and such. Have you deleted any deployment (possibly incorrectly) recently by any chance?
Also, what is your blasadmin setting value for this setting? Is it set to 60 by any chance?
To specify an interval between heartbeats for an Application Server, enter the following:
set appserver ServerMonitorInterval #
where # is the frequency with which an Application Server updates its own time stamp (that is, its heartbeat). When an Application Server updates its heartbeat, it also checks for the heartbeat of any remote Application Servers.
Bill Robinson - since we just enabled the spawners yesterday, i'm going to let it cook for another day, seems that all appservices didn't restart yesterday due to some hung jobs in the queue, trying to get those queues cleaned out today and see what the same job run does this evening. If same issue is shown in job logs I will zip up the appserver & spawner logs
Yanick Girouard its set to 20
actually this isn't an appserver communication issue. it's for the 'maintenance window' feature. you can get rid of the message by disabling that service w/in the appserver, and i think it's fixed in a later version.
Ah didn't catch that sorry.
So I believe all the appservices were successfully restarted yesterday, enabling the process spawner and while I did see better overall job completion I did find that 3 of my jobs from the batch were reporting errors in the thousands (4000+, 2000+, 1250+).
Here are the three main concerns
Error Jul 11, 2014 1:20:07 AM Caught exception running command - "C:/Program Files/BMC Software/BladeLogic/NSH/bin/nsh" --norc -c "/C/Program Files/BMC Software/BladeLogic/NSH/tmp/appserver1_job_2/baafcf66-1c2e-4a7a-8e8d-ac9353af4e60/master_5d7a5f32-65ab-485b-8e87-c8b75fa3a30a" targethost1
Error: java.rmi.ConnectIOException: Exception creating connection to: appserver1; nested exception is:
java.net.SocketException: No buffer space available (maximum connections reached?): connect
Error Jul 11, 2014 1:20:38 AM nsh:1: no such file or directory: /C/Program Error Jul 11, 2014 1:20:39 AM Error running master script on server targethost2
Error Jul 11, 2014 1:25:39 AM Error running master script on server targethost3
attached zip of logs - msg me on communities for password
appserver01-logs.zip 9.9 MB
#2 looks like your script is missing double-quotes around a command call somewhere. There are spaces in "Program Files"