1 2 Previous Next 27 Replies Latest reply on Jul 16, 2014 2:08 PM by Yanick Girouard

    Intermittent "Caught exception running command - /opt/bladelogic/bin/nsh --norc -c ..." errors in type 1 NSH Script jobs running against multiple targets in parallel

    Yanick Girouard

      I have opened this as an official support ticket, but was wondering if the community could help as well... so here goes...


      We have several NSH Script jobs (type 1, Execute script against each host (using "runscript") ) that intermittently fail on several targets with the following error, but work flawlessly on the other targets:


      Caught exception running command - /opt/bladelogic/bin/nsh --norc -c /opt/bladelogic/tmp/whms11915_job4/scripts/job__bf2c8598-3d0a-44fd-b0ef-899be50413dc/master_d7003ccf-0a4c-4d46-9048-e3785d3331b3
      Error: null


      The path after the "--norc -c" part changes for each job run, but the rest is always the same. In the job logs for those targets, this error is the first and only log message, meaning the app server fails to launch the script.


      However, when running the job against a single target which failed, the job works without any error. We can also run other jobs on the same targets and navigate it through the live browse functions of the OM.


      The targets are of different OS and platforms, and have no points in common (different versions, different patch levels, different vlans, etc...)


      All of those jobs seem to have one point in common: They all have the setting "Number of Targets to Process in Parallel" set to a number higher than 10 (such as 25).


      Running the jobs against all targets again won't always return the same failures for the same targets, which leads me to believe this may be a threading issue.


      What could explain this error, and what can we do to prevent it completely?


      We're running BladeLogic Server Automation 7.6 on Solaris 10.

        1 2 Previous Next