what specifically is the script doing ? can you attach the script?
is there anything common about the boxes it hangs on, eg same vlan, same os? does it hang on the same 15 servers all the time ?
more details about your env - how many appservers, max heap settings, etc ?
have you tried running the hanging agents w/ debug logging enabled ?
can you easily reproduce this ?
I've attached the script:
Unfortunatley there appears to nothing linking the servers, happening on different OS/kernel versions and accross multiple vlans.
Yes it does hang on the same server each time, and easy to produce as it happens everytime i run the script.
Can you tell me how to enable debugging?
I assume you have already tried running the script directly against the problematic target via. NSH instead of a job.
This way we could rule out anything from the script to be responsible for the hang.
What version are the targets where teh hang is being seen? Is it different than the ones where its works?
By debug logging I think Bill might be referring to enabling -x in your script, which you already have, and enabling debug logging for rscd and a couple of other things on the target.
On Unix, you would do this by modifying the /usr/lib/rsc/log4crc.txt and using the logging priority level of "debug" in following lines:
<category name="rscd" priority="info1" appender="/opt/bmc/bladelogic/NSH/log/rscd.log" debugappender="/opt/bmc/bladelogic/NSH/log/rscd.log"/>
<category name="bldeploy" priority="info"/>
yes - in the top of the script put a 'set -x', after the shebang.
another idea is to put in some 'date' statements or hellos or something to see where the hang might be happening as it could be a problem w/ one of the commands you are nexec'ing.