Did you resolve this issue in the end? I'm seeing the same thing running a compliance job that calls an nsh script.
I 'solved' my problem by limiting the concurrency. Seems to be a performance issue.
I have generally found that this is connected to having MaxNshProxyContexts set too low. This is very important if you are using the NSH proxy and should be kept fairly high to allow sufficient NSH connections to process all the jobs and/or client requests. In my experience, you always end up with more NSH connections than you would expect from running jobs.
You also need to keep an eye on MaxNshProxyThreads, but this does not need to be as high as the proxy multiplexes a number of contexts on each thread.
Thanks, that's very useful. I'm added it to my list of items to investigate. Could you give any idea of the sorts of numbers that are suitable for MaxNshProxyContexts and MaxNshProxyThreads?
My experience here is from 8.0, and this area has changed from 7.6, so almost certainly does not all apply on 7.x. This is also a complex topic and so I cannot cover everything here.
The numbers depend on the type of jobs being run and how many NSH connections that they run. It also depends whether you have a single NSH proxy for everything or whether the JOB Servers are also configured as NSH proxies (this is often the case with 8.0, esp. if you are using automation principals or SOCKS proxies).
MaxNshProxyThreads controls the number of threads used to handle NSH connections. MaxNshProxyContexts controls the maximum number of NSH connections from both users with NSH and for jobs running NSH commands. Normally multiple NSH connections are handled by a single thread and so MaxNshProxyContexts should always be higher than MaxNshProxyThreads (generally a multiple of it).
The only exception to this is if NshProxyMaxThreadIdleTime is set to -1. In this case, each NSH connection is allocated a specific thread and so there is no point allocating more contexts than there are threads.
So defining these values will be based on your particular use case, but you might start out with something like:
If you are also connecting NSH clients to this NSH proxy (as well as just running NSH jobs through it), you may want to increase MaxNshProxyContexts even more.
The main challenge is working out the number of NSH connections that your jobs generate, because each job may create multiple NSH connections, for example "ls -la | grep dir" will use 2 NSH connections. One file deploy job also seems to use multiple NSH connections.
We've come across this problem as well in an effort to enable the use of automation principals on a limited number of servers. After setting up NSH proxies on our appservers and setting them to use nsh proxies by default, a number of jobs (file deploys) started failing with the above error presumably because of a lack of available NSH Proxy "Contexts". I've since increased this from the default setting (or what was previously set) of 20 to 200, but I'm not sure if I need to change anything else as well to avoid this error.
Will I see jobs fails because MaxNshProxyThreads is too low, or just poor performance? This is currently set to 3.
What is the relation of MaxWorkItemThreads to MaxNshProxyThreads? We've recently set this to 50 for unrelated reasons, but I'm unsure if a higher number of WorkItemThreads would help or exacerbate the problem of running out of NshProxyContexts/NshProxyThreads (i.e. is it possible that the number of required NshProxyContexts/Threads could increase as the number of WorkItemThreads increases).
Would setting NshProxyMaxThreadIdleTime to 0 cause any nsh thread to immediately time out? This doesn't seem like it would be very good for performance. This is set to 500 in our (hopefully default) environment, so moving it to 0 seems like it would be a precipitous change. Also, do NshProxySocketOperationTimeout (set to 7200) or NshProxySocketConnectTimeout (set to 60) need to be changed?
Other than the potential to overwhelm the appservers, would there be any overt problems caused by just setting the NshContexts and/or NshThreads to an extremely high number while we work out more appropriate values? We've never used an NSH proxy before so we've no idea what kind of numbers we need, except that it needs to be large enough to absolutely avoid the "SSO Error"
I'll try and answer some of these, but I don't have all the answers here.
On MaxNshProxyThreads, this should be increased. A general guideline is a ratio of about 5:1, so if MaxNshProxyContexts is set to 200, you should have MaxNshProxyThreads set to about 40. In general a higher ratio will result in lower performance, although it may eventually result in timeouts and failures.
You should keep WorkItemThreads at 50 as I believe this is unrelated to MaxNshProxyThreads (not 100% sure on this though).
Setting NshProxyMaxThreadIdleTime to 0 actually sets the timeout to never expire rather than immediately expire. In most cases 500 should be fine, we dropped it to zero during testing as we thought that some SOCKS proxy challenges were related to this. In retrospect, this may have been a red herring.
Not sure on the NSH Proxy Socket timeout parameters, but we have left them at the defaults.
As you say, setting the NSH threads (these are the ones that use the resources) too high will just consume app server resources, so potentially affect JVM parameters.
This is interesting. We get errors like that in what seems to be completely different circumstances than most of the posts above.
Here goes: BL 8.0 SP5 Patch 1, (also true in SP1 and even before, under BL 7.4.6) everything running on Solaris 10 servers
3 job appservers, none set to be NSH proxies, all on separate servers
1 configuration manager appserver also configured to be an NSH proxy, another separate server
We don't allow users on our appserver hosts, nor do we allow them to run NSH from their PC's where the clients are installed, so we provide another Solaris 10 host with an appserver installed, but not running - I suppose we could have just installed NSH...
When we get "SSO error: Error reading server greeting", so far we have always tracked it back to the shared memory on the NSH host. Basically, every NSH session allocates a block of shared memory. If the session is not properly closed out (for example, by typing "exit" at the prompt of an interactive session), that block of shared memory remains allocated. Eventually, the server runs out of blocks it can allocate, and every attempt to start an NSH session generates this error or other similar errors. Note: other programs besides NSH use shared memory as well, so care must be taken when cleaning up this resource...
Clearing the no longer needed shared memory blocks on the NSH host fixes the problem. We actually needed to create a NSH script job to do this on a scheduled basis, due to the frequency of occurrence...
command to view the shared memory - "ipcs -ma"
command to clean up shared memory - "ipcrm -m $id" (and we are removing the shared memory blocks only of a small size, found experimentally)
Note: In my case, we are mostly talking about interactive terminal sessions running NSH, or NSH scripts invoked directly on this machine via a Web interface, NOT NSH script jobs running via the BL interface. As a matter of fact, I don't believe we have ever seen this error in an NSH script job invoked from within Bladelogic. It also does not appear to be related to the NSH proxy in any way.
Sometimes, if no cleanup is done, and no system reboot occurs, these shared memory blocks have been observed to hang around for months - the first time a complete cleanup was done there were blocks allocated by IDs that had not worked here for more than 5 months!
Hope this helps!
I simply can't seem to banish this problem. I've set nsh proxy threads to 50, contexts to 200 on 2 appservers. I restrict nsh jobs to running 30 at a time and I'm STILL getting this error.
I would usually run nsh script jobs centrally and pass the server list as a parameter. This was causing problems with the proxy getting overwhelmed and requests being denied. Now i've resorted to rewriting it to runscript on individual servers and outputting the results to a central file: even that is causing problems. I'm not sure if setting up a dedicated nsh proxy would help, and if we did that we'd need it to have at least 150 threads and 750 contexts. Does that seem like a reasonable number for an environment with around 5000 total servers?
Were you able to find a solution for this problem?
We also have set up the automation principal in our environment recently. The NSH script jobs and file deploy jobs when run on a larger subset of servers fails on certain servers giving the same error message as reported on this thread. I can set up a limit for parallel processing on these kind of jobs but that will not help if multiple users run these jobs simultaneoulsy. These are the values that I have set:
ideally you should have a single job instance talking to each proxy.
if that job instances has MaxWorkItemThreads = 100
For a job server w/ maxworkitemthreads = 50
Restricting file deploys to running concurrently on only about half the number of servers as there are nsh proxy threads seems to work without issue - I have not run into any situation in which I needed to run a file deploy on huge numbers of servers at the same time so I usually just restrict to 90 targets at a time, which even for a job running on a few thousand servers usually doesn't take that long. As you say this wouldn't help the problem if we saw multiple users running such jobs simultaneously but that is not a situation I have seen.
The thing that fixed this issue the most for me was adding an explicit disconnect for any nsh script that iterates through a list of servers. The job will retain the connections and hold on to the nsh proxy thread until explicitly directed to disconnect. So if your nsh job looks like :
for server in $1
If you add something that says 'disconnect $server' into the loop it prevents the problems we used to get with such nsh jobs.
Thank you Bill and Jim for the solutions. Changing the blasadmin settings really helped and now I can deploy the file deploy job to like 1000 servers at a time. But it is good to keep a limit (say 100 or so) so that the app server can handle multiple such jobs if executed without giving memory or sso greeting related issues.
But I am still facing problems with NSH script jobs. For example, I created a type 3 job to execute certain batch/windows commands (just to test) on the target servers.
iisapp /a test /r
If I add the first 3 commands, it works on all servers and the job is done. But when I add the last command, the logs on the BL window reports 'success' on 1 server and 'running' on rest of the servers. And when I click on individual server in the window, the logs suggest that all the commands are executed but the job keeps on running. Is there something that needs to be changed for type 3. Type 4 jobs work fine.
Note: I have not created any new job. I am using the same jobs which were working before setting up the AP.
what does the iisapp /a test /r do ? perhaps it's waiting for a return that doesn't come back ?