8 Replies Latest reply on May 18, 2015 5:02 AM by Clement BARRET

    [ NSH ] NSH Shell is crashing silently...

    Clement BARRET

      Hi,

       

      I have a major issue I'd like to talk about here.

       

      You might tell me if this is an already resolved issue in the upcoming BladeLogic versions or not.

       

      Our platform version : BladeLogic Network Shell 8.3.02.332

       

      Case 1 :

       

      So, let's say you're running a NSH Script Job.

       

      You have one target only. (you can have more, that's not related)

       

      Your NSH script is something very simple but it also imply a target server reboot and a "disconnect".

       

      The calling NSH script (from the Job item) will sometimes be "killed" silently (or crashes silently) so that the end of the script (after the reboot/disconnect) is never executed (nor reached).

       

      There is no error message either...

       

      Case 2 :

       

      I also have another case which trigger the "NSH sudden death" :

       

      Simply load a Job context using a blcli in an NSH Script that's running "as a daemon" (well, it's some NSH process running and waiting for input of a given format to trigger the launch of the given Job).

       

      Run it using the blcli. If the Job lasts more than about 1 hour (I'd say that number but I'm not exactly sure how long it must be) the next command of your NSH script following the blcli Job "executeAndWait" (of any kind) will just crash the NSH script silently... Let's just say it is REALLY hardcore to circumvent such "sudden death" and it's really messing with me around...

       

      I really wish you could help on those or tell me if these are already known (and patched) issues.

       

      Best regards.

       

       

       

                 Clément BARRET

       

       

       

      PS : I duno why but no matter what I do, I can't get rid of the "disconnect" output (I tried all possible redirections... it's still flushed and printed at the end of the NSH script). Any idea ?

        • 1. Re: [ NSH ] NSH Shell is crashing silently...
          Bill Robinson

          1 - do you have any logs of this ? including the appserver log that ran the WIT for the job ?  and only in the case where you do a reboot of the target and/or (?) a 'cd //@;disconnect target' ?  and type 1, type  2 ?

           

          2 - i haven't seen nsh crash but if the blcli -> appserver connection is disrupted i've seen that the next blcli command will fail. that's usually an issue if the blcli is connecting though a load balancer or some other network device to the appserver and that connection is timed out.  and you'd need the 'appserviceurls' set on that instance to the vip or whatever.  also - any logs of this ?

           

          what 'disconnect' output are you referring to ?

           

          also - were there other nsh jobs running during this time ?  there was a defect where when the nsh job was stopping it would kill the appserver and/or spawner randomly, i'm not sure if that would affect other pids on the box or not (eg, another nsh process for another nsh job)

          • 2. Re: [ NSH ] NSH Shell is crashing silently...
            Clement BARRET

            Yes I do,

             

            (I've not checked the appserver's log and I guess WIT stands for Worker Item Thread right ?)

             

            For your information : we use NSH proxy.

             

            Case 1)

             

            Basically, the logs accessible from the "BladeLogic console" on the given Job Item. After the execution of a scripts (which includes a call to my safe reboot.nsh own script), there is a bunch of files copy etc. that should be made (with respective prints) and none of them are done.

             

            I got some "SSL_Disconnect" or "SSL error" etc.

             

            Case 2)

             

            It's really easy to make NSH "silently segfault" (as I call it) or have a "sudden death". (which shouldn't ever happen imo). There is no reboot involved in this scenario.

             

            The easiest way is that, just load a Job context (a job with a single sleep 9000 should be good enough) and start it in your NSH script with blcli_execute Job executeJobAndWait. You're sure that the next NSH script command that will require the NSH context (another blcli for example) will silently crash the NSH.

             

            Logs I can provide are as if I had trimmed the end... but I've not :X

             

            That's really problematic but I've found a way to circumvent this issue by calling a "executeJobAndWaitForRunID" then staying in a loop which queries the JobRun state every 10 seconds (in fact, I guess that maintains the "connexion up"). NB : I have to redo the findByJobrunKey otherwise at one point it will crash...

             

            The loop I use looks like :

             

              SECONDS=0;

              while [[ "$job_is_running" != "false" && $SECONDS -lt $JOB_MAX_EXECUTION_TIME ]]; do

                [ -f "$GENERES_DAEMON_SHUTDOWN_FILE" ] && STATE="shutdown" && break;

                sleep $JOB_POLLING_INTERVAL;

                printf "${(e)LNOW} [INFO] Checking the job running state (after %ds)...\n" "${SECONDS}";

                blcli_exec_silent JobRun findByJobRunKey "${JOB_RUN_DBKEY}"; RC=$?;

                (( RC )) &&  echo "${(e)LNOW} [ERROR] JobRun findByJobRunKey ($RC)" && return $RC;

                blcli_exec_silent JobRun getIsRunning; RC=$?;

                (( RC )) &&  echo "${(e)LNOW} [ERROR] JobRun getIsRunning ($RC)" && return $RC;

                job_is_running="$BLCLI_OUTPUT";

              done

             

             

            So... I found that very handy earlier to be able to just use "executeJobAndWait" but since my script was just killed afterwards I had to find this workaround... I just wish you would tell me that has already been fixed.

             

             

            Oh and for the disconnect message.

             

            I tried any redirection like

             

            disconnect 2>/dev/null 1>/dev/null

             

            disconnect > /dev/null 2>&1

             

            disconnect 2>/dev/null 1>&2

             

            etc etc and  I never could get rid of the

             

            "Disconnecting from vuhplabstdev001 ...done"

             

            line which always appears at one point...... which is really annoying tbh but still not "a big deal".

             

            Thanks for your help.

             

            Regards,

             

             

                       Clément BARRET

            • 3. Re: [ NSH ] NSH Shell is crashing silently...
              Clement BARRET

              Well, any update ?

               

              I would be glad at least to be able to get rid of the "Disconnecting from..." message...

               

              Regards

              • 4. Re: [ NSH ] NSH Shell is crashing silently...
                richard mcleod

                Note: You should always use disconnect in your NSH sessions to avoid NSH processes/resources remaining in use after your action has completed. I use this to ensure we exit cleanly

                 

                cd //@;disconnect ${HOST}||true

                • 5. Re: [ NSH ] NSH Shell is crashing silently...
                  Clement BARRET

                  Richard McLeod a écrit:

                   

                  Note: You should always use disconnect in your NSH sessions to avoid NSH processes/resources remaining in use after your action has completed. I use this to ensure we exit cleanly

                   

                  cd //@;disconnect ${HOST}||true

                   

                  Well, this doesn't help me at all...

                   

                  My point is not "using disconnect or not using it" it's getting rid of its output...

                   

                  I'll quote parts of my previous messages :

                   

                  "PS : I duno why but no matter what I do, I can't get rid of the "disconnect" output (I tried all possible redirections... it's still flushed and printed at the end of the NSH script). Any idea ?"

                   

                  "

                  Oh and for the disconnect message.

                   

                  I tried any redirection like

                   

                  disconnect 2>/dev/null 1>/dev/null

                   

                  disconnect > /dev/null 2>&1

                   

                  disconnect 2>/dev/null 1>&2

                   

                  etc etc and  I never could get rid of the

                   

                  "Disconnecting from vuhplabstdev001 ...done"

                   

                  line which always appears at one point...... which is really annoying tbh but still not "a big deal".

                   

                  Thanks for your help."

                   

                  I guess I'll eventually open a support case that might be easier.

                   

                  Best regards,

                   

                   

                  CB

                  • 6. Re: [ NSH ] NSH Shell is crashing silently...
                    Bill Robinson

                    this works fine:

                    % disconnect red5-85 2>/dev/null

                    % version

                    BladeLogic Server Automation RSCD Agent 8.5.01.231 (Release) [Oct  1 2014 17:07:54]

                    • 7. Re: [ NSH ] NSH Shell is crashing silently...
                      Bill Robinson

                      what is 'blcli_exec_silent' ?

                       

                      i have similar loops and i don't have any issues w/ nsh silent dying.

                      • 8. Re: [ NSH ] NSH Shell is crashing silently...
                        Clement BARRET

                        Bill Robinson :

                         

                        Hi, blcli_exec_silent is just a custom function embedding the blcli_execute but handling errors output and regular output so that it's easier for me to control what I'm displaying.

                         

                        Anyway, the disconnect "bug" I got is real.

                         

                        I can think the output is not displayed but it's tricky. In fact, you can't rely on what you see.

                         

                        Basically, the scheme is a bit more complex than running "disconnect" from a terminal and tell "et voilà" there's no output as expected.

                         

                        It's a bit more like this :

                         

                        NSH (executes) => NSH_SCRIPT_JOB (executes in the context of the targets) => NSH_SCRIPT (using nsh -c "my_nsh").

                         

                        If I use disconnect (redirs to null, whatever "formula")  in the the NSH_SCRIPT_JOB, the "disconnected from" message is printed out at the end of the job, even after my last  "exit" message...

                         

                        If I use "disconnect" from inside the NSH_SCRIPT, it will sill print this message at the end of the script execution.

                         

                        I know our BladeLogic version is not the latest one but those bugs are real (as are the other and acknowledged ones I've discovered, leading your team to work on fixes). I wouldn't be bothering you about it if I hadn't been able to reproduce this bug every time...

                         

                        About the initial topic discussion : (NSH sudden death)

                         

                        For information, after digging up a bit here and there in the community discussion threads, I found that my "nsh sudden death" might have been caused by a "user process number limit" so I've increased it in the

                         

                        /etc/security/limits.conf file (to 300 for the user bladmin).

                         

                        I still have more thorough tests to confirm it fixed our issue definitely but I'm quite confident it did.