1 2 Previous Next 27 Replies Latest reply on May 19, 2016 3:53 PM by Bill Robinson

    JOB timeout properties not killing script in blpackage

    Neal Meagher

      I have a simple script running inside a blpackage that is a part of batch file

       

      JOB_PART_TIMEOUT on the BLPackage set to 70 minutes and a JOB_TIMEOUT on the batch set to 140. When it the script runs beyond 70 minutes its shows kill command is sent, but script remains running when grep.

        • 1. Re: JOB timeout properties not killing script in blpackage
          Bill Robinson

          the timeouts are set on the job, not the blpackage. 

           

          so on the deploy job for the blpackage you have JOB_TIMEOUT = 140 and JOB_PART_TIMEOUT=70 ?  or you have a deploy job in a batch job and the timeout properties are set on the batch ? what are the values of these properties on the deploy ?

           

          so what do you see in the bldeploy log for this job and in the job run log ? do you see the job cancel initiated ?

           

          what is the blpackage running in this script ?

           

          if this is indeed in a batch job have you read:

          Defining timeouts for jobs - BMC Server Automation 8.7 - BMC Documentation

           

          "For a Batch Job, the job part timeout is not relevant at the Batch Job level. Only the job part timeouts defined in the member jobs are taken into account."

           

          so if you have jpto set on the batch job, that will not cancel the member job once that is crossed.

          • 2. Re: JOB timeout properties not killing script in blpackage
            Neal Meagher

            For job_blpkg_wwf_run_scripts, deploy job  I have a JOB_PART_TIMEOUT set to 70 minutes and a JOB_TIMEOUT set to 240 on the BATCH job itself.

             

            This is what i see when kill command is sent

             

            Here's the output from the agent log on the target:

            ulvuidt01 10.208.98.96 05/09/16 19:08:30.384 23464 0/0 (SubAdmin:ryank@oak.fg.rbc.com): CM: > [Deploy] Job 'job_blpkg_wwf_run_scripts' is executing a dry run

            ulvuidt01 10.208.98.107 05/09/16 19:21:17.986 24705 0/0 (SubAdmin:ryank@oak.fg.rbc.com): CM: > [Deploy] Retrieving the root filesystem

            ulvuidt01 10.208.98.107 05/09/16 19:21:18.192 24705 0/0 (SubAdmin:ryank@oak.fg.rbc.com): CM: > [Deploy] Copying '//127.0.0.1/opt/bmc/fileserver/imported/1455634911710/blpackages/0e0315f8-2b2d-40f8-85e2-b53f019605c1' to '//ulvuidt01.devfg.rbc.com/var/tmp/stage/9e9af91752c734188645cd5beceae4e3'

            ulvuidt01 10.208.98.108 05/09/16 19:21:22.799 24711 0/0 (SubAdmin:ryank@oak.fg.rbc.com): CM: > [Deploy] Job 'job_blpkg_wwf_run_scripts' is applying

            ulvuidt01 10.208.98.108 05/09/16 20:31:25.573 24711 0/0 (SubAdmin:ryank@oak.fg.rbc.com): CM: Killing process tree with parent id = 24714

            ulvuidt01 10.208.98.107 05/09/16 20:31:27.703 31816 0/0 (SubAdmin:ryank@oak.fg.rbc.com): CM: > [Deploy] Job 'job_blpkg_wwf_run_scripts' is being cancelled

            ulvuidt01 10.208.98.107 05/09/16 20:31:37.723 31831 0/0 (SubAdmin:ryank@oak.fg.rbc.com): CM: > [null] Deleting //ulvuidt01.devfg.rbc.com/var/tmp/stage/9e9af91752c734188645cd5beceae4e3

             

             

            So from the above, we can see that the job was started at 19:21:22 and 70 minutes later, it attempted to kill the PID/cancel the job at 20:31:25.

            The script that was executing appears to have continued to run... I can validate this as the script captures the start time and its end time:

             

            See it running whole grep for catch

            • 3. Re: JOB timeout properties not killing script in blpackage
              Bill Robinson

              ok, so what exactly is in your blpackage ?  what commands are being run?  how ?  what commands are being left running on the target?  the bldeploy ?  a command your blpackage calls ?

              • 4. Re: JOB timeout properties not killing script in blpackage
                Neal Meagher

                Here are the details below.

                 

                 

                We have the following BATCH job:

                /RBC_Administrative_Regions/Canada_Enterprise/ServerEnvironment/PROD/BSA_Automation/WWF/CAE_job_batch_wwf_run which contains the following jobs:

                 

                 

                 

                 

                For job_blpkg_wwf_run_scripts,  I have a JOB_PART_TIMEOUT set to 70 minutes and a JOB_TIMEOUT set to 240 on the BATCH job itself.

                 

                 

                A brief description of each job:

                *       job_copy_wwf_scripts - copies 2 shell scripts to each target.

                *       job_blpkg_wwf_run_scripts - executes the 2 scripts copied to each target

                *       job_nsh_wwf_copy_xml - copies the XML files created by the scripts to a central location

                *       job_blpkg_wwf_move_home - moves the XML files to their final destination and sets ownership

                 

                 

                A brief description of the 2 scripts which are copied and executed on each target:

                *       Script 1 - Executes the UNIX "find" command on each local file system on a server

                *       Script 2 - Converts the output generated from the find command to XML

                 

                 

                 

                 

                tHIS IS THR SCRIPT:

                 

                 

                #!/bin/sh

                LOG="/tmp/dummy.log"

                 

                 

                logIt() {

                    msg="$1"

                    echo "${msg}" | tee -a $LOG

                    test -f /tmp/dummy.out && rm /tmp/dummy.out

                    exit 1

                }

                 

                 

                sigquit() {

                   logIt "signal QUIT received on `date`"

                }

                 

                 

                sigint() {

                   logIt "signal INT received on `date`"

                }

                 

                 

                sigabrt() {

                   logIt "signal ABRT received on `date`"

                }

                 

                 

                sigterm() {

                   logIt "signal TERM received on `date`"

                }

                 

                 

                trap 'sigquit' QUIT

                trap 'sigint'  INT

                trap 'sigabrt' ABRT

                trap 'sigterm' TERM

                 

                 

                echo "My PID is: $$"

                touch /tmp/dummy.out

                 

                 

                while [ 1 ]; do

                    sleep 1

                done

                 

                 

                exit 0

                • 5. Re: JOB timeout properties not killing script in blpackage
                  Neal Meagher

                  When the timeout parameter is set does it it run the zsh command listed _kill _killall under NSH\share\zsh\4.3.4\functions?

                  • 6. Re: JOB timeout properties not killing script in blpackage
                    Bill Robinson

                    you said there are two scripts in the blpackage the job_blpkg_wwf_run_scripts deploys.   i only see the contents of one.

                     

                     

                    can you attach the two scripts that are in that blpackage.

                     

                     

                    also - as i previously asked - what is still running on the target after the job cancel ?

                     

                     

                    i'm not sure why nsh settings are relevant here - do you have nsh installed on the target of job_blpkg_wwf_run_scripts?  ? and your script is running under nsh and not bash or sh ?

                    • 7. Re: JOB timeout properties not killing script in blpackage
                      Bill Robinson

                      also - just like your other package issue - why are you doing file copies in a blpackage ?  why don't you do the whole thing in nsh ??

                       

                       

                      all of those items you list can be done directly in nsh in a single job. no batch job.

                      • 8. Re: JOB timeout properties not killing script in blpackage
                        Neal Meagher

                        The script attached is the test scrip[t we are using that the kill command is not ending. The kill command gets sent. As it show in both app server and rscd logs. Bladleogic sees it as being cancelled. But the script attached keeps running regardless. I want to know where the kill command is stored in bladelogic. What is triggered when you set the timeout parameters of the job. It isnt under NSH/BIN When  I search for kill I see there are _kill _killall under NSH\share\zsh\4.3.4\functions.

                        • 9. Re: JOB timeout properties not killing script in blpackage
                          Neal Meagher

                          At this point. It doesnt matter what script I am running. The timeout parameter set on the deploy job send the kills comamnd but the script keep running

                          • 10. Re: JOB timeout properties not killing script in blpackage
                            Bill Robinson

                            your blpackage has two scripts.  you said it has a find command (just a command or that's in a script) and then presumably what you pasted into the thread.  so those are two separate files that are copied to the target and then you have a blpackage that runs them.  how are those executed by the blpackage ?  what is in the external command that runs them ?

                             

                             

                            and what exactly do you see running on the target after the cancel ?  eg the ps output ?

                            • 11. Re: JOB timeout properties not killing script in blpackage
                              Bill Robinson

                              it matters how you are starting the script and possibly what is in it.  eg, if you background and nohup it, killing the original parent won't matter.?

                              • 12. Re: JOB timeout properties not killing script in blpackage
                                Yanick Girouard

                                If the job kills the PID that the rscd agent started, and that that process forked another process that is not linked (i.e. the other process is not a child of the PID launched by the RSCD agent), then killing that initial PID won't kill the other process.

                                 

                                This is why Bill asked those questions and why it's important to know what is called in the BLPackage and how it's called.

                                • 13. Re: JOB timeout properties not killing script in blpackage
                                  Neal Meagher

                                  I tested it out with the one script and it doesnt stop the script.

                                   

                                  #!/bin/sh

                                  LOG="/tmp/dummy.log"

                                   

                                   

                                  logIt() {

                                      msg="$1"

                                      echo "${msg}" | tee -a $LOG

                                      test -f /tmp/dummy.out && rm /tmp/dummy.out

                                      exit 1

                                  }

                                   

                                   

                                  sigquit() {

                                    logIt "signal QUIT received on `date`"

                                  }

                                   

                                   

                                  sigint() {

                                    logIt "signal INT received on `date`"

                                  }

                                   

                                   

                                  sigabrt() {

                                    logIt "signal ABRT received on `date`"

                                  }

                                   

                                   

                                  sigterm() {

                                    logIt "signal TERM received on `date`"

                                  }

                                   

                                   

                                  trap 'sigquit' QUIT

                                  trap 'sigint'  INT

                                  trap 'sigabrt' ABRT

                                  trap 'sigterm' TERM

                                   

                                   

                                  echo "My PID is: $$"

                                  touch /tmp/dummy.out

                                   

                                   

                                  while [ 1 ]; do

                                      sleep 1

                                  done

                                   

                                   

                                  exit 0

                                   

                                  I add this script as a file to deport. Call it in a blpackage as per image

                                  • 14. Re: JOB timeout properties not killing script in blpackage
                                    Bill Robinson

                                    so in the 03 screenshot the parent the script is pid 4233.  what is that pid ?

                                     

                                    next time you run this, look through the process list and trace up the process ownership.

                                    1 2 Previous Next