7 Replies Latest reply on Dec 18, 2014 2:12 PM by Bill Robinson

    Odd Batch Job Behavior

      Seeing some odd behavior in Batch jobs and am looking for input.


      I have a batch job, filled with ~20 child jobs.  The Batch job is set to fail and stop if an individual child job fails.


      In job run log, I'm seeing individual child jobs failed with a Red X, but yet the child jobs keep running.  If you look at the "status" of the Red X jobs, they show as "Canceled" even though no one canceled them.  If you drill into the job, you'll notice that all three phases of the job completed with a green check mark.


      The first time a child job fails in this manner, the Batch job disappears from the "Tasks in Progress" pane, yet the child jobs continue to run.  When all of the child jobs complete, the overall batch will end with a Red X, but all of the child jobs have completed.  (With some of them showing a Red X with status "Canceled", but the actual job worked just fine on the target server.)  This happens to random child jobs on subsequent Batch runs, it does not happen to the same jobs over and over. 


      We're seeing this in a test environment on a regular basis.  We've seen this behavior *occasionally* in our Prod environment, but not often enough to really dig into.


      Has anyone seen this behavior before, or have any insights on where to start looking for a root cause for failure?

        • 1. Re: Odd Batch Job Behavior
          Raja Mohan

          Christopher Morris what OS the batch job is running against? I did see some weird unexplained behavior against SOLARIS 10.

          • 2. Re: Odd Batch Job Behavior
            Bill Robinson

            can you paste the batch job options you have set - is it 'by server' or 'in parallel' or what?


            is there a job_part_timeout or  job_timeout set on the batch job and child jobs, if so to what and when does the cancel happen ?


            can you export the job run logs for the batch and the child jobs and attach them ?

            • 3. Re: Odd Batch Job Behavior

              The problem is happening over several Batch Jobs, which are running against a mix of Windows and Linux servers.  I just didn't want to confuse my initial post with too much information. 

              • 4. Re: Re: Odd Batch Job Behavior

                Batch Job Options

                • NOT CHECKED: "Continue executing batch when individual jobs return non-zero exit code"
                • IS CHECKED:  "Execute by Server"


                Timeout Settings

                • Batch job is set to 0 & 0 for job_part_timeout & job_timeout
                • Individual jobs may have (reasonable and tested) timeouts, but the error does not happen on the same job over and over.  I think we can safely rule out child job timeouts as the core issue.


                Exported Job Logs are attached. In this job run, we have a batch that contains only 3 child jobs.  The first job, "Remove Root SSH Access" is the job that failed for the "canceled" status (but actually completed successfully against the target).  The last two jobs "Reboot Any Server" and "BB - MM...." completed with Green check-marks after the first job completed with "canceled" status.


                • BATCH_Log.csv = the batch job log
                • BB_Job_Log.csv = job #3 completed successfully
                • REBOOT_Job_Log.csv = job #2 completed successfully
                • REMOVE_Job_Log = job #1, failed with "canceled" status <-- demonstrates odd behavior
                • 5. Re: Re: Odd Batch Job Behavior
                  Bill Robinson

                  when you check the 'by server' option it creates a batch job run per target, and then in each batch job run it will run the member job against each target.  so what should happen is if one of the member jobs fails against one target, then no other job for that target should run, but other targets would not be affected by that.


                  so i tired a couple things.  a 'byserver' batch job that runs a couple member jobs where they fail - no timeouts, and that fails and doesn't progress.  then i added a sleep job that sleeps for 10 min, w/ the job part timeout on that job set to 2 min, job timeout set to 3.  that also failed to progress past that job.  in that case though it shows as WIT timed out.


                  so i ran that one again and manually cancelled one of the job runs (cancel not abort) and i saw the same thing.


                  so, we probably need to see why the jobs are cancelled - maybe there is some kind of appserver communication issue when this is happening?  i mean, is this always happening ?


                  also this REMOVE job, what is doing the reboot ?

                  • 6. Re: Re: Re: Odd Batch Job Behavior

                    maybe there is some kind of appserver communication issue when this is happening?

                    Duplicating this error at will is difficult.  What would be the best method of detecting appserver communication in real time?  Setting app server log levels to debug and tailing the log?


                    i mean, is this always happening ?

                    On certain Batch jobs that have been ported through the move, it happens fairly regularly.  Newly created Batch jobs do not seem to have the issue.  When it does happen, it happens on random child jobs.  It's not the same child job every time. 


                    also this REMOVE job, what is doing the reboot ?

                    This is a Deploy job.  Inside the BLPkg, we alter "sshd_config" as an Object.  The next line is an External Command which echos out to the job log indicating that the server is being rebooted.  That EC is marked as "Reboot: After Item Deployment".  When the server comes back online, the next EC echos out to the job log that the server is back online with new configurations.  BLPkg is complete.


                    The Job itself is marked as "Use item defined reboot setting".

                    • 7. Re: Re: Re: Odd Batch Job Behavior
                      Bill Robinson

                      do you really need to do a reboot just to update ssh ?  can't you just restart the service ?


                      but you are saying that this issue happens w/ different jobs as members of the batch, so it may happen w/ jobs that don't do a reboot ?


                      if you know when it's happened you can grab the appserver logs from the appservers and the job run log and we can try and correlate what is happening on the appservers.  you can also bump up the number of kept rolled logs in the log4j.properties on the appservers.