7 Replies Latest reply on Sep 23, 2016 7:04 AM by Bill Robinson

    Repeated error in job server logs Work item '' has been aborted. Reason: Application se rver: blasb02-job1 went down during work item execution.

    Ashish Vijay

      Hello

       

      From last few days, we notice below error in our job server logs. We have two app servers in our infra and this error is coming only on one server.

       

      [22 Sep 2016 11:32:27,153] [Scheduled-System-Tasks-Thread-12] [INFO] [System:System:] [Memory Monitor] Total JVM (B): 754778112,Free JVM (B): 510578880,Used JVM (B): 244199232,VSize (B): 8260378624,RSS (B): 987885568,Used File Descriptors: 283

      [22 Sep 2016 11:33:27,149] [Scheduled-System-Tasks-Thread-5] [INFO] [System:System:] [Memory Monitor] Total JVM (B): 728039424,Free JVM (B): 528074336,Used JVM (B): 199965088,VSize (B): 8250904576,RSS (B): 1007075328,Used File Descriptors: 289

      [22 Sep 2016 11:33:27,505] [Job-Execution-3] [ERROR] [BLAdmin:BLAdmins:] [NSHScript] Work item '' has been aborted. Reason: Application server: blasb02-job1 went down during work item execution.

      [22 Sep 2016 11:33:27,511] [Job-Execution-3] [ERROR] [BLAdmin:BLAdmins:] [NSHScript] Work item '' has been aborted. Reason: Application server: blasb02-job1 went down during work item execution.

      [22 Sep 2016 11:33:27,516] [Job-Execution-3] [ERROR] [BLAdmin:BLAdmins:] [NSHScript] The job 'GENERES - 842 - EXECUTE - Worker 002' has failed

      [22 Sep 2016 11:34:27,149] [Scheduled-System-Tasks-Thread-4] [INFO] [System:System:] [Memory Monitor] Total JVM (B): 728039424,Free JVM (B): 492122048,Used JVM (B): 235917376,VSize (B): 8250904576,RSS (B): 1007845376,Used File Descriptors: 273

      [22 Sep 2016 11:35:27,149] [Scheduled-System-Tasks-Thread-16] [INFO] [System:System:] [Memory Monitor] Total JVM (B): 728039424,Free JVM (B): 479189280,Used JVM (B): 248850144,VSize (B): 8250904576,RSS (B): 1008234496,Used File Descriptors: 276

       

      There was no change done in infra during this time. We have BSA 8.3.02 App Server running on RHEL 6.2. This is intermittent error which sometime comes regularly and some time at a big gap.

       

      Anyone have any solution for this problem. If you need any other information then let me know.

       

      Bill Robinson  Sean Berry Your help is needed on urgent.

       

      Thanks,

      Ashish Vijay

        • 1. Re: Repeated error in job server logs Work item '' has been aborted. Reason: Application se rver: blasb02-job1 went down during work item execution.
          Bill Robinson

          well, why did blasb02-job1 go down ?

           

          what's in the appserver logs for that instance during the same time ?

           

          is there a high load on either system when this happens ?

          • 2. Re: Repeated error in job server logs Work item '' has been aborted. Reason: Application se rver: blasb02-job1 went down during work item execution.
            Ashish Vijay

            Hello Bill,

             

            Not only job server1 goes down, we have 3 job servers and all has same problem randomly.

             

            I do not see any load on cpu and ram at that time. I checked app server log as well but there also i did not find any suspicious activity during that time. Even if you check logs i pasted above, we have enough free jvm available on system.

             

            Maximum time this error come with NSH job execution (type 1) but some time it comes with deploy job as well. Can you suggest anything i should check in system or in app server config when this issue occur.

            • 3. Re: Repeated error in job server logs Work item '' has been aborted. Reason: Application se rver: blasb02-job1 went down during work item execution.
              Santhosh Kurimilla

              The job server is going out of no of threads allocated?

               

              It is generally - if a JOB server/instance picks up a job, it will try to allocate most possible threads from the same Appserver/instance first then, it will go for other instances. So, if your NSH Type1 job has high parallelism setup and executing against high no of targets, chances to consume all the threads.

               

              Please check:

              - NSh type1 job that are running with High parallelism

              - Review no of available worker thread items on all the JOB server instances in the env

              - Review JMX Mon for threads usage during the suspected jobs execution or issue occurance

              • 4. Re: Repeated error in job server logs Work item '' has been aborted. Reason: Application se rver: blasb02-job1 went down during work item execution.
                Ashish Vijay

                My NSH job is running with 30 parallel targets only. I checked thread count consumed and it is only 3 to 4 at the time of job run and i have 100 WIT on my app server configured.

                 

                I took thread dump of job server which is attached here. Can you get some information from it.

                 

                Approval-2 (State = WAITING)

                ------------------------------------------------------

                Blocking on monitor: java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@40411cf0

                Stack Trace:

                    sun.misc.Unsafe.park(Native Method)

                    java.util.concurrent.locks.LockSupport.park(Unknown Source)

                    java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(Unknown Source)

                    com.bladelogic.om.infra.shared.collections.BlBlockingQueue.take(BlBlockingQueue.java:52)

                    com.bladelogic.om.infra.app.service.approval.ApprovalProcessQueue.take(ApprovalProcessQueue.java:49)

                    com.bladelogic.om.infra.app.service.thread.BlBlockingThread.getProcessObject(BlBlockingThread.java:68)

                    com.bladelogic.om.infra.app.service.thread.BlBlockingThread.run(BlBlockingThread.java:83)

                 

                Bl-Exec-0 (State = WAITING)

                ------------------------------------------------------

                Blocking on monitor: java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@76ee4c00

                Stack Trace:

                    sun.misc.Unsafe.park(Native Method)

                    java.util.concurrent.locks.LockSupport.park(Unknown Source)

                    java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(Unknown Source)

                    com.bladelogic.om.infra.shared.collections.BlBlockingQueue.take(BlBlockingQueue.java:52)

                    com.bladelogic.om.infra.app.service.thread.BlBlockingThread.getProcessObject(BlBlockingThread.java:68)

                    com.bladelogic.om.infra.app.service.thread.BlBlockingThread.run(BlBlockingThread.java:83)

                • 5. Re: Repeated error in job server logs Work item '' has been aborted. Reason: Application se rver: blasb02-job1 went down during work item execution.
                  Santhosh Kurimilla

                  From the logs, below results seem to be suspicious. Almost all the WITs are in waiting state for 32 hours. Ref: java - Difference between WAIT and BLOCKED thread states - Stack Overflow

                   

                  -Check if you have any long running stale job runs and kill them if possible.

                  - Also, check any blocking on DB.

                   

                  Work Item Thread status:

                  =========================

                  Name,State,Time in state (ms),Additional Info

                  --------------------------------------------------

                  WorkItem-Thread-0,WAITING,2h:23m:4s818ms,

                  WorkItem-Thread-1,WAITING,32h:6m:32s968ms,

                  WorkItem-Thread-2,WAITING,32h:7m:32s365ms,

                  WorkItem-Thread-3,WAITING,32h:7m:28s550ms,

                  WorkItem-Thread-4,WAITING,32h:7m:16s546ms,

                  WorkItem-Thread-5,WAITING,34h:56m:32s53ms,

                  WorkItem-Thread-6,WAITING,34h:56m:32s33ms,

                  WorkItem-Thread-7,WAITING,17h:0m:28s281ms,

                  WorkItem-Thread-8,WAITING,16h:29m:9s902ms,

                  WorkItem-Thread-9,WAITING,2h:37m:42s564ms,

                  WorkItem-Thread-10,WAITING,34h:42m:13s474ms,

                  WorkItem-Thread-11,WAITING,34h:56m:32s29ms,

                  WorkItem-Thread-12,WAITING,32h:7m:21s437ms,

                  WorkItem-Thread-13,WAITING,9h:25m:17s819ms,

                  WorkItem-Thread-14,WAITING,23h:25m:16s312ms,

                  WorkItem-Thread-15,WAITING,15h:43m:25s512ms,

                  WorkItem-Thread-16,WAITING,16h:56m:56s250ms,

                  WorkItem-Thread-17,WAITING,15h:59m:21s988ms,

                  WorkItem-Thread-18,WAITING,32h:7m:8s909ms,

                  WorkItem-Thread-19,WAITING,34h:56m:31s907ms,

                  WorkItem-Thread-20,WAITING,34h:56m:31s907ms,

                  WorkItem-Thread-21,WAITING,16h:56m:53s8ms,

                  WorkItem-Thread-22,WAITING,15h:50m:9s222ms,

                  WorkItem-Thread-23,WAITING,34h:56m:31s905ms,

                  WorkItem-Thread-24,WAITING,34h:56m:31s902ms,

                  WorkItem-Thread-25,WAITING,34h:56m:31s901ms,

                  WorkItem-Thread-26,WAITING,9h:25m:19s612ms,

                  WorkItem-Thread-27,WAITING,2h:20m:49s645ms,

                  WorkItem-Thread-28,WAITING,16h:56m:50s492ms,

                  WorkItem-Thread-29,WAITING,23h:25m:19s840ms,

                  WorkItem-Thread-30,WAITING,9h:25m:55s715ms,

                  WorkItem-Thread-31,WAITING,34h:56m:31s884ms,

                  WorkItem-Thread-32,WAITING,23h:25m:15s142ms,

                  WorkItem-Thread-33,WAITING,34h:56m:31s884ms,

                  WorkItem-Thread-34,WAITING,2h:21m:16s296ms,

                  WorkItem-Thread-35,WAITING,34h:56m:31s884ms,

                  WorkItem-Thread-36,WAITING,18h:55m:38s617ms,

                  WorkItem-Thread-37,WAITING,23h:25m:23s264ms,

                  WorkItem-Thread-38,WAITING,19h:30m:16s89ms,

                  WorkItem-Thread-39,WAITING,34h:56m:31s868ms,

                  WorkItem-Thread-40,WAITING,23h:25m:12s263ms,

                  WorkItem-Thread-41,WAITING,34h:56m:31s868ms,

                  WorkItem-Thread-42,WAITING,34h:56m:31s852ms,

                  WorkItem-Thread-43,WAITING,34h:56m:31s851ms,

                  WorkItem-Thread-44,WAITING,34h:56m:31s868ms,

                  WorkItem-Thread-45,WAITING,9h:25m:25s647ms,

                  WorkItem-Thread-46,WAITING,34h:56m:31s851ms,

                  WorkItem-Thread-47,WAITING,18h:55m:39s861ms,

                  WorkItem-Thread-48,WAITING,23h:25m:23s78ms,

                  WorkItem-Thread-49,WAITING,34h:56m:31s851ms,

                  WorkItem-Thread-50,WAITING,34h:56m:31s852ms,

                  WorkItem-Thread-51,WAITING,16h:59m:47s399ms,

                  WorkItem-Thread-52,WAITING,34h:56m:31s852ms,

                  WorkItem-Thread-53,WAITING,34h:56m:31s852ms,

                  WorkItem-Thread-54,WAITING,34h:56m:31s849ms,

                  WorkItem-Thread-55,WAITING,9h:25m:24s381ms,

                  WorkItem-Thread-56,WAITING,34h:56m:31s849ms,

                  WorkItem-Thread-57,WAITING,34h:56m:31s848ms,

                  WorkItem-Thread-58,WAITING,34h:56m:31s848ms,

                  WorkItem-Thread-59,WAITING,2h:37m:5s435ms,

                  WorkItem-Thread-60,WAITING,34h:56m:31s847ms,

                  WorkItem-Thread-61,WAITING,34h:56m:31s847ms,

                  WorkItem-Thread-62,WAITING,16h:1m:15s809ms,

                  WorkItem-Thread-63,WAITING,16h:59m:48s103ms,

                  WorkItem-Thread-64,WAITING,34h:56m:31s845ms,

                  WorkItem-Thread-65,WAITING,34h:56m:31s846ms,

                  WorkItem-Thread-66,WAITING,34h:56m:31s846ms,

                  WorkItem-Thread-67,WAITING,34h:56m:31s845ms,

                  WorkItem-Thread-68,WAITING,34h:56m:31s845ms,

                  WorkItem-Thread-69,WAITING,2h:37m:5s209ms,

                  WorkItem-Thread-70,WAITING,34h:56m:31s844ms,

                  WorkItem-Thread-71,WAITING,34h:56m:31s842ms,

                  WorkItem-Thread-72,WAITING,17h:0m:28s244ms,

                  WorkItem-Thread-73,WAITING,34h:56m:31s842ms,

                  WorkItem-Thread-74,WAITING,34h:56m:31s842ms,

                  WorkItem-Thread-75,WAITING,17h:45m:0s24ms,

                  WorkItem-Thread-76,WAITING,34h:56m:31s841ms,

                  WorkItem-Thread-77,WAITING,34h:56m:31s841ms,

                  WorkItem-Thread-78,WAITING,34h:56m:31s841ms,

                  WorkItem-Thread-79,WAITING,34h:56m:31s840ms,

                  WorkItem-Thread-80,WAITING,34h:56m:31s840ms,

                  WorkItem-Thread-81,WAITING,17h:0m:28s242ms,

                  WorkItem-Thread-82,WAITING,34h:56m:31s838ms,

                  WorkItem-Thread-83,WAITING,34h:56m:31s838ms,

                  WorkItem-Thread-84,WAITING,34h:56m:31s839ms,

                  WorkItem-Thread-85,WAITING,2h:37m:2s848ms,

                  WorkItem-Thread-86,WAITING,16h:1m:6s244ms,

                  WorkItem-Thread-87,WAITING,34h:56m:31s838ms,

                  WorkItem-Thread-88,WAITING,34h:56m:31s835ms,

                  WorkItem-Thread-89,WAITING,34h:56m:31s833ms,

                  WorkItem-Thread-90,WAITING,34h:56m:31s836ms,

                  WorkItem-Thread-91,WAITING,34h:56m:31s835ms,

                  WorkItem-Thread-92,WAITING,34h:56m:31s835ms,

                  WorkItem-Thread-93,WAITING,8h:55m:29s411ms,

                  WorkItem-Thread-94,WAITING,34h:56m:31s833ms,

                  WorkItem-Thread-95,WAITING,34h:56m:31s834ms,

                  WorkItem-Thread-96,WAITING,0h:5m:16s0ms,

                  WorkItem-Thread-97,WAITING,34h:56m:31s831ms,

                  WorkItem-Thread-98,WAITING,34h:56m:31s832ms,

                  WorkItem-Thread-99,WAITING,34h:56m:31s830ms,

                  • 6. Re: Repeated error in job server logs Work item '' has been aborted. Reason: Application se rver: blasb02-job1 went down during work item execution.
                    Santhosh Kurimilla

                    Threads in Waiting state in the above logs: They must be from a stale long running job, identify and kill it.

                          1 0h:

                          3 15h

                          8 16h

                          4 17h

                          2 18h

                          1 19h

                          6 23h

                          7 2h:

                          6 32h

                         56 34h

                          1 8h:

                          5 9h: