9 Replies Latest reply on Sep 3, 2014 9:17 AM by Simon Wardley

    RSCD agent hanging mid-job

      Hi,

       

      I have an RSCD agent that seem to keep hanging on 5 servers (out of +2000).

       

      The RSCD Agent log states:

       

      INFO     rscd -  gbahevl452.gb.tntpost.com 21474 -1/-1 (Not_available): (Not_available): FIPS already enabled

      09/02/14 14:21:29.411 INFO     rscd -  gbahevl452.gb.tntpost.com 21475 -1/-1 (Not_available): (Not_available): FIPS already enabled

      09/02/14 14:21:29.459 INFO1    rscd -  10.210.137.5 21475 0/0 (L3_IMI_Global_Full-v2:D060AHE@GB898.TPGMS.COM): CM: > [Deploy] Job 'IMIDEPGENj482ahe' is executing a dry run

      09/02/14 14:28:56.069 INFO     rscd -  gbahevl452.gb.tntpost.com 21669 -1/-1 (Not_available): (Not_available): FIPS already enabled

      09/02/14 14:28:56.163 INFO1    rscd -  10.210.137.5 21669 0/0 (L3_IMI_Global_Full-v2:D060AHE@GB898.TPGMS.COM): CM: > [Deploy] Deleting //gbahevl452.gb.tntpost.com/usr/nsh/NSH/Transactions/log/tmp/bldeploy-2d997d9474333b598483a3d02d49aaea.log

      09/02/14 14:29:04.203 INFO     rscd -  gbahevl452.gb.tntpost.com 22048 -1/-1 (Not_available): (Not_available): FIPS already enabled

      09/02/14 14:29:04.292 INFO1    rscd -  10.210.137.2 22048 0/0 (L3_IMI_Global_Full-v2:D060AHE@GB898.TPGMS.COM): CM: > [Deploy] Retrieving the root filesystem

       

      The Job Logs state the following:

       

      INFO     rscd -  gbahevl452.gb.tntpost.com 21474 -1/-1 (Not_available): (Not_available): FIPS already enabled

      09/02/14 14:21:29.411 INFO     rscd -  gbahevl452.gb.tntpost.com 21475 -1/-1 (Not_available): (Not_available): FIPS already enabled

      09/02/14 14:21:29.459 INFO1    rscd -  10.210.137.5 21475 0/0 (L3_IMI_Global_Full-v2:D060AHE@GB898.TPGMS.COM): CM: > [Deploy] Job 'IMIDEPGENj482ahe' is executing a dry run

      09/02/14 14:28:56.069 INFO     rscd -  gbahevl452.gb.tntpost.com 21669 -1/-1 (Not_available): (Not_available): FIPS already enabled

      09/02/14 14:28:56.163 INFO1    rscd -  10.210.137.5 21669 0/0 (L3_IMI_Global_Full-v2:D060AHE@GB898.TPGMS.COM): CM: > [Deploy] Deleting //gbahevl452.gb.tntpost.com/usr/nsh/NSH/Transactions/log/tmp/bldeploy-2d997d9474333b598483a3d02d49aaea.log

      09/02/14 14:29:04.203 INFO     rscd -  gbahevl452.gb.tntpost.com 22048 -1/-1 (Not_available): (Not_available): FIPS already enabled

      09/02/14 14:29:04.292 INFO1    rscd -  10.210.137.2 22048 0/0 (L3_IMI_Global_Full-v2:D060AHE@GB898.TPGMS.COM): CM: > [Deploy] Retrieving the root filesystem

       

       

      However no update has occurred since (a good 45 minutes).

      This is the 2nd job that seems to fails against thes 5 servers, am I better to try a re-install of the agent?

       

      Any ideas?

       

      Simon

        • 1. Re: RSCD agent hanging mid-job

          Hi Simon,

           

          If your agent hangs during a DeployJob, before re-installing the agent, I would first look at the Transaction logs on the agent. These logs contain details about the execution of the DeployJob and might help you understand what is going on on the server.

           

          You can also look at the the rscd/rscw processes on the server and their children. This might also help you understand what keeps the DeployJob from finishing.

           

          Olivier.

          • 2. Re: RSCD agent hanging mid-job
            Bill Robinson

            what's different about these 5 boxes ?

            • 3. Re: RSCD agent hanging mid-job

              Hi Olivier,

               

              The same job is hanging on those boxes again today.

              There's nothing in the transactions folder with the same date as the job.

               

              As for the rscd/rscw processes, here a list below:

              root     25386     1  0 Sep02 ?        00:00:00 bin/rscw

              root     25387 25386  0 Sep02 ?        00:00:00 bin/rscd

              root     25388 25386  0 Sep02 ?        00:00:00 bin/rscd

              root     26171 25388  0 00:00 ?        00:00:00 bin/rscd

              root     26172 25388  0 00:00 ?        00:00:00 bin/rscd

              root     26177 25388  0 00:00 ?        00:00:00 bin/rscd

              root     26178 25388  0 00:00 ?        00:00:00 bin/rscd

              root     26179 25388  0 00:00 ?        00:00:00 bin/rscd

              root     27464 25388  0 Sep02 ?        00:00:00 bin/rscd

              Not sure if this helps?

               

              Hi Bill,

              I presume you mean the config files within /usr/lib/rsc and potentially the program binaries?

               

              I'll do an md5 checksum and get back to you?

               

              Cheers, Simon

              • 4. Re: RSCD agent hanging mid-job

                Hi Simon,

                 

                you need to go one step deeper in the process tree: I'd like to see the children of the rscd processes if any.

                 

                O.

                • 5. Re: RSCD agent hanging mid-job

                  Hi Olivier,

                   

                  Any idea how to step deeper into the process tree?

                   

                  I ran a "ps -ef | grep -i rsc" command....

                   

                  Simon

                  • 6. Re: RSCD agent hanging mid-job

                    Hi Bill,

                     

                    I have grabbed a sample server where the job ran successfully for comparison.

                     

                    The hanging server has more entries in it's exports file, which tells me its not access based.

                    Both servers users files have "nouser" entries & "BLAdmins:H174AHE     rw,map=root". (H174AHE being the job owner)

                    Users.local on both boxes contains "BLAdmins:*  rw,map=root".

                    The md5sum of secure matches on both servers.

                    Both also respond to agentinfo and confirm a licence agent, even though the job is still hanging.

                     

                    Thanks Simon

                    • 7. Re: RSCD agent hanging mid-job
                      Davorin Stevanovic

                      I dont see any bldeploy<uuid>.log file. Could you please share with us bldeploy.log from RSCD_DIR/Transactions/logs/bldeploy<uuid>.log (should be bldeploy-2d997d9474333b598483a3d02d49aaea.log)

                       

                      In bldeploy we should see why it is hanging.

                      • 8. Re: RSCD agent hanging mid-job
                        Bill Robinson

                        are they on a different network location than the working servers?

                        do they have a different HIPS or HIDS or AV installed?

                        different os ?

                         

                         

                        from the log it looks like the simulate phase completed..

                         

                        you can do a pstree -p <parent rscd pid)>

                        • 9. Re: RSCD agent hanging mid-job

                          I believe we may have resolved the issue.

                           

                          The script was running a "nover -u -c | head -2 | tail -1" commands.

                           

                          After speaking to our UNIX team regarding your HIPS etc. comment he found that one of the nfs mounts pointing to our BL 7.4.4 datastore that we decommissioned recently, was causing the nover command to hang.

                           

                          I'll confirm this tomorrow when the job re-runs.

                           

                          Thanks for everyones help.

                           

                          Simon