8 Replies Latest reply on Jun 7, 2012 8:15 AM by Lazar NameToUpdate

    Server rebooting (properly) after job but logs not showing it

      Running BSA 8.0 SP11, I created a Compliance and Remediation Job that checks to see if IE8 is installed on a server and, if it is not installed, installs it. The job runs successfully, no problem, but there is an issue with the results log.

       

      Installing IE8 requires a server reboot, which the Job does. However, about half the time (or about for half the servers when I run the Job against multiple servers) the results log for the Job shows a failure on the Remediation because the server did not reboot. But the servers did reboot!

       

      If you are logged into a server when the Job is being run, you can see the server reboot. If you are pinging the server while the server is rebooting, you can see the server reboot. And when you log into the server after it has come back online after the reboot, you see that IE8 is installed. But again, half the time the results logs say that the remediation part of the Job fails because the server does not reboot

       

      I have the reboot setting set in the BladeLogic Job, as opposed to using a command line switch ("/forcerestart") in the installation script of the Microsoft IE8 executables. Of note, to try to troubleshoot this issue I did try to use the /forcerestart command line switch, too, but the results were the same: about half the time the job results show that the remediation part failed even though it was successful.

       

      I want to emphasize that the Job itself has never failed: only the results log and then only about half the time. The annoyance is that if I am running this Job against many (potentially several hundred) servers at a time, about half of them will show a failure when they have not failed. Except for logging into the individual servers themselves, there is no way to prove to a client, customer or colleague that the Job was successful on every server.

       

      But again, despite what the Job results log shows, the Job itself has never failed. Any suggestions on how to get the Job results log to be accurate?

        • 1. Server rebooting (properly) after job but logs not showing it

          This is most likely because of a known issue on the agent side that is resolved in 8.1.2.

           

          Issue being:

          bldeploy initiates a reboot, and the server starts to gracefully stop all the services. The bldeploy will wait for 60 seconds, and if the actual reboot did not happen within 60 seconds, the bldeploy assumes that the reboot has failed and reports so in teh job, when in reality it just took a bit longer to stop all the services and reboot the server.

           

          The fix was to prolong the 60 sec grace period to 15 minutes.

           

          you can validate this in the bldeploy log (//target/<agent>/Transactions/log/bldeploy-xxx.log) as follows...

          Find the line that initiates a reboot, and then expect to see the next entry in exactly 1 minute that sais the reboot fails (or something like that).

           

          This is fixed in 8.0.11 and 8.1.2 (I believe)

           

          If I am wrong here, please attache the bldeploy log.

           

          Lazar

          • 2. Server rebooting (properly) after job but logs not showing it

            The timing was about 1:43 later in the logs but I think you are correct. The only thing is that the version of BSA is 8.0.11.

            • 3. Server rebooting (properly) after job but logs not showing it

              in that case I may not be correct. attache the bldeploy log please

              • 4. Server rebooting (properly) after job but logs not showing it

                I meant this is fixed on the agent side, bot appserver (in case what you meantioned is the appserver version).

                 

                This is what I expect to see in the bldeploy log:

                 

                ...

                05/10/12 01:23:52.886 INFO     bldeploy - Attempting shutdown REBOOT=true TIMEOUT=0 MSG=System is rebooting

                05/10/12 01:24:52.902 DEBUG    bldeploy - Metabase initialization skipped because it was not needed

                05/10/12 01:24:52.902 INFO     bldeploy - Apply Succeeded

                05/10/12 01:24:52.917 WARN     bldeploy - Reboot required but did not occur. Manual reboot needed to complete operation.

                 

                 

                From the first to second line, there is 60 sec waiting time. but like I said, if this is not it, bldeploy log show help us to proceed further.

                 

                Lazar

                • 5. Re: Server rebooting (properly) after job but logs not showing it

                  Not exactly what I am seeing in the logs. I attached the bldeploy.log from the server where it last happened. But as you can see, although the App Server is 8.0.111, the agent version is 8.0.0.422.

                  • 6. Server rebooting (properly) after job but logs not showing it

                    This job rebooted correctly, and it did not complain that the job failed because the server did not reboot. The job complained because not every item in the deploy job completed successfully. For instance item 3 did not complete successfully:

                     

                    06/05/12 18:19:55.341 INFO     bldeploy - [3][] [stdout: 3] 

                    C:\temp\stage\a7399e3a059530cb8e758831efb2e888>chcp 1252  1>NUL

                    06/05/12 18:19:56.888 INFO     bldeploy - [3][] [stdout: 3] 

                    C:\temp\stage\a7399e3a059530cb8e758831efb2e888>change user /execute

                    06/05/12 18:19:59.607 INFO     bldeploy - [3][] [stdout: 3]  Install mode does not apply to a Terminal server configured for remote administration.

                    06/05/12 18:20:00.341 DEBUG    bldeploy - [3][] In RunProcess: exitCode = 1
                    06/05/12 18:20:00.544 ERROR    bldeploy - [3][] Command returned non-zero exit code: 1

                     

                    When at least one item fails in deploy job, expect the job to fail with exit code -4001, just like in your log. To illustrate how to deal with -4001, I wrote the following guide:

                     

                    KA354095 - BBSA Deploy Job error: APPLY failed for server [target]. Exit code = -4001

                    https://kb.bmc.com/infocenter/index?page=content&id=KA354095

                     

                    Lazar

                    1 of 1 people found this helpful
                    • 7. Server rebooting (properly) after job but logs not showing it

                      This actually makes sense. I conflated two issue: one of them the results of a change that I recently had to make in the Job.

                       

                      The error you see is because of the "change user /mode" commands that I put in the Jobs. The last few collections of servers that I had to run the Job against had Terminal Servers in them. Unfortunately, I only found out about the terminal servers after I had created the job and it was being run in production.

                       

                      For administrative reasons, the selections of servers to have IE8 installed at a particular time is not based on technology (i.e. their version of Windows, if they are terminal servers, etc.) nor can I create Smart Groups to separate or classify them out. Therefore, I created one BladeLogic Complaiance and Renediation job to determine the operating system version and run the appropriate IE8 executable. But when I saw that I had several terminal servers in several of the groups, because I had to resolve that issue immediately, I just added "change user /mode" commands. When the Job runs against a terminal server, it just works. And when the Job is not run against a terminal server, it produces trivial errors and continues on to successfully execute.

                       

                      As a solution, I could modify the Compliance and Remediation Job to determine if the target server is a terminal servers or not but, although that is a relatively trivial change, it probably would require the entire Job to go through the client's code review process again. If I had known about the terminal servers when I originally created the job, I would have done this but, again, I only found out about them after the job was in production and it was much easier to add two quick lines to workaround an issue that I was encountering immeidately. Longer term, I am working on correcting that situation but in the meantime, I still need to continue deploying IE8 throughout the enterprise.

                       

                      However, that does not mean that I have not seen an error message pop up in the results log saying that the server did not reboot when in fact it did. For those situations, I think you first analysis was correct. Although the App Server is 8.0 SP11, I would guess that close to half of the agents are still 8.0.0.422. If I understand you correctly, this is causing the "false failures" when they occur in those cases. And again, they don't always occur.

                       

                      Finally note that the client is planning to upgrade to 8.2 SP1 this summer and, when we do, we will also upgrade all the agents to the current version. By that time I also hope to have separated my terminal servers in BladeLogic.

                       

                      But in the meantime, the IE8 upgrade itself is working. The only issue is the "false failures" in the results log: annoying and definitely not optimal but at this point, from everything you are telling me combined with our current administrative constraints, but not a "showstopper."

                       

                      Thank you!

                      • 8. Server rebooting (properly) after job but logs not showing it

                        you're welcome.

                         

                        >>I would guess that close to half of the agents are still 8.0.0.422. If I understand you correctly, this is causing the "false failures" when they occur in those cases. And again, they don't always occur.

                         

                        Yes, if your agent is on 8.0.0.422 (SP4, I believe), and the system requires more than 1 minute to stop all the services, then you will see the false negative with regards to reboot. And yes, this does not always occur, becuase the state of the system can differ, and what today took 2 minutes to stop, could take 30 seconds tomorrow, and the reboot false failure would not have happened.

                         

                        Once you upgrade the agents, thechnically you should see the same false failure if the server took longer than 15 minutes to stop all services prior to reboot. If so, then there is a bigger issue to investigate.

                         

                        Lazar

                        1 of 1 people found this helpful