This is most likely because of a known issue on the agent side that is resolved in 8.1.2.
bldeploy initiates a reboot, and the server starts to gracefully stop all the services. The bldeploy will wait for 60 seconds, and if the actual reboot did not happen within 60 seconds, the bldeploy assumes that the reboot has failed and reports so in teh job, when in reality it just took a bit longer to stop all the services and reboot the server.
The fix was to prolong the 60 sec grace period to 15 minutes.
you can validate this in the bldeploy log (//target/<agent>/Transactions/log/bldeploy-xxx.log) as follows...
Find the line that initiates a reboot, and then expect to see the next entry in exactly 1 minute that sais the reboot fails (or something like that).
This is fixed in 8.0.11 and 8.1.2 (I believe)
If I am wrong here, please attache the bldeploy log.
The timing was about 1:43 later in the logs but I think you are correct. The only thing is that the version of BSA is 8.0.11.
in that case I may not be correct. attache the bldeploy log please
I meant this is fixed on the agent side, bot appserver (in case what you meantioned is the appserver version).
This is what I expect to see in the bldeploy log:
05/10/12 01:23:52.886 INFO bldeploy - Attempting shutdown REBOOT=true TIMEOUT=0 MSG=System is rebooting
05/10/12 01:24:52.902 DEBUG bldeploy - Metabase initialization skipped because it was not needed
05/10/12 01:24:52.902 INFO bldeploy - Apply Succeeded
05/10/12 01:24:52.917 WARN bldeploy - Reboot required but did not occur. Manual reboot needed to complete operation.
From the first to second line, there is 60 sec waiting time. but like I said, if this is not it, bldeploy log show help us to proceed further.
Not exactly what I am seeing in the logs. I attached the bldeploy.log from the server where it last happened. But as you can see, although the App Server is 8.0.111, the agent version is 22.214.171.1242.
bldeploy.log 29.8 K
1 of 1 people found this helpful
This job rebooted correctly, and it did not complain that the job failed because the server did not reboot. The job complained because not every item in the deploy job completed successfully. For instance item 3 did not complete successfully:
06/05/12 18:19:55.341 INFO bldeploy -  [stdout: 3]
C:\temp\stage\a7399e3a059530cb8e758831efb2e888>chcp 1252 1>NUL
06/05/12 18:19:56.888 INFO bldeploy -  [stdout: 3]
C:\temp\stage\a7399e3a059530cb8e758831efb2e888>change user /execute
06/05/12 18:19:59.607 INFO bldeploy -  [stdout: 3] Install mode does not apply to a Terminal server configured for remote administration.
06/05/12 18:20:00.341 DEBUG bldeploy -  In RunProcess: exitCode = 1
06/05/12 18:20:00.544 ERROR bldeploy -  Command returned non-zero exit code: 1
When at least one item fails in deploy job, expect the job to fail with exit code -4001, just like in your log. To illustrate how to deal with -4001, I wrote the following guide:
KA354095 - BBSA Deploy Job error: APPLY failed for server [target]. Exit code = -4001
This actually makes sense. I conflated two issue: one of them the results of a change that I recently had to make in the Job.
The error you see is because of the "change user /mode" commands that I put in the Jobs. The last few collections of servers that I had to run the Job against had Terminal Servers in them. Unfortunately, I only found out about the terminal servers after I had created the job and it was being run in production.
For administrative reasons, the selections of servers to have IE8 installed at a particular time is not based on technology (i.e. their version of Windows, if they are terminal servers, etc.) nor can I create Smart Groups to separate or classify them out. Therefore, I created one BladeLogic Complaiance and Renediation job to determine the operating system version and run the appropriate IE8 executable. But when I saw that I had several terminal servers in several of the groups, because I had to resolve that issue immediately, I just added "change user /mode" commands. When the Job runs against a terminal server, it just works. And when the Job is not run against a terminal server, it produces trivial errors and continues on to successfully execute.
As a solution, I could modify the Compliance and Remediation Job to determine if the target server is a terminal servers or not but, although that is a relatively trivial change, it probably would require the entire Job to go through the client's code review process again. If I had known about the terminal servers when I originally created the job, I would have done this but, again, I only found out about them after the job was in production and it was much easier to add two quick lines to workaround an issue that I was encountering immeidately. Longer term, I am working on correcting that situation but in the meantime, I still need to continue deploying IE8 throughout the enterprise.
However, that does not mean that I have not seen an error message pop up in the results log saying that the server did not reboot when in fact it did. For those situations, I think you first analysis was correct. Although the App Server is 8.0 SP11, I would guess that close to half of the agents are still 126.96.36.1992. If I understand you correctly, this is causing the "false failures" when they occur in those cases. And again, they don't always occur.
Finally note that the client is planning to upgrade to 8.2 SP1 this summer and, when we do, we will also upgrade all the agents to the current version. By that time I also hope to have separated my terminal servers in BladeLogic.
But in the meantime, the IE8 upgrade itself is working. The only issue is the "false failures" in the results log: annoying and definitely not optimal but at this point, from everything you are telling me combined with our current administrative constraints, but not a "showstopper."
1 of 1 people found this helpful
>>I would guess that close to half of the agents are still 188.8.131.522. If I understand you correctly, this is causing the "false failures" when they occur in those cases. And again, they don't always occur.
Yes, if your agent is on 184.108.40.2062 (SP4, I believe), and the system requires more than 1 minute to stop all the services, then you will see the false negative with regards to reboot. And yes, this does not always occur, becuase the state of the system can differ, and what today took 2 minutes to stop, could take 30 seconds tomorrow, and the reboot false failure would not have happened.
Once you upgrade the agents, thechnically you should see the same false failure if the server took longer than 15 minutes to stop all services prior to reboot. If so, then there is a bigger issue to investigate.