9 Replies Latest reply on Mar 19, 2013 4:47 PM by afurman NameToUpdate

    BBSA 8.2 appserver threads shutdown due to lost connection to SQL

      BBSA 8.2 x86 running on Server 2003 SP2 x86,using remote SQL server 2005 SP4 enterprise,  cluster.

       

      We are experiencing some sporadic network connectivity issues with the app server computer (or the network segment it is connected to)

      As a result appserver logs:"Message:connection reset SQL State:08S01 ErrorCode:0" and other SQL-related errors.

       

      The real problem is that immediately following these errors appserver starts to shutdown all of its processes and even though the network issues seems to go away quickly appserver is not available untill we manually restart all services in windows.  (the service is still running in windows but according to appserver.log all threads are shut down)

       

      We waited at least 30 minutes for appserver to restart its threads on its own but after that just did it manually.

       

      Question:  somehow I remember in the past I saw appserver automatically restart its threads in similar cicumstances, so no manual intervention was required (maybe that was prior to version 8.2 though)

       

      can anybody tell me if such automatic restart exists and if so how long do we have to wait for it to happen?

        • 1. Re: BBSA 8.2 appserver threads shutdown due to lost connection to SQL
          Barry McQuillan

          When you say all threads are shutdown, are you referring to database connection threads or application server threads?

          • 2. Re: BBSA 8.2 appserver threads shutdown due to lost connection to SQL

            Barry,

            Basically all appserver processes are being shutdown(appserver, authentication, etc.) Probably includes databaase connection threads as  well, not sure.

            • 3. Re: BBSA 8.2 appserver threads shutdown due to lost connection to SQL
              Barry McQuillan

              I'm not aware of any specific functionality that restarts an appserver if the processes die.

              However I can think of 2 ways that this could be done.

              1. Using BPPM or some other monitoring tool, set up a trigger for the appserver process, when it dies has BPPM restart it, if restart still fails then send alert for manual intervention.
              2. Create a script that does the above and schedule it to run regularly.

               

              I should point out that this is not a BSA issue and must applications dislike having their databases removed mid-transaction.  The above should only be considered a temporary solution and should be removed once the network issues have been resolved.

               

              You should also rasie an issue with BMC Support for this problem.

              • 4. Re: BBSA 8.2 appserver threads shutdown due to lost connection to SQL
                Bill Robinson

                what is the network path between the appserver and db ?  fw, router, etc?  any timeouts there?

                 

                can you run a 'blasadmin -a show database all' and post the result ?

                • 5. Re: BBSA 8.2 appserver threads shutdown due to lost connection to SQL
                  Sean Berry

                  If the database becomes unresponsive, it wouldn’t surprise me that BSA eventually becomes unresponsive.  Make sure to size the DB environment for the size of the appserver load.  If the DB needs to be restarted, plan to restart the appserver.

                  • 6. Re: BBSA 8.2 appserver threads shutdown due to lost connection to SQL

                    Bill,

                    this is not related to my questions in other thread about settig up a BCP with the database actoss the WAN.

                    DB and app server in this case are in the same datacenter and the same LAN.

                     

                    99% of the time there are no issues.   This situation happens when network link goes down randomly, very briefly and not too frequently  (it has happened 3 times in the last 5 months for example) It has to do with either the NICs on the server or the switch - this is not a "bladelogic" problem, there's nothing wrong with the way appserver and its database are configured.

                     

                    My quesiton is about BBSA's handling of such events.  What I see in appservr.log is that appserver immmediately starts shutting itself down. and stays down for at least 30 min (then we manually restart the windows service)

                     

                    What I seem to remember from the past, maybe it was version 8.0 still is that appserver would be able to "recover" on its own after DB detected as being offline by retrying periodically and starting the processes back up if DB is detected online.

                     

                    I believe I've seen it durring the time when some maint. work was done at night on SQL server by our DBAs and in the morning I did not have to do anything - appservers were running fine.

                     

                    I don't see bladelogic recover the same way in our current situation, so I'm trying to find out what I should expect, is it going to recover if I give it more time for example? If so how much time?

                     

                    Was hoping maybe someone had similar experience or knows the internals of error handling in this process.

                     

                    I do realize that losing connection to DB is a bad thing- no argument here :-)

                    • 7. Re: BBSA 8.2 appserver threads shutdown due to lost connection to SQL

                      Barry,

                      yes we're thinking about a scripted solution to this restarting the app server servucesm just like you said.

                      And yes, I'll open a support ticket, I jhust wanted to see if I can find some answers in the community first.

                      We'll also have to fix the actual hardware problme with either a NIC or a switch which causes this situation in the first place :-)

                      Thanks

                       

                      alex.

                      • 8. Re: BBSA 8.2 appserver threads shutdown due to lost connection to SQL
                        Bill Robinson

                        I was more wondering what's causing the problem becasue the best solution is going to be to fix that.

                         

                        I haven't seen db connectivity issues cause an appserver shutdown before, though i'm probably more familar w/ oracle.  what should happen imo, is that the connection would be disconnected, we'd eventually catch that, and try to open a new connection. if the nic is getting disconnected, it's possible that all connections are dead right?  so maybe in that case we initiate a shutdown.

                         

                        like what barry said you'd need to implement a monitoring agent that could initiate the restart.

                        • 9. Re: BBSA 8.2 appserver threads shutdown due to lost connection to SQL

                          Well, for what it's worth, I simulated a few failure scenarios in the lab and watched bladelogic's response to each one:

                           

                          Scenario#1 NIC on the application server is Disabled for about a minute and re-enabled (brief network outage) Appserver.log  records errors but does not shut down processes, ABLE  to continue on its own without restart once connection is restored.

                           

                          Scenario#2 NIC on the application server is Disabled for about 15 minutes and re-enabled (long-lasting network outage) Appserver records errors, and then abruptly shuts down:

                          [19 Mar 2013 12:32:17,200] [System-In-Thread] [INFO] [::] [] Stopping appserver due to failure to read launcher socket [19 Mar 2013 12:32:17,200] [main] [INFO] [::] [] Shutdown requested

                          [19 Mar 2013 12:32:17,200] [System-In-Thread] [INFO] [::] [] Shutdown processes complete

                          [19 Mar 2013 12:32:17,200] [main] [INFO] [::] [] Shutdown processes complete

                          [19 Mar 2013 12:32:17,200] [Thread-1] [INFO] [::] [] Undeploying

                          nothing after that - UNABLE to continue on its own without restart once connection is restored

                           

                          Scenario#3 SQL server hosting bladelogic database is stopped.

                          Appserver records errors but does not shut down processes, constantly retrying about every 30 sec (gave it about 15 min) and then ABLE  to continue on its own without service restart once SQL server becomes available.

                          I believe this is the scenario I observed in the past which lead me to believe that there’s some kind of general auto-repair/retry mechanism In bladelogic.  I guess there is but only for certain kind of errors.

                           

                          Our recent outage  based on the data captured in appserver.log falls mostly into Scenario#2 category because It was some kind of network outage, bladelogic processes are shut down right away and require manual intervention (service restart) to continue. The difference is that I don’t see in system log that the outage lasted a long time and I see a much more detailed and orderly shutdown:

                          [14 Mar 2013 16:58:46,935] [Thread-1] [INFO] [::] [] Finished shutting down thread pool Support-Thread-

                          [14 Mar 2013 16:58:46,935] [Thread-1] [INFO] [::] [] Support Service stopped.

                          [14 Mar 2013 16:58:46,935] [Thread-1] [INFO] [::] [] Stopping Cleanup Service...

                          [14 Mar 2013 16:58:46,935] [Thread-1] [INFO] [::] [] Cleanup Service stopped.

                          [14 Mar 2013 16:58:46,935] [Thread-1] [INFO] [::] [] Stopping Grammar Service...

                          [14 Mar 2013 16:58:46,935] [Thread-1] [INFO] [::] [] Grammar Service stopped.

                          [14 Mar 2013 16:58:46,935] [Thread-1] [INFO] [::] [] File Manager Service stopping...

                          [14 Mar 2013 16:58:46,935] [Thread-1] [INFO] [::] [] File Manager Service stopped.

                          [14 Mar 2013 16:58:46,935] [Thread-1] [INFO] [::] [] Stopping App Server Service...

                          [14 Mar 2013 16:58:46,935] [Thread-1] [INFO] [::] [] App Server Service stopped.

                          [14 Mar 2013 16:58:46,935] [Thread-1] [INFO] [::] [] Stopping Authentication Service...

                          [14 Mar 2013 16:58:46,935] [Thread-1] [INFO] [::] [] Authentication-Service is shutting down the acceptor

                           

                          But the end result is the same - it shuts everything down and you need to manualy restart.