11 Replies Latest reply on Jul 21, 2015 9:39 AM by Bill Robinson

    java.io.SyncFailedException: lost message framing

    Yanick Girouard

      (Using BSA 8.5.01.304 on Windows/Oracle, with NSH Proxy enabled.)

       

      Once in a while, always at what seems to be the same time (4:00 AM), we see this error in a type 2 NSH Script Job...

       

      Error20-Jul-2015 4:00:20 AMcom.bladelogic.om.infra.session.client.SessionLoginException: Your session credential does not contain a usable Application Server URL.
      Error20-Jul-2015 4:00:20 AMErrors were:
      Error20-Jul-2015 4:00:20 AMCannot connect to "service:appsvc.bladelogic:blsess://APP_SERVICE_HOSTNAME:9841" - java.io.SyncFailedException: lost message framing

       

      With the following matching the timestamp in the app server log:

       

      "20 Jul 2015 04:00:20,336",Accept-Client-Connections-Thread,INFO,::,Appserver,Connection disconnecting: id = 910179

      "20 Jul 2015 04:00:20,570",Client-Connections-Thread-9,WARN,Anonymous:Anonymous:IP_OF_APPSERVER,Client,java.net.SocketException: Connection reset

      com.bladelogic.sso.engine.ClientDisconnectedException: java.net.SocketException: Connection reset

        at com.bladelogic.sso.engine.StreamMessaging.receiveMessage(StreamMessaging.java:123)

        at com.bladelogic.sso.engine.StreamMessaging.receiveToken(StreamMessaging.java:203)

        at com.bladelogic.om.infra.mfw.net.BlSessionServerConnection.authenticate(BlSessionServerConnection.java:186)

        at com.bladelogic.om.infra.mfw.net.BlSessionServerConnection.doHandshake(BlSessionServerConnection.java:116)

        at com.bladelogic.om.infra.mfw.net.ClientWorkerThread.handleNewConnection(ClientWorkerThread.java:144)

        at com.bladelogic.om.infra.mfw.net.ClientWorkerThread.execute(ClientWorkerThread.java:97)

        at com.bladelogic.om.infra.mfw.net.ClientWorkerThread.execute(ClientWorkerThread.java:27)

        at com.bladelogic.om.infra.app.service.thread.BlBlockingThread.run(BlBlockingThread.java:95)

      Caused by: java.net.SocketException: Connection reset

        at java.net.SocketInputStream.read(Unknown Source)

        at java.net.SocketInputStream.read(Unknown Source)

        at sun.security.ssl.InputRecord.readFully(Unknown Source)

        at sun.security.ssl.InputRecord.read(Unknown Source)

        at sun.security.ssl.SSLSocketImpl.readRecord(Unknown Source)

        at sun.security.ssl.SSLSocketImpl.readDataRecord(Unknown Source)

        at sun.security.ssl.AppInputStream.read(Unknown Source)

        at java.io.BufferedInputStream.fill(Unknown Source)

        at java.io.BufferedInputStream.read(Unknown Source)

        at com.bladelogic.sso.engine.StreamMessaging.receiveMessage(StreamMessaging.java:119)

        ... 7 more

      "20 Jul 2015 04:00:20,570",Client-Connections-Thread-9,WARN,Anonymous:Anonymous:IP_OF_APPSERVER,Client,Error authorizing the connection

      "20 Jul 2015 04:00:20,570",Client-Connections-Thread-1,INFO,USERNAME@DOMAIN:Anonymous:IP_OF_APPSERVER,Client,Connection disconnecting: id = 910851

       

      We can't find anything such as VMware events or database issues, or other logged activity matching this time that could identify the potential cause... Subsequent runs on the same app servers are all fine, and there are no other errors in the logs of the app server. The app server services are all green and running.


      A search in the job_run_event table for the same error returned only this occurrence, and we keep 180 days worth of historical data.


      We do have a monitoring probe telneting the port of the app service every minute, so I don't know if this could have caused a glitch, but then why has it not happened at every other probe check as well? We have NSH Script Jobs scheduled to run every 30 minutes, at the 00 and 30 minute mark.


      Would anyone have any idea what could be causing this?

        • 1. Re: java.io.SyncFailedException: lost message framing

          Is it possible for you to reproduce this without nsh proxy?

           

          Thanks,

          -Saif.

           

          -


          Yanick Girouard wrote -


           

          >                                                                             

          >                                                                                                                                                                                       BMC Communities

          >                                        

          >java.io.SyncFailedException: lost message framing

          >created by Yanick Girouard in Server Automation - View the full discussion

          >(Using BSA 8.5.01.304 on Windows/Oracle, with NSH Proxy enabled.)

          >Once in a while, always at what seems to be the same time (4:00 AM), we see this error in a type 2 NSH Script Job...

          >Error20-Jul-2015 4:00:20 AMcom.bladelogic.om.infra.session.client.SessionLoginException: Your session credential does not contain a usable Application Server URL.Error20-Jul-2015 4:00:20 AMErrors were:Error20-Jul-2015 4:00:20 AMCannot connect to "service:appsvc.bladelogic:blsess://APP_SERVICE_HOSTNAME:9841" - java.io.SyncFailedException: lost message framing

          >With the following matching the timestamp in the app server log:

          >"20 Jul 2015 04:00:20,336",Accept-Client-Connections-Thread,INFO,::,Appserver,Connection disconnecting: id = 910179

          >"20 Jul 2015 04:00:20,570",Client-Connections-Thread-9,WARN,Anonymous:Anonymous:IP_OF_APPSERVER,Client,java.net.SocketException: Connection reset

          >com.bladelogic.sso.engine.ClientDisconnectedException: java.net.SocketException: Connection reset

          >  at com.bladelogic.sso.engine.StreamMessaging.receiveMessage(StreamMessaging.java:123)

          >  at com.bladelogic.sso.engine.StreamMessaging.receiveToken(StreamMessaging.java:203)

          >  at com.bladelogic.om.infra.mfw.net.BlSessionServerConnection.authenticate(BlSessionServerConnection.java:186)

          >  at com.bladelogic.om.infra.mfw.net.BlSessionServerConnection.doHandshake(BlSessionServerConnection.java:116)

          >  at com.bladelogic.om.infra.mfw.net.ClientWorkerThread.handleNewConnection(ClientWorkerThread.java:144)

          >  at com.bladelogic.om.infra.mfw.net.ClientWorkerThread.execute(ClientWorkerThread.java:97)

          >  at com.bladelogic.om.infra.mfw.net.ClientWorkerThread.execute(ClientWorkerThread.java:27)

          >  at com.bladelogic.om.infra.app.service.thread.BlBlockingThread.run(BlBlockingThread.java:95)

          >Caused by: java.net.SocketException: Connection reset

          >  at java.net.SocketInputStream.read(Unknown Source)

          >  at java.net.SocketInputStream.read(Unknown Source)

          >  at sun.security.ssl.InputRecord.readFully(Unknown Source)

          >  at sun.security.ssl.InputRecord.read(Unknown Source)

          >  at sun.security.ssl.SSLSocketImpl.readRecord(Unknown Source)

          >  at sun.security.ssl.SSLSocketImpl.readDataRecord(Unknown Source)

          >  at sun.security.ssl.AppInputStream.read(Unknown Source)

          >  at java.io.BufferedInputStream.fill(Unknown Source)

          >  at java.io.BufferedInputStream.read(Unknown Source)

          >  at com.bladelogic.sso.engine.StreamMessaging.receiveMessage(StreamMessaging.java:119)

          >  ... 7 more

          >"20 Jul 2015 04:00:20,570",Client-Connections-Thread-9,WARN,Anonymous:Anonymous:IP_OF_APPSERVER,Client,Error authorizing the connection

          >"20 Jul 2015 04:00:20,570",Client-Connections-Thread-1,INFO,USERNAME@DOMAIN:Anonymous:IP_OF_APPSERVER,Client,Connection disconnecting: id = 910851

          >We can't find anything such as VMware events or database issues, or other logged activity matching this time that could identify the potential cause... Subsequent runs on the same app servers are all fine, and there are no other errors in the logs of the app server. The app server services are all green and running.

          >

          >A search in the job_run_event table for the same error returned only this occurrence, and we keep 180 days worth of historical data.

          >

          >We do have a monitoring probe telneting the port of the app service every minute, so I don't know if this could have caused a glitch, but then why has it not happened at every other probe check as well? We have NSH Script Jobs scheduled to run every 30 minutes, at the 00 and 30 minute mark.

          >

          >Would anyone have any idea what could be causing this?

          >     Reply to this message by replying to this email, or go to the message on BMC Communities      Start a new discussion in Server Automation by email or at BMC Communities            Following Server Automation in these streams: Inbox

          >

          >Thank you for your participation in BMC Communities.

          • 2. Re: java.io.SyncFailedException: lost message framing
            Yanick Girouard

            I can't reproduce it on demand, can't disable the NSH Proxy (many jobs depend on it). Sorry.

            • 3. Re: java.io.SyncFailedException: lost message framing

              I mean in your test environment. Sometimes appserver process reaching the total open file descriptors can cause such connection reset type of failures.

               

              You may want to reduce the number of work item threads or increase the limit on total number of open file descriptors on your appserver process to see if it helps.

              • 4. Re: java.io.SyncFailedException: lost message framing
                Yanick Girouard

                Our app servers is running on Windows, not Unix... Windows doesn't have a limit other than memory and memory was pretty much dormant at the time of the issue, I checked the performance stats of the VM (it's a virtual machine).

                • 5. Re: java.io.SyncFailedException: lost message framing
                  Bill Robinson

                  multiple appservers here?  this should be a message from the blcli failing to connect to the appserver url it has that's in the session.

                   

                  how is the blcli called in this script ?

                   

                  you can also dump the CLI_SSO_CREDS variable - that will contain the connection info that's being set in the job.

                  • 6. Re: java.io.SyncFailedException: lost message framing
                    Yanick Girouard

                    I can add some debug info and check that variable in case it happens again, but it's rare so far. We have 4 app servers yes, and we are simply doing a blcli_execute in the script do do what we need, there's no blcli_connect. This is a type 2 NSH script job.

                    • 7. Re: java.io.SyncFailedException: lost message framing
                      Bill Robinson

                      The blcli_execute will run the connect if the session isn’t connected.

                      • 8. Re: java.io.SyncFailedException: lost message framing
                        Yanick Girouard

                        So something else fluked then?

                         

                        This is an old script that wasn't yet optimized. It used to be all blcli calls without the _execute, which was killing the app severs. I had them all changed, but the owner also needs to change all the instances of VAR=`blcli_execute ...` by just the blcli_execute and blcli_storeenv VAR instead of calling it with backticks. Maybe that's part of the issue...

                         

                        The script basically empties a job of targets and target groups, to add a new target group in it, calls the job and then clears the targets again.

                         

                        I wrote a new version that is more optimized and uses executeAgainstAndWaitForRunId instead, which I run in parallel to this one with a fake job (this is for server reboots), and mine works perfectly, so maybe that's just it.

                        • 9. Re: java.io.SyncFailedException: lost message framing
                          Bill Robinson

                          Hmm – so the VAR=`blcli_execute ..` spawns a new process and it might not get the connection settings, so that could throw the error.

                          • 10. Re: java.io.SyncFailedException: lost message framing
                            Yanick Girouard

                            Yeah that's what I think... but why is it working 99.9% of the time though, that's the weird part. Could it be a capacity issue with the NSH Proxy? I don't see any errors along those lines in the app server logs though...

                            • 11. Re: java.io.SyncFailedException: lost message framing
                              Bill Robinson

                              Blcli won’t connect to the proxy.