5 Replies Latest reply on Mar 19, 2012 10:37 PM by Ranganath Samudrala

    increasing maxThreads

    Ross Williamson
      Share This:

      All,

       

      We have an situation where we have AO taking inbound web service requests performing some actions and then returning a result.  Under high load we are seeing that the AO processes are completeing in around 500ms but the web services calls are taking around 30 seconds to complete (30 seconds to return from the viewpoint of the caller)

       

      I am seeing NUMBER_OF_UNANSWERED_REQUESTS floating around the 80 mark but the grid itself is healthy.  Now the obvious answer is to add a second node (HACDP or AP) and balance the requests but we are doing failure scenario testing and this behaviour is seriously killing overall service transaction performance.  Does anyone have any experience with increasing the tomcat maxThreads parameter?

       

      The maxThreads reached warning appears in the catalina.out log so i know that we are reaching this limit.  I can not find any guidance on this in the KB or documentation. 

       

      Any comments/thoughts?

        • 1. increasing maxThreads
          Ranganath Samudrala

          What version of AO platform?

           

          NUMBER_OF_UNANSWERED_REQUESTS    is just a number of requests sent out to other peers in the grid for which no response has been received yet. Once a response is received that number will be decremented. If you looking at grid performance, some of the numbers you want to look at are "COUNT_OF_COMPLETED_PROCESSES", "COUNT_OF_FAILED_PROCESSES", "COUNT_OF_STARTED_PROCESSES", "NUMBER_OF_RUNNING_PROCESSES" and how they are trending over a period of time.

           

          If the platform version is anything but 7.6.02.sp4, I highly recommend upgrading to SP4, since there are important fixes/changes related to peer and grid performance.

           

          It is interesting you mention that processes complete is 500 msecs, but web service calls are taking 30 seconds to complete. Can you elaborate on this? Processes that complete in 500 msecs, how are they started, via schedules, rules, web service (orca web service or legacy web service) or OCP. 

           

          "maxThreads reached" - what is the system architecture - 32bit or 64bit? What value did you see for NUMBER_OF_THREADS ?

           

          Ranga

          1 of 1 people found this helpful
          • 2. Re: increasing maxThreads
            Ross Williamson

            Ranga,

             

            The version is 7.6.02.02 (there is no intention in the short term to go to sp4 as we missed our upgrade window when sp3 was pulled

             

            Here are the values you refer to

             

            G     COUNT_OF_COMPLETED_PROCESSES    156926

            G     COUNT_OF_FAILED_PROCESSES    3

            G     COUNT_OF_STARTED_PROCESSES    156929

            G     NUMBER_OF_RUNNING_PROCESSES    0

            P:CDP1     COUNT_OF_COMPLETED_PROCESSES    143290

            P:CDP1     COUNT_OF_FAILED_PROCESSES    3

            P:CDP1     COUNT_OF_STARTED_PROCESSES    143293

            P:CDP1     NUMBER_OF_RUNNING_PROCESSES    0

            P:HACDP     COUNT_OF_COMPLETED_PROCESSES    13636

            P:HACDP     COUNT_OF_FAILED_PROCESSES    0

            P:HACDP     COUNT_OF_STARTED_PROCESSES    13636

            P:HACDP     NUMBER_OF_RUNNING_PROCESSES    0

             

            I know what the failed processes are and they are unrelated to this performance issue.

             

            Number of threads was 432 and its on a 64bit platform.

             

            The process.

            Its a quite simple process that is started by a legacy web services call takes a single parameter

            performs a small amount of manipulation and then writes a single row to a Remedy Database.

            The total processes run time is around 500 ms with the Remedy write taking the lions share of the time

             

            We are doing load testing at the moment and finding some interesting results.

             

            For the results below I have used a different test environment to the one above which doesnt

            have a HACDP to eliminate grid traffic.

             

            A basic load test is performed using SOAPUI with a simple load test.

            20 Threads, 100 ms delay, 0.5 random spread, over 600 minutes.

            This gives the following results.

             

            VIA SSL CDP:  Max Response Time = 13.8 sec

                             Avg Response Time = 2.5 sec

             

            VIA CDP (NO SSL) : Max Response Time = 10.9 sec

                             Avg Response Time = 2.7 sec

             

            So statistically the SSL is not really making much of a difference.  The remedy system was isolated (no other

            users) so we can assume that the remedy response time will be similar for all transactions.

             

            So with the assumption that the process takes around

            Max : 1.2 sec (This is the longest I saw in the logs)

            Average : 0.5 sec

             

            The web services management layer is adding between (using no ssl scenario)

            Max : 9.7 sec

            Avg : 2.2 sec

             

            Worth noting that a single isolated run of the webservice responds in around 600ms,

            this issue seems to only come up when the web services layer is put under load.

             

             

            In an effort to isolate network latency and remedy/process overhead I created a single

            process whose fingerprint is the same but performs no processing and just returns, on a

            NON SSL CDP.

            This was to establish the network latency and the overall performance of the webservices layer

             

            A single run of the process is around 40ms

            A run using the same parameters as above (20 thread, 100ms delay, 0.5 spread, 600 seconds)

            Gives

            Max Response: 1.9sec

            Avg : 0.366 sec

             

            (whilst the Avg is smaller than the single isolated test I just chalk this up to network. since we

            are only talking about 0.04 sec)

             

             

            So therefore statistically network latency isnt really adding much to the mix.

             

            Also it isolates the delay cause into the consequences of a longer running processes returning

            values to a 'waiting' webservice.

             

            Whilst the values I cite above are not in the 30 second range the response times to the test

            steadly get worse over the run time.  We are going to start another run today which is a 24 hour soak

            test.  I will grab some process stats during that run to see whats happening.

             

            If it follows previous tests then I would expect that the response time will degrade over the

            period of the test.

             

             

            Ranga, does that give a better view of what I am seeing.  This is general OOTB install nothing exciting

            and I am at a loss as to what is causing the significant delay between process completion and subsequent

            response to the SOAP consumer.

             

            Interestingly in the past as the load as built up we have seen the 150 threads reached error in the

            grid.log file, once we get to that point the entire system starts to collapse.

             

            As a indication we are only trying to do 1 transaction per second, this is not extreme (no where

            near my soak tests above) but over time the SOAP performance degrades whilst the process

            performance stays roughly the same.

             

            Any thoughts?

            • 3. Re: increasing maxThreads
              Ranganath Samudrala

              Can you send the exact error message you see in the grid.log for the error "150 threads reached error in the grid.log file".

               

              The health stats numbers look normal and nothing untoward is happening there. A thread count of 432 is very low and under load this can go up to 900, 1500 and beyond and it is quite normal.

               

              We need to run some internal tests to verify the scenario where, under load, legacy web service requests take time to return responses even though associated jobs seem to complete in normally without delay.

               

              Ranga

              • 4. increasing maxThreads
                Ross Williamson

                Ranga,

                 

                Is there any recommendation for memory size on a 64bit architecture?  -Mx et al?

                 

                Ross

                • 5. Re: increasing maxThreads
                  Ranganath Samudrala

                  Depends - anywhere from 2G to 8G. I have seen 16G being used as well, but it depends on the use case

                   

                  Number of jobs per second- burst and sustained

                  Data size being passed between call processes

                  Etc.

                   

                  Please send an email to ranga_samudrala@bmc.com<mailto:ranga_samudrala@bmc.com> with your use case scenarios So that we can help you with any issues you may be facing.

                   

                  Regards

                  Ranga

                   

                  Sent from my iPhone

                  1 of 1 people found this helpful