1 of 1 people found this helpful
What version of AO platform?
NUMBER_OF_UNANSWERED_REQUESTS is just a number of requests sent out to other peers in the grid for which no response has been received yet. Once a response is received that number will be decremented. If you looking at grid performance, some of the numbers you want to look at are "COUNT_OF_COMPLETED_PROCESSES", "COUNT_OF_FAILED_PROCESSES", "COUNT_OF_STARTED_PROCESSES", "NUMBER_OF_RUNNING_PROCESSES" and how they are trending over a period of time.
If the platform version is anything but 7.6.02.sp4, I highly recommend upgrading to SP4, since there are important fixes/changes related to peer and grid performance.
It is interesting you mention that processes complete is 500 msecs, but web service calls are taking 30 seconds to complete. Can you elaborate on this? Processes that complete in 500 msecs, how are they started, via schedules, rules, web service (orca web service or legacy web service) or OCP.
"maxThreads reached" - what is the system architecture - 32bit or 64bit? What value did you see for NUMBER_OF_THREADS ?
The version is 7.6.02.02 (there is no intention in the short term to go to sp4 as we missed our upgrade window when sp3 was pulled
Here are the values you refer to
G COUNT_OF_COMPLETED_PROCESSES 156926
G COUNT_OF_FAILED_PROCESSES 3
G COUNT_OF_STARTED_PROCESSES 156929
G NUMBER_OF_RUNNING_PROCESSES 0
P:CDP1 COUNT_OF_COMPLETED_PROCESSES 143290
P:CDP1 COUNT_OF_FAILED_PROCESSES 3
P:CDP1 COUNT_OF_STARTED_PROCESSES 143293
P:CDP1 NUMBER_OF_RUNNING_PROCESSES 0
P:HACDP COUNT_OF_COMPLETED_PROCESSES 13636
P:HACDP COUNT_OF_FAILED_PROCESSES 0
P:HACDP COUNT_OF_STARTED_PROCESSES 13636
P:HACDP NUMBER_OF_RUNNING_PROCESSES 0
I know what the failed processes are and they are unrelated to this performance issue.
Number of threads was 432 and its on a 64bit platform.
Its a quite simple process that is started by a legacy web services call takes a single parameter
performs a small amount of manipulation and then writes a single row to a Remedy Database.
The total processes run time is around 500 ms with the Remedy write taking the lions share of the time
We are doing load testing at the moment and finding some interesting results.
For the results below I have used a different test environment to the one above which doesnt
have a HACDP to eliminate grid traffic.
A basic load test is performed using SOAPUI with a simple load test.
20 Threads, 100 ms delay, 0.5 random spread, over 600 minutes.
This gives the following results.
VIA SSL CDP: Max Response Time = 13.8 sec
Avg Response Time = 2.5 sec
VIA CDP (NO SSL) : Max Response Time = 10.9 sec
Avg Response Time = 2.7 sec
So statistically the SSL is not really making much of a difference. The remedy system was isolated (no other
users) so we can assume that the remedy response time will be similar for all transactions.
So with the assumption that the process takes around
Max : 1.2 sec (This is the longest I saw in the logs)
Average : 0.5 sec
The web services management layer is adding between (using no ssl scenario)
Max : 9.7 sec
Avg : 2.2 sec
Worth noting that a single isolated run of the webservice responds in around 600ms,
this issue seems to only come up when the web services layer is put under load.
In an effort to isolate network latency and remedy/process overhead I created a single
process whose fingerprint is the same but performs no processing and just returns, on a
NON SSL CDP.
This was to establish the network latency and the overall performance of the webservices layer
A single run of the process is around 40ms
A run using the same parameters as above (20 thread, 100ms delay, 0.5 spread, 600 seconds)
Max Response: 1.9sec
Avg : 0.366 sec
(whilst the Avg is smaller than the single isolated test I just chalk this up to network. since we
are only talking about 0.04 sec)
So therefore statistically network latency isnt really adding much to the mix.
Also it isolates the delay cause into the consequences of a longer running processes returning
values to a 'waiting' webservice.
Whilst the values I cite above are not in the 30 second range the response times to the test
steadly get worse over the run time. We are going to start another run today which is a 24 hour soak
test. I will grab some process stats during that run to see whats happening.
If it follows previous tests then I would expect that the response time will degrade over the
period of the test.
Ranga, does that give a better view of what I am seeing. This is general OOTB install nothing exciting
and I am at a loss as to what is causing the significant delay between process completion and subsequent
response to the SOAP consumer.
Interestingly in the past as the load as built up we have seen the 150 threads reached error in the
grid.log file, once we get to that point the entire system starts to collapse.
As a indication we are only trying to do 1 transaction per second, this is not extreme (no where
near my soak tests above) but over time the SOAP performance degrades whilst the process
performance stays roughly the same.
Can you send the exact error message you see in the grid.log for the error "150 threads reached error in the grid.log file".
The health stats numbers look normal and nothing untoward is happening there. A thread count of 432 is very low and under load this can go up to 900, 1500 and beyond and it is quite normal.
We need to run some internal tests to verify the scenario where, under load, legacy web service requests take time to return responses even though associated jobs seem to complete in normally without delay.
Is there any recommendation for memory size on a 64bit architecture? -Mx et al?
1 of 1 people found this helpful
Depends - anywhere from 2G to 8G. I have seen 16G being used as well, but it depends on the use case
Number of jobs per second- burst and sustained
Data size being passed between call processes
Sent from my iPhone