Share:|

This is my first blog post and the first blog post in the series of AR System Rocks articles.


In this article I will give you some tips that will help you interpreting pyARLogAnalyzer results.

 

Let's start with the first essential question.

 

Why analyzing API / SQL / ESCL logs?

In general, we have to analyze AR System server side logs in the following use cases:

  • Troubleshooting an issue: In that case, you may have to analyze other logs depending of the issue (CMDB, Approval, Plugins, Filter, etc..). When it's related to AR System workflow, we used to enable Filter logs too.
  • Troubleshooting performance issue: This is the specific use case for which we always enable at least API / SQL logs. But as performance issues may be caused by Escalations or workflow, we can also enable those logs.

 

I am using pyARLogAnalyzer in both use cases, but principally for the second use case. This article will cover only the second use case, the first use case will be covered in another future article AR System Rocks - Using pyARLogFilter for troubleshooting.

 

Logging options

Enabling AR System server side logging such as API / SQL / ESCL / FLTR impacts the performances of the solution (generally between 20% and 30%), I always use the following options in order to limit as far as possible the impact on the performances.

 

 

  • Buffer Logged Lines: This is where I am lying. I am not always enabling this option, I disable this option only when I have to troubleshot AR System server crash or AR System server thread crash. Indeed, in case of a crash of the process or of the thread, if the log lines are buffered in memory, you don't have the information in your log files of the root cause of the crash. In all other case, buffering log lines in memory have fewer impact on the performances of the system as the memory I/O is much faster than disk I/O.
  • Log Per Thread: Using that option, one log file will be generated for each AR System thread. Using that option other thread don't have to wait for the write permission in the log file when another thread is already writing to that log file. So the impacts on performances are fewer with that option. Furthermore, I prefer using that option as the logs generated are easier to read, so it's easier to troubleshot an issue using that option. This will be explained in another future article AR System Rocks - Using pyARLogFilter for troubleshooting.

 

Combined logs or not combined logs?

First a combined log is when you enable multiple logging areas (API / SQL  / FLTR for example) in the same log file. It can be useful for troubleshooting, but generally, I prefer having separate log files for each logging area. This will be explained in another future article AR System Rocks - Using pyARLogFilter for troubleshooting.

 

The good answer to that question is it depends what you want to do with those logs. If it's for manual troubleshooting, combined logs can help you a lot. If it's just for performance analysis, you don't need combined logs.

 

When enabling AR System server side logs?

The good answer for that question is never on a production system.

 

But, you can enable logs for a short period on production only for performance analysis or troubleshooting.

 

"A short period" may vary depending of what you want to do with those logs:

  • Troubleshooting an issue: Enable logs, reproduce the issue, disable logs. That's simple.
  • Troubleshooting an issue caused by a scheduled escalation during the night: Use an escalation on the same pool that runs 1 minute before to enable logs and use an escalation on the same pool that runs 1 minute after to disable logs. The form Configuration ARDBC is your friend. That's simple too.
  • Troubleshooting general performance issue: Wait for a peak of load (generally Monday morning between 10:30 - 11:30) and enable logs for at least 30 minutes of activity, then disable the logs.
  • Troubleshooting performance issue that occurs at specific periods: Enable logs before this period, and disable logs after this period.
  • Troubleshooting scheduled integrations performance issue: Enable logs before this integration starts, and disable logs after this integration finishes.

 

What's pyARLogAnalyzer?

pyARLogAnalyzer is a program that I develop and maintain which helps you analyzing AR System server side API and/or SQL and/or ESCL logs. You can analyze one or more log files in one time.

 

It's very useful to find performance issues as it gives you a lot of indicators. A lot yes, but for newbies, too much indicators (even if the results are highlighted in green, orange or red) that they are unable to understand for taking the good corrective action(s).

 

In the results:

  • Green indicators means: OK, no issue.
  • Orange indicators means: There is an issue, but that's not your priority in the investigations.
  • Red indicators means: There is an issue, fix it urgently.

 

That's why I am giving you in this article some tips that will guide you interpreting main indicators.

 

Interpreting pyARlogAnalyzer results

Top API / Top SQL / Top Queued API

First a quick introduction to those sections:

  • Top API: By default, it's the 100 longest API calls detected in the logs.
  • Top SQL: By default, it's the 100 longest SQL statements detected in the logs.
  • Top Queued API: By default, it's the 100 API calls that have been queued the longest time detected in the logs.

 

The description of each section is generally enough to know what you have to do with that, but let's see one use case that requires a deeper analysis. In that use case, you have log files for a 1 hour period and you know there is only 10 minutes during this hour when the performances goes bad.

 

Let's analyze the logs again using the --displaytime option. (This example is easy)

 

 

And look into the results.

 

 

 

 

We can see very long API Calls, but the longest SQL statement are not related to those calls. Let's look into the Top Queued API.

 

 

We clearly see a correlation between the longest API call that generate other API waiting in the queue. So that's quite easy to find the slow down period in that use case. But if it's not so easy, Excel is your friend as it have some sorting and filtering features. Let's copy/paste the Top API into Excel and sort by Start Time.

 

 

 

We can see in the results clearly what we found previously, in the Top API, there is no long API calls between the 2 dates in red.

 

Perfect!!! Now we found the period which cause the performance issue, we can refine the analysis for that period only and find the root cause.

 

SQL Regex Matching

This feature helps you to find if it's necessary spending time for tuning a specific SQL statement compared to another. If a SQL statement matches to the defined RegEx in the configuration file, specific statistics are calculated for this kind of SQL Statement.

 

The following screenshot shows you that the first SQL statement runs 18 times and the second runs 14566 times. So if we can optimize the second SQL statement (that's not the case), the gain will be higher than if we spend our time optimizing the first one.

 

 

 

Threads Statistics

This section presents a lot of indicators concerning AR System threads usage.

 

 

It's the time for a presentation of the main indicators and the tips related:

IndicatorQuick DefinitionRelated Tips
% Queued APIRepresents the percentage of API calls that have been queued for that queue and thread.

The short and wrong answer is: "Hey man, it's easy, you have just to increase the number of threads for that queue!"

The good answer is: "A deeper analysis is necessary to find the root cause. It the root cause, is not enough threads for that queue, increase it. Other wise fix the root cause!"

In general, don't increase the number of threads if that's not required. In general, that's just a workaround for lazy guys. That's said.

You may have to make a deeper analysis using the following sections:

  • Top API
  • Top SQL
  • Top Queued API
  • API Thread Idle Time Statistics
% Queued API TimeRepresents the percentage of queued time compared to the total execution time for that queue and thread.
Avg Time APIRepresents the average execution time of API calls for that queue and thread.

If average times are not good enough, you may have to make a deeper analysis using the following sections:

  • Top API
  • Top SQL
  • API Thread Idle Time Statistics
  • SQL Thread Idle Time Statistics
  • API Form Statistics
  • SQL Table Statistics
  • API User Form Statistics (available only with a detailed analysis: --detailed option used in the command line)
  • SQL User Table Statistics (available only with a detailed analysis: --detailed option used in the command line)
Avg Time SQLRepresents the average execution time of SQL statements for that queue and thread.

 

API Thread Idle Time Statistics / SQL Threads Idle Time Statistics

The API Thread Idle Time section is the most useful section to fine tune the number of threads of each queue in AR System.

The SQL Thread Idle Time section can be also used in the same way, but it's more for advanced users, experts and geeks. That's said.

 

 

 

First start a short description of what we are talking here. We are talking about idle time also known as gaps in other log analysis tools. As a schema is better than a long and boring discussion, let's just look the following schema. No other explanation is required.

 

 

Now let's talk about the main indicators.

IndicatorQuick DefinitionRelated Tips
% Idle Time > 1sRepresents the percentage of Idle Time that are longer than 1 second.If this indicator is > 70-80% for all threads in a queue, consider decreasing the number of threads for that queue.
% Idle Time = 0sRepresents  the percentage of times there is no idle time.
  • Be careful of the count before doing bad things if it's equal to 100%.
  • If more than 10% of threads in a queue have this indicator >= 90%, consider increasing the number of threads for that queue.

 

WARNING The previous tips are very good for fine tuning of the AR System threads configuration, but in order to do that you must first tune correctly all other areas.

 

API Form Statistics / SQL Table Statistics / API User Form Statistics / SQL User Table Statistics

Those sections are useful if you want to detect an issue related to a specific Form, Table and/or user activity.

 

 

 

 

 

In order to find which area you have to optimize first in order to gain the more, you can copy past those results into Excel and use the sorting and filtering capabilities as shown bellow.

 

 

 

Escalation analysis

Quick introduction

Escalations are executed on Pools. There is one thread and only one thread associated to one Pool. Escalations are usually (except using Developer Studio) executed on a time frame basis (scheduled or by interval) and are not related to the user activity.

 

So all escalations on the same pool are queued in the same thread. As the minimum interval you can use is one minute, you must ensure that all escalations in the same pool are processed in less than one minute. Otherwise the escalations are queuing, escalations are delayed and not processed on time. And, in the worst case, if escalations on the same pool are never executed in less than one minute, you will have constantly high load on the system generated by the escalation processing.

 

The following sections in the pyARLogAnalyzer results helps you finding where the issue is. I know the sections in the results file are in the opposite order, but I generally have a look in that order.

 

Escalation Pool Summary

This section presents the summary by escalation Pool.

 

 

Tips:

  • Consider that escalation are queuing and have a big impact on the load of the system when the Avg Time is closed to 60 seconds or higher.
  • Consider that escalations are queuing and have a medium impact on the load of the system when the Avg Time is closed to 40 seconds or higher and Max Time is closed to 60 seconds or higher.

 

Escalation Summary

This section helps you finding which escalation causes queuing or may cause queuing.

 

Escalation Statistics

This section helps you finding on which form causes queuing or may cause queuing. It's very useful in case of shared escalations across multiple forms.

 

 

SQL Table Not Found

This section presents the SQL statements for which no table have been found.

 

 

If you can see here SQL statements generated by AR System, consider adding the following option in your ar.conf file:

Enable-Unlimited-Log-Line-Length: T



 

Working with RPC ID's

Here I will just do a quick explanation of what a RPC ID is. RPC ID identifies a unique API call in the entire system. The RPC ID will be the same in all AR System log files for the same API Call.

 

I will talk about that deeper in another future article AR System Rocks - Using pyARLogFilter for troubleshooting.

 

References

I am sure you will envoy using pyARLogAnalyzer 5.1.2.

If you're too impatient to wait for the next article, you can trypyARLogFilter 2.0.5 too.

Here is the link to the documentation of Enable-Unlimited-Log-line-Lenght parameter https://docs.bmc.com/docs/display/ars81/ar.cfg+or+ar.conf+options+E-M.

Here is the next article in this serie AR System Rocks - Understanding RPC ID's for troubleshooting

 

Just one last tip... As I like the Ubuntu Font Familly because I find that the rendering is very smart for reading documents especially web pages, If this font is installed on your system, the results html file will use it by default.

 

I hope you appreciated that article and it will be useful for you.

 

Vincent.