On a Friday night, there’s nothing I like better than to curl up with a good glass of red wine, a selection of fine cheeses and the latest set of log files which, especially for the occasion, I have printed out and bound. I start at page one, and I quickly get caught up in the various adventures of the session id, the interplay of the threads with the HTTP requests, and of course I always finish with the latest collection of memory dumps. The perfect way to start my weekend.
No really, I’m only joking. I don’t even like wine. And as for the log files, well, it doesn’t really work like that. You see, you need to know what you’re looking for in those log files. You can’t just stare at them for 10 minutes and hope to understand what happened on an application level. At their best, log files offer an incomplete picture of what happened.
But is that a bad thing? I don’t think it is. But here’s the thing, you really need to understand what these files are trying to tell you. You need to approach them with a specific idea in mind of what you want to achieve, what you are hoping to find. Sure, if your aim is to get an overview of the general performance of an application (think slowest SQL queries or the average processing time of HTTP requests), put your logs through a log analyser and use the reports. That’s generally a good approach, but sometimes there’s a tendency to over-rely on these sort of reports. If it doesn’t resolve your problem, you need to use a more forensic approach; you need to know how to read the logs.
Let’s have a look at the various logs we have on offer for Mid-Tier and AR server and let’s try to follow a complete conversation from start to finish. I’ll use version 8.1 for now, but as the log format changed somewhat in version 9, I’ll write a follow-up on this next month.
So here’s my problem statement: at certain times of the day users complain that the system slows down. We’ve collected set after set of AR server logs, but the log analyser hasn’t come up with anything conclusive. There are no obvious problems on the database, the network team can’t find anything, and we’re a bit stuck. So where do we go next?
Well, you need to get specific, really look at how requests are handled in the log files, follow a complete conversation from start to finish and see how this is handled. In other words, you need to adopt a more forensic approach.
A conversation is spread across multiple log files, let’s first remind ourselves what AR System looks like:
So the client (usually the browser) communicates with the web server, which in turn communicates with the JSP engine. Mid-Tier communicates via API calls with AR server which in turn communicates via SQL with the database. Here are the logs I feel help the most in logging a complete conversation:
First stop, the access logs. These are the web server logs that record all the requests that are received. You can of course use Fiddler as well, but at this stage I’m more interested in the bigger picture. What sort of requests are received, what answer goes back and crucially (at least in this case) how long do these requests take?
Next, the Mid-Tier logs. Mid-Tier doesn’t log everything at the same level of detail, but what we’re primarily interested in are the backchannel requests, which are logged with enough detail to work with. I need to know if they arrive safely on Mid-Tier and what Mid-Tier does with them. I included an API trace as well, more about this later.
From AR server we collect the API/SQL logs, the first part to know what sort of API calls the request results into, and the SQL calls to know what we actually send to the database. I like to use the combined format, with everything in one file it makes for easier reading.
Good, let’s get started. It’s all about following conversations in logs, you need to know where it starts and you need to know how to follow the conversation to the end. But you need employ a more forensic method, what I mean by that you need to prove or disprove something. That might sound a bit theoretical, but let’s have a look.
The access logs are a great start for any log analysis. It’s a log file that’s easily overlooked, and although it doesn’t provide the same amount of information as a proper HTTP trace (think Fiddler), it captures all the requests that come in. You get a much better picture of how a server is performing and you don’t have to capture any logs on the client side. With a little tweaking (see my article here for more information) this is the information you can get out of this:
The IP data tells us where the request comes from (as in the browser), we have the timestamp, the URL, the HTTP status code which tells us what the outcome was, the amount of bytes that are returned and the total time it took for the request to get processed.
I like to convert them to Excel via a tool I wrote. What I’m looking for are requests that are slow. Not necessarily the slowest, but requests that I’d expect to go through relatively quickly. The slowest requests are not necessarily a problem, I’d expect a request for an attachment of a report to take a lot of time. But check the requests I highlighted, I sorted these by URL to get to see how they perform. They are identical, same URL, same amount of bytes (469). They get small amounts of data. It’s not that they take exceptionally long, but sometimes the requests go from a few milliseconds to 4 seconds. That shouldn’t happen.
So let’s follow two requests, one that works okay and another one that takes too long and see what happens. We now have the start of the conversation, let’s see where Mid-Tier picks it up. That’s quite easy to do. I know the timestamp (that’s the time it sends the response, not the time it arrives) and the URL so I can just look that up in the Mid-Tier logs. I went for the categories Servlet, performance and Internal, which give me the best information. Let’s have a look at what’s logged:
I used the thread ID to sort everything, every request uses the same thread so you need to use this ID to string everything together, I left it out to make it a bit more readable. In both cases Mid-Tier picks up the requests at the correct time, they are extracted and sent to AR server. But then Mid-Tier starts waiting until it gets an answer from AR server. In the first case it only has to wait a few milliseconds, in the second case it waits seconds.
What Mid-Tier is doing is translating the backchannel request to a series of API calls and sending these to AR server. The API calls are specific requests or instructions using a protocol which both Mid-Tier and AR server understand. The Mid-Tier logs don’t actually record the exact moment these calls leave the server. It’s probably the last line of the mapProperties, but to be absolutely sure you need to check the API trace.
It’s not a log I use very often, but if I want to know exactly how a request travels across the various servers, that’s the log I use on Mid-Tier. There are a lot of logs on ARS that are called API logs and it can get a bit confusing. The log on Mid-Tier is the one that records the API traffic from the Mid-Tier perspective, it record everything that is sent and received. It’s not the same as the API log on AR server or the API log which you can enable via the Mid-Tier Configuration tool. This is something you need to enable via a specific XML file. I’ve included instructions below.
API trace logs record all the API traffic from all the users on that Mid-Tier, logs tend to grow fairly quickly so you need to be careful. There’s a lot of data recorded in these logs so it can be a bit tricky to get the bit you need. I usually use the timestamp as recorded by Mid-Tier and some details of the request, a form name, value, those sort of things. This is what I find:
What we’re looking at are the representations of two individual calls that Mid-Tier sends to AR server. Mid-Tier converted the request to specific calls, the values are all recorded and we now know exactly at what moment the request leaves the system and at what time the answer comes back.
So in our test case, we now know the delay is not on the web server side or the Mid-Tier side. The request leaves the server on time and the Mid-Tier server is just waiting for the answer.
So what’s next? Well, you need to look at the next piece of the conversation, AR server. The best log for this is the combined API/SQL log. Since we’re now communicating via API calls between Mid-Tier and AR server and we already know what the API call looks like that Mid-Tier sends off, we can just check for these calls as they arrive on AR server.
There’s no unique ID we can use to link the logs (I’ve raised an Idea for this), but you can use details like the sort of request, timestamp and field names, values, etc. to find a match. Similar to the Thread ID in the Mid-Tier logs you can use the RPC ID in the API/SQL logs to string everything together. This is what I came up with:
Notice that the name of the API call is different from the one used by Mid-Tier. So what we have now is the API call sent by Mid-Tier and the same API call recorded by AR server. But there’s something odd here. Notice the timestamp. That’s the moment AR server is first aware of the request. When everything is okay that’s just milliseconds later than the timestamp Mid-Tier recorded, but check the timestamp of the second API call, there’s a big difference here It’s recorded much later. When it’s received it’s processed fairly quickly, and the answer goes back within milliseconds to Mid-Tier. But why it received so late?
If you ask a Mid-Tier engineer he will tell you the timestamp recorded in the API trace is the moment the request leaves the server and it’s an AR server delay. But if you ask an AR server engineer he will tell you that the timestamp recorded in the server API log is the moment the AR server is first aware of the request and it’s a Mid-Tier delay. You basically have two parts of a conversation and there’s a gap of a few seconds in the middle which is unaccounted for.
At this stage we know it’s not a web server delay, we know it’s not delaying on the Mid-Tier server and we know it’s not delaying on the AR server, you now need to prove (or disprove) it’s the network. It’s the only part we haven’t looked at yet.
So how you do this? Well, that’s last piece of the puzzle and it’s a difficult one. All the logs we’ve looked at so far are logs on application level, but now we need to check what happens between the two applications, but there’s no neat log file for this. Instead we need to check the network traffic. The way to do this is by using network traces. You basically record all the traffic that flows over the network card. If you do this on both the machines that host Mid-Tier and AR server you should get a good picture of what’s happening. How to do this depends on the OS, in this case it’s Linux which means we need to use tcpdump: tcpdump –I <interface> -s 65535 –w <file>
I’ve included an article with more information, but the basic idea is to capture the network traffic from two sides with the API logs (also on two sides) to get the correct details. The application to read network data is Wireshark, so once you have the logs, open them via Wireshark and have a look.
Not as easy to read as the other logs. What you’re looking at is basically raw data, so you need to look for specific pieces of traffic here. The first thing I do is filter by TCP traffic. There’s a lot of stuff logged here, but I know the API traffic between the two servers are sent via TCP.
Next I use the same method I used for the other logs. Using a combination of timestamps, field IDs, field values I scan through the logs for any requests that match. I also keep an eye on the IP addresses here since I’m looking for traffic between the Mid-Tier and AR server. I know both IPs so I know which is the sender and the receiver. Here’s what the traffic looks like that I found:
From a Mid-Tier perspective it looks like it leaves the server on time, but it doesn’t get an answer back until four minutes later. The network from the AR server machine confirms this, the traffic comes in too late so there’s nothing else AR server can do. You see, the network is delaying this here, something goes wrong when the packages are sent between the two servers. It’s a network problem and the network team can probably answer where this is going wrong.
I know this all quite detailed, but my point is that none of these problems would show up if you only look at the logs using reports that record the slowest call, highest threads, etc. You can only find this out if you actually look at the traffic. If you follow a conversation in the log files you can see exactly where what happens. You can prove it’s not the web server, then you check if it’s the Mid-Tier server that’s causing the problem. If not, follow the conversation to AR server, the database, etc. I often just take a few similar requests over a certain period and check how they perform. If there are any differences, I check where it spends most of its time and try to find out why.
That said, your first port of call when dealing with performance problems are probably still the log analyser and the access logs, but if that doesn’t help and the circumstances are suspicious, you’ve got it to get down to the details.