14 Replies Latest reply: Oct 29, 2002 9:43 AM by ghalied RSS

Agent Collector Check Script

rwetjen

Soliciting some help here. I'm writing a script to help our SA's check the status of data collection and troubleshoot the client installations (unix). Most of it's pretty straight forward except checking if a running collector is actually doing it's thing. Script starts out by checking to see if the bgsagent and bgscollect processes are running. Easy enough but it's no guarantee that the collector is actually gathering data. Looking for a clean definitive way to check that the collector is working in all the key metric groups. I suppose I could check for the metric group directories in the dated collect directory and then check for a udr file. The only problem I see with doing it that way (other than it's clunky) is that the udr file is padded when it's created and you could have an empty one and not know it. I'm checking semaphore and shared memory resources in the next section, but that's not definitive either especially when the collectors already running. So, anyone have any ideas? Rich

  • 1. Just a thought
    ghalied
    find . -name *.udr -mtime 0 -size +1 -exec ls -l {} \;


    seems lean enough for some simple checking.

    I started doing some stuff with the output to check for files modified in the last 15min, but it got 'clunky' as well. If someone else doesn't put up something then I'll put that code up.

    Thanx,
    Ghalied

  • 2.
    rwetjen

    Ghalied, Thanks, this will work for what I'm trying to do. I've been noodling it around a bit and the only thing it misses is the validity of the data. One example I've run into was an AIX collector that would write out over a GB of disk config data in the first interval and then very small spills for the rest of the run. Of course Analyze didn't understand any of it. Patches fixed it but we've had enough of these oddities that I'd like to be able to test for valid data. Short of running it through Analyze, I can't think of a way though. Rich

  • 3.
    Perry Stupp

    Rich,

    While I applaud your efforts to test that things are working 100% correctly, particularly given that the installations are in the hands of an SA who may not necessarily be as concerned about the data as you are, I would like to try and encourage some discussion around installation best practices along with this thread. I have spoken with a reasonable number of customers all with large-scale implementations that seem to have tamed the installation process. While I can't talk for them and, being on the pre-sales side of things, I don't personally have the first hand experience with it, I am hoping that some will come forward and discuss what they do to streamline and ensure the successful role out of large numbers of new systems. I know that there can be issues with the installation but ensuring proper system configuration and patching should address the majority of those concerns out of the gate.

    To try and address your specific concerns, start by checking the agent and collect logs ($BEST1_HOME/bgs/monitor/log and $BEST1_HOME/bgs/log) for any obvious errors. Look for shmget and semctl errors which are the best indicators of shared memory or semaphore identifier problems.

    An excellent test of both the collector and the agent is printudr (or Investigate if you're at the console). Generally speaking, if printudr is working for each of the core metric groups then you will most likely be ok throughout the entire collection / processing cycle. See the thread http://communities.bmc.com/communities/thread/18475 for more details about printudr and thread http://communities.bmc.com/communities/thread/18077 for a list of core metric groups for both Unix and NT. I will try to post a more comprehensive policy that tests each of the core metric groups but right now I don't have access to my machine. When run without the -r (repository for spilled data) option, printudr will validate the collector and the agent's ability to transfer data but not necessarily the agent's ability to correctly spill the data for processing by analyze.

    You'll also want to check that data is being spilled correctly by the agent. Based upon the scenario that you described, you'll want to wait a few spill intervals before doing your testing to ensure that things are working consistently. To expedite this, you might want run a test with a shorter spill interval although you're still looking at anywhere from a 3 to 15 minute delay before you can be 100% certain. I'll often use three five minute spills and then delete the data. Provided the system is not enormous (hundreds of disks or 10's of thousands of active processes) you can probably take your spill interval down as low as one minute (-i 1) without issue. To do this with 5 minute intervals, issue the following command locally;

    best1collect -e15 -i5 -d /usr/adm/best1_default/collect  (unix)<br />best1collect -e15 -i5 -d "C:\temp" (NT)

    For testing the quality / integrity of the spilled data you can use UDRviewer or printudr, printudr is probably cleaner / easier. Using the same aforementioned policy (yet to be posted) we can test the spilled data by using the -r (repository option) against the collect directory. I'll post the specific syntax for printudr when I post the policy.

    You should also be comfortable with the UDRviewer command which is handy for a variety of things involving data. The simplest use of UDRviewer is;

    UDRviewer -r /usr/adm/best1_default/collect -t

    although using the syntax;

    UDRviewer -u .../collect/node/noInstance/DS/MetricGroup/*.udr -m

    (where ... is the path to the collect data, node is the name of the machine you are on, DS is a datestamped directory created with your collection request, MetricGroup is a specific metric group such as System_Configuration) you can see specific configuration data and collected values. (Note: When using NT, aside from the different paths, you must explicitly name the udr file as the command shell does not expand the wildcarded filename for use by UDRviewer)

    That's all I can think of off the top of my head. Hope that helps.
    Regards,

    Perry.

  • 4.
    rwetjen

    Perry, Thanks, hadn't considered UDRviewer. I've never had occasion to use it. Invoking it with the -t option gives me just what I was after. What I'll do is check for a write to the udr file within spill + 5. At this point I have the full path to the specific udr file so I can test each specifically. I'm assuming the the response Memory test PASSED with the -u and -t options means the data isn't corrupted? If both the write and UDRviewer tests pass I'll call it a successful udr file. Rich

  • 5. Checking from Console
    notnavi

    Expanding upon this thread....
    I am working on a similiar script that will run from the console, that checks the status of the data collectors and I am trying to determine a method to check if a data collector is hung w/o using rsh.
    Any ideas
    JD

  • 6.
    Perry Stupp

    John,

    If you're running the Unix console then this functionality more or less already exists in the manager run. The "Agent Restart" option (Options -> Collect -> Restart Options) will check the status of each of the nodes in the domain and attempt to restart it in the event that it has stopped. The restart option will send an e-mail if it detects that collection has stopped on any node for 2 periods (30 minutes by default). If you want to do something like this yourself, take a look at http://communities.bmc.com/communities/message/55670&highlight=#6510 for more details. Neither of these approaches, however, test the quality of the data as Rich is trying to do above. To do that, you need something a little more sophisticated that can be run on the remote machine. If you have it available, you can always use PATROL to monitor things and handle error messages through any number of mechanisms. Alternatively you could use UMX for this type of thing; neither of these approaches require you to use anything like rsh.

    Regards,

    Perry.

  • 7. Hung collectors
    notnavi

    Perry,
    I guess I was not clear in my original posting, the restart is not the problem, the problem I am trying to monitor for is when the collector is hung ie. the collector is running but no data is being collected.
    We have encountered this quite a bit, mostly with NT, but a few times with UNIX clients.
    JD

  • 8.
    Perry Stupp

    John,

    I'll be honest with you; I'm not really much of a "support" person. My work normally begins when things are working correctly so, although I'll certainly look into what can be done to help, I just don't run into enough problems to know what all can go wrong. I would start by contacting support to ensure that there are no known problems that have already been addressed by patches or perhaps the new 6.6.10 release that just went GA. You should also try to rule out things like running out of disk space on the collection filesystem. Again, once you get beyond that you can use any of the tools at your disposal to try and identify the possible problems. These tools can be scheduled through PATROL or initiated through UMX (via Investigate).

    I'm going to be attending a product planning meeting next week and it's clear that we need more tools around this type of thing (or perhaps to expose more of these tools to end users via KMs etc...). If anyone out there has any specific comments or concerns that you would like me to present to PM&D please send them along to me at mailto:perry_stupp@bmc.com . Otherwise, I'll do my best to incorporate the comments that I'm seeing here on devcon.

    Thanks,

    Perry.

  • 9.
    rwetjen

    John, This is very similar to what I'm working on although I'm focusing solely on Unix at this point. Rather than looking for a hung collector, I'm attacking it from the other direction. Checking if the collectors running, kernel resources are adequate, writing to file within last spill and then corrupted data. Would think that would uncover an errant collector? Did run into a really interesting one yesterday. I'm not opening a support case on it because it's an older 6.5.10 collector on Solaris scheduled for upgrade soon. Collectors are running (both system and Sybase) and collecting all metrics/groups but all values (and I mean every one) are zero. This is a new one on me, wondering if anyone else has seen it. I've encounterd output with one or two metric groups zeroed, but never before all. I'm reasonably sure that upgrading to the newer patched collectors will correct the problem but wondering if I should add a test to the script to look for this as well. Rich

  • 10.
    Perry Stupp

    Rich,

    This looks like a good opportunity for me to get back on track with our earlier thread. Sorry I couldn't be more definitive on the UDRviewer tests, to be perfectly honest, not having any more documentation that you do, I'm not 100% certain what constitutes VALID and when you'll see otherwise. I know, however, that I can count on you post whatever it is that you find in your testing. I'll do my best to look into this next week while I'm in Waltham and I'll post an update.

    Keep up the vigilance and rest assured that this will be the source of much discussion at the PM&D meeting next week!

    Regards,

    Perry.

  • 11.
    rwetjen

    Perry, This one with the zero values has really got my attention. It passes the UDRviewer test because zero is a valid value. But I'd say there's a problem when every metric for both the sysem and db collector over 96 intervals is zero. The only way I can think to trap this error is to check values in some selected metrics that logically can't be zero. Ah, the persuit of certainty in an uncertain world. Rich

  • 12. After
    ghalied

    And the next step...
    After finding those metrics that are not recording properly is to troubleshoot them. For metrics whose values are zero, where would I start?
    I know this is a kinda broad question, but I'm sure those with experience would have guidelines to follow, pitfalls to look out for, etc.

  • 13.
    Perry Stupp

    Ghalied,

    I would definitely encourage you to engage customer support in situations like that, if for no other reason than to ensure that the problem is logged.

    Rich,
    I saw some things yesterday that will be available in 7.1.01 that I definitely think will help with this effort. The best1collect -Q status query will include additional information regarding the number of metric groups that are collecting data, serious problems with data collection as well as some of the agent error messages that were available only in logs on the remote machine previously. It's hard to say if this will address all of the problem that you are encountering but it definitely looks like a step in the right direction. I have been pushing forward your concerns and both PM&D have heard them loud and clear. I suspect that there will be a lag before everything is addressed but I'll continue to help where I can to make this process more reliable.

    Regards,

    Perry.

    "Definitely", is definitely the word of the day of the day today - no doubt!

  • 14.
    ghalied

    "Perry, This one with the zero values has really got my attention. It passes the UDRviewer test because zero is a valid value. But I'd say there's a problem when every metric for both the sysem and db collector over 96 intervals is zero. The only way I can think to trap this error is to check values in some selected metrics that logically can't be zero. Ah, the persuit of certainty in an uncertain world. Rich "

    I'm interested to know if anyone has thought of a way to test for the above situation yet.

    Thanx,
    Ghalied