Looking for comprehensive documentation for BMC TrueSight self-monitoring. Comprehensive self-monitoring an inherently complex topic...
There are a lot of moving parts in the BMC TrueSight architecture. All of the products need to be configured, running, and connected in order for the product to function. Self-monitoring is essential in order to keep an eye on the entire monitoring enterprise and make sure everything is functioning normally. For self-monitoring to be completely effective, it has to contain monitoring when things are up and connected as well as when essential components are down or not working properly. The tool can’t monitor itself if components are not running or not working. I believe self-monitoring warrants a body of documentation to address this. I have not found anything in the BMC documentation that provides complete documentation for self-monitoring.
I have been told that BMC has beefed up the methodology and documentation in 11.0 regarding self-monitoring and that the documentation and methodology are compatible with 10.7. However, I have not been able to find anything related to this.. Does anyone have documentation links regarding self-monitoring? Essentially, I need to know as much as possible regarding monitoring for things that can go wrong. I understand that there is a self-monitoring KM that monitors the TSPS components and maybe other things but I haven't found much documenation that provides details. While the Native monitors are still available, they should probably only be used as a last resort. In order to ensure that monitoring is functioning, we need to monitor the infrastructure components (Presentation Server, SSO, etc.... We need to be able to monitor the TSIM Server (the other day, the main cell crashed, rendering the TSIM Server pretty much defunct – how to monitor that?. We need to be able to monitor PATROL Agent connectivity through Integration Service as well as data and event flow (when we hit the maximum number of attributes that caused subsequent PATROL Agents to block, the Agent Error log showed the PATROL Agent connected to the Integration Service but nothing else). We need to know as much as possible about how to monitor outside the tool (the tool can’t monitor itself it it’s down or not working). We need diagnostic methods to make command-line queries in case the console gui is accessible or not functioning.
I've understand that there is documentation in 11.0 that is relevant in 10.7 but I have not been able to find it.
Here are the kinds of things that are needed for Self-Monitoring:
TSIM Server is down or is not functioning.
Main presentation cell is down or not responding (this would adversely affect the TSIM Server)
Integration Service down or not functioning.
PATROL Agent(s) down or not functioning.
PATROL Agent unable to connect to Integration Service
PATROL Agent loses connection to Integration Service
PATROL Agent data not flowing into TrueSight (the gaps in data prior to 9.5 for which there were no events or notifications was insidious)
PATROL Agent events not flowing into TrueSight (could be related to Integration Service or cell)
TrueSight events not getting sent to Presentation Server
Some of these can be monitored by the product itself. However, others require external monitoring (the tool can't monitor itself if it's down).
There is the legacy monitoring from the Self-Monitoring Integration Service (using native monitors). However, this is not state-of-the art monitoring that focuses primarily on the TSIM Server.
From the BMC Product Documentation, BMC PATROL for TrueSight Self-Monitoring provides monitoring for the following:
> Presentation Server
> TSIM SErver
> Cell components that handle events
> PATROL Agents
> Provides alerts and annotation
> Leverage existing KMs to monitor processes and databases
> Templates for configuration of KMs
> Agent actions for health reports
> Remote monitoring of an entire environment using a single PATROL Agent
> Receive data using a REST API
There are settings in the various PNet configuration files to enable JMX interface to in order facilitate self-monitoring. Howver, I haven't seen anything that talks about these settings and how to use JMX for making queries related to troubleshooting. I don't know whether this opens up methodology in the Self-Monitoring KM or if it's related to ad hoc queries. Again, documentation would be useful.
When possible, I believe it is best to use a PATROL Agent for Self-Monitoring and set up Agent-based thresholds to get events in case TSIM Server is down, cells are down, PATROL Agents are not working, etc... However (unless you set the PATROL Agent up to work autonomously without connectivity to TrueSight), this only works if the product is functioning (i.e. PATROL Agent has to be running, connected to Integration Service, and configured to do the monitoring. Events and/or data needs to be flowing through the Integration Service and alert methodology has to be functioning. If this is not the case, then self-monitoring falls short.
It would be useful to know what kinds of notification e-mails come throught he TSIM administrative e-mail and whether there are associated events. Notifications sent through the TSIM administrative e-mail should properly fall under self-monitoring. I believe there are notifications through this e-mail that are not provided as events/alerts. Is there any documentation about what kinds of notifications are sent to this e-mail address. The one that I’m aware of is disk space on TSIM Server is below 80%. I’m guessing that there are others. Is there anything that documents this?
In order to prevent overloading an Integration Service, it would be useful to be able to query the Integration Service for PATROL Agent connections, how many instances and attributes are being passed through the Integration Service, etc... The pproxcli was a great tool. It iw unfortunate (I would say a serious oversight) that this functionality was not included with the updated Integration Service architecture). Without thismethodology, it's difficult (maybe not possible) to make command line queries to ensure that we're not overloading the Integration Service). We've recently hit a built-in limit, which disrupted our deployment efforts. The rule of thumb is that you shouldn't have more than around 900 or so PatrolAgents connected to an Integration Service. But this is really not a good rule of thumb because it doesn't take into account what kind of data and events are being delivered by PATROL Agents through the Integration Service nor does it factor in that certain KMs (remote monitoring, WebSphere, etc...) contain a lot of instances and attributes. Essentially, PATROL Agents that monitor lots of devices should probably be weighted more heavily than normal. for example, a PATROL Agent that does remote monitoring for hundreds of devices should probably be weighted according to the number (or perhaps a percentage) of the devices that are being monitored.
I think it would be useful to have self-monitoring covered as an all encompassing topic in the BMC TrueSight documentation. What documentation that exists is sprinkled across the landscape and I haven't found anything to tie it all together.
If anyone has anything to share regarding this, that would be great. I would love to hear your ideas about self monitoring in BMC TrueSight Operations Management.