1 of 1 people found this helpful
Sounds like you are using global thresholds/server thresholds. Then you are dealing with server events generated by BPPM server vs agent events generated by PATROL agents or 3rd-party tools. You are right, for server events, there is just one event from open to close. And it can change severity between minor, major, and critical when thresholds are crossed. That is why 'occurred' timestamp still refers to the time when the event was first raised no matter how many times the severity has changed later. (For agent events, there is a separate event for every severity change.)
I include 'modified' time column (mc_date_modification) as part of operator view next to 'occurred' time so that the operators would know when the event changed its severity or status. This would help your operators to know when your event changed to the last severity (critical in your case). You can even customize what event slot change will change this 'modified' time stamp by editing pw\server\etc\mcell.modify. By default changes occurred in status, severity, mc_priority, repeat_count, and CLASS will change 'modified' time stamp.
Hi, thanks for the advice about the right slot to make visible in console and yes, it's about global threshold.
What do you think about my assumption on threshold crossings ? I'm seeing a lot of major/critical in the event history but the values collected by the monitor were not below 15% for three hours, so i don't really think that the event actually changed severity.
My guess is that BPPM doesn't consider previous severity changes when evaluating a new data point for event severity. It simply checks what is the highest severity criteria this data point meets at that moment. In your case, your data value was indeed below 15% for 3 hours because these 3 hours include the time period when your data was below 5% (with event severity at critical).
I don't have an example handy to validate my guess. But if you can find an example to either validate or invalidate it, please let us know.
i'm still not sure how to intepret this case:
This is the alarm in console
This is the history
At 02/17 6:32pm hits the major threshold (for 3 hours i want to believe) and i get the first event.
Then available space decreases again (12:43), the alarm changes severity to critical and on the console i see a red event with the original timestamp (after your advice i'm also showing the modified date, even though i fear that other operations on the event might change that timestamp..).
2 minutes later 12:45 for some reason some space was available.
Now, in order to change severity/staus the available space should have been <15% for 3 hours, which obiously didn't as only 2 minutes were passed !! Therefore I expect that the 12:45 event never really existed and i wonder if and how and when this "intelligent" event history is really reliable.
the same goes for all the major up to the last one.
the event is still in the console and shows a measeured metric value of 4.25% which is the one that triggered the critical threshold the first time:
even though currently the metric is 7%
and, btw, it has been in the major threshold for days now but it's still showing critical:
I am a little confused about all this information..
1 of 1 people found this helpful
As you may know, two major components in BPPM, PNET and Cell, communicate with each other through APIs but they can't read each other's memory due to their architecture difference. PNET is Java based and Cell is Prolog based.
I can't explain why your cell stopped updating the events sent from PNET. I wonder if it is just for this particular Logical Disk Free Space events. Are other server events (new or updated) look OK? If all events seem to be stuck, you may want to bounce the cell with 'mcontrol reload'.
For your major event raised at 12:45, just two minutes after raising a critical event, I said in my previous post: 'My guess is that BPPM doesn't consider previous severity changes when evaluating a new data point for event severity.' What I meant was:
At 12:45, when Logical Disk Free Space was between 5% and 15%, PNET didn't consider there was a critical event raised at 12:43, it didn't even consider there was first major event raised 10 days ago. All it did was to look at your threshold starting from the highest severity setting. It asked: did this data point value below 5% for last 0 minute? If so, set severity to critical. If not, did this data point value below 15% for last 3 hours? If so, set severity to major. In your case at 12:45, the data point value was indeed below 15% for the last 3 hours. Even at 12:43 when a critical event was raised, the data point value was below 15% - if it was below 5% it was obviously below 15%.
Ok, now i see your point!
I was expecting that severity changes reset the 3 hours duration count. Thanks for the hint !