6 Replies Latest reply on Mar 6, 2017 12:38 AM by EDOARDO SPELTA

    Intelligent event history..is something wrong with it ?

    EDOARDO SPELTA
      Share This:

      I was reviewing a case for a customer, where logical disk free space on a windows agent was flapping above/below a critical threshold several times.

       

      Critical was set for free space <5%

      Major was set for free space <15% for 3 hours

       

      In BPPM9.6 om console i only had one critical event, whose time (received i guess) was the time when major threhsold was crossed for the first time.

      I selected the Intelligent event history from that event and i got a bunch of critical/major sequence, as the free space was going below/above the critical threshold.

       

      First of all: older event in the event history was a major (below 15% for 3 hours) so..why do i have a critical in console reporting that major first event timestamp ?

      I assume that critical is the actual last status of the monitor, but why is the timestamp still referencing to the first event in the chain ? In these cases event consoles should report latest timestamps, as operators have to know the timestamp of the latest status change (and if they really need it, they can look into the history to see the timestamp of the first one one investigating further...).

       

      I assume that the events shown in the event history are actually only one event which is changing severity and text message while crossing the critical threshold, but again..why do i get in console  the first event timestamp ? This is misleading.

       

      Besides, the event history itself seems wrong to me because it does not take into consideration that my major threshold has a duration. So it is not true that all those major events in the history were shown in the console. They get showed in the history just because the free space dropped below 15% but since it didn't stay there for 3 hours no status change was expected to happen (how do i know if it actually did ?) so the history in my opinion is reporting wrong information.

       

      What do you think of it ? any past experience or similar cases ?

        • 1. Re: Intelligent event history..is something wrong with it ?
          Willa Ou

          Edoardo,

           

          Sounds like you are using global thresholds/server thresholds.  Then you are dealing with server events generated by BPPM server vs agent events generated by PATROL agents or 3rd-party tools.  You are right, for server events, there is just one event from open to close. And it can change severity between minor, major, and critical when thresholds are crossed.  That is why 'occurred' timestamp still refers to the time when the event was first raised no matter how many times the severity has changed later.  (For agent events, there is a separate event for every severity change.) 

           

          I include 'modified' time column (mc_date_modification) as part of operator view next to 'occurred' time so that the operators would know when the event changed its severity or status.  This would help your operators to know when your event changed to the last severity (critical in your case).  You can even customize what event slot change will change this 'modified' time stamp by editing pw\server\etc\mcell.modify.  By default changes occurred in status, severity, mc_priority, repeat_count, and CLASS will change 'modified' time stamp. 

           

          Best Regards,

          Willa Ou

          willa_ou@worldopus.com

          1 of 1 people found this helpful
          • 2. Re: Intelligent event history..is something wrong with it ?
            EDOARDO SPELTA

            Hi, thanks for the advice about the right slot to make visible in console and yes, it's about global threshold.

             

            What do you think about my assumption on threshold crossings ? I'm seeing a lot of major/critical in the event history but the values collected by the monitor were not below 15% for three hours, so i don't really think that the event actually changed severity.

            • 3. Re: Intelligent event history..is something wrong with it ?
              Willa Ou

              Edoardo,

               

              My guess is that BPPM doesn't consider previous severity changes when evaluating a new data point for event severity.  It simply checks what is the highest severity criteria this data point meets at that moment.  In your case, your data value was indeed below 15% for 3 hours because these 3 hours include the time period when your data was below 5% (with event severity at critical).

               

              I don't have an example handy to validate my guess.  But if you can find an example to either validate or invalidate it, please let us know.

               

              Best Regards,

              Willa Ou

              willa_ou@worldopus.com

              • 4. Re: Intelligent event history..is something wrong with it ?
                EDOARDO SPELTA

                Hello,

                i'm still not sure how to intepret this case:

                 

                This is the alarm in console

                This is the history

                At 02/17 6:32pm hits the major threshold (for 3 hours i want to believe) and i get the first event.

                Then available space decreases again (12:43), the alarm changes severity to critical and on the console i see a red event with the original timestamp (after your advice i'm also showing the modified date, even though i fear that other operations on the event might change that timestamp..).

                2 minutes later 12:45 for some reason some space was available.

                Now, in order to change severity/staus the available space should have been  <15% for 3 hours, which obiously didn't as only 2 minutes were passed !! Therefore I expect that the 12:45 event never really existed and i wonder if and how and when this "intelligent" event history is really reliable.

                the same goes for all the major up to the last one.

                 

                the event is still in the console and shows a measeured metric value of 4.25% which is the one that triggered the critical threshold the first time:

                even though currently the metric is 7%

                 

                and, btw, it has been in the major threshold for days now but it's still showing critical:

                 

                I am a little confused about all this information..

                • 5. Re: Intelligent event history..is something wrong with it ?
                  Willa Ou

                  Edoardo,

                   

                  As you may know, two major components in BPPM, PNET and Cell, communicate with each other through APIs but they can't read each other's memory due to their architecture difference. PNET is Java based and Cell is Prolog based.

                   

                  I can't explain why your cell stopped updating the events sent from PNET. I wonder if it is just for this particular Logical Disk Free Space events.  Are other server events (new or updated) look OK?  If all events seem to be stuck, you may want to bounce the cell with 'mcontrol reload'.

                   

                  For your major event raised at 12:45, just two minutes after raising a critical event, I said in my previous post: 'My guess is that BPPM doesn't consider previous severity changes when evaluating a new data point for event severity.'  What I meant was:

                   

                  At 12:45, when Logical Disk Free Space was between 5% and 15%, PNET didn't consider there was a critical event raised at 12:43, it didn't even consider there was first major event raised 10 days ago.  All it did was to look at your threshold starting from the highest severity setting.  It asked: did this data point value below 5% for last 0 minute?  If so, set severity to critical. If not, did this data point value below 15% for last 3 hours? If so, set severity to major.  In your case at 12:45, the data point value was indeed below 15% for the last 3 hours.  Even at 12:43 when a critical event was raised, the data point value was below 15% - if it was below 5% it was obviously below 15%.

                   

                  Best Regards,

                  Willa Ou

                  willa_ou@worldopus.com

                  1 of 1 people found this helpful
                  • 6. Re: Intelligent event history..is something wrong with it ?
                    EDOARDO SPELTA

                    Ok, now i see your point!

                     

                     

                    I was expecting that severity changes reset the 3 hours duration count. Thanks for the hint !

                     

                    Regards,

                    Edoardo