5 Replies Latest reply on Nov 30, 2018 12:51 PM by Michael Evans

    test servers vs production

    Steve Robinson
      Share This:

      When a patrol agent triggers and event and it is worthy of someone being notified, we have someone that will contact the appropriate  support group.

       

      But for all of our test environments those events are currently being ignored after hours.  

       

      But here comes the next morning and there may still be patrol agents in warning or alarm mode but the event that triggered over night was ignored.

       

      How do I let our monitoring person know that there are left over events from the previous evening.

       

      One thought was to bounce all patrol agents on the test servers, but that would cause other issues.

       

      Steve

        • 1. Re: test servers vs production
          Garland Smith

          Why are the events being ignored after hours?

          Doesn’t PKM for Event Management have a “resend alerts” option?

           

          https://docs.bmc.com/docs/display/public/esgkm2900/SETEVENTMANAGEMENTVARIABLE_alertResenddialogbox

           

          I hope this helps.

          Garland Smith

          1 of 2 people found this helpful
          • 2. Re: test servers vs production
            Michael Evans

            We have created a script that will close certain alarms/events on different time intervals.  If the alarm condition is still true we get a new event.

            This is important to us because we don't have a centralized team looking at all open events and watching things slide down / get buried in the event view.  Instead we send event alerts directly to the app DevOps teams.

            We don't use the event KM because we use all server side thresholds, we don't use any agent created events.

             

            A couple of use cases

            1. Critical or other important issues (web url offline, disk space, memory pressure) we close every 15 minutes to 24 hours depending on needs.  TSIM will then recreate an event if applicable.
            2. Closure of all old events (not alarms) that would otherwise need manual closure: done every 24 hours.  This keeps the console clear of clutter.
            3. Closure of all events for a device group after their maintenance period.  This forces any events to retrigger that would have been missed during a 'silent time'.  (we don't use blackouts as we want all of their app data in case they need to start troubleshooting during maintenance; however we do not sent alerts during maintenance).

             

             

            Here is a sample of the script...

            Query first

             

            :: Query for events that are just excess noise

            mquery -n %tsimcell% -a PATROL_EV -w "status: == OPEN AND severity: == OK" -q -s event_handle

            mquery -n %tsimcell% -a PATROL_EV -w "p_class: == Disconnect AND status: == OPEN AND severity: == INFO" -q -s event_handle

             

            ::Query for events to be closed on scheduled task trigger (every 15 minutes).

            mquery -n %tsimcell% -w "mc_object_class: == NT_LOGICAL_DISKS AND severity: == CRITICAL AND status: == OPEN" -q -s pn_alarm_id

            mquery -n %tsimcell% -w "mc_object_class: == NUK_FileSystem AND severity: == CRITICAL AND status: == OPEN" -q -s pn_alarm_id

             

            ::Query for events older than a day

            mquery -n %tsimcell% -w "mc_object_class: == NUK_FileSystem AND severity: == MINOR AND status: == OPEN AND mc_arrival_time: < %epochtimeADayAgo%"  -q -s pn_alarm_id

            mquery -n %tsimcell% -w "mc_object_class: == NUK_Process AND severity: == CRITICAL AND status: == OPEN AND mc_arrival_time: < %epochtimeADayAgo%"  -q -s pn_alarm_id

            mquery -n %tsimcell% -w "mc_object_class: == NUK_Process AND severity: == MAJOR AND status: == OPEN AND mc_arrival_time: < %epochtimeADayAgo%"  -q -s pn_alarm_id

             

            Then close the events

            msetmsg -n %tsimcell% -u <ExternalEventID> -C

            pw event close csm_user -ai <AlarmID>

            1 of 1 people found this helpful
            • 3. Re: test servers vs production
              Steve Robinson

              Do not know why the resend events option would even be used for this issue?  To my knowledge it does not trigger at a specific time, just tries to resend.

               

              The queries listed above might be  helpful. 

              To further explain we have events on test systems that we are interested in but do not want anyone called after hours.  We do want email notifications to go out to the support group for the particular event so that they can investigate the next business day.

              We have a cell, call it A,  that receives all events and then propagates all events of interest to a higher cell, call it B. 

              Cell A does not propagate any events on test systems after hours. But does send email to the support groups.

              The group that monitors, after hours, never sees any events on cell A, just cell B.

               

              So our working day begins at 7am Monday - Friday.  My thought is to set up a query that would find any events in warning or alarm at 7am on cell A and create a new event  that would propagate to cell B.

              Of course a clean up script would also have to be done to clean up both cells.

              • 4. Re: test servers vs production
                EDOARDO SPELTA

                Hello,

                very interesting and useful, thanks! I have two questions:

                1) why closing events both on cell and TSIM ? Shouldn't the former trigger the latter ?

                2) im always concerned about close events retention in BPPM, because default is 1 day if i remember well. So if i daily closed all those events i would not be able to find them if i need to troubleshoot. And tipically i would not be able to answer the question: "there was an incident 3 days ago, did the monitoring system send any alarm?". How do you deal with this ?

                • 5. Re: test servers vs production
                  Michael Evans

                  1) certain types of alerts vs. alarms are best to close in the Cell vs. TSIM.  If the Cell generated the event close it there (patrol events); if the Rate process generated the event close it there (Server threshold alarms).  Nearly every event we generate is a server side threshold.  We don't allow patrol to generate any events.

                  2) we retain our events for 30 days for that same research reason.