1 of 2 people found this helpful
Why are the events being ignored after hours?
Doesn’t PKM for Event Management have a “resend alerts” option?
I hope this helps.
1 of 1 people found this helpful
We have created a script that will close certain alarms/events on different time intervals. If the alarm condition is still true we get a new event.
This is important to us because we don't have a centralized team looking at all open events and watching things slide down / get buried in the event view. Instead we send event alerts directly to the app DevOps teams.
We don't use the event KM because we use all server side thresholds, we don't use any agent created events.
A couple of use cases
- Critical or other important issues (web url offline, disk space, memory pressure) we close every 15 minutes to 24 hours depending on needs. TSIM will then recreate an event if applicable.
- Closure of all old events (not alarms) that would otherwise need manual closure: done every 24 hours. This keeps the console clear of clutter.
- Closure of all events for a device group after their maintenance period. This forces any events to retrigger that would have been missed during a 'silent time'. (we don't use blackouts as we want all of their app data in case they need to start troubleshooting during maintenance; however we do not sent alerts during maintenance).
Here is a sample of the script...
:: Query for events that are just excess noise
mquery -n %tsimcell% -a PATROL_EV -w "status: == OPEN AND severity: == OK" -q -s event_handle
mquery -n %tsimcell% -a PATROL_EV -w "p_class: == Disconnect AND status: == OPEN AND severity: == INFO" -q -s event_handle
::Query for events to be closed on scheduled task trigger (every 15 minutes).
mquery -n %tsimcell% -w "mc_object_class: == NT_LOGICAL_DISKS AND severity: == CRITICAL AND status: == OPEN" -q -s pn_alarm_id
mquery -n %tsimcell% -w "mc_object_class: == NUK_FileSystem AND severity: == CRITICAL AND status: == OPEN" -q -s pn_alarm_id
::Query for events older than a day
mquery -n %tsimcell% -w "mc_object_class: == NUK_FileSystem AND severity: == MINOR AND status: == OPEN AND mc_arrival_time: < %epochtimeADayAgo%" -q -s pn_alarm_id
mquery -n %tsimcell% -w "mc_object_class: == NUK_Process AND severity: == CRITICAL AND status: == OPEN AND mc_arrival_time: < %epochtimeADayAgo%" -q -s pn_alarm_id
mquery -n %tsimcell% -w "mc_object_class: == NUK_Process AND severity: == MAJOR AND status: == OPEN AND mc_arrival_time: < %epochtimeADayAgo%" -q -s pn_alarm_id
Then close the events
msetmsg -n %tsimcell% -u <ExternalEventID> -C
pw event close csm_user -ai <AlarmID>
Do not know why the resend events option would even be used for this issue? To my knowledge it does not trigger at a specific time, just tries to resend.
The queries listed above might be helpful.
To further explain we have events on test systems that we are interested in but do not want anyone called after hours. We do want email notifications to go out to the support group for the particular event so that they can investigate the next business day.
We have a cell, call it A, that receives all events and then propagates all events of interest to a higher cell, call it B.
Cell A does not propagate any events on test systems after hours. But does send email to the support groups.
The group that monitors, after hours, never sees any events on cell A, just cell B.
So our working day begins at 7am Monday - Friday. My thought is to set up a query that would find any events in warning or alarm at 7am on cell A and create a new event that would propagate to cell B.
Of course a clean up script would also have to be done to clean up both cells.
very interesting and useful, thanks! I have two questions:
1) why closing events both on cell and TSIM ? Shouldn't the former trigger the latter ?
2) im always concerned about close events retention in BPPM, because default is 1 day if i remember well. So if i daily closed all those events i would not be able to find them if i need to troubleshoot. And tipically i would not be able to answer the question: "there was an incident 3 days ago, did the monitoring system send any alarm?". How do you deal with this ?
1) certain types of alerts vs. alarms are best to close in the Cell vs. TSIM. If the Cell generated the event close it there (patrol events); if the Rate process generated the event close it there (Server threshold alarms). Nearly every event we generate is a server side threshold. We don't allow patrol to generate any events.
2) we retain our events for 30 days for that same research reason.