To ensure best possible availability of your Helix Services, the SaaS Operations team is monitoring and managing the services our customers subscribe to.
Any alerts from the monitoring tools creates an incident that is picked up by the SaaS Operations NOC (Network Operations Center) team and managed through to closure.
An overview of the process can be found in this chart:
The Infrastructure and the applications are monitored using appropriate tools, the majority from the BMC Helix Monitor/ (Truesight) set of solutions, but also more vendor specific tools where so required. Alerts from the various tools are consolidated and correlated into the Truesight Operations Event Management module TSOM (Truesight Operations Management), generating enriched Incidents in the SaaS Operations Helix Remedy environment called Myportal.
The SaaS NOC (Network Operations Center) team based in Pune India are assigned to all the incoming Incidents except DB and Network alerts which are assigned directly to the DBA and Network teams. The teams work on the alerts in shifts, 24x7. Many of the standard Incidents have predefined runbooks with automatic resolutions - while others are remediated manually by the NOC team following a documented procedure.
In case a 1-Critical incident is raised, it is assumed to be a PROD outage where many End-Users/Customers Businesses are impacted. The NOC will then immediately manually like any user try to log in to the customers environment to confirm if it really is an outage or not. In case they can not log in, they through automation Open a BMC internal bridge, calling in a MIM(Major Incident Manager) and SME's (Subject Matter Experts) to manage the investigation. The automation also send out a Notification to customers informing the customer their Service might be impacted, indicated by a critical alerts from our monitoring tools, and creates an unavailability record in our disruption database.
Once the outage is remediated through a solution or workaround, a Problem ticket is created and the MIM starts preparing a MIR (Major Incident Report) document, that can be requested by customers. It contains information about the incident, what was done to restore the service and what will be done to prevent it from happening again. If known at that time it will also explain what caused the outage.
The NOC sends out a notification telling the customer that the service has been restored. The actual outage time is collected from our Transaction monitoring tool that simulates a user logging in - and is stored in the Unavailability record. For other alerts the actual timing of the alert is used to define the incident duration.
After NOC has updated the disruption DB, the availability can be seen by Helix Remedy customers on the i.onbmc.com Service Status Dashboard, based on the disruption DB data. Work is ongoing to extend this to additional Helix Services.
Details on our policies can be found on our documentation site docs.bmc.com here:
Examples of Policies related to availability: