When I was a kid I would occasionally get sent to the principal’s office. It was usually due to a hall monitor or safety patrol detecting undesirable behavior. A couple of infractions that come to mind are snowball throwing during recess and giving out kooties “no trade backs” when I was supposed to be sitting in my seat. The principal would then take action in the form of a stern talking or some other punishment that could vary in severity depending on the permission received from home. The safety patrol provided the monitoring, the principal provided the remediation.
Are you beginning to see where I’m taking this?
This is similar to the sequence of actions that occurs in a Network OperationsCenter (NOC). Monitoring is handled by network operations using Assurance tools that monitor and correlate events. Remediation is handled, once approved, by network engineering using some sort of device management or automation tool. Let’s take it one step further. A friend of mine once threw a paper airplane in class. This was detected and he was given a warning by the teacher (suppressed event). After a double dog dare he threw another one that hit her in the head. This was correlated with the previous event and led to escalation of a trouble ticket (go to the principal’s office). Because the principal had permission on file from my friend’s parents (Approval), my buddy got the paddle which meant a canoe paddle to the backside (remediation). He never threw another paper airplane in class. I didn’t just make that up.
Assurance and Automation, the two main pillars of this dance are typically treated as two completely separate responsibilities handled by different personnel and different buying groups. The hand off from one to the other is usually manual. But it doesn’t have to be that way. There are out of the box tools and integrations that handle detection, correlation and remediation all while following ITIL best practices and process compliance. Here is an example:
Automated Fault Management
Change to network device is detected by Network Automation tool and sent to Event Management
Fault on that network device is detected by Network Management tool and sent to Event Management
Event Management correlates these events and automatically generates an Incident enriched with the relevant data.
Network Automation tool builds a script to remove the change to the network device that caused the outage. This can be done manually or automatically depending on site policy or severity of event.
Change Ticket with change details is created automatically
Once the Change Ticket is approved which again can be done manually or automatically depending on policy, the Network Automation tool sends the script to the network device
Change Ticket is closed automatically
Network Management tool detects that service is back up and automatically closes the Incident.
Note that there are points in this sequence where pauses can occur for manual action and review; however, if site policy allows it, the entire sequence could run automatically, taking full advantage of the automation tools (i.e.Network Automation in this example), assurance tools (i.e. Network Management and Event Management) and the service desk (i.e. Incident and Change management).
We have customers who are doing this level of automation today but the market has not nearly exploited the possibilities yet. Another example that customers are beginning to leverage is compliance automation where instead of a Fault being detected, acompliance violation in a device’s configuration is detected and auto-remediated.
So go forth and further automate your service operations by collapsing Assurance and Automation. It will keep you out of the Principal’s office. I double dog dare you.