ITIL Incident Management
Incident Management in ITIL is the key process in Service Operation. Most Service Providers are evaluated and assessed by the speed they respond and restore service after an Incident has occurred. By definition, an Incident is an unplanned interruption to an IT service or reduction in quality of an IT service.
It may also be the failure of a Configuration item (CI) that has not yet impacted service. The simple explanation is an Incident is an unplanned disruption, or impending disruption, to an IT service. If disk space is filling up quickly and the service CI will be out of space in three hours, it is an Incident. If the network quality is degraded, it is an Incident. Incidents include disruptions reported by users (either via calls to the Service Desk or imputed into the ITSM tool), by technical staff, or automatically detected and reported by event monitoring tools.
The concept of Incidents disrupting a service is one most people are familiar. Think of your household phone or internet service. When you moved into your residence and signed up for service, your considered the service worth the price. If there is a disruption in the service, it is painful to the customer and the goal should be quick resolution. The customer does not want hours – or even days – without phone and internet. Even if the customer is rebated for the outage, it still leaves a scar on the relationship. This is the same with the customer of an IT service. They want as few Incidents as possible, lasting the shortest amount of time as possible. The customer is paying for a service and wants it available when needed. The business customers who are paying for the IT service do not care about the cause of the disruption, just service restored as quickly as possible and the issue to not arise again.
How can you measure and report incidents?
As Incidents are reported, the Incident Management process seeks to understand the impact and urgency of the Incident on order to act accordingly. The combination of Impact and Urgency is called Priority.
- Impact is the measure of the effect of the Incident, Problem, Change, or other ITSM record. Impact is usually measured on the impact to service levels.
- Urgency is the length of time until the Incident, Problem or Change has a significant impact on the IT service. This is how quickly the Service Provider needs to act to resolve on behalf of the business customer.
- Priority is a way to identify the relative importance of an Incident, Problem, or Change. Priority allows a common understanding to offer relative importance of Incidents and Problems.
Most organizations utilize a Priority Matrix that is a 3-by-3 or 4-by-4 scale. For example, high impact and high urgency would result in a Priority 1 Incident. Additionally, a low impact and low urgency Incident would be the lowest Priority (some organizations call this Priority 4 or Priority 5).
Most Service Providers, both internal and external, use this matrix to determine the response and closure times for service levels. These service levels are agreed upon and documented to form a Service Level Agreement (SLA) or Operating Level Agreement (OLA).
Users do not care the nature or the cause of the Incident, just how soon it can be resolved. Problem Management will investigate root cause. Most organizations keep volume metrics like number of Incidents broken down by Service Provider. Many track service metrics suck as Mean Time to Restore Service (MTTRS) and Mean Time Between Service Interruptions (MTBSI). The service metrics are great when reviewing service availability with the business customer.
Advanced process metrics will include a view into the maturity of the Incident Management process. This includes the percentage of Incidents logged, categorized, and prioritized correctly, each with a separate metric tracked per Service Provider. The Service Desk will be measured on first-call resolution, broken down per Service Desk agent.
Other metrics for measuring incidents
- Average cost of handling an incident, broken down by Service Provider and Priority.
- Incident Reopen Rate measuring if an Incident is closed prematurely as a wider-spread issue was unknown at the time.
- Number of Incidents per service
- Number of Incidents resolved within agreed SLAs or OLAs
- Number of times SLA or OLA target times exceeded for Incident resolution
Incident Management is the process that determines how business customers view the performance of the service and the Service Provider. The performance against the SLAs and OLAs will determine how the Service Provider is viewed by the business customer.