ITSM: Major Incident Management (MIM) Workflow

This is for the management and documenting of a Service Restoration Team (SRT)

Aug 10, 2024

Major Incident Management (MIM) Workflow
This is for the management and documentation of a Service Restoration Team (SRT)

This document is continuously under review, see also:

MIM SRT Datapoints to Capture

An initial concern is called out.

Monitoring:

Follow the process for the specific alert.
Client Services:
Qualification of the issue.
Document what initial basic troubleshooting steps have been completed.
- Identify the troubleshooting steps.
The Proposed issue is then determined to be a Major Incident Candidate or not.
The escalation is made to the NOC One Pager. (MIM input channel for Major Incident Candidates. ‘One Page’ with escalation instructions, including ‘continuity of business' alternatives if tools are experiencing an outage)
The NOC gathers the basic data requirements to begin the Triage of the Major Incident Candidate. Effective basic troubleshooting prior to raising the call to MIM is crucial. The more pertinent information, the better. This information is “Best Effort” and may change due to a later discovery.

Minimum detail required for MIM to triage a Major Incident Candidate.

Major Incident Candidate escalation template

*This template answers questions that MIM must be able to answer upon accepting the escalation for client and ELT-level notifications.
**Be aware that all Major Incidents will have high visibility as senior leadership will be engaged for each major incident called in!

The NOC sends out the page to engage the Service Restoration Team.
In the first 15 minutes of the Outage Call, the SRT (Service Restoration Team) begins the Triage of the Major Incident Candidate.
The Major Incident notification process is implemented.
If the Major Incident Candidate is validated to not be a Major Incident, the NOC sends the response notification to Close the Loop on notifications.
A Proactive response is used when we can proactively identify an issue before there is customer impact and remediate the issue prior to impact (ex. An escalation is raised for an expiring license, but the Collaborative Work Team has sufficient time to renew the license before it expires, etc.).
The P<#> Response is used when there is customer impact, but the actual impact is not significant enough to justify a Major Incident (ex. An escalation is raised for a network outage, but triage determines only one user is experiencing the issue, etc.).
If the Major Incident Candidate is validated to be a Major Incident or is still potentially a Major Incident, the NOC sends the Initial or Potential notification as appropriate.
If there is no doubt that there is a definite Major Incident, then the Initial notification is used to inform our audience that we know that we have an issue.
If Triage needs to be extended, to qualify if the Major Incident Candidate is actually a Major Incident, then the Potential notification will be used.
When the NOC is notified of a Major Incident AFTER the issue is already resolved (and only then!) the Initial/Closure notification is used.
If a Major Incident Candidate is determined to be a P3 through P5, then the NOC will ensure that the ticket is assigned to the appropriate assignment group and will be managed by that group.
During the Service Restoration Team (SRT) call, the MIM and NOC Analyst will document all critical, detailed information about the Major Incident. This includes the time and dates of pertinent milestones, what mitigation efforts were done, and who was responsible for the various information provided on the SRT.

Make note of “Who, What, When, Why, Where & How.”

If it’s not documented, it didn't happen.

Update Notifications will be sent by MIM and the NOC according to the appropriate OLA.

Major Incident Notification OLA’s:

First Communication – 15 minutes or less from first contact to NOC
Updates
First update - Less than 30 minutes from First Communication
P1 – Every 30 minutes following the First Update
P2 – Every 60 minutes following the First Update
(Exceptions: If agreed by the business, time determined on the bridge or following significant update)
Closure – within 15 minutes after confirmation services are restored.

The MIM will be focused primarily on the Post Incident Review (PIR) documentation.

Customer Business Impact with a concise business impact statement, including EACH of the following:
- Was this initially reported by the client, the companies' internal monitoring, or both?
  - Please specify which clients and monitoring tools reported this, and when.
  - When was the first alert?
- Are the users aware of the issue (monitoring on the customer end, helpdesk calls, etc.)?
  - Are these Internal or External customers having the issue (it may be both)?
    - Is a Sensitive customer of the company impacted?
- What can the users not do (What is it that the users cannot do, that the customers could do before)?
- When was the first customer contact?
  - How many customer contacts have there been for the issue (are the users aware of the outage)?
  - How many Customer emails are on the issue, by product?
  - How many Customer calls on the issue, by product?
  - What is the approximate percentage of users impacted?
- Products Impacted:
  - Platforms may encompass several products. Please specify all products that would have been affected, individually.
- Is there a viable workaround?
  - Describe workaround
- What is the application or service used for, etc.?
- Is this related to a Change?
  - What Change Control number is it related to?
  - Was this an Undocumented Change?
- Is this related to a Vendor?
  - Vendor ticket?
  - Who manages the Vendor relationship?
  - Are there any posted maintenance notices or details about security changes from the vendor?
- Is this related to a Known Error?
  - What is the Known Error number?
  - Is this related to a known defect or bug?
    - What is the JIRA for the defect or bug?
- What is the potential impact on our company's reputation, brand, and customer confidence in the company?
- Are any deadlines in jeopardy of being missed?
- Are these SLAs in jeopardy?
  - Are these OLAs in jeopardy?
- What is the current approximate Financial impact?
  - Is the Business willing to increase the current impact to unaffected users in order to immediately resolve the issue?
  - Can remediation changes wait until normal maintenance hours?
- Justifications for Changes in Priority?
  - Who authorized the change in priority?
- Follow-up Action Items (Irreversible corrective Actions - ICAs) (Problem Management’s responsibility but may be added by others).
  - What is the description or scope of the ICA?
  - Who is assigned the ICA?
  - What is the priority of the ICA?
  - Due date of completion of the ICA?
- Are communications from vendors, alerts, etc. going to an alias and not an individual?
- Is this related to a Project?
The NOC will be focused primarily on the Incident ticket timeline of events and actions.
- GIT (Global IT) notified (may be prior to NOC notified):
- NOC notified Triage began for Major Incident Candidate (NOC received a <call/monitoring alert> from <name & group (service desk, client services)> at?):
- Triage ended; initial Priority determined?
- Downtime began:
- Degradation began:
- Changes in Priority
- Downtime ended:
- Document each Time-Date stamp for Bouncing Up & Down Outages
- Workaround in place:
- Degradation ended:
- Document each Time-Date stamp for Bouncing Up & Down Outages
- Duty Manager page sent out via the xMatters tool.
- SRT (Service Restoration Team) Bridge (Outage Call) initiated.
- Incident posted in Teams to the NOC <Teams Channel Name(s)> channel for awareness.
- When the Outage call started.
- Attendees:
- NOC/Scribe (Author)
- Duty Manager
- Technical Incident Manager (Lead Technical resource)
- Support Group owning the outage (This must match the Technical Incident Manager support group)
- Client Services Representative
- Product Management Representative
- Vendor Relationship Manager
- Who will own updating and closing the ticket in Drive (if the Incident Manager is not Global IT)?
- Business Entities/Countries Impacted
- Global or Regional Outage?
- Business Service(s) Impacted?
- CI's (Configuration Items/Server names/Applications) Impacted?
- Notification Timestamps:
- Notification timeline - Initial sent (posted with email headers):
- Notification timeline - Update(s) sent (posted with email headers):
- Notification timeline - Closure sent (posted with email headers):
- What are the current restoration steps (Plan A)?
- What can be done concurrently, in parallel, to restore service faster?
- What are our Next Steps, once complete?
- What can be done concurrently to prepare for the Next Steps?
- What is our contingency plan if this current path fails (Plan B)?
- What can be done now to prep Plan B for quick implementation if we fall back to it?
- What can be done concurrently to facilitate Plan B, in parallel?
- What can be done concurrently to prepare for the Next Steps of our Plan B?

Upon resolution of a Major Incident, the following will occur.

The Close the Loop notification will be sent according to the determination of the type of resolution.
Proactive downgrade
P<#> downgrade
Initial/Closure
Closure
Final Update
- Child to Parent
- Duplicate
The De-escalation Process of a Major Incident will be completed by the MIM.

Major Incident Notification OLA’s:

First Communication – 15 minutes or less from first contact to NOC
Updates
- First update - Less than 30 minutes from First Communication
- P1 – Every 30 minutes following the First Update
- P2 – Every 60 minutes following the First Update
- (Exceptions: If agreed by the business, time determined on the bridge or following significant update)
Closure – within 15 minutes after confirmation services are restored.

SEE also the RACI chart for information to triage a Major Incident Candidate.

Life Cycle of a P1/P2 for Major Incident Management/Problem Management

PIR Process

PIR is posted to the PIR Forms, mark the Problem Manager assigned, and the entry is changed to WIP when the Technical Incident Manager is emailed the PIR to begin working.

If needed, the Post-Mortem is typically 2-3 days after the resolution to provide time for identifying the Root Cause.