Document history

DateSummary
2010-04-01Approved: Architecture Review Board
2015-08-28Version 2.1 – reviewed and updated for activation of Enterprise Service Management Tool (eSMT)
  • Updated references to Ministry of Government Services (MGS) to Treasury Board Secretariat (TBS)
  • Corrected references to appendix 6.4 to appendix 6.2.2
  • Request for change (RFC) terminology updated to change request (CRQ)
  • Updated contact information to reflect assignment to IT Service Management Leads (ITSML)
  • Updated Impacts to existing standards table to reflect impact on GO-ITS 44, Terminology Reference Model
  • Updated impacts to existing environments table to reflect conformity to eSMT
  • Updated definitions of impact, urgency and priority matrix in appendix 6.2.2
  • Added requirement to verify and update incident fields and categorization when resolving an incident to Incident Analyst responsibilities.
  • Minor wording and grammar updates
2016-03-08Reference to “situation (war) room” changed to “situation room” based on feedback from SCS.
2016-03-16Architecture Review Board endorsement
2016-03-31IT Executive Leadership Council approval
2018-06-28Draft version 2.2 – Review and endorsement by Service Management Executive Committee (SMX)
2019-02-13Architecture Review Board endorsement
2019-03-11Updates to Draft: (1) based on post-ARB feedback, inserted new wording in section 4.2.10 Service Owner, i.e. “Accountability and managing of applicable vendors including but not limited to engagement during incidents in a timely manner”; (2) added definition for ‘Proactive incident management’ to section 7 Glossary
2019-07-11IT Executive Leadership Council approval (Approved Version 2.2)

1. Foreword

Government of Ontario Information Technology Standards (GO-ITS) are the official publications on the IT standards adopted by the Treasury Board Secretariat for use across the government’s IT infrastructure.

These publications support the responsibilities of the Treasury Board Secretariat for coordinating standardization of Information & Information Technology (I&IT) in the Government of Ontario.

In particular, GO-ITS describe where the application of a standard is mandatory and specify any qualifications governing the implementation of standards.

2. Introduction

2.1. Background

The requirement for an all-encompassing Ontario Public Service (OPS) Incident Management standard was predicated by the positioning of all infrastructure service and support within Infrastructure Technology Services (ITS), a new organization within the OPS mandated in 2005 to deliver these types of services to the OPS. The ITS organization was created in 2006 to achieve this goal. Establishment of this goal required an update of the requirements for the GO-ITS Standard for Incident Management based on the situation described above. The result was an updated version of GO-ITS 37 created and approved in July of 2007.

During February 2009, a series of outages to Ontario.ca infrastructure prompted I & IT Executive Management to conduct a review of both Incident and change management processes and procedures. The review identified deficiencies in a number of areas including; procedures, operational process management and behaviour. The review made specific recommendations to address the deficiencies and these recommendations have subsequently been sanctioned by Information Technology Executive Leadership Council (ITELC). Accordingly, the OPS Enterprise IT Service Management Program (OEIP) has updated the enterprise incident management process standard to incorporate the recommendations.

The updated standard redefined certain aspects of the enterprise incident management principles, roles and the associated process model. Updates to GO-ITS 37 included:

  • Principles, roles, responsibilities and the high-level process flow required to ensure an enterprise perspective of incident management for the OPS.
  • Definition of a major incident protocol at the process standard level
  • Incorporation of ITIL® V3 (2007)footnote 1 concepts, introduction of a service-based focus for enterprise incident management disciplines and the natural evolution of IT Service Management within the OPS

These standard elements continue to provide a single unified process for enterprise incident management within the OPS. Use of a single process and supporting information will enable OPS-wide management and reporting for the enterprise incident management process through establishment of associated metrics.

In May 2015, the enterprise IT Service Management (ITSM) tool set was upgraded. The upgrade required updates to this document to accommodate changes in terminology and the priority matrix.

From March to June 2016, a comprehensive review was conducted to assess suggested changes to accommodate evolving business and service management processes.

On October 24, 2016, the new Enterprise Service Management (eSM) division was formed within the Office of the Corporate Chief Information Officer (OCCIO).  eSM brought together the Service Management function from I+IT clusters and ITS under a single, new entity in order to:

  • Improve internal IT service delivery
  • Enable more consistent processes and service levels across the OPS
  • Improve efficiencies in the way IT services were delivered

In addition, the evolution of incident management broadened the process scope to include proactive incident management and the incorporation of event management as a key enabler.  From Fiscal Year 2016/2017, GO-ITS 37 was reviewed and updated to reflect the roles for the new division and additional principles for the evolution of incident management.

On October 1, 2018, I&IT moved under the new Ministry of Government and Consumer Services (MGCS).  To further the government mandate to provide simpler, faster and better government services, eSM joined the Infrastructure Technology Division on October 22, 2018.

2.2. Purpose

The goals of the enterprise incident management process are to restore normal service operation as quickly as possible, minimize the adverse impact on business operations and ensure that the best possible levels of service quality and availability are maintained.

This process standard describes best practices to be utilized for incident management. The process design is organizationally agnostic and is not constrained by the status quo. Implementation of the process may require organizational or behavioural transformation.

2.3. Value to the business

The value of incident management includes:

  • The ability to detect and resolve incidents, which results in lower downtime to the business, which in turn means higher availability of the service. This means that the business is able to exploit the functionality of the service as designed.
  • The ability to align IT activity to real-time business priorities. This is because incident management includes the capability to identify business priorities and dynamically allocate resources as necessary.
  • The ability to identify potential improvements to services. This happens as a result of understanding what constitutes an incident and also from being in contact with the activities of business operational staff.
  • The Service Desk can, during its handling of incidents, identify additional service or training requirements found in IT or the business.

2.4. Basic concepts

ITIL defines an ‘incident’ as: "An unplanned interruption to an IT service or reduction in the quality of an IT service." Failure of a service component or element item that has not yet impacted service is also considered an incident (e.g., failure of one disk from a mirrored set).

Incident management is the process for dealing with all incidents. This can include;

  • failures, questions or queries reported by the users (usually via a telephone call to the service desk)
  • anomalies detected by technical staff
  • automatically detected errors or conditions reported by event monitoring tools

The Service Desk Agent (SDA) captures the pertinent information and logs, classifies and prioritizes the incident.

The priority of an incident is primarily determined by the impact on the business and the urgency with which a resolution or work-around is needed (as defined in Appendix 6.3). Objective targets for resolving incidents are defined in Service Level Agreements (SLAs). Major incidents, which typically have highest impact and demand quicker resolution, follow the same process as any other incident, but are managed by a separate procedure.

The Service Desk takes advantage of diagnostic scripts to capture and verify information that is required to quickly resolve the incident. In the case where the Service Desk cannot achieve resolution, this information helps in ensuring the incident is assigned to the appropriate tier 2 group for action. The Service Desk Agent often references incident patterns, the known error database and any available knowledge management records to obtain any information that will assist them in attempting to resolve the incident at First Point of Contact (FPOC).

If the incident cannot be resolved at FPOC, the SDA assigns the incident to a group with more specialized skills. (This is known as functional escalation).

Tier 1-n thresholds

Each support tier may be allocated a certain amount of time to resolve the incident, following which the incident must be functionally escalated to a more specialized group. The amount of time allocated to each tier is set so that service restoration occurs within the agreed targets, as defined in the SLA. These allocations may be adjusted from time to time based upon staffing models, experience on supporting the various services and ongoing changes to service specifications and components.

Process flow chart demonstrating the functional escalation steps for each support tier. Full description available using link below.

Accessible description of infographic 1

Queues, support groups and functional escalation

The incident management system supports the practice of queues and queue management: each queue represents a view of all incidents assigned to an organization at all levels of priority. This provides a queue manager with an overall perspective of how the incident management process is being executed across all support groups within an organization at any given time. Should a certain part of the organization be experiencing a back log related to incidents in their respective queues, their respective manager may be asked by the queue manager to perform hierarchical escalation, to notify more senior management of the situation in an effort to relieve the pressure on any specific queue.

This basic concept applies to the design of the incident management process within the Ontario Public Service; however, organizational maturity currently prevents the industry best practice from being strictly followed. It is important to note this concept as it describes the desired organizational behaviour or "future-state" model.

Various support groups have also been established in each OPS organization based upon areas of functional expertise. An incident can be assigned to any one of these support groups, where it is then assigned to an individual member of that group to undertake incident diagnosis and resolution. All of these support groups must roll up into an organizational queue view, so that the overall perspective is available to the Queue Manager.

A Service Desk Agent, who cannot resolve an incident at FPOC, assigns it to the appropriate tier n support group, based upon the initial diagnosis.

Once the Service Desk Agent has assigned the incident to a tier n Incident Analyst, one of three things typically occurs:

  • Resolution: The Incident Analyst restores service, sets the incident status to resolved and informs appropriate tier n resources, if required.
  • Re-assignment: The Incident Analyst concludes that the cause of the incident does not lie in the support agent’s area of expertise and assigns the incident back to the Service Desk for re-assignment to a more appropriate group
  • Functional escalation: The Incident Analyst cannot resolve the incident within defined threshold and requests that the incident be assigned to the support group with more specialized skills

A Queue Manager role is established for an individual support group to monitor their respective queues at regular intervals to identify any incidents that have not been assigned to individuals, or have not been resolved within defined thresholds and to take proactive action.

Accountability

Regardless of the support staff and organization to which an incident may be assigned, the Incident Manager (part of the OPS IT Service Desk [OPS ITSD] organization) remains accountable for ensuring that enterprise incident management process and procedures are followed and that prompt incident resolution activities are undertaken with service level objectives in mind.

Whoever restores service (Service Desk Agent or tier 2-N support group), they are accountable to confirm with the customer and/or end user that service has been restored and verify the accuracy of the resolution categorization prior to resolving the incident. The incident management tool, eSMT, will close the incident fifteen days after the ticket is in a resolved state. The exception is the Major Incident Protocol (MIP) where the Incident Coordinator (IC) is accountable to resolve the incident.

Inputs to the incident management process include footnote 2 :

  • Incident records from calls to the Service Desk
  • Service level objectives (from SLAs)
  • Capacity management thresholds and monitoring alerts
  • Incident resolution details from the incident template
  • Incident patterns and workarounds from incident knowledge management database
  • Known errors from problem management
  • Configuration Item (CI) data from configuration management
  • Change requests (CRQ)

Outputs from the incident management process include footnote 3 :

  • Closed incidents - services restored
  • Change Requests (CRQ) - incident resolution
  • Inconsistencies found while interrogating the Configuration Management Database (CMDB)
  • Consistent, meaningful and maintained incident records
  • Meaningful management information

2.5. Scope

2.5.1. In Scope

Incident management includes any event which disrupts, or which could disrupt, a service. This includes events which are communicated directly by users through the Service Desk or events detected through an automated interface from event management to incident management tools.

For purposes of clarity, any use of the terms Incident Manager, incident management or incidents within this document includes the "enterprise" perspective described in Section 2.1.

Service requests do not represent a disruption to agreed service, but are a way of meeting the customer’s needs and may be addressing a specific aspect or feature of the service being provided (request fulfillment). This will be documented in the service level agreement with each customer and the service level objective will be outlined therein. Service requests are dealt with by a separate request fulfilment process.

Service requests in the OPS may be currently tracked under the same incident management enabling technology used by the Service Desk for incident logging.

Incident management scope:

IsIs not
"How to" and technical questionsService requests (request fulfillment) This is handled in the OPS through service request management portal.
Service RestorationRoot cause analysis (part of problem management)
Steps and procedures to manage major incidentsEstablishment of communication thresholds for customers

2.6. Applicability statements

2.6.1. Organization

Government of Ontario IT standards and enterprise solutions and services apply (are mandatory) for use by all ministries/clusters and to all former schedule I and IV provincial government agencies under their present classification (Advisory, Regulatory, Adjudicative, Operational Service, Operational Enterprise, Trust or Crown Foundation) according to the current agency classification system.

Additionally, this applies to any other new or existing agencies designated by Management Board of Cabinet as being subject to such publications, i.e., the GO-ITS publications and enterprise solutions and services, and particularly applies to Advisory, Regulatory, and Adjudicative Agencies. Further included is any agency which, under the terms of its Memorandum of Understanding with its responsible Minister, is required to satisfy the mandatory requirements set out in any of the Management Board of Cabinet Directives (cf. Operational Service, Operational Enterprise, Trust, or Crown Foundation Agencies).

As new GO-IT standards are approved, they are deemed mandatory on a go-forward basis. Specifically, in the case of the revised version of GO-ITS 37 V2.0, the effective date has been established as July 1, 2010. Future versions will become mandatory on the effective date established for that version.

When implementing or adopting any Government of Ontario IT standards or IT standards updates, ministries and I&IT clusters must follow their organization's pre-approved policies and practices for ensuring that adequate change control, change management and risk mitigation mechanisms are in place and employed.

For the purposes of this document, any reference to ministries or the government includes applicable agencies.

2.6.2. Requirements levels

Within this document, certain wording conventions are followed. There are precise requirements and obligations associated with the following terms:

Must: This word, or the terms "required", or "shall”, means that the statement is an absolute mandatory requirement.

Should: This word “should”, or the adjective "recommended”, means that there may exist valid reasons in particular circumstances to ignore the recommendation, but the full implications (e.g., business functionality, security, cost) must be understood and carefully considered before deciding to ignore the recommendation.

2.6.3. Compliance requirements

Execution of this process at the operational level requires use of procedures, work instructions and enabling technology to automate certain workflow aspects. These elements will be produced by the organization selected by ITELC as the Operational Process Manager. Pending formalization of an ITSM process lifecycle management protocol, the following statements are presented to ensure that these elements are fully compliant with this standard:

  • Procedures must be developed by decomposing each process step from section 4.3 into procedural subtasks. These procedures must be submitted to the Enterprise Process Owner for certification that they comply with the spirit and intent of the process standard.
  • Work Instructions must be developed by decomposing all procedural subtasks into further subtasks. These must be then submitted to the Enterprise Process Owner for certification that they comply with the certified process and procedures.
  • Functional requirements must be developed for enabling technology that will be used to automate aspects of the work instructions and procedures. Functional requirements must also be submitted to the Enterprise Process Owner for certification that they align with the certified procedures.
  • Any subsequent modifications to the procedures, work instructions or enabling technology must be managed via Enterprise Change Management (eCM). They are subject to review and endorsement by the respective working committees for any changes/improvements and annual approval by Service Management Executive Committee (SMX).

3. Standards lifecycle management

3.1. Contact information

Accountable role (standard owner) definition

The individual or committee ultimately accountable for the process of developing this standard. Where a committee owns the standard, the committee Chair is accountable for developing the standard including future updates. There must be exactly one accountable role identified. The accountable person also signs off as the initial approver of the proposed standard before it is submitted for formal endorsement to Architecture Review Board (ARB) and approval by ITELC. (Note: in the OPS this role is normally at the IT executive or manager level)

Accountable role:Chair of Service Management Executive Committee (SMX)

Responsible role definition

The organization(s) responsible for the development of this standard. There may be more than one responsible organization identified if it is a partnership/joint effort. (Note: the responsible organization(s) provides the resource(s) to develop the standard)

Responsible organization(s): Infrastructure Technology Services (ITS), MGCS

Support role definition

The support role is the resource(s) to which the responsibility for actually completing the work and developing the standard has been assigned. If there is more than one support role, the first role identified should be that of the editor – the resource responsible for coordinating the overall effort.

Support role (editor):

Ministry: Ministry of Government and Consumer Services (MGCS)
Division: Infrastructure Technology Services (ITS)
Branch:  Service Management Operations & Process Management Branch

Job Title: Senior Manager
Name:  Arpad Martonosi
Phone: (416) 327-2080
Email: Arpad.Martonosi@ontario.ca

Job Title: Incident Manager
Name:  John Mancuso
Phone: (905) 704-2824
Email: John.Mancuso@Ontario.ca

The above individual will be contacted by the Standards Section once a year, or as required, to discuss and determine potential changes and/or updates to the standard (including version upgrades and/or whether the standard is still relevant and current).

Consulted

Please indicate who was consulted as part of the development of this standard. Include individuals (by role and organization) and committees, councils and/or working groups.

(Note: ‘consulted’ means those whose opinions are sought, generally characterized by two-way communications such as workshops):

Organization consulted (Ministry/I&IT Cluster)DivisionBranchDate
Tracey BurnsCAC

N/A

N/A

Steve TheofilaktidisCAC

N/A

N/A

Sacha Sone

CAC

N/A

N/A

Linda Anceriz

CSC

N/A

N/A

Jennifer Ellis

CYSSC

N/A

N/A

Lucille Gauthier

LTC

N/A

N/A

Tim Trojko

GSIC

N/A

N/A

Chantal Gallant

HSC

N/A

N/A

Ian Anderson

LRC

N/A

N/A

Cathy Hogan

JC

N/A

N/A

Vickie Barber

LRC

N/A

N/A

Matthew Mroczeck

CAC

N/A

N/A

Jeff Miclash

LTC

N/A

N/A

Lee Herrera

GSIC

N/A

N/A

Jennifer Sherlock

JC

N/A

N/A

Arpad Martonosi

eSM

N/A

N/A

Stephane Vertefeuille

ITS

SM

N/A

Amanda McCabe

ITS

SM

N/A

Kym Eedy

eSM

N/A

N/A

Tom Cholewinsky

eSM

N/A

N/A

Mike Williams

eSM

N/A

N/A

Jeff Martinson

eSM

N/A

N/A

John Mancuso

eSM

N/A

N/A

Aaron Zammit

eSM

N/A

N/A

Olive Kilpatrick

eSM

N/A

N/A

Carla Bellon

LTC

N/A

N/A

Real Martin

LTC/eSM

N/A

N/A

Christopher Phillips

eSM

N/A

N/A

Cary Lee

ITS

CRM

N/A

Derek Brown

eSM

N/A

N/A

Arthur Ho

eSM

N/A

N/A

James Foisy

eSM

N/A

N/A

Lucia Chiarello

OPP

N/A

N/A

Matt Glassford

eSM

N/A

N/A

Victor Krause

eSM

N/A

N/A

Guido Piraino

eSM

N/A

N/A

Robin Aklu

eSM

N/A

N/A

Winston Constantine

CSOC

N/A

N/A

Jamal Bandukwala

CSOC

N/A

N/A

Kevin Beauchesne

CSOC

N/A

N/A

Committee/working group consultedDate
ITSM LeadsDec 2009 and Feb 2010
Partner Incident Management LiaisonsAug-Sept 2015

Enterprise Incident Manager/Incident Management Process Coordinators/Service Desk Team Leads

Mar-Jun 2016

Partner Incident Management Liaisons

Sept 2016

Incident Management User Community

Sept 2017

ITSM GO-ITS Sub-Committee (PSSC)

Sept 2017

eSM Management/CSD

Jan 2018

Informed

Please indicate who was informed during the development of this standard. Include individuals (by role and organization) and committees, councils and/or working groups.

(Note: ‘informed’ means those who are kept up-to-date on progress, generally characterized by one-way communication such as presentations):

Organization informed (ministry/cluster)DivisionBranchDate

N/A

N/A

N/A

N/A

Committee/working group informedDate

N/A

N/A

3.2. Recommended versioning and/or change management

Changes (i.e., all revisions, updates, versioning) to the standard require authorization from the "responsible" organization.

Once a determination has been made by the responsible organization to proceed with changes, the Service Management Operations & Process Management Branch, MGCS, will coordinate and provide assistance with respect to the approvals process.

The approval process for changes to standards will be determined based on the degree and impact of the change. The degree and impact of changes fall into one of two categories:

Minor changes - requiring communication to stakeholders. Changes are noted in the "Document History" section of the standard;

Major changes - requiring a presentation to SMX/ARB for endorsement and ITELC for approval.

Below are guidelines for differentiating between minor and major changes:

Major:

  • represents a change to one or more of scope, principles, roles or high-level process flow
  • responds to legislative changes

Minor:

  • does not impact other standards (e.g., updated glossary information or updated informative or normative reference documentation)

3.3. Publication details

All approved Government of Ontario IT Standards (GO-ITS) are published on the OPS Intranet site. Please indicate below if this standard is also to be published on the public, GO-ITS Internet Site.

Publication of GO-ITS standardYes/No
Standard to be published on both the OPS intranet and the GO-ITS internet web site (available to the public, vendors etc.)Yes

4. Technical specification

4.1. Process principles

Principles are established to ensure that the process identifies the desired outcomes or behaviours related to adoption at an enterprise level. They also serve to provide direction for the development of procedures and (as necessary) work instructions that will ensure consistent execution of the process. The absence of well-defined and well-understood principles may result in process execution that is not aligned with the process standard. Process principles for OPS enterprise incident management are listed below.

Principle 1:

A single enterprise incident management process shall be used across the OPS in support of I & IT services.

Rationale:
  • A single process eliminates costs and inefficiencies of multiple processes for different services.
  • Establishment of a Single Point of Contact (SPOC) OPS IT Service Desk (OPS ITSD) in FY 2006/2007 implied a single incident management process for OPS I & IT incident management.
Implications:
  • Legacy incident management related procedures and work instructions must be integrated and aligned to the OPS enterprise incident management process.
  • Application support groups must adapt existing procedures and work instructions to comply with the OPS enterprise incident management process.

Principle 2:

Incident classification must identify the service(s) that is/are impacted (from the customer’s perspective).

Rationale:
  • OPS service directive.
  • Formative OEIP business architecture principle to establish a service focus for ITSM processes.
  • Enable implementation of Enterprise Service Agreements Model (eSAM).
Implications:
  • IT staff must adopt an end-to-end service perspective for all incidents.
  • Service classification requirements must be defined and included in enabling technology.
  • Service owners must identify the services/hierarchy.
  • A service configuration hierarchy must exist in order to identify impacted services.
  • IT staff must be trained in new classification techniques.
  • Incident messaging with user/customer must communicate the service that reflects the business impact.

Principle 3:

The OPS ITSD shall be the single entry point into the enterprise incident management process and will have oversight of incidents through their complete lifecycle including: assignment, functional and hierarchical escalation, tracking, communication and closure.

Rationale:
  • Single accountability for execution of enterprise incident management process.
  • Ability to share topical information within a single group and provide enterprise perspective.
  • Ability to cross-reference other incidents and establish incident priority from an enterprise perspective.
  • Consistent management and co-ordination of incident resolution.
Implications:
  • Effective diagnostic scripts and support models are required to assist in triage of incidents and ensure accurate assignment to the appropriate tier-n resources.
  • Service Owners must support the objective assessment of reported incidents and ensure criteria for impact and urgency (used to determine priority) are established and communicated to customers through the service level management process.
  • Incident assignments/re-assignments to tier-n support must occur via Service Desk only.

Principle 4:

The OPS ITSD shall act as the single point of contact for all business communication regarding reported incidents.

Rationale:
  • Consistent support interface for customers.
  • Consistent delivery and coordination of communications to internal staff.
  • Reduces duplicative messaging and ensures common perspective is provided to customers and to I & IT Senior Management.
  • IT tier 2-n support staff are more productive since they are protected from interruptions and the need to manage communications.
Implications:
  • Assistance and incident status information must be available (24/7) from the OPS ITSD throughout the entire lifecycle of the incident.
  • OPS ITSD and technical support staff will have to adjust their messaging to describe impacts/status in terminology that is service focused and customer based rather than technical in nature.
  • OPS ITSD will distribute all major incident communications (sanctioned by the Major Incident Manager).
  • Customers or I & IT clusters must have in place a mechanism to broadly disseminate information provided to them by the OPS ITSD.

Principle 5:

An incident must be logged through the OPS ITSD as a prerequisite for engagement of any tier 2-n support staff, including external service providers.

Rationale:
  • The incident record is the source for all incident resolution activities undertaken by any support staff.  Failure to document these activities increases the risk of delayed resolution.
Implications:
  • OPS ITSD procedures must identify the minimum level of information required to initiate an incident record and to enable effective investigation and diagnosis.

Principle 6:

Closure of incidents shall be dependent upon validating with either the end user or the customer that service has been restored.

Rationale:
  • Obtaining positive confirmation of incident resolution ensures that the customer is satisfied with the service delivered.
  • Validation step enhances the image of the IT organization.
Implications:
  • Customers will identify an appropriate level of resource to accept the validation request.
  • A suitable mechanism must be defined to deal with circumstances when end user(s) cannot be reached for validation within a predefined time period.

Principle 7:

There shall be notification and escalation procedures that ensure consistent, timely incident resolution and communication of progress relative to service level agreements.

Rationale:
  • Setting customer expectation for timing of periodic status reports will prevent interruptions caused by requests for status.
  • More effective delivery of end-to-end service as IT staff will have a clear understanding of agreed incident service level targets, which will guide appropriate functional and hierarchical escalation.
  • Incidents resolved within customer expectations will increase customer satisfaction.
Implications:
  • Clear triggers and thresholds must be defined for functional and hierarchical escalation, as well as any periodic status notifications (this implies some form of automation); service level objectives (documented in service level agreements) must be clearly and explicitly defined and linked to these thresholds.
  • A single escalation procedure must exist for functional and hierarchical escalation and must be adhered to by all participants in the incident management process.
  • A single notification procedure must exist for notification.
  • Any unique requirements for service-specific notification thresholds must be documented and managed through the service level management process, and outputs from these situations must be configured within the OPS ITSD enabling technology to support the requirements.
  • Templates and scripts are required to ensure consistency of messaging.
  • Customer messaging must be tailored to deliver a customer perspective.
  • Messaging for the internal service provider community may carry a different level of detail, and this will be managed through local work instructions at the OPS ITSD.

Principle 8:

All incident information, including resolution details, shall be logged in an accessible incident management repository.

Rationale:
  • Single source of data for all enterprise incidents ensures consistent view and authoritative source for management of incidents.
  • Tracking of progress enables ability to escalate.
  • Provides knowledge base to enable:
    • Reduction in Mean Time to Resolve (MTTR) for similar incidents by applying previous workaround.
    • Analysis and identification of problems (by problem management process).
  • Audit trail informs reporting (service level management).
Implications:
  • Incident management must be supported by an integrated IT support system with a common database for logging all incident and resolution information.
  • Incident management and problem management must have access to the same database.
  • Validation of accuracy of resolution details must occur before any auto-closure of tickets.

Principle 9:

A separate procedure shall be established to manage resolution of major incidents that will include nomination of a single manager for the incident. This resource will be assigned from a pool of management within ITS, Cyber Security Division or the Cluster.

Rationale:
  • Major incidents involve outages where the business impact is high and usually impacts public-facing services.  Restoration justifies extraordinary attention and resources.
  • Special leadership may be required to secure and manage resources to ensure prompt resolution of major incidents.  This will include authority to make human resources decisions/financial commitments as required.
  • Establishment of an accountable lead will ensure ownership of the major incident and provide an objective point of escalation and contact throughout the life of the incident from declaration to major incident review.
Implications:
  • Criteria for major incident declaration must be defined, documented and communicated to stakeholders and then linked to incident prioritization activities at the OPS ITSD.
    • Criteria may vary by service; it is neither reasonable nor efficient to define “one size fits all” criteria that apply to all incidents.
    • It is an expensive undertaking to invoke major incident procedures and secure and coordinate the resources required to deal with a major incident. Therefore, care must be taken to prevent subjective or reactive declaration by specifying objective, quantifiable attributes for an incident to be declared major.
    • Priority 1 incidents are defined as major incidents.
  • Ability to engage and receive confirmation of acceptance from the accountable Major Incident Manager must be 24/7.
  • Incident Analyst staff in any organization must be contactable on a 24/7 basis to support major incidents.
  • Some major incidents may not require special leadership if resolution activities are outside the span of control of the OPS I & IT community (e.g., major power outage or major weather situation across the province).
  • Staff involved in the incident management process must be trained in the major incident procedure.
  • Logistics, facilities and technical requirements for a situation room must be identified and provisioned to support prolonged or multiple incident events.  This information must be made widely available to all stakeholders in the enterprise incident management process.

Principle 10:

Any proposed service restoration activity which has the potential to impact other services or other customers of the same service must be approved by the Service Owner(s) before being undertaken.

Rationale:
  • Ensures that incident resolution activities do not impact other services or other users of the same service.
  • Ensures a business perspective is considered before possible disruptive actions are taken for incident resolution.
Implications:
  • Service Owner(s) must be contactable 24/7.
  • As an alternative to 24/7 availability, a defined policy must be developed by the Service Owner that will outline the proposed approach for each of the services in the catalogue of the Service Provider.  This policy must be shared with stakeholders and embedded in all service level agreements. The Incident Manager or Major Incident Manager (see Principle 9 above) would be contacted to provide requisite approval (after due consideration of the policy).
  • An ability to relate components and enabling services is required to understand potential impact to other users.  This information is typically obtained from the Configuration Management Data Base (CMDB).

Principle 11:

Incident resolution activities must commence as soon as possible for all incidents regardless of priority.

Rationale:
  • Industry best practice supports determining as soon as possible the extent and effort required to resolve incidents.
  • Delaying resolution activities for a seemingly minor or misdiagnosed incident could increase the impact to the customer (activities to resolve incidents reported during non-prime shifts, if deferred to next business day, can result in service-affecting impact to the customer).
Implications:
  • Unresolved incidents must be monitored on a periodic basis and their impact reassessed based on service level objectives.
  • Local work instructions must prescribe that a “sweeping” of the incident queues be performed on a periodic basis to ensure outstanding incidents have been actioned in support of service level objectives.
  • Ability to engage active support of tier 2-n resources outside of business hours.
  • Priority 2 incidents that are assigned to tier 2-n support groups outside of regular business hours may not be actioned until next business day.

Principle 12:

All Service Owners and OPS Service Providers shall fulfill their roles in compliance with the OPS enterprise incident management process.

Rationale:
  • Consistent participation from all stakeholders is required to ensure success of the enterprise incident management process.
Implications:
  • Underpinning Contracts (UCs) with external service providers must reflect the enterprise incident management process requirements.
  • Operating Level Agreements (OLAs) between internal service providers must be in place and reflect enterprise incident management process requirements.

Principle 13:

A mechanism must be in place to identify security-related incidents and engage appropriate support staff to resolve the issue.

Rationale:
  • Security- related incidents may require specialized skills that are not resident in the OPS ITSD organization.
Implications:
  • A security support group must be established and staffed on a 24/7 basis.
  • Special procedures must be defined and agreed to by the OPS ITSD and Cyber Security Division (CSD) to address security-related incidents.
  • OPS ITSD staff must be provided with initial and ongoing training to ensure they are equipped to identify potential security-related incidents.
  • This mechanism must be bidirectional in nature as CSD must have the ability to proactively inform the OPS ITSD of a security-related incident.

Principle 14:

A proactive enterprise incident management process is required where high likelihood of future impact is detected, and corrective action is required to prevent business impact.

Rationale:
  • The ability to detect and resolve incidents before business impact occurs results in lower downtime and higher availability of a service.
  • The dynamic allocation of resources to address identified business priority prior to impact ensures the best possible levels of service quality.
Implications:
  • Event management is a requirement to detect exceptions that may cause business impact.
  • Incident management processes can be activated before users contact the OPS IT Service Desk.
  • Automation may be applied for the resolution of incidents where appropriate.  For example, a script can be run to automatically restart a service when an event detects that a website has become unavailable.

Principle 15:

Event management is essential to enterprise incident management by providing information on the status of I+IT services and detecting any deviation from normal or expected operational behaviour.

Rationale:
  • Provides the ability to identify an exception that can be resolved before the user is impacted or to minimize the impact on the user.
  • Enables automated responses, creating efficiencies in timeliness and utilization of resources.
Implications:
  • Event management tools are required to interface with the incident management tool.
  • Automated responses must be logged within the enterprise Incident Management Tool to enable validation and trending for problem management.
  • The appropriate priority will be assigned to the incident created from an event.

Principle 16:

An event transitions to an incident when it is assessed as a clearly defined exception that may cause significant impact to business services.

Rationale:
  • A defined process must be applied to highlight significant events and avoid being inundated by insignificant events.
  • Human or cognitive analysis is required before the event is transitioned to an incident.  The analysis will include verification of business impact and priority.
  • The incident must identify the exception, the significance of the exception and the information required to determine the appropriate action to take.
Implications:
  • All events do not require immediate or expedited activities for remediation.
  • Events that are manageable through event logs or alert consoles do not require transition to incidents.
  • Information is available to quickly and accurately assess impact and urgency.
  • Event generated incidents shall have a unique identifier to differentiate them from user generated incidents.
  • When there are recurring event-generated incidents, similar to recurring user‑generated incidents, problem management will be engaged to investigate the root cause.

4.2. Process roles and responsibilities

Each process requires specific roles to undertake defined responsibilities for process design, development, execution and management. An organization may choose to assign more than one role to an individual. Similarly, the responsibilities of one role could be mapped to multiple individuals.

One role is accountable for each process activity. With appropriate consideration of the required skills and managerial capability, this person may delegate certain responsibilities to other individuals; however, it is ultimately the job of the person who is accountable to ensure that the “job gets done.”

Regardless of the mapping of responsibilities within an organization, specific roles are necessary for the proper operation and management of the process. This section lists the mandatory roles and responsibilities that must be established to execute the incident management process.

Process taskIncident manager (all incidents)Major incident manager (P1)Situation manager (P2)Service desk agentIncident analyst (tier2-n)Service ownerITS Incident Advisor
Log & Classify IncidentANilNilRNilNilNil
Prioritize IncidentANilNilRNilNilNil
Declare Major IncidentA,RINilCNilII
Perform Tier 1 DiagnosisANilNilRNilNilNil
Functional EscalationARRRNilNilC
Perform Tier-N DiagnosisANilNilNilRINil
Resolve IncidentAA*A*R,IRII
Monitor IncidentARRNilNilNilR
Close IncidentANilNilR**NilNilNil

Legend: Responsible, Accountable, Consult before, Informed
A*

  • Major Incident Manager is accountable to resolve major incidents per major incident protocol
  • Situation Manager may be called upon to resolve other incidents as deemed necessary by the Incident Manager

R** Incident closure is automated by the tool at this time

Enterprise incident management process owner

The Process Owner owns the process and the supporting documentation for the process. The Process Owner provides process leadership to the IT organization by overseeing the process and ensuring that the process is followed by the organization. When the process isn't being followed or isn't working well, the Process Owner is responsible for identifying why and ensuring that required actions are taken to correct the situation. In addition, the Process Owner is responsible for the approval of all proposed changes to the process and development of process improvement plans.

Responsibilities
  • Ensures that the process is defined, documented, maintained and communicated at an enterprise level through appropriate vehicles (e.g. Corporate ARB).
  • Undertakes periodic review of all ITSM processes from an enterprise perspective and ensures that a methodology of continuous service improvement (including applicable process-level supporting metrics) is in place to address shortcomings and evolving requirements.
  • Ensures that all enterprise ITSM processes are considered and managed in an integrated manner, taking into consideration OPS policies and directives and factoring in evolving trends in technology and practice.
  • Solicits OPS stakeholders and communities of interest to identify enterprise ITSM process requirements for consideration by the enterprise ITSM program.
  • Coordinates, presents and recommends options for the prioritization, development and delivery of process requirements to appropriate governing body.
  • Ensures enterprise ITSM procedures and work instructions and functional requirements for enabling technology are aligned with the enterprise process.
Segregation of duties

The role of enterprise Process Owner is separate and distinct from that of the Incident Manager and the roles shall be separately staffed.

Incident Manager (IM)

The Incident Manager is accountable for managing execution of the incident management process and directing the activities of all OPS I&IT organizations required to respond to incidents in compliance with SLAs and SLOs. The Incident Manager is accountable for the lifecycle of all incidents and acts as the incident management point of escalation for incident notification and for hierarchical escalation.

Responsibilities
  • Develops and maintains an appropriate level of incident management procedures and/or work instructions to support the needs of the business.
  • Ensures that incident management staff are trained and familiar with IM procedures.
  • Monitors IT support staff performance of the incident management process; creates and executes action plans when necessary to ensure effective operation and continuous improvement.
  • Manages incident resource allocation and workload distribution.
  • Invokes the major incident procedure as appropriate.
  • Engages upper levels of management as appropriate.
  • Ensures that a major incident review is conducted for all major incidents and that recommended action items are completed.
  • Provides information for management related to OPS ITSD performance.
  • Highlights trends resulting from recurring incidents for review by problem management.
  • Monitors performance of the incident management process and identifies process improvements to the enterprise IM Process Owner.

Major Incident Manager (IM)

In certain cases of incidents, a Major Incident Manager may be required to manage resolution activities. The Incident Manager or delegate will make this determination, and as required, will assign a single individual to undertake the MIM role for the service recovery activities related to that incident. The MIM is accountable for taking actions necessary to resolve a major incident and restore service. In all cases a major incident will be classified using urgency/impact definitions documented in Section 6.3 of this standard.  By definition, major incidents will be classified as priority 1 (P1). Activities managed by this individual may cross organizational boundaries. The MIM will be selected from a pool of management staff within ITS, Cyber Security Division or the cluster.

The administrative aspects of the major incident will continue to be managed through the OPS ITSD and the Incident Manager or delegate will continue to perform responsibilities related to incident notification, escalation and communication.  The Incident Manager maintains ownership and accountability for the lifecycle of the incident.  This allows the MIM to fully focus effort and attention upon managing the technical resolution of the incident.

Responsibilities
  • Identifies the required members of the resolution team, and requests their participation via the eIM.
  • Ensures that a systematic approach is used to evaluate the reported symptoms, impacts and contributing factors of the incident.
  • Ensures assignment of key Incident Analyst to develop the optimum plan to restore service or create a workaround.
  • Ensures the incident record is maintained.
  • Ensures that status messages are provided in the incident record for periodic progress reports based on the major incident notification schedule.
  • Undertakes functional escalation based upon predefined thresholds for the service being supported.
  • Provides documentation for major incident review report.

Note:  The Situation Manager will assume the role of the MIM for major incidents when the eIM has determined that no MIM is required to manage resolution activities.

Situation Manager (SM)

The Situation Manager is engaged by the Incident Manager to manage escalations of incidents meeting pre-specified criteria (typically a priority 2). The SM is accountable for taking actions necessary to resolve P2 incidents and restore service.

Responsibilities
  • Resolve the escalated incident leveraging resources provided by the Incident Manager.
  • Identify and lead the required members of the resolution team to develop the plan to restore service or create a workaround.
  • Ensure that status messages are provided in the incident record for periodic progress reports based on the Notification Schedule.
  • Perform escalation evaluations.
  • Coordinate the establishment of resolution teams.
  • Function as “point-of-contact” for resolution teams.
  • Manage further hierarchical and functional escalations.
  • Recommend activating disaster recovery process (as necessary).

Queue Manager (QM)

The Queue Manager monitors the queue to ensure that all incident tickets assigned to various support groups in their organization are promptly actioned and/or escalated within defined thresholds in support of Service Level Agreements/Objectives (SLAs/SLOs). This role is predominantly concerned with the overall performance of resources involved in the incident management process and is defined to establish an objective perspective on how incident management is being undertaken within a specific organization.  As such, there are no specific accountabilities.

Responsibilities
  • Address process execution issues encountered by support personnel and ensure that all tickets assigned to a queue are promptly actioned.
  • Monitor the incident queues.
  • Ensure that all incidents placed in a queue are assigned to the appropriate resource within the queue.
  • Monitor all incidents and advise support group members of upcoming and actual service level breaches. (Note: Engaging support group will only occur if a Service Desk Analyst has not already performed this action.)
  • Respond to the escalated incidents in a timely and appropriate fashion to minimize the effect of incidents on agreed service levels.
  • Follow defined escalation path as defined in the escalation policy.
  • May facilitate support resource commitment and allocation.
  • Attend incident review meetings as required.
  • Participate in process improvement sessions.

Service Desk Manager (SDM)

The SDM is accountable for all aspects of the OPS ITSD.

Responsibilities
  • Manages overall Service Desk activities.
  • Acts as escalation point for Team Leads.
  • Monitors incident volumes and trends to ensure appropriate staffing levels.
  • Recommends procedural improvements to the Incident Manager.

Service Desk Team Lead (TL)

The Service Desk Team Lead provides team leadership, coordination, expertise and advice to ensure OPS IT Service Operations daily activities, to achieve approved goals and standards and ensure service level timelines are met.  The TL provides subject matter expertise for Section/Branch planning, problem solving and project management.

Responsibilities
  • Ensures currency and effectiveness of diagnostic scripts used to perform incident triage.
  • Manages shift schedules to ensure appropriate staffing and skill levels are maintained.
  • Acts as escalation point for Service Desk Agents in difficult or controversial situations.
  • Arranges staff training and awareness sessions.
  • Produces statistics and management reports.
  • Undertakes HR activities as required.
  • Assists Service Desk Agents when workloads are high or more experience is required.
  • Works with ITS Incident Advisors (ITS-IAs) in co-ordinating regular review and continuous improvement sessions involving appropriate staff.

Service Desk Agent (SDA)

The Service Desk Agent provides the single point of contact for customers during the incident lifecycle.

Responsibilities
  • Authenticates the caller (user or customer) and captures minimum level of defined contact information.
  • Is aware of the level of support to which the individual reporting the incident is entitled.
  • Creates an incident record for the new incident or updates the record for existing incidents.
  • Classifies the incident.
  • Ensure that description of all incident resolution activities is accurately captured in incident records assigned to them. 
  • Continually updates incident records with progress/status information to reflect their own activities.
  • Attempts incident resolution at first point of contact (tier 1) using diagnostic scripts and knowledge records such as known errors.
  • If unable to restore service within predefined threshold, performs functional escalation and assigns incident to the appropriate tier 2 support group.
  • Facilitates functional escalation between tier 2 and tier-n support groups and records circumstances in the incident record.
  • Informs the Queue and/or Incident Manager of any “non-minor” incidents.
  • Keeps the customer or user updated on incident progress based on notification protocol where applicable.
  • Obtains user (or customer) concurrence that the support actions provided addressed their needs prior to the Service Desk classifying  the incident as resolved.

Incident Analyst (IA)

Incident Analysts are tier 2-n support group staff in each organization who provide progressively greater technical expertise to resolve incidents that have not been resolved at the previous tier.

Responsibilities
  • Responds to assigned incidents within agreed timeframes.
  • Diagnoses, develops workarounds and/or attempts to resolve assigned incidents.
  • Requests assistance from other tier 2 support areas via the Incident or Queue Manager.
  • If unable to resolve, requests functional escalation via the OPS ITSD.
  • Keeps the OPS ITSD informed of progress on assigned incidents via incident -enabling technology.
  • Updates the incident record and notifies the client as soon as it is known that the expected resolution will not occur within service thresholds.
  • When requested by the Queue and/or Incident Manager, provides technical assistance for other tier-n resources.
  • When requested by the Queue and/or Incident Manager, provides technical communication/explanation to customers and/or end users.
  • Follows defined process for creation of an incident record for all/any activities undertaken related to remedial action for technology or service supported.
  • When resolving an incident, reviews and updates the Service+, CI+, Operational Categorization and Resolution Categorization in the incident management tool to reflect the actual failing component that was corrected to restore normal services.

Note: When designated by the Major Incident Manager as the technical lead for a major incident, the Incident Analyst has additional responsibilities:

  • Undertake technical leadership of the analysis, diagnosis and develop the subsequent action plan to remediate the major incident.
  • Provide periodic updates and status reports to the Major Incident Manager to ensure communication and notification requirements of the incident management process are satisfied.

Service Owner

The Service Owner has responsibilities specific to the enterprise incident management process.  These fall under the broad category of the service support model that is the responsibility of the Service Owner to define and maintain.  Additional responsibilities for the Service Owner in support of the enterprise problem management process can be found in GO-ITS 38.

In order to provide seamless, end-to-end support for incident management for OPS I&IT services, it is necessary to document all aspects of the support model.  I&IT clusters are accountable for the application component of many of the OPS services; the enterprise incident management process must be informed with key aspects of the support structure for applications.

The Service Owner is responsible for the identification, documentation and maintenance of internal/external partner solution/service knowledge required to inform the support model used by the OPS ITSD.

Responsibilities
  • Define and establish the support model (including required skills for tier 2-n support staff) up to and including the application.
  • Provide information, via the Infrastructure Technology Services Incident Advisor, to the ITSD. This would include items such as service/solution descriptions, diagnostic content, mandatory information capture at Tier 1, and First Point of Contact (FPOC) resolution steps for use by Service Desk Agents.
  • Maintain the above information and inform the appropriate parties of updates:
    • Infrastructure Technology Services Incident Advisor for support model updates.
    • Service Level Manager for revisions to service level objectives.
  • Develop local procedure information in support of incident management for cluster services/solutions and obtain endorsement from the enterprise Incident Manager that these align with OPS ITSD procedures.
  • Accountability and managing of applicable vendors including but not limited to engagement during incidents in a timely manner.

Infrastructure Technology Services Incident Advisor (ITS-IA)

The Infrastructure Technology Services Incident Advisor (ITS-IA) provides a point of contact between the Incident Manager and partner organizations (e.g., DCO, Telecom, clusters, CSD, 3rd Party Service Providers) to enable effective and efficient execution of the incident management process.

Responsibilities
  • Coordinates with Service Owners in their portfolio to provide and maintain support models with the information required by the OPS ITSD (i.e., impact and urgency, Situation Mangers, escalation contacts, support structure, service/solution descriptions, diagnostic approach, mandatory information capture and First Point of Contact (FPOC) resolution steps.
  • Provides accurate organization information management relative to incident management process, including VIP lists, location, details, organizational and/or staff changes, etc.
  • Delegate of the enterprise Incident Manager in rotational coverage.
  • Coordinates incident resolution activities for organizations within the portfolio.  Fulfils the role of Major Incident Manager for the portfolio as required. Has the authority to bring in appropriate executives to make human resource or financial decisions required through a communications protocol.
  • Acts as the escalation point for any organizational issues regarding execution of the incident management process.
  • Acts as lead on tickets 90 days or older.
  • Creates/performs P2 incident reviews.
  • Analyzes and reviews reports to identify and highlight incident trending and assist with root cause analysis/problem management for recurring high-impact incidents.  Assist with the transition of incidents to problem investigations.
  • Represents the incident management process for portfolio release management activities.
  • Attends/participates in MIRs and continues to co-ordinate portfolio-related action items.
  • NAS Critical Site List:  Maintenance of portfolio locations and site contacts.
  • Assist the Communications Coordinator as required.

Communication Coordinator

The Communication Coordinator is accountable to the Incident Management and Communications Process Owner and performs the day-to-day communications operational tasks demanded by the process activities.  The Communication Coordinator is primarily responsible for communications that provide status updates or information to a targeted audience.

Responsibilities
  • Responsible for Executive/Business, Informational and Technical communications to Service Owners, executive, business and cluster audience.
  • Promotes and communicates the process to all parties involved.
  • Assists with incident management briefing notes if required.
  • Participates in situation rooms if required.
  • Monitors and performs monthly analysis to determine opportunities for continual service improvement of the communications process.
  • Contributes to monthly operational performance reports.
  • Collaborates with other communications teams/resources to assist with forecasting potential I+IT impact.  Creates I+IT awareness by appropriate means.

Incident Coordinator

The Incident Coordinator functions as the primary focal point for priority 1 and 2 incidents while in tandem coordinating the process execution, resource identification and Major Incident Review (MIR) activities.

4.3. Process flows

4.3.1. Incident management process overview

Process flow demonstrating the enterprise incident managmement process steps. Full description available using link below.

Accessible description of infographic 2

4.3.2. Incident management process tasks

NumberTaskRolesInput, triggerDescriptionOutput, completion criteria

1.0

Report Incident

User
OPS staff

User-perceived service outage or degradation,
monitoring event

Users must contact the Service Desk to report an incident. Event monitoring may also proactively indicate an incident before the users are impacted.

N/A

2.0

Log & Classify Incident

SDA

Service Desk informed of incident

SDA creates incident record and captures user contact information, classification data and details about symptoms.

N/A

3.0

Prioritize Incident

SDA

Incident classified

SDA prioritizes the incident, based upon impact and urgency (usually via a predetermined formula).

N/A

4.0

Perform Tier 1 Diagnosis

SDA

Incident prioritized

Service Desk Agent conducts initial diagnosis to discover the full symptoms of the incident and to determine exactly what has gone wrong and how to correct it. The agents will use diagnostic scripts and known error information to assist in this task.

N/A

5.0

Declare Major Incident

SDA
IM

Major incident criteria is met

SDA determines that incident meets agreed criteria for major incident and informs the Incident Manager, who determines whether or not to declare a major incident and what parts of the major incident protocol will be invoked

IM informed, major incident protocol invoked

6.0

Functional Escalation

SDA
QM

SDA cannot restore service within agreed threshold

If SDA cannot restore service at first point of contact within predetermined timeframe, the incident will be assigned to an Incident Analyst (Tier 2 support group) to attempt to restore service within service level targets. This functional escalation is repeated to tier 3 and so on (if the tier 2 Incident Analyst cannot resolve the incident within a defined threshold).

N/A

7.0

Perform Tier n Diagnosis

IA

Functional escalation

Incident Analysts will conduct further diagnosis to determine how to restore service.

N/A

8.0

Resolve Incident

SDA
IA

Diagnosis has indicated probable resolution

The Incident Analyst or SDA takes (or coordinates) necessary action to restore service and conducts tests to ensure that service is restored. (Note: this could include asking user to take actions, e.g., rebooting computer.)
IA/SDA requests the user to confirm that service has been restored from their perspective and then resolves the incident.
If the user cannot be reached within an agreed threshold, the IA/SDA follows the predefined policy for such situations.

Service has been restored from SDA/IA perspective; user confirms service restoration

9.0

Monitor Incident

IM
QM

Incident logged

Incidents are monitored throughout their lifecycle:

  • Queue Manager ensures that incidents assigned to tier n support groups are resolved or functionally escalated within defined thresholds
  • Incident Manager monitors thresholds and may escalate or manage notifications if service level targets are in jeopardy

N/A

10.0

Close Incident

N/A

Analyst indicates service restoration

Closure is completed automatically by the tool after a predetermined interval.

N/A

4.4. Linkages to other processes

ProcessLinkage

Enterprise Problem Management
(ePM)

  • PM requires that incident management capture sufficient and accurate information to enable problem identification:
    • Proper closure codes.
    • Proper classification.
    • Link new incidents to existing problems.
    • Known defective components (based upon event monitoring and component alarms).
  • PM makes information available that can support incident resolution activities (e. g., known errors, workarounds, and patterns).
  • Enabling technology must be able to define relationship between incident, problem and known error records.
  • Incident management may identify potential problems to problem management.
Enterprise change management (ECM)
  • Should restoration of a service require modification of a component under the control of configuration management, then ECM must be engaged
  • Enabling technology must be able to define relationship between incident and change records.
Enterprise Service Asset Configuration Management
  • Provides the infrastructure data required to assess customer impact of an IT infrastructure component failure.
  • Identifies the CI Owners for service delivery support, financial / asset ownership and associated user(s).
  • Uses data to correlate the CI with the appropriate SLA to determine the priority of actions and escalations.
  • Ensures that all appropriate CI data is linked to each application.
Service level management
  • Although this process has not yet been formalized at the enterprise level, there is an expectation that incident escalation thresholds are defined to support SLA’s and OLA’s.

Consistent use of service and component classification schemas must be used across ITSM processes such as incident, change and problem management to enable industry best practice process integration. Failure to adopt a common approach to implementing these three processes will result in needless rework and additional administrative overhead for operational staff.

4.5. Incident management process quality control

Certain aspects of execution of the incident management process are monitored, as a quality control measure, to identify opportunities to improve process effectiveness and efficiency.

Monitoring

The Incident Manager is responsible for monitoring certain aspects of the activities performed by the incident management team on a regular basis. This serves a twofold purpose:

  • The Incident Manager can identify any bottlenecks at the operational level and take appropriate corrective action.
  • Both the Incident Manager and the enterprise Process Owner can identify opportunities for improvement at the process and procedural levels.

Reporting

Reporting involves measuring the process via metrics and recording how well it behaves in relation to the objectives or targets specified in the metrics. Metrics provide incident management personnel with feedback on the process.  They also provide the incident management Process Owner with the necessary information to review overall process health and to undertake continual service improvement techniques. 

Evaluating

Evaluating the process involves regular reviews of the execution of the process and identification of possible improvements or actions to address performance gaps. Every process is only as good as its last improvement; hence, the feedback loop of continuous improvement is inherent in every process.

4.6. Metrics

Metrics are intended to provide a useful measurement of process effectiveness and efficiency. Metrics are also required for strategic decision support. The following need careful consideration:

  • Reporting metrics will be readily measurable (preferably automated collection and presentation of data).
  • Metrics will be chosen to reflect process activity (how much work is done), process quality (how well was it done) and process execution (to review and plan the job at hand).
  • The enterprise incident management Process Owner is accountable for the definition of an appropriate suite of metrics to determine the overall health of the enterprise incident management process.
  • The Incident Manager will develop and run the reports and may develop other metrics to monitor other operational aspects of process execution, such as workload and resource balancing.

The following represents the initial suite of metrics that will be used to analyze process performance, identify opportunities for improvements and for strategic decision support. Any count of incidents must exclude service requests.

Workload:

  • Total numbers of incidents per period (as a control measure) (excluding service requests)
  • Number and percentage of major incidents
  • Size of current incident backlog

Process Effectiveness:

  • Number and percentage of incidents reassigned
  • Number and percentage of incidents incorrectly classified
  • Percentage of incidents resolved within agreed response time

Process Efficiency:

  • Percentage of incidents closed by the Service Desk without reference to other levels of support (often referred to as ‘first point of contact’)
  • Mean Time to Resolve Incidents (MTTR)
  • Percentage of incidents resolved on first attempt
  • Percentage of assigned incidents resolved within service level objectives (total and broken down by queue)
  • Percentage of event-generated, potential user impacting incidents compared to user reported incidents
  • Percentage of incidents resolved prior to user impact
  • Aging report showing the number and percentage of assigned incidents per organization that have been outstanding for longer than periods as designated from time to time by the IM Process Owner

4.7. Standard process parameters

For an enterprise process to be effective, parameters used for the classification, categorization, prioritization and closure of incidents must be consistently used across the OPS.  Special attention must be given to parameters required for consistency of reporting. This is particularly important for the provision of reliable business intelligence.

Please refer to the Classification Model section of the GO-ITS 44 ITSM Terminology Reference Model for standard process parameters and allowable values for incident management.

Please refer to the State Model section of the GO-ITS 44 ITSM Terminology Reference Model for standard status/state parameters and their definitions for incident management.

5. Related standards

5.1. Impacts to existing standards

GO-IT standardImpactRecommended action
GO-ITS 44 Terminology Reference ModelGO-ITS 37 redefines urgency and impact classification elementsMaintain alignment
GO-ITS 55 Service Desk Interaction Model and Incident Management Support PatternsGO-ITS 55 contains role definitions that are redundant.Verified roles are aligned and no update required.
GO-ITS 38 Enterprise Problem ManagementNo impact

N/A

GO-ITS 35 Enterprise Change ManagementNo impact

N/A

GO-ITS 36 Enterprise Service and Asset Configuration ManagementNo impact

N/A

5.2. Impacts to existing environment

Impacted infrastructureImpactRecommended action
eSMTConforms to updated Appendix 6.2.2Nil

6. Appendices

Detailed document history

The complete revision history of GO-ITS 37 is listed below:

Date

Summary

2009-06-17

Version 1.7: presented to ITSC

2009-07-16

Version 1.8: reflects feedback from Stakeholders, received up to and including 2009-07-16

2009-08-14

Version 1.9: reflects additional roles and new principle regarding security related incidents

2009-09-09

Version 1.94: reflects feedback since August 19 and injection of Urgency / Impact definitions (Section 6.4)

2010-02-02

Version 1.95: accepts all changes in version 1.94 and incorporates results of discussions held in Dec 2009 and Jan 2010 with ITSM Leads and ITS / OCCTO OEIP

2010-02-08

Version 1.95: updated subsequent to meeting with Head, Corporate Architecture Branch, OCCTO, post ITSML discussion of 2010-02-04. Suggestions received at ITSML embedded.

2010-02-10

Version 1.95: updated to modify references to post-mortem terminology (changed to Major Incident Review) per discussion / feedback from ITSML

2010-03-03

Version 1.97: inserted effective date for this revised version as July 1, 2010

  • Hyperlink inserted in Appendix for MIP Normative reference

2010-03-17

Endorsed: IT Standards Council endorsement

2010-03-19

Version 2.0 Final Draft - post ITSC endorsement of 2010/03/17

  • Section 4.2.10 and Principle 9 – removed specific reference to Service Management Branches and replaced with generic wording – “appropriate branches”
  • Updated Section 4.3.1 – Process Flow - added box in diagram to reflect User “Reporting Incident”
  • Section 6.2.1 – Added clarification statement to describe illustrative characteristic of diagram

2010-04-01

Approved: Architecture Review Board (Version 2.0)

2015-08-28

Draft Version 2.1 – reviewed and updated for activation of Enterprise Service Management Tool (eSMT)

  • Updated references to Ministry of Government Services (MGS) to Treasury Board Secretariat (TBS)
  • Corrected references to appendix 6.4 to appendix 6.2.2
  • Request for change (RFC) terminology updated to change request (CRQ)
  • Updated contact information to reflect assignment to IT Service Management Leads (ITSML)
  • Updated Impacts to existing standards table to reflect impact on GO-ITS 44, Terminology Reference Model
  • Updated impacts to existing environments table to reflect conformity to eSMT
  • Updated definitions of impact, urgency and priority matrix in appendix 6.2.2
  • Added requirement to verify and update incident fields and categorization when resolving an incident to Incident Analyst responsibilities.
  • Minor wording and grammar updates

2016-03-08

Reference to “situation (war) room” changed to “situation room” based on feedback from SCS.

2016-03-16

Architecture Review Board endorsement

2016-03-31

IT Executive Leadership Council approval (Approved Version 2.1)

2016-09-09

Draft Version 2.2 reviewed and updated to accommodate evolving business practices and service management processes.

  • Section 2.1, Background. Added reason for review.
  • Section 2.4.3, Accountability. Updated to make whoever restores service accountable to confirm with the end user that service has been restored and verify the accuracy of resolution categorization prior to resolving the incident.  The tool will close the incident after in resolved state for 15 days.  Exception is MIP, where IC is accountable to resolve the MI.
  • Section 2.5, Scope. Updated to add Service Restoration as part of the incident management scope.  Also updated “service fulfillment’ to “request fulfillment.”
  • Section 2.4, Basic Concepts.  Updated to indicate the respective manager may request the Queue Manager escalate versus the SD manager.  Also updated to reflect SDA assigns ticket to tier n versus tier 2. Updated to reflect a QM is established for each support group and removed reference to overall QM.  Updated to indicate the IA restores service, sets the incident status to resolved and informs the appropriate tier n resources, if required.  Inputs updated to include “capacity management thresholds and monitoring alerts” and “Change Requests (CRQ).”
  • Section 2.6.3, Compliance Requirements. Updated the last bullet in with the statement that “They are subject to review and endorsement by their respective working committees for an changes/improvements and annual approval by ITSML.”
  • Section 3.1, Contact Information, Consulted. Updated committees consulted for this review.
  • Updated Principle 1, Rationale, first bullet to reflect processes versus models.
  • Updated Principle 2 to indicate messaging must reflect the business impact. Removed statement that routing may need to be modified.  Removed Cluster in reference to Service Owners.  Added “formative” to reference to OEIP business architecture principle.  Updated ISAM to eSAM (Enterprise Service Agreements Model).
  • Updated Principle 3 to state OPS ITSD will have oversight of incidents versus managing incidents.  Updated to reflect Service Owners must support objective assessments of incidents versus SD Senior Management.
  • Updated Principle 4 to state OPS ITSD shall act as the single point of contact for all business communication regarding reported incidents
  • Updated Principle 4 to remove the bullet “Tier 2-N resources may request OPS ITSD staff to coordinate dialogue with end user or customers (used to gather additional detail or information to effect incident resolution) if they are unable to contact the end user directly.”
  • Updated Principle 5 rationale to state “The incident record is the source for all incident record activities undertaken by support staff” for clarity.
  • Updated Principle 9 pool of management to ITS, Cyber Security Division or the cluster and authority to include human resources decisions/financial commitments as required
  • Updated Principle 9 Rationale to include the statement “Major incidents involve outages where the business impact is high and usually impacts public facing services.  Restoration justifies extraordinary attention and resources.”
  • Updated Principle 9 to clearly state Priority 1 incidents are defined as major incidents
  • In Principle 9 updated staff to be trained as staff involved in the incident management process versus service level management.
  • Updated Principle 11 to remove placing a Priority 3 incident in pending outside of business hours.  Removed OPS ITSD from development of local work instructions.
  • Updated Principle 11 to remove statements referring to use of the pending state as more appropriate in the PPG.
  • Updated Cyber Security Branch to Cyber Security Division.
  • Section 4.2, Process Roles and Responsibilities.  RACI updated to indicate incident closure is automated by the tool at this time.
  • Section 4.2.3, Major Incident Manager. Updated reference to ITSD to eIM for identification of required resources.
  • Section 4.2.4, Situation Manager.  Updated responsibilities to ensure that status messages are provided in the incident record for periodic reports based on the Notification Schedule.
  • Section 4.2.5, Queue Manager. Updated to state the QM may facilitate support resource commitment and allocation.
  • Section 4.2.6, Service Desk Manager.  Updated to remove responsibility for effective management of the incident queues across the OPS I&IT organizations.
  • Section 4.2.7. Added general description of the Service Desk Team Lead role.  Added seventh bullet indicating also works with PIMLs to co-ordinate reviews and continuous improvement sessions with appropriate staff.
  • Section 4.2.8, Service Desk Analyst.  Updated to state SDA is aware of the level of support an individual reporting an incident is entitled to, versus authenticates.  Updated to ensure capture of resolution activities in the incident record applies only to incidents assigned to the SDA.  Updated capture of incident progress in the incident record to only reflect SDA activities.  Removed requirement to update the incident record to support tier 2-n resources as/if requested.  Updated requirement to keep user updated on progress based on the notification schedule to as applicable.  Updated the requirement to obtain customer concurrence that support actions addressed their needs, to prior to the OS ITSD resolving the incident versus closing the incident.
  • Section 4.2.3.  Moved up Major Incident Manager section to place before Situation Manager.  Updated responsibilities to ensure incident record is maintained and to ensure status messages are provided in the incident record for periodic progress reports per the Notification Schedule.  Added note that SM assumes the role of the MIM if the eIM has not designated a MIM for a major incident.
  • Section 4.2.3. Removed reference to problem management resources.
  • Section 4.2.9, Incident Analyst.  Updated to reflect IA updates the incident record and notifies the client if resolution will not occur within service thresholds versus notifying the OPS ITSD.  Updated to ensure IA follows the defined process for ticket creation versus creating an incident.
  • Section 4.2.10, Service Owner.  Clarified general reference to responsibilities.  Added “internal/external” to partner solution/service knowledge required.
  • Section 4.3.2, Incident Management Process Tasks. Updated number 1.0 to reflect users must contact the OPS ITSD.  Moved Perform Tier 1 Diagnosis to number 4.0 and moved Declare Major Incident to number 5.0.  Added to 8.0 (Resolves Incident) the IA/SDA requests user to confirm service has been restored and then resolves incident.  Also added that IA/SDA follows the predefined policy if the user cannot be reached. Updated Task 10 to reflect closure is accomplished automatically by the tool after a predetermined interval.
  • Section 4.4. Updated to include linkages to configuration management. 
  • Section 4.6, Metrics.  Removed “Average time for tier 2-n support to respond to a functionally escalated incident.”  Removed “Average call time with no escalation” metric.
  • Section 5.1. Updated to GO-ITS 35 and GO-ITS 36 standards.  Updated title of GO-ITS 55.
  • Added the Major Incident Protocol as an appendix.
  • Adjusted section numbering as required.
  • Various grammatical corrections.
  • Created new Appendix 6.2, Document History, for the detailed list of changes. Renumbered subsequent appendices.

2017-03-28

  • Section 4.1, Process Principles.  Added principles 14, 15 and 16.
  • Section 4.2, Process Roles and Responsibilities.  Added new Section 4.2.12, Communication Coordinator.

2017-04-27

Section 4.2, Process roles and responsibilities. 

  • Updated 4.2.11, formerly PIML, to new eSM-IA role.
  • Updated RACI for eSM-IA process tasks.

2017-09-28

  • Section 1, Forward.  Updated forward per PSSC standard.
  • Section 2.1, Background.  Updated for currency and removed reference to GO-ITS 44.
  • Section 11, Scope.  Removed definition of thresholds for communication through SLM.
  • Section 3.1, Contact Information, Consulted.  Added list of individuals and their organizations consulted in all reviews.  Updated list of committees/working groups consulted for version 2.2.
  • Section 4.2, Process roles and responsibilities.  Updated RACI eSMIA column.  Added section 4.2.13, Incident Coordinator.
  • Section 4.3.2, Incident management process tasks.  Updated Output, Completion Criteria for tasks 8.0 and 10.0.
  • Section 4.6, Metrics.  Under Process Efficiency, added two new metrics for proactive incident management.
  • Section 6.4, Definitions: urgency and impact.  For Impact 1-Extensive, added new criteria for degradation of mission-critical, citizen-facing service.  For Urgency 2-High, split first bullet into three bullets.
  • Appendix 6.1, Major Incident Protocol.  Updated criteria for high impact.
  • Appendix 6.2, Document Revisions. Updated for GO-ITS 37 version 2.2.
  • Various grammatical and format corrections throughout.

2018-01-24

  • Section 4.1, Process Principles.  Updated implications of Principle 14 to include automation.  Updated implications of Principle 16 to include problem management root cause investigations.

2018-03-28

  • Section 3.1, Contact Information.  Updated Support role (editor) to Arpad Martonosi.

6.1. Major Incident Protocol (MIP) Ver. 2.1

The Major Incident Protocol defines the mandatory elements that must be established in order to develop and execute the Major Incident Protocol.

Table of Contents

1  Introduction to Major Incident Protocol
1.1 Background
1.2 Purpose
1.3 Basic Concepts
1.4 Scope
2  Technical Specification
2.1 Criteria for Major Incident Declaration
2.1.1 Criteria for High Impact
2.1.2 Criteria for High Urgency
2.2 Situation Room Requirements
2.3 Escalation Schedule
2.4 Notification Schedule
2.5 Procedural Tasks

1  Introduction to Major Incident Protocol

1.1 Background

During February 2009, a series of outages to Ontario.ca infrastructure prompted I & IT Executive Management to conduct a review footnote 4 of both incident and change management processes and procedures. The review identified deficiencies in a number of areas including; procedures, operational process management and behavior. Specific recommendations were made to address the deficiencies (see Constraints section below), which have subsequently been sanctioned by ITELC (April 2009). One of these recommendations was the development of a common major incident procedure, including establishment of a situation room, for use across OPS I&IT organizations. This document is a normative reference to GO-ITS 37, Enterprise Incident Management Process.

In May 2015, the enterprise IT Service Management (ITSM) tool set was upgraded.  The upgrade required updates to this document to accommodate changes in terminology and the priority matrix.

The Annual IT Rules Work Plan for 2016-17 directed the rescindment of the Normative Reference to GO-ITS 37 and its inclusion as an appendix to GO-ITS 37.

1.2 Purpose

This purpose of this document is to define the mandatory elements that must be established in order to develop and execute a major incident protocol:
•   Mandatory high-level tasks are defined
•   Requirements are defined for establishment of a situation room

The Incident Manager and Major Incident Manager must jointly establish these elements and develop true procedures derived from the mandatory high-level tasks.

1.3 Basic Concepts

Major incidents follow virtually the same steps as normal incidents but with an emphasis on accelerated functional escalation, coordination of resources and enhanced communications management to achieve resolution as quickly as possible.

The Major Incident Protocol (MIP) provides for the dynamic establishment of a separate major incident team under the direct leadership of a Major Incident Manager (MIM). The team is formed on a case-by-case basis to quickly marshal key technical resources to focus on providing the swiftest remedial action to resolve the incident.

Throughout, the OPS ITSD will continue to ensure that all activities are recorded and that users/customers are kept fully informed of progress.

1.4 Scope

In Scope

Out of Scope

High impact, high urgency incidents (per description in Go-ITS # 37, Enterprise Incident Management)

Incidents which are non-service impacting

Component failures whose estimated time to resolve will result in failure to meet one or more service level objectives (e.g., component failure on weekend, but parts unavailable until Tuesday)

Failures to components that will not affect service level
Objectives within estimated resolution time

Heightened notifications to stakeholders

Nil

Externally hosted services.  The Service Owner is responsible to meet the Major Incident Manager responsibilities for any externally hosted services that meet the criteria of a priority 1 incident.

Nil

It is contrary to best practice to apply the same degree of management overhead required for major incidents to incidents of lesser priority. This document does not preclude the OPS I&IT organization from requesting this Major Incident Protocol (MIP) for incidents of lesser impact or urgency, to be approved at the discretion of the enterprise Incident Manager on a case-by-case basis.

2  Technical Specification

2.1 Criteria for Major Incident Declaration footnote 5

To be considered for the major incident protocol, an incident must be classified as both high impact and high urgency.

2.1.1 Criteria for High Impact
  • A failure of an IT Business Service affecting multiple organizations footnote 6
  • A failure affecting public safety
  • A security-related incident affecting a large number of users across multiple organizations where total loss or compromise of critical business data may result
  • A core network outage or a network outage affecting a mission critical government location
  • A failure affecting > 1000 users
  • A failure that affects a money back guarantee public service offering
  • Mission-critical applications fully unavailable
  • Mission-critical, citizen-facing service degradation causing significant impact (service is unusable by the business or general public)
  • Citizen-facing government websites
2.1.2 Criteria for High Urgency
  • A formal SLA is in place that specifies an IT restoration of service time of < or = 4.5 hours
  • A security threat exists or there is potential for severe or substantial impact
  • A failure where formal SLA has been breached or it is known that an SLA will be breached
  • Response required includes an immediate and sustained effort using any/all available resources until the incident is resolved
  • VIP or Sensitive VIP service interruptions

Priority

Impact -1 extensive/widespread

Impact -2 significant/large

Impact -3 moderate/limited

Impact -4 minor/localised

Urgency -1 critical

Priority 1

Priority 2

Priority 3

Priority 3

Urgency -2 high

Priority 2

Priority 2

Priority 3

Priority 3

Urgency -3 medium

Priority 3

Priority 3

Priority 3

Priority 3

Urgency -4 low

Priority 3

Priority 3

Priority 3

Priority 4

2.2 Situation Room Requirements

A situation room will normally be used to bring response/resolution teams together.  Typically, this will be facilitated through a conference call set up without an actual physical room.

Physical meeting rooms and locations may be designated for major regional government centers that house I & IT support staff and their managers and must be equipped with the following facilities:

  • Speaker-phone
  • A permanent standing teleconference bridge with contact numbers maintained and published by the OPS ITSD to all participants in enterprise incident management
  • Network access point (and preferably a PC)
  • Access to the incident management enabling technology
  • Wireless network access (via a separate redundant path)
  • Whiteboards and markers

Participants must be familiar with the use of generic, redundant communications (e.g., Blackberry PIN or SMS messages) to allow communication in the event of disruption to the email or telephone systems. The OPS ITSD must maintain and publish a contact list and instructions for this purpose.

Service Owners must establish their delegates and inform the OPS ITSD, who will contact the delegate, should they not be able to contact the Service Owner

2.3 Escalation Schedule

Escalation point

Threshold

0

Even if not required for initial investigation, 3rd party service support staff should be put on standby notice for possible functional escalation, based upon consideration of their contractual response time commitments and prior history of meeting them.

Upon declaration of major incident

1

Engagement of tier 3 resources

2 hours

2.4 Notification Schedule

The following represents the default schedule for incident-related communications. The MIM may choose to modify this schedule when the action plan is established.

Notification and Content

Timing

Distribution method

Prime

Notification 1 – Invitation to major incident team to attend initial meeting (with follow-up confirmation).

Upon declaration of major incident

Email and/or telephone
SendWordNow

ITSD

Notification 2 -  Initial notification to impacted users, customers and senior management

  • Impact & scope statement
  • Statement of action underway
  • Estimated TTR (if available), else estimated time for next status

Within 1 hour following declaration of major incident

IVR
Email to distribution list
SendWordNow

ITSD

Notification 3 – N:  progress report to impacted users & customers

  • Current status
  • Temporary measures or workarounds that may have been developed
  • Estimated TTR (if available), else estimated time for next progress report

Every 3 hours

IVR
Email to distribution list
SendWordNow

ITSD

Major Incident Review Report

Target 7 business days after resolution

Email

IM

Major Incident Review Action Item Status

Standing agenda topic at Incident Manager monthly meeting until all items are complete

Email minutes

IM

Sr. Management Briefing
Provide Sr. Management with executive summary of events and any action plans resulting from the major incident review

Upon completion of Major Incident Review Report

Email

IM

2.5 Procedural Tasks

The following table outlines the high level tasks required to manage a major incident. They are similar to normal incident management tasks, but with a different emphasis on communications management and expedited resolution.

Legend: SDA = Service Desk Agent, IM = Incident Manager, QM = Queue Manager, MIM = Major Incident Manager, IA = Incident Analyst,
SO = Service Owner, SDM = Service Desk Manager (or Team Lead), IC = Incident Coordinator

No.

Task

Roles

Input, trigger

Description

Output, Completion criteria

MI-1

Identify Potential Major Incident

SDA-A,R
QM-C
IM-I

Incident meets MI criteria

Service Desk Agent uses diagnostic scripts and consults support models to capture and document information about symptoms and prioritize and classify the incident.

Service Desk Agent informs the Incident Manager that the incident matches the criteria for a major incident.

IM informed of potential major incident

MI-2

Confirm Major Incident

IM-A,R
IC-R
SDM-C

Possible major incident identified

The Incident Manager reviews the incident and confirms whether or not it is a major incident.

The Incident Manager determines whether it is necessary to invoke all aspects of the major incident protocol. If resolution activities are beyond OPS I & IT control (e.g., city- wide power failure), Incident Manager may choose to invoke only a sub-set of MIP, the Notification Schedule.

Major incident declared, or Notification Schedule invoked, or major incident request rejected

MI-3

Assign Major Incident Manager

IM-A,R
IC-R
MIM-I

Major incident declared

The Incident Manager selects Major Incident Manager (MIM) from a designated pool of managers, as follows:

  • Typically from the organization whose service is impacted.
  • If more than one organization’s services are impacted, or if the incident is limited to an infrastructure service, the MIM will be selected from ITS.
  • If the incident is security related, the MIM will be assigned from CSD.

MIM resource commitment

MI-4

Assemble Team

IM-A,R
IC-R
IA-C
MIM-C

MIM assigned

Incident Manager/MIM consult with tier 2 resources to identify required support areas to assist in diagnosis and resolution activities.

OPS ITSD then sends meeting invitation to required support areas and confirms participation of team members (including Service Owners).

During the course of investigation, the MIM may identify other resources and request them to assist in resolution activities.

Resource commitments

MI-5

Ready Situation Room (if required)

MIM-A
IC-R

Nil

MIM decides whether situation room is required for resolution discussion and coordination. If so, he requests the eIM/IC set up the call.

Situation room conference call established and team notified.

MI-6

Establish Plan

MIM-A
SO-C
IA-R
IC-I
IC-R*

Major incident team assembled

MIM convenes the situation room.
(Note:  limit attendance to staff directly involved in the resolution.)
MIM facilitates a review of known facts, as documented by the incident record.

MIM develops a systematic plan of attack. (It may be helpful to have select tier 3 resources attend initial meeting to suggest a course of action and identify information they might require, if the incident is functionally escalated to them.)

MIM assigns one of the IAs as technical lead and assigns other tasks to IAs and/or Service Owners.

MIM provides schedules for situation rooms, notifications and escalations.

*MIM ensures information is available for stakeholder and end-user communications.

Tasks assigned
Schedules confirmed
Initial user notifications distributed

MI-7

Execute the Plan

IM-A
IA-R
MIM-C
SO-C
IC-I

Tasks assigned

Under the direction of the designated technical lead, Incident Analysts and SMEs conduct diagnostic activities to determine how to restore service.

Once team has identified probable resolution steps, they must obtain MIM concurrence before proceeding. (MIM may choose to consult with Service Owner before agreeing to proceed with proposed resolution.)

If resolution requires modification of a service or component under change control, then the MIM will ensure that a Change Owner is assigned to manage the appropriate enterprise change management activities.

Solution developed
CRQ (if required)

MI-8

Manage the Plan

MIM-A,R
IA-R
IC-I

Ongoing

The Incident Analyst will inform the Major Incident Manager of progress of diagnostic or resolution activity.

Major Incident Manager conducts meetings to review progress and latest estimated time to resolve. MIM ensures information is available for stakeholder and end-user communications.

Major Incident Manager conducts functional or hierarchic escalation based upon the Escalation Schedule defined earlier and updates the incident record.

 

Note: Once Service Owners have been made aware of the major incident, they must monitor latest estimated resolution time in case they need to invoke business continuity plans.

Escalations
Progress reports
Updates to incident record

MI-9

Manage Communication

MIM-A
IC-R

Ongoing

Major Incident Manager ensures progress updates are available for distribution by the Incident Coordinator per the Notification Schedule.
MIM will determine if the situation warrants separate internal messages to Sr. Management (e.g., management may require information that is not suitable for external distribution).

Progress reports distributed

MI-10

Resolve Incident

IM-A
IA-R
MIM-I
SO-I
IC-I

Solution developed and approved

Incident Analysts proceed to execute and test the agreed steps to restore service.

Incident resolved

MI-11

Close Incident

IM-A
SDA-R
Cust.-C

Incident resolved

Service Desk confirms service restoration with the customer.

Incident closed

MI-12

Conduct Major Incident Review

IM-A
IC-R
MIM-C
IA-C
SO-C

Incident resolved

Incident Manager facilitates meeting with mandatory representation from all functional areas involved in the resolution.
In the spirit of continual improvement, the following standing items are reviewed to see if they can be improved or if actions are required to prevent or expedite resolution of future recurrence of the incident:

  • Major incident procedures
  • Diagnostic activities and tools
  • Information to be retained for future situations
  • Service, solution, infrastructure weaknesses

Incident Coordinator, using standard template, produces the Major Incident Review Report to identify any recommendations or action items and sends it to the Incident Manager who is accountable to assign action items to appropriate Service Owners or functional managers and to follow up to ensure completion of action items.
If the major incident impacted any citizen-facing services, Incident Manager will inform Senior Management of the pertinent details.

Major Incident Review Report issued

6.2. MIP Ver. 2.1 revision history

Date

Summary

2010-03-17

Endorsed: IT Standards Council endorsement (as Normative Reference to GO-ITS 37 v2.0)

2010-04-01

Approved: Architecture Review Board approval

2015–09-30

Version 1.1:  Reviewed and updated for activation of Enterprise Service Management Tool (eSMT)

  • Updated Ministry of Government Services (MGS) to Treasury Board Secretariat (TBS)
  • Updated Criteria for Major Incident Declaration to conform to eSMT
  • Updated version of GO-ITS 37 referenced in document
  • Updated War Room terminology to Situation (War) Room
  • Updated Notification Schedule to remove initial response to whomever reported the original incident and corrected numbering
  • Updated distribution method for Sr. Management Briefing to email
  • Updated Request for Change (RFC) to Change Request (CRQ)
  • Updated Post Mortem terminology to Major Incident Review
  • Minor wording and grammar updates

2016-09-09

Version 2.0

  • Reformatted as an appendix to GO-ITS 37 due to being rescinded as a Normative Reference.  Removed Copyright & Disclaimer and Forward as no longer applicable.
  • Added reference to rescindment to Section 1.1, Background
  • Section 1.3, Basic Concepts, removed reference to problem management resources and root cause analysis
  • Section 1.4, Scope, updated to indicate the MIP may be applied to incidents of lesser priority to be approved by the enterprise Incident Manager on a case-by-case basis
  • Section 1.4, Scope. Removed externally hosted services from Out of Scope.  Added externally hosted services to In Scope, with direction that Service Owner is responsible to meet the MIM responsibilities.
  • Section 2.2, Situation Room Requirements.  Updated to reflect the situation room is typically a conference call versus a physical room.
  • Section 2.3.  Escalation Schedule.  Removed reference to engagement of problem management.
  • Section 2.5, Procedural tasks, MI-5 removed reference to a physical situation room, stating the IC will set up the call
  • Section 2.5, Procedural tasks, MI-6, updated review of facts as documented by the incident record; removed MIM following a template for the meeting, updated MIM to provide schedules for the situation room versus team meetings, removed reference to use of a template to prepare communications.
  • Section 2.5, Procedural tasks, MI-8, updated to MIM ensures information is available for stakeholder and end-user updates, removed reference to problem management
  • Section 2.5, Procedural tasks, MI-9, updated to MIM ensures progress updates are available for distribution by the Incident Coordinator per the Notification Schedule
  • Grammar and formatting updates as required

2017-09-28

Version 2.1.  Section 2.1.1, Criteria for High Impact, added additional criteria for degradation of mission-critical, citizen-facing service.

6.3. Enterprise differentiation: process, procedure, work instruction

Process, procedure and work instructions are three levels of task descriptions that are often confused with one another.

  • Level 1 tasks are defined in a process. They specify what action must be taken and who is involved.
  • Level 2 tasks are defined in procedures that decompose each level 1 task into more granular operational tasks and additionally, prescribe how the activity should be performed.
  • Level 3 tasks represent work instructions. They are further decomposition of procedure-level tasks that typically are defined to address any unique local requirements when performing a procedural task.

6.4. Definitions: urgency and impact

The following table provides the framework for classifying the urgency and impact of incidents, which are then used to establish incident priority. Urgency and impact were originally defined in GO-ITS 44 Terminology Reference Model, to ensure that local process implementations used common terminology.

ITSM has matured across the OPS and enterprise processes are now in place for incident, problem, change management and service and asset configuration management. The definitions have been updated to reflect best practices.

Classifications

Definitions

Field values

Criteria (at least 1 criteria must be met)

Impact

Measure of scope and criticality to business.  Often equal to the extent to which an incident leads to distortion of agreed or expected service levels.

1 – Extensive/ widespread

  • A failure of an IT business service affecting multiple organizations footnote 7
  • A failure affecting public safety
  • A security-related incident affecting a large number of users across multiple organizations where total loss or compromise of critical business data may result.
  • A core network outage or a network outage affecting mission critical government location
  • A failure affecting > 1000 users
  • A failure that affects a money-back guarantee public-service offering
  • Mission-critical applications fully unavailable
  • Mission-critical, citizen-facing service degradation causing significant impact (service is unusable by the business or general public)
  • Citizen-facing government websites

Impact

Measure of scope and criticality to business.  Often equal to the extent to which an incident leads to distortion of agreed or expected service levels.

2-Significant/ large

Failure of an IT business service affecting a single organization which may include:

  • A network outage affecting business- critical government offices
  • A security-related incident affecting large numbers of users where work may be seriously impeded/interrupted within large groups or some business information may be at risk.
  • A failure or serious degradation affecting > 500 users
  • A failure that affects a public-facing “non-guaranteed” service offering
  • Failure of business-critical applications
  • A failure affecting all users in a single organization

Impact

Measure of scope and criticality to business.  Often equal to the extent to which an incident leads to distortion of agreed or expected service levels.

3-Moderate/ limited

All remaining failures of IT business services which may include:

  • Single user
  • A small isolated group of users with a common failure (single application, location, a failure on one of several IT business services utilized)
  • Security-related incident affecting a single or small number of users where some business data may be subject to limited compromise.

Impact

Measure of scope and criticality to business.  Often equal to the extent to which an incident leads to distortion of agreed or expected service levels.

4-Minor/ localized

Service requests

Urgency

Measures how quickly an incident needs to be responded to based on the business needs of the customer.

1-Critical

  • A formal SLA is in place that specifies an IT restoration of service time of < or = to 4.5 hours
  • A security threat exists or there is potential for severe or substantial impact
  • A failure where formal SLA has been breached or it is known that an SLA will be breached
  • Response required includes an immediate and sustained effort using any/all available resources until the incident is resolved
  • Executive or VIP service interruptions

Urgency

Measures how quickly an incident needs to be responded to based on the business needs of the customer.

2-High

  • SLA/SLO specifies a restoration of IT service within same business day
  • A security threat exists with potential for moderate impact.
  • Work may be impeded in small groups.
  • There might be some compromise of data and/or lack of availability for a small number of systems.

Urgency

Measures how quickly an incident needs to be responded to based on the business needs of the customer.

3-Medium

  • Single user
  • A small isolated group of users with a common failure (single application, location, a failure on one of several IT business services utilized)
  • A security-related failure with potential for minimal impact

Urgency

Measures how quickly an incident needs to be responded to based on the business needs of the customer.

4-Low

Service request

Priority

Impact 1-extensive/widespread

Impact 2-significant/large

Impact 3-moderate/limited

Impact 4-minor/localized

Urgency 1-critical

Priority 1

Priority 2

Priority 3

Priority 3

Urgency 2-high

Priority 2

Priority 2

Priority 3

Priority 3

Urgency 3-medium

Priority 3

Priority 3

Priority 3

Priority 3

Urgency 4-low

Priority 3

Priority 3

Priority 3

Priority 4

7. Glossary

Term

Description

Assignment

Assignment occurs when an incident is assigned by the OPS ITSD to a tier 2-n support group within the OPS to attempt incident resolution. The assigned support group must respond in accordance with the OPS incident management process/procedures and their actions may be directed by the OPS Incident Manager. (see Dispatch)

CRQ

Change Request

Customer

Someone who buys goods or services. The customer of an IT Service Provider is the person or group that defines and agrees the service level targets. The term customer is also sometimes informally used to mean users, e.g., “this is a customer-focused organization.”

Diagnostic scripts

Documents used by the Service Desk to help classify and resolve incidents. These documents, based upon input from specialist support groups and suppliers, identify key questions to be asked to obtain details about what has gone wrong, with suggestions for resolution activities to be performed.

Dispatch

Dispatch occurs when the OPS ITSD assigns an incident to a Service Provider outside the OPS to attempt resolution. Provider behaviour is specified by an underpinning contract and the OPS Incident Manager does not have authority to direct the providers’ activities other than coordination of activities between the provider and other OPS support groups.

ECM

The enterprise change management process defined by OPS GO-IT Standard 35.

Error

(Service operation) A design flaw or malfunction that causes a failure of one or more configuration items or IT services. A mistake made by a person or a faulty process that affects a CI or IT service is also an error.

Escalation

An activity that obtains additional resources when these are needed to meet service level targets or customer expectations. Escalation may be needed within any IT service management process, but is most commonly associated with incident management, problem management and the management of customer complaints. There are two types of escalation: functional escalation and hierarchical escalation.

External service provider

An IT service provider that is part of a different organization from its customer. An IT service provider may have both internal customers and external customers.

Functional escalation

Transferring an incident, problem or change to a technical team with a higher level of expertise to assist in an escalation.

Hierarchical escalation

Informing or involving more senior levels of management to assist in an escalation.

Impact

A measure of the effect of an incident, problem or change on business processes. Impact is often based on how service levels will be affected. Impact and urgency are used to assign priority.

Incident

An unplanned interruption to an IT service or reduction in the quality of an IT service. Failure of a configuration item that has not yet affected service is also an incident. For example, failure of one disk from a mirror set.

Incident management

The process responsible for managing the lifecycle of all incidents. The primary objective of incident management is to return the IT service to customers as quickly as possible.

Incident pattern

A pattern exists for each high level business service to define how the OPS ITSD interacts with OPS service chain partners such as clusters, ministries and corporate providers to resolve reported incidents.

Incident record

A record containing the details of an incident. Each incident record documents the lifecycle of a single incident.

Internal service provider

An IT service provider that is part of the same organization as its customer. An IT service provider may have both internal customers and external customers.

Ishikawa diagram

A technique that helps a team to identify all the possible causes of a problem. Originally devised by Kaoru Ishikawa, the output of this technique is a diagram that looks like a fishbone.

IT service

A service provided to one or more customers by an IT Service Provider. An IT service is based on the use of information technology and supports the customer’s business processes. An IT service is made up from a combination of people, processes and technology and should be defined in a service level agreement.

Kepner & Tregoe analysis

A structured approach to problem solving. The problem is analysed in terms of what, where, when and extent. Possible causes are identified. The most probable cause is tested. The true cause is verified.

Known error (KE)

A problem that has a documented root cause and a workaround. Known errors are created and managed throughout their lifecycle by problem management. Known errors may also be identified by developers or suppliers.

Known error database

A database containing all known error records. This database is created by problem management and used by incident and problem management.

KE record

A record containing the details of a known error. Each known error record documents the lifecycle of a known error, including the status, root cause and workaround. In some implementations a known error is documented using additional fields in a problem record.

Operational Level Agreement (OLA)

An agreement between an IT Service Owner and another IT Service Owner within the same organization.
The other Service Owner provides services that support delivery of IT services to Service Owner A’s customers.
The OLA defines targets and responsibilities that are required to meet agreed service level targets in an SLA.
The OLA defines the goods or services to be provided and the responsibilities of both parties. For example, there could be an OLA:

  • Between the IT service provider and a procurement department to obtain hardware in agreed times. 
  • Between the Service Desk and a support group to provide incident resolution in agreed times.

Proactive incident management

Incident Management goal is to promptly restore service for unplanned outages, proactive incident resolution further enables that processes to avoid business impact where imminent failure is detected. Example, an alert is triggered for disk space filling on a critical piece of infrastructure and action is required to avoid a failure in the service chain.

Process manager

A role responsible for operational management of a process. The Process Manager’s responsibilities include planning and coordination of all activities required to carry out, monitor and report on the process. There may be several Process Managers for one process; for example, regional Change Managers or IT Service Continuity Managers for each data centre.

Process owner

A role responsible for ensuring that a process is fit for purpose. The Process Owner’s responsibilities include sponsorship, design, change management and continual improvement of the process and its metrics.

Process Service Level Objective
(PSLO)

A service level objective for a specific process task or metric. For example:

  • Problem resolution will complete within x weeks, based upon problem classification
  • 70% of incidents will be linked to problems

Proactive problem management

Part of the problem management process. The objective of proactive problem management is to identify problems that might otherwise be missed. Proactive problem management analyses incident records and uses data collected by other IT service management processes to identify trends or significant problems.

Problem

A cause of one or more incidents. The cause is not usually known at the time a problem record is created, and the problem management process is responsible for further investigation.

Problem management

The process responsible for managing the lifecycle of all problems. The primary objectives of problem management are to prevent incidents from happening and to minimize the impact of incidents that cannot be prevented.

Problem record

A record containing the details of a problem. Each problem record documents the lifecycle of a single problem.

Recovery Time Objective (RTO)

Specifies the maximum tolerable service outage that can be sustained before consideration must be made to invoke Business Continuity or Disaster Recovery plans.

Release

A collection of hardware, software, documentation, processes or other components required to implement one or more approved changes to IT services. The contents of each release are managed, tested, and deployed as a single entity.

Root cause

The underlying or original cause of an incident or problem.

Root cause analysis

An activity that identifies the root cause of an incident or problem.

Service

ITIL defines service as “a means of delivering value to customers by facilitating specific outcomes customers want to achieve without the ownership of specific costs and risks.” GO-ITS 56.1 defines services within the OPS as functionality that can be directly consumed by an end user.  Relationships and obligations between Service Owners and customers are documented in SLAs. (see Support service)

Service desk

The single point of contact between the service provider and the users. A typical service desk manages incidents and service requests and also handles communication with the users.

Service Failure Analysis (SFA)

An activity that identifies underlying causes of one or more IT service interruptions. SFA identifies opportunities to improve the IT Service Provider’s processes and tools and not just the IT infrastructure. SFA is a time-constrained, project-like activity, rather than an ongoing process of analysis. (See also root cause analysis.)

Service Level Agreement (SLA)

An agreement between an IT service provider and a customer.
The SLA describes the IT service, documents service level targets, and specifies the responsibilities of the IT Service Provider and the customer.
A single SLA may cover multiple IT services or multiple customers.
(See also operational level agreement and underpinning contract)

Service Level Objective (SLO)

In the absence of a formally negotiated SLA, a service provider must define performance objectives for delivery and support of the service.  

Service owner

A member of a service provider organization responsible for delivery of a specific service.

Service manager

A manager who is responsible for managing the end-to-end lifecycle of one or more IT services.

Service provider

An organization supplying services to one or more internal customers or external customers. Service provider is often used as an abbreviation for IT service provider. Where there are several service providers that enable an overarching service, they are sometimes called supply chain (or service chain) partners.

Support model

Contains information required to support a specific service, including identification of support resources, classification elements, escalation contacts and service restoration targets. (This document contains some elements of what ITIL calls the service operations plan.)

Support service

Internal services that support a ‘consumable’ service. Support services are typically not visible to end users. Relationships and obligations between service support owners and their customer (service owners) are documented in OLAs and UCs. (see Service)

Trend analysis

Analysis of data to identify time-related patterns. Trend analysis is used in problem management to identify common failures or fragile configuration items and in capacity management as a modelling tool to predict future behaviour. It is also used as a management tool for identifying deficiencies in IT service management processes.

Underpinning Contract (UC)

Contract between an OPS IT Service Provider and an external third party IT Service Provider. The third party provides goods or services that support delivery of an IT service to a customer. The UC defines targets and responsibilities that are required to meet agreed service level targets in an SLA.

Urgency

A measure of how long it will be until an incident, problem or change has a significant impact on the business. For example, a high impact incident may have low urgency if the impact will not affect the business until the end of the financial year. Impact and urgency are used to assign priority.

User

A person who consumes the IT service on a day-to-day basis. Users are distinct from customers, as some customers do not use the IT service directly.

Workaround

Reducing or eliminating the impact of an incident or problem for which a full resolution is not yet available. For example, by restarting a failed configuration item. Workarounds for problems are documented in known error records. Workarounds for incidents that do not have associated problem records are documented in the incident record.

Description

Infographic 1: Functional escalation process flow

Process flow chart demonstrating the functional escalation steps for each support tier.

  • Step 1, the Incident is logged and classified with Tier 1, which is the IT Service Desk.
  • Step 2, the IT Service Desk provides initial support.
  • Step 3, if the issue is resolved the incident is closed. If the incident cannot be resolved at Tier 1 it moves to Tier 2 support for investigation and remediation.
  • Step 4, if the issue is resolved the incident is closed. If the incident cannot be resolved at Tier 2 it moves to Tier 3 support for investigation and remediation.
  • Step 5, if the issue is resolved the incident is closed. If the incident cannot be resolved at Tier 3 it moves to Tier N support etc.

Infographic 2: Enterprise incident management process overview

Process flow demonstrating the enterprise incident management process steps.

  • Step 1, User reports the incident to the Service Desk.
  • Step 2, Service desk analyst will log and classify the incident.
  • Step 3, Service desk analyst will prioritize the incident.
  • Step 4, Service desk analyst will determine whether or not it is a major incident.
  • Step 5, if not a major incident, the service desk analyst conducts the Tier 1 diagnosis. If it is a major incident, the major incident manager engages the major incident protocol.
  • Step 6, if the incident can be resolved the incident is closed. If not, the service desk analyst performs a functional escalation to Tier 2.
  • Step 7, the subsequent support tiers will conduct diagnosis until the incident is resolved.