week 4

Learning Aims and Objectives:
Aim: In this week's page, students will learn the 
Objectives:
1. By the end of this week's page students will be able to.

2. By the end of the week's page students will be able to.


 

 

P3.4 The principles of incident management (for example Information Technology Infrastructure Library (ITIL®)) models in the context of digital support services

Incident management is a process used to deal with problems (referred to as “incidents”) that occur when using technology or digital services, such as computers, websites, or apps. For instance, if a website crashes or a computer system stops working, that’s an incident.

One of the most widely used frameworks for managing incidents is called ITIL® (Information Technology Infrastructure Library). ITIL provides a structured set of steps to ensure incidents are fixed quickly and don’t cause bigger issues.

ITIL was created by the UK Government’s Central Computer and Telecommunications Agency (CCTA) in the 1980s. It started as a way to standardise IT practices across different government departments. As ITIL became more popular in the private sector, it was published as a series of books, which came to be known as the ITIL library.

In 2013, AXELOS Limited was set up as a partnership between the UK Government’s Cabinet Office and Capita plc. AXELOS took over the management and ownership of ITIL, along with other well-known frameworks like PRINCE2 and MSP. Under AXELOS, ITIL has continued to develop, keeping up with new technology and business practices.

 

How Incident Management Works

    1.    Identify the incident: This is when someone notices there’s a problem. It could be a user calling support to say they can’t log into a website, or an alert showing a server is down.
    2.    Log the incident: The details of the problem are recorded in a system so it can be tracked. This helps to organise the work and ensure nothing gets forgotten.
    3.    Classify the incident: Not all incidents are equal. For example, if one person can’t access a service, it’s important but not as urgent as if the whole service is down for everyone. The incident is ranked by how serious it is (high or low priority).
    4.    Assign the incident: The problem is sent to the correct team or person who can fix it. For example, if it’s a network issue, it will go to the network team.
    5.    Resolve the incident: The team works to fix the issue as quickly as possible. Once it’s fixed, they ensure everything is working normally again.
    6.    Close the incident: After confirming that the problem is solved, the incident is closed in the system.

Example of Incident Management in Use

    •    E-commerce websites: If a site like Argos’ checkout system stops working, that’s a significant incident. The incident management team would receive an alert, determine the cause (perhaps a server issue), and work swiftly to resolve it to avoid losing customers or revenue.
    •    Schools and Universities: If an online learning platform crashes during an important exam, the IT team would treat this as a high-priority incident and work to restore it as quickly as possible.

Benefits of Using Incident Management (ITIL)

    1.    Efficiency: Having a structured approach means incidents are resolved faster, reducing downtime and ensuring services are restored quickly.
    2.    Organisation: With each incident being tracked and assigned properly, teams can avoid confusion and ensure the right person is working on the problem.
    3.    Prevents Recurrence: After resolving an incident, the team can review what went wrong to prevent it from happening again, leading to better service overall.
    4.    Improves Customer Satisfaction: The quicker and more efficiently issues are resolved, the more satisfied customers or users are with the service.

Disadvantages of Using Incident Management (ITIL)

    1.    Time-Consuming: Following a structured process takes time. In less critical situations, it might feel like more effort than it’s worth.
    2.    Requires Training: Those handling incidents need to be familiar with the ITIL process, and training can take time and resources.
    3.    Rigid Structure: In some cases, the strict rules of ITIL may slow things down. If something is urgent but not classified as high priority under the system, the process might delay a quick fix.
    4.    Overhead: Small organisations with fewer incidents may find the ITIL-based system too formal or costly to maintain, as it’s designed for larger operations.

 

 

 

Detection

Detection in Incident Management: Reporting and Recording

The detection stage of incident management is critical, as it’s the first step in identifying and responding to problems that occur within digital support services. This phase focuses on the reporting and recording of incidents, ensuring they are captured and tracked accurately.

Reporting the Incident

Incidents can be detected in different ways, and the reporting process involves bringing attention to a problem. There are two main types of reporting:

    1.    User-reported incidents: These are issues that are brought to the attention of the IT or support team by the users themselves. For example, a student might report that they are unable to access their online learning portal, or a customer may contact a help desk because a website is not loading properly. Users play an important role in helping identify problems early, especially when automatic systems don’t detect them.
    2.    System-reported incidents: Many digital services use monitoring tools that automatically detect when something goes wrong. These systems can track things like server performance, website traffic, or errors within an application. If something unusual happens, such as a spike in errors or a system crashing, the monitoring system will generate an alert that reports the incident to the IT team without any human intervention.

Recording the Incident

Once an incident is reported, the next step is to record it in a central system, such as an incident management tool. This is important for several reasons:

    1.    Tracking and organisation: Recording incidents ensures that each one is properly tracked and monitored. By logging details like the time, type of issue, and who reported it, the team can stay organised and make sure nothing is missed. It also allows the team to see patterns over time, which can help in identifying recurring problems.
    2.    Prioritisation: When incidents are recorded, they can be classified based on their severity or impact. This allows the IT team to prioritise their work. For example, an issue affecting a large number of users would be recorded as high priority, while a minor issue affecting a single user might be recorded as low priority.
    3.    Audit trail: Recording incidents creates a clear history of what issues have occurred and how they were handled. This is important for accountability and reviewing how effectively incidents are being managed. It also helps in future problem-solving, as the team can look back at how similar incidents were resolved in the past.
    4.    Improving future responses: The data collected during the recording process can be used to improve incident management over time. By analysing the types of incidents that occur and how they are resolved, teams can adjust their processes to be more efficient and effective in the future.

 

Response

Response in Incident Management: Ownership, Resolution, and Recording

Once an incident is detected and recorded, the next crucial phase is the response. This involves identifying who is responsible for handling the issue, working to resolve the problem and restore normal service, and documenting how the incident was resolved.

Identifying an Owner

After an incident is recorded, it must be assigned to the correct person or team, known as the owner of the incident. The owner has responsibility for managing the issue until it is fully resolved. Identifying the right owner is essential for efficient incident management, and this process often depends on the type of incident:

    1.    Assigning based on expertise: Different teams or individuals may specialise in certain areas, such as network issues, server maintenance, or application development. For example, if the issue involves a server being down, the incident would likely be assigned to the infrastructure team. If it’s a software bug, it might go to the development team.
    2.    Clear responsibility: By having a designated owner, it is clear who is accountable for managing the incident. This avoids confusion, ensures that the incident is being actively worked on, and prevents duplication of effort.
    3.    Escalation procedures: If the incident is complex or severe, it may need to be escalated to a higher-level team or manager. This is particularly important for high-priority incidents that impact large numbers of users or critical services.

Resolving the Issue and Restoring Service

Once an owner is identified, the next step is to focus on resolving the incident and restoring normal service. This involves diagnosing the root cause of the problem and implementing a solution to fix it.

    1.    Diagnosis and troubleshooting: The incident owner will start by identifying the cause of the issue, which may involve investigating system logs, reviewing recent changes, or running diagnostic tests. For example, if a website is down, the owner might check whether there’s an issue with the server or if a recent software update caused the problem.
    2.    Implementing a fix: Once the cause is identified, the owner works on resolving the issue. This might involve repairing hardware, restoring backups, patching software, or rolling back a faulty update. The key goal is to restore the service as quickly and effectively as possible.
    3.    Minimising impact: Throughout the resolution process, the owner will aim to minimise the impact on users. This could involve providing regular updates to affected users or setting up temporary workarounds to ensure some level of service is maintained while the main issue is being fixed.

Recording Incident Resolution and Applied Changes

Once the incident is resolved, it’s important to record the resolution and any changes that were applied to fix the issue. This final step ensures that the incident is fully documented and can be reviewed later if necessary.

    1.    Documenting the solution: The details of how the incident was resolved are recorded in the incident management system. This includes what caused the problem, the steps taken to fix it, and whether any changes were made to prevent it from happening again.
    2.    Review and learning: Recording the resolution helps build a knowledge base that can be referred to in the future. If a similar incident occurs, the team can look back at how it was previously handled, which can speed up future responses and improve overall service management.
    3.    Preventing recurrence: By reviewing recorded incidents, teams can spot trends or recurring problems. This allows them to implement proactive measures to prevent future incidents, such as updating software more carefully or improving system monitoring.
    4.    Closure and feedback: Once the incident is resolved and recorded, the incident is formally closed. Feedback can also be gathered from users or stakeholders to evaluate how effectively the incident was handled, which helps improve future responses.

 

Intelligence

The intelligence aspect of incident management focuses on what can be learned from each incident to prevent future occurrences and improve overall service quality. It involves carefully recording lessons learned, investigating the root cause, and using that knowledge to update procedures and reduce the risk of similar incidents happening again.

Recording Lessons Learned, Fixes, and Procedure Updates

After an incident is resolved, it is vital to record all lessons learned during the process. This step ensures that the organisation can learn from the experience and improve its incident management approach.

    1.    Documenting the lessons: Once an incident has been resolved, the team reflects on what went well, what challenges were faced, and how things could be done better next time. These insights are recorded in detail so they can be reviewed in future incidents.
    2.    Recording fixes: Any specific fixes or technical changes applied to resolve the issue are documented. For example, if a software bug was fixed, the team would note exactly what was done, such as applying a patch or updating a configuration. This provides a reference for dealing with similar issues in the future.
    3.    Updating procedures: If the incident revealed flaws or gaps in the current processes, these procedures should be updated. For example, if it took too long to detect the incident, the monitoring systems might need to be improved. Procedure updates ensure that the organisation is better prepared next time.

Performing In-Depth Investigation and Root Cause Analysis

Once an incident is resolved, it’s essential to investigate thoroughly to understand the underlying cause, especially if the issue was complex or had a significant impact. This can prevent the same problem from recurring in the future.

    1.    Root cause analysis: An in-depth investigation is performed to identify the true cause of the incident, rather than just addressing the symptoms. For example, if a website crashed, was it due to a server overload, a software bug, or a misconfiguration? Understanding the root cause ensures that the real problem is fixed, not just the immediate issue.
    2.    Forensic analysis: In more complex or security-related incidents, such as a data breach or system failure, a forensic analysis may be required. This involves a detailed examination of system logs, network activity, and other data to uncover exactly what went wrong and how the incident occurred. Forensic analysis can provide valuable insights into vulnerabilities or weaknesses in the system.
    3.    Identifying patterns: By performing detailed analysis on multiple incidents over time, patterns may emerge that highlight broader issues. For example, if similar incidents keep occurring, it could point to an underlying flaw in the system architecture that needs addressing.

Sharing Lessons Learned for Continual Improvement

A key part of the intelligence aspect of incident management is sharing lessons learned to improve the organisation’s overall capability and reduce the likelihood of incidents repeating.

    1.    Internal knowledge sharing: Once lessons have been documented, they should be shared across teams to ensure everyone is aware of the insights gained. This could involve team meetings, reports, or internal documentation. For example, if a particular fix worked well, other teams should know about it so they can apply it in similar situations.
    2.    Contributing to continual improvement: The insights gained from each incident are fed into the continual improvement process. This means that incident management procedures, response times, and technical systems are constantly evolving to become more effective over time. Continual improvement also helps the organisation stay agile in responding to new and emerging types of incidents.
    3.    Reducing the risk of repetition: The ultimate goal of recording and sharing lessons is to reduce the risk of similar incidents happening again. By identifying the root causes and updating procedures, the organisation can minimise vulnerabilities and ensure future incidents are less likely to occur.
    4.    Developing preventative measures: Armed with lessons from past incidents, the organisation can proactively implement preventative measures. For example, if a particular type of cyberattack caused an incident, stronger security measures can be put in place to prevent similar attacks in the future.

 

Activity: Incident Management Group Task

You have been asked to work in teams of up to three students to create a presentation on incident management based on the principles we’ve discussed: detection, response, and intelligence. The aim is to demonstrate your understanding of how incidents are reported, managed, resolved, and learned from, using a real-world scenario.

You will have 30 minutes to complete this activity. Your group will need to explain the following in your presentation:
    1.    Detection: How was the incident identified and reported?
           •    Was it a user-reported issue or system-reported?
           •    How was it logged, and what details were recorded?
    2.    Response: Who was responsible for fixing the issue, and how was it resolved?
          •    How was the incident assigned, and what steps were taken to resolve it?
          •    What actions were necessary to restore normal service?
    3.    Intelligence: What lessons were learned from the incident?
         •    Was an in-depth investigation done to find the root cause?
         •    How could the organisation use this experience to prevent similar incidents in the future?

Scenario Example: Online Learning Platform Crash

Imagine you are part of the IT support team at a college, and one day during an important mock exam, the college’s online learning platform crashes. Students cannot access their exam papers, and teachers are panicking. Your job is to manage the incident from start to finish.

         •    Detection: The incident is first reported by a teacher who notices students can’t log in. Shortly after, a system alert shows that the platform’s server is down.
         •    Response: The IT support team assigns the incident to the infrastructure team, who investigates and finds that a recent update overloaded the server. They restore the platform by rolling back the update, and the students can continue their exams.
        •    Intelligence: After the incident, the team conducts an analysis and realises that the system wasn’t tested properly after the update. The team decides to implement a new procedure for testing updates before they are applied.

Your group will use this scenario, or create your own similar one, to guide your presentation. Make sure to:

    •    Explain the detection process (how was the problem reported and logged?).
    •    Describe the response (who was responsible, and what steps were taken to fix it?).
    •    Discuss the intelligence gathered (what lessons were learned, and how can future issues be prevented?).

You will have 5 minutes at the end of the 30 minutes to present your findings. Good luck!

 

 


Last Updated
2024-11-05 13:58:11

Links to Learning Outcomes

Links to Assessment criteria

 


English


Maths







How 2's Coverage





Files that support this week


| | | | |
Week 3
Prev
Week 4
Prev
Week 5
Prev

Next

Next
Webmaster Spelling Notifications