week 6

1.3.3 The purpose of root cause analysis and when it is used.

Root Cause Analysis (RCA) is a structured problem-solving method used to identify the underlying reason why a fault, incident, or failure has occurred, rather than just treating the visible symptoms. Its purpose is to ensure that organisations can prevent the same problem from happening again by uncovering what actually caused it whether that is a technical issue, a human-error chain, a process failure, or a combination of several factors. RCA is typically used after significant incidents such as system outages, security breaches, recurring faults, health and safety incidents, or quality-control failures. It is most valuable when a problem is persistent, high-impact, or costly, and when simply fixing the immediate issue does not guarantee long-term stability. By identifying the true root cause and implementing corrective actions, organisations improve reliability, strengthen processes, enhance safety, and reduce long-term operational risk.

1.3.4 The approaches to root cause analysis:

Five whys

An iterative questioning technique, the 5 Whys is a simple but effective way to get to the bottom of a problem by asking “why” five times, or more if needed. It’s like peeling back layers of an onion, going deeper each time until you find the true cause of the issue, not just the surface-level problem.

Here’s how it works:

   •   You start by identifying the problem.
   •   Then, you ask why the problem happened.
   •   After you get the first answer, you ask why again to dig deeper.
   •   You keep repeating this process, usually about five times, until you uncover the root cause of the issue.

In digital support services, this technique helps teams figure out why things go wrong, whether it’s related to technology not working, users having issues, or systems failing. By understanding the true cause, companies can fix problems effectively and prevent them from happening again.

Example Case Studies in Digital Support Services:

   1.   Slow Website Performance
   •   Problem: Users report that a website is loading slowly.
   •   1st Why: Why is the website slow? Because the server response time is long.
   •   2nd Why: Why is the server response time long? Because it’s overloaded with too many requests.
   •   3rd Why: Why are there too many requests? Because a lot of bots are making fake requests.
   •   4th Why: Why are bots making fake requests? Because there’s no filtering system in place to block them.
   •   5th Why: Why is there no filtering system? Because it wasn’t set up during the website’s initial configuration.
   •   Solution: Set up a bot-blocking system to reduce server load and improve performance.

2.   Customer Complaints About Software Crashing
   •   Problem: A company’s support team receives multiple complaints that their app is crashing.
   •   1st Why: Why is the app crashing? Because it’s running out of memory.
   •   2nd Why: Why is it running out of memory? Because it’s using too much data.
   •   3rd Why: Why is it using too much data? Because the images in the app are not compressed.
   •   4th Why: Why aren’t the images compressed? Because the developers didn’t optimize them.
   •   5th Why: Why weren’t the developers optimizing images? Because they weren’t aware of the issue until users complained.
   •   Solution: Train developers on data optimization and perform regular app performance tests.

3.   Repeated Failed Logins by Customers
   •   Problem: Customers are having trouble logging into their accounts.
   •   1st Why: Why can’t they log in? Because their passwords are being rejected.
   •   2nd Why: Why are passwords being rejected? Because the system doesn’t recognize them.
   •   3rd Why: Why doesn’t the system recognize them? Because some users are resetting their passwords repeatedly.
   •   4th Why: Why are they resetting their passwords repeatedly? Because they can’t remember them.
   •   5th Why: Why can’t they remember their passwords? Because the system has strict password requirements that are hard to remember.
   •   Solution: Implement a more user-friendly password recovery system or use passwordless login methods like biometrics or magic links.

The 5 Whys technique helps digital support services dig deeper to find solutions that aren’t just quick fixes but address the real problem, saving time and improving user experiences in the long run.

5 Whys Game

Objective:
This activity helps students understand the 5 Whys technique by creating a game where they identify the root cause of a problem. It encourages teamwork, critical thinking, and communication skills.
Setup:
   •   Split the class into pairs (Pair A and Pair B).
   •   Each pair will create a simple problem scenario and guide the other pair through the 5 Whys questioning process to find the root cause.
   •   After creating their scenarios, the pairs will swap and solve each other’s problem using the 5 Whys method.

Instructions:
Part 1: Creating the Problem Scenario (10-15 minutes)
   1.   Create a Problem: In each pair, students come up with a fictional problem for the other pair to solve. The problem should be related to everyday experiences like technology, school, or a common inconvenience. The problem can be simple, such as:
   •   “The Wi-Fi isn’t working.”
   •   “A game console keeps freezing.”
   •   “The printer won’t print.”
   2.   Think of the Root Cause: After creating the problem, the pair should also come up with a potential root cause of the issue (e.g. Wi-Fi router is out of date, or the console needs an update) to make sure they can guide the other pair toward the answer.
   3.   Develop Hints: Prepare answers for each of the 5 Whys steps that will eventually lead to the root cause. Make sure the hints lead the problem-solvers to dig deeper, but don’t give away the solution immediately.
Example:
   •   Problem: “The laptop won’t turn on.”
   •   Why #1: Why isn’t the laptop turning on? (The battery is dead.)
   •   Why #2: Why is the battery dead? (It wasn’t charged overnight.)
   •   Why #3: Why wasn’t it charged overnight? (The charger wasn’t plugged in properly.)
   •   Why #4: Why wasn’t it plugged in properly? (The power strip was switched off.)
   •   Why #5: Why was the power strip switched off? (It was switched off to save power, but the student forgot to turn it back on before charging.)

Part 2: Swapping and Solving (10-15 minutes)
   1.   Swap Problems: Once both pairs have completed their problem scenarios, they swap with another pair. Now each group has a new problem to solve using the 5 Whys technique.
   2.   Solve the Problem:
   •   The pair solving the problem will ask “why” questions, starting with the surface problem.
   •   The other pair (the creators of the problem) will provide answers based on the hints they prepared.
   •   The goal is for the solvers to identify the root cause after asking at least 5 “why” questions.
   3.   Reflection: After solving, each pair should discuss:
   •   Did they reach the root cause? How difficult was it to ask the right “why” questions?
   •   Were the answers clear, or did they need more information?
   •   How did this process help them understand the problem better?

Part 3: Group Discussion (5-10 minutes)
After both pairs have completed the activity, come together as a class to reflect on the experience. Ask students:
•   What challenges did they face when asking “why”?
   •   Did anyone find a different root cause than expected?
   •   How can this technique be useful in real life, especially in areas like technology or problem-solving?

Extensions:
   •   Role Reversal: Have the pairs switch roles again and create new problems.
   •   Real-World Scenarios: Ask pairs to think of real issues they’ve encountered in their daily lives (like a broken phone charger or app not working) and apply the 5 Whys.

This activity is designed to be fun, engaging, and hands-on, helping students apply the 5 Whys technique in a creative way while working together!

Failure mode and effects analysis (FMEA)

Identifies which parts of the process or system are faulty

In the digital support services industry, Failure Mode and Effects Analysis (FMEA) can be highly valuable in ensuring reliability, minimising downtime, and improving the overall customer experience. The industry relies on complex digital systems, software platforms, and networks, where failures can have significant operational and business impacts. Here’s how FMEA can be contextualised within this sector:

1. Failure Modes in Digital Support Services

• Software Failures: These could include bugs, crashes, or incompatibilities in the software used to provide support services. A failure might prevent users from accessing critical support tools or resources.

• Network Downtime: Disruptions in network connectivity that hinder communication between support teams and customers, leading to delays in problem resolution.

• Data Breaches or Cybersecurity Issues: Failures in protecting customer data can lead to breaches, data loss, or unauthorised access, which are critical in digital services.

• Poor Integration: Incompatibility between various software systems or tools used in support services, causing disruptions in the workflow or poor user experiences.

2. Effects of Failure

• Customer Dissatisfaction: Failures can lead to delays in resolving customer issues, causing frustration and possibly losing customers.

• Operational Downtime: Extended periods of system unavailability affect the ability of the support team to operate efficiently, impacting overall service delivery.

• Reputation Damage: Cybersecurity issues or consistent service failures can damage the organisation’s reputation, eroding trust with clients.

• Financial Losses: Unplanned outages, lost productivity, and reputational damage can lead to financial consequences, either through lost business opportunities or costs associated with fixing the issues.

3. Severity, Occurrence, and Detection in Digital Support Services

• Severity (S): For digital support services, severity can range from minor inconveniences (e.g., slow service) to critical issues such as complete system shutdowns, impacting service-level agreements (SLAs).

• Occurrence (O): In a digital environment, failure modes with high occurrence might include recurring software bugs, frequent network interruptions, or continual user-reported issues.

• Detection (D): Early detection of failure modes could involve automated monitoring tools, error logs, or customer feedback systems. Failures that are hard to detect, such as latent cybersecurity vulnerabilities, would rank high on the detection scale.

4. Risk Priority Number (RPN)

• In digital support services, the RPN helps identify where proactive improvements are most needed. For example, a failure mode such as a data breach, with high severity (due to legal and reputational risks), moderate occurrence, and low detectability, would have a high RPN. This would signal the need for immediate attention, such as implementing stronger cybersecurity measures or monitoring systems.

5. Mitigation and Prevention in Digital Support

• Automated Monitoring Tools: To minimise occurrences of network and software failures, companies can implement real-time monitoring systems that alert teams when failures are likely to occur.

• Redundant Systems: In the case of network downtime, having redundant systems or backup networks can ensure continued service even if the primary system fails.

• Patch Management and Software Updates: Regularly updating software and applying patches can prevent common bugs and vulnerabilities that lead to system failures.

• Cybersecurity Protocols: Stronger encryption, multi-factor authentication, and real-time threat detection can mitigate the risk of data breaches and security-related failures.

FMEA as a Learning Tool in Digital Support Services:

In this industry, FMEA becomes a proactive learning model for continuously improving digital infrastructure. By routinely analysing failure modes, companies can:

• Enhance their incident response protocols to minimise customer impact.

• Build resilience into their systems by identifying critical failure points before they cause major service disruptions.

• Foster continuous improvement in service delivery by learning from previous failures, which in turn improves customer satisfaction and operational efficiency.

FMEA helps digital support service providers identify and prioritise potential failures, mitigate risks, and enhance overall system reliability, which is crucial for maintaining high levels of service quality and customer trust.

Activity: Failure Mode and Effects Analysis (FMEA) in Digital Support Services
Duration: 30 minutes
Target Audience: IT Students
Failure Mode and Effects Analysis (FMEA) with an application in Digital Support Services

Activity Overview:
In this activity, students will apply the principles of FMEA to identify potential failure modes in a common Digital Support Services scenario, evaluate the impact, and develop strategies to mitigate these risks. This practical exercise will enhance their problem-solving skills and understanding of risk management in IT service delivery.

Learning Objectives:
   •   Understand the purpose and process of FMEA.
•   Apply FMEA to a real-world Digital Support Services scenario.
•   Identify failure modes, their effects, and possible mitigation strategies.
•   Present findings in a structured manner.

Scenario: User Account Management System

You work in a Digital Support Services team responsible for managing a company’s User Account Management System. This system handles user account creation, password resets, role assignments, and account deactivation. Ensuring this service operates smoothly is critical to business continuity and user satisfaction.

Task Breakdown (30 minutes):
   1.   Introduction to FMEA (5 minutes):
•   Brief overview of what FMEA is: A structured approach to identifying and evaluating potential failures in a process, system, or product and their effects.
•   Explain how FMEA can be applied to Digital Support Services—e.g., improving system reliability, preventing service outages, and enhancing user experience.

2.   Group Work: Failure Mode Identification (10 minutes):
Divide students into small groups (3-5 members). Each group will brainstorm potential failure modes in the User Account Management System.
Examples of Failure Modes:
•   Incorrect password resets (users unable to reset their passwords due to system errors).
•   Delayed account activation (users not receiving timely access to their accounts).
•   Misassigned user roles (users receiving incorrect permissions).
•   Security vulnerabilities (users gaining unauthorised access).

   3.   FMEA Analysis (10 minutes):
Each group will use the FMEA method to analyse one or two failure modes identified earlier.
For each failure mode, they must:
•   Identify the potential effects (e.g., user frustration, system downtime).
•   Rate the severity (how serious the effect is on the system or users) on a scale of 1 to 10.
•   Identify the causes (e.g., human error, software bugs, incorrect configurations).
•   Rate the occurrence (likelihood of the failure happening) on a scale of 1 to 10.
•   Identify current controls (e.g., automated password reset system).
•   Rate the detection (how likely the system can detect the failure before it occurs) on a scale of 1 to 10.
•   Suggest improvements to reduce the risk of failure (e.g., better user training, enhanced system logging).

4.   Presentation and Discussion (5 minutes):
Each group will present one failure mode and their FMEA analysis to the class, focusing on the failure’s severity, likelihood of occurrence, and proposed solutions. The instructor and other groups can ask questions and provide feedback.

Example to Support Understanding:
Failure Mode: Incorrect Password Resets
   •   Effect: Users are unable to access the system, leading to frustration, productivity loss, and increased support calls.
•   Severity: 8 (high, since access is crucial for daily operations).
•   Cause: Incorrect configuration of the password reset system or outdated email templates.
•   Occurrence: 5 (moderate, occurs occasionally but not frequently).
•   Current Controls: Automated password reset tool, email notifications.
•   Detection: 6 (moderate, failures are reported by users, but the system does not automatically detect the issue).
•   Suggested Improvement: Introduce a monitoring system that alerts support staff when password resets fail and implement regular audits of the reset tool.

Expected Outcome:
Students will gain a practical understanding of FMEA as a tool for improving Digital Support Services by identifying and addressing potential failure points. The exercise will also help them develop teamwork and presentation skills as they discuss their findings and propose solutions.

Materials Required:
   •   FMEA templates (paper or digital)
•   Scenario description (provided in the activity)
•   Whiteboard/flip chart for group presentations

Event tree analysis (ETA)

Event Tree Analysis (ETA) is a method used to evaluate how an event or failure could progress and what consequences it might lead to. It starts with a single event, called an “initiating event”, and from there, branches out like a tree, showing different possible outcomes. This analysis is especially useful in safety and risk assessments because it helps identify how different systems, processes, or actions can either stop or allow the event to get worse.

How ETA Works:

   1.   Identify the initiating event – This could be anything, from a system error to a power outage.
   2.   Identify the systems or processes designed to respond – For each step, there might be safety systems or processes that can either work (success) or fail (failure).
   3.   Create branches for each decision point – If something works, the branch leads to a positive outcome. If something fails, the branch leads to a more negative outcome.
   4.   Evaluate consequences – Each branch ends in a possible outcome, ranging from “nothing happens” to a serious failure.

Example Situations in Digital Support Services:

In digital support services, where businesses provide technical help and maintain digital infrastructure, ETA can help assess risks related to system failures or cyber-attacks.

Example 1: System Outage in a Cloud Service Provider

Imagine a situation where a cloud service provider (like Google Cloud or Amazon Web Services) experiences a major power

Activity:
Event Tree Analysis (ETA) in Digital Support Services
Objective:
To research and understand how Event Tree Analysis (ETA) can be applied in digital support services, particularly focusing on system failures, cyber-attacks, or technical errors. Students will present their findings in a 5-minute presentation, showing their understanding of ETA and how it applies to real-world scenarios.
Instructions:

1.   Choose a Scenario:
   •   Select a situation in digital support services where ETA could be applied. Examples include:
   •   A cyber-attack on a company’s network.
   •   A system failure or outage in a cloud service provider.
   •   A technical error in a data centre causing partial downtime.
   •   Think about what the initiating event might be, and what the potential outcomes could be, based on whether systems work or fail.

   2.   Research ETA in Digital Systems:
   •   Spend 10-15 minutes researching the use of Event Tree Analysis in the context of digital systems. Use the following guiding questions:
   •   What is ETA, and how does it help assess risk?
   •   How does ETA apply in scenarios like system failures or cyber-attacks?
   •   What are some real-world examples where ETA has been used in digital services?

   3.   Sources for Research:
   •   Health and Safety Executive: Event Tree Analysis – A basic guide to ETA and how it works.
   •   ScienceDirect: Event Tree Analysis – A detailed overview of ETA and its applications.
   •   Cloud Computing Incidents Database – Real-world cases of cloud service outages and failures, which can be useful for examples.
   •   Cyber Security and Infrastructure Agency (CISA) – Information on cyber-attacks and system vulnerabilities.

   4.   Create a Presentation:
   •   Spend 15 minutes compiling your findings into a short presentation. Your presentation should include:
   •   A brief explanation of Event Tree Analysis.
   •   The scenario you selected (e.g., system failure, cyber-attack) and its potential consequences.
   •   A simple event tree diagram showing the possible outcomes (you can draw this or create it digitally).
   •   A conclusion explaining why ETA is useful in digital support services.

   5.   Presentation Requirements:
   •   The presentation should last 5 minutes.
   •   Be ready to explain your event tree and how you arrived at the possible outcomes.

• actions to take after using root cause analysis:

o log

o close

o escalate to an appropriate manager, specialist or external third party.

1.3.5 The process of the high-level problem-solving strategy:

A high-level problem-solving strategy is a structured, logical process used to diagnose, analyse, and resolve issues efficiently. In IT Support and Cyber Security, this approach is essential for ensuring faults are identified accurately, risks are managed appropriately, and solutions are implemented safely without causing further disruption. It helps technicians stay systematic and evidence-based especially when dealing with complex digital systems, networks, devices, and security incidents.

Below is a full explanation of each stage, followed by examples that link directly to Digital Support Services and Security operations.

Identify the Problem

This first stage focuses on gathering information about what is wrong. Technicians must determine the scope, symptoms, and impact of the fault.

In practice (IT Support example):

A user reports that their PC cannot access the network.
Support staff collect details:
- Does the device see the Wi-Fi network?
- Are other devices affected?
- When did the issue begin?
- Has anything changed (software update, new hardware, password expired)?

In practice (Security example):

SIEM alerts show unusual outbound traffic from a workstation.
Analysts identify whether this is expected behaviour or an indicator of compromise (IoC).

• gather information

Analyse the information

After gathering symptoms, you refine the problem into a clear definition. This avoids fixing the wrong issue.

In practice (IT Support example):

After testing, the root cause might be:

Network cable unplugged
Incorrect static IP address
DHCP server unavailable

In practice (Security example):

A phishing investigation identifies:

Compromised user credentials
Malicious script running in the background
User clicked an unsafe link

Make a plan of action

At this point, technicians brainstorm multiple potential fixes. These can include temporary workarounds or long-term solutions.

In practice (IT Support example):

For a PC that cannot join the domain:

Reset network adapter
Flush DNS
Rejoin the domain
Replace corrupted profile
Roll back the latest update

In practice (Security example):

For a credential compromise:

Force password reset
Remove suspicious log-ins
Block malicious IP addresses
Patch affected systems
Strengthen MFA for future prevention

Implement the Solution

The chosen solution is applied. IT support technicians must consider:

Change-control procedures
System downtime
Backups
Communication with stakeholders

IT Support Example:

Applying a Group Policy fix to a faulty set of PCs after notifying staff.
Installing the correct printer driver across a department.

Security Example:

Isolating a workstation from the network during incident response.
Rolling out a security patch across all devices.

Review the solution.

Not all solutions are equal. Technicians assess each based on feasibility, risk, cost, time, and impact on users.

IT Support Example:

To fix slow network speeds, options include:

Replace damaged cabling (long-term fix)
Restart the switch (quick fix but may cause downtime)
Change user’s port (minimal disruption)

The technician chooses the option that balances effectiveness and minimal business impact.

Security Example:

For malware detected on a server, options could include:

Clean the infection
Restore from a timestamped backup
Rebuild the server from scratch

If the malware is unknown or persistent, rebuilding the server may be the safest option despite being time-consuming.

1.3.6 The definition of a digital incident, in incident management:

• a single unplanned event

• that disrupts service operations

• that negatively impacts service quality

1.3.7 The definition of a digital problem, in incident management, as the cause of the incident.

1.3.8 The process of incident management:

• detection: report, record, prioritise

• response: identify owner, resolve and restore, record resolution

• intelligence: record lessons, identify cause, share lessons.

1.3.9 The interrelationships between problems and problem-solving strategies and make judgements about the suitability of strategies for solving the problems in digital support and security

Last Updated
2025-11-20 15:18:01

English and Maths

English

Maths

Stretch and Challenge

Fast to implement
Accessible by default
No dependencies

Homework

Equality and Diversity Calendar

How to's

How 2's Coverage

Links to Learning Outcomes		Links to Assessment criteria

Files that support this week

Week 5 →
Next 5 Week 6 →
Next 6 ←
Prev5

week 6

1.3.3 The purpose of root cause analysis and when it is used.

1.3.4 The approaches to root cause analysis:

Five whys

Failure mode and effects analysis (FMEA)

Event tree analysis (ETA)

1.3.5 The process of the high-level problem-solving strategy:

Identify the Problem

Analyse the information

Make a plan of action

Implement the Solution

Review the solution.

1.3.6 The definition of a digital incident, in incident management:

1.3.7 The definition of a digital problem, in incident management, as the cause of the incident.

1.3.8 The process of incident management:

1.3.9 The interrelationships between problems and problem-solving strategies and make judgements about the suitability of strategies for solving the problems in digital support and security

Last Updated2025-11-20 15:18:01

English

Maths

Stretch and Challenge

Homework

How 2's Coverage

Links to Learning Outcomes

Links to Assessment criteria

Files that support this week

Last Updated
2025-11-20 15:18:01