Week 1 | T&L Activities: Learning Aims and Objectives:
P3.1 Fault analysis tools and their applications to identify problemsWhy do companies need to use tools to monitor and analyse faults? Is it not better to have a system that alerts when there are just faults and then they can be fixed? The answer to these questions is quite simple, not all faults that are the same are triggered in the same way. We must remember that at any one time, a network, computer or digital device could be processing any number of actions in the background that users might not be aware of. The use of tools to monitor and analyse faults provides engineers and technicians an opportunity to reflect on the logs of the systems to understand potential triggers.
Have you ever had a situation where you have had a fault and someone else has also experienced it? Were they triggered at the same time? Were you both working on the same thing with the same applications or elements open? How did you resolve this issue? what was it? Was it the same process as the other person?
Having looked at and used some of the network/tone testers in a lab setting, research and review the different testers available and the features and functions that they have and can do. Discuss where these might be used and the benefits to the users.
How are faults identified on the systems that we use? Let's look into the most common. Traditional Techniques used to fault find used flow charts to ask users questions to discount or discredit areas that the issue wasn't. These flow charts were significant in thier breath of the topic area, and became almost obsolete due to the large amount of variables and possible resolutions. However a flow chart can be seen in action to help Sheldon find a friend. System alertsA flag when a system condition is outside predetermined parameters Common system alerts come in the form of beep codes, these codes act like a morse code for technicians. The codes themselves are short and long pulses of sound generated by a small speaker mounted to the main circuit board. On newer devices small lcd screens display codes that can be cross referenced against the manufacturers information doucmentations provided with the device or online.
Flow Chart Operators
Lets consider the use of flowcharts to support the resolution of a sound card. Using the flow chart notation above create a chart that some one could use to fix a sound card issue.
Activity/error logsRecord of all interactions and events within network systems The importance of logging only to clear to reflect on possible clues to the lead up to faults and thier diagnosis. Within the educational sector a shared academic network called "Janet" is used. This network is operated and supported by the Jisc group and is governed by UKERNA (United Kingdom Education and Research Networking Association), which is a nonprofit group of the U.K. Higher Education Funding Council. Within Jisc website they have a series of support documentation that further discusses the use and need for the use of Activity/error logs Jisc Activity and Error loggingResearch using the link above what areas are suggested for those that use and work with the Janet network to log. What do these logs look like and what files are they? Log files are not formatted documents that present information in pretty headings and images in most examples. These files tend to have rows of information that is separated over lines for different logged data. As you are able to see in the image below the log is identifying the date time that the log recorded something and a message of the error that was logged and what the user might have been trying to access. In most situations the log files for networks and programs are basic text files, that if attempted to be opened, would in a windows environment by opened in notepad. The files are automatically generated by the systems at specific times and if needed manually. Live tracesidentify any network traffic or activity in real-time Network traffic can be monitored when its working in real time but what about when it isn't!? Case Study:
DashboardsA consolidated visual representation of system condition and performance
Files that support this week | English:
|
Assessment:
|
Learning Outcomes:
|
Awarding Organisation Criteria:
|
Maths:
|
|||||
Stretch and Challenge:
|
E&D / BV | ||||
Homework / Extension:
|
ILT | ||||
→ | → | → | → | → | → |
Week 2 | T&L Activities: Learning Aims and Objectives:
P3.2 The purpose and application of organisational frameworks for troubleshooting and problem management
Problem identification– identify and isolate faults using diagnostic and analytical tools to establish the probable cause ### **Activity Title: Troubleshooting and Problem Management: Identifying Faults Using Diagnostic and Analytical Tools**
LoggingReview fault history, identifying potential trends and issues
Action planPlan or strategy for repair, restoration and prevention of further issues
EscalationTo an appropriate manager, specialist or external third-party
Solution implementationImplement required changes to fix and restore services
Problem closure and reviewNotify user and document any configuration changes Files that support this week | English:
|
Assessment:
|
Learning Outcomes:
|
Awarding Organisation Criteria:
|
Maths:
|
|||||
Stretch and Challenge:
|
E&D / BV | ||||
Homework / Extension:
|
ILT | ||||
→ | → | → | → | → | → |
Week 3 | T&L Activities: Learning Aims and Objectives:
P3.3 Root cause analysis approaches and their applications within problem management:
The 5 ‘whys’An iterative questioning technique, the 5 Whys is a simple but effective way to get to the bottom of a problem by asking “why” five times, or more if needed. It’s like peeling back layers of an onion, going deeper each time until you find the true cause of the issue, not just the surface-level problem. Here’s how it works: • You start by identifying the problem. In digital support services, this technique helps teams figure out why things go wrong, whether it’s related to technology not working, users having issues, or systems failing. By understanding the true cause, companies can fix problems effectively and prevent them from happening again. Example Case Studies in Digital Support Services: 1. Slow Website Performance 2. Customer Complaints About Software Crashing 3. Repeated Failed Logins by Customers The 5 Whys technique helps digital support services dig deeper to find solutions that aren’t just quick fixes but address the real problem, saving time and improving user experiences in the long run.
5 Whys Game
Fishbone diagramA Fishbone Diagram, also known as a Cause and Effect Diagram or Ishikawa Diagram, is a visual tool used to identify and organise potential causes of a problem or effect. The diagram resembles a fish skeleton, where the “head” represents the main problem, and the “bones” branching off represent different categories of causes. The concept was developed by Kaoru Ishikawa, a Japanese quality control expert, in the 1960s.
Key Components 1. Head: The problem or effect you want to analyse (e.g., “Customer Satisfaction Issues”). Application in Digital Support Services In the context of digital support services, a Fishbone Diagram can help teams identify the root causes of issues that customers face, leading to more effective solutions. Here’s how it can be used: 1. Identifying Technical Problems: For example, if a company receives complaints about its online support portal being difficult to navigate, a Fishbone Diagram can help identify whether the issues stem from: 2. Improving Customer Service: If customer satisfaction scores are low, the Fishbone Diagram can break down potential causes, such as: Case Studies and Examples 1. Case Study: A Software Company 2. Case Study: An E-Commerce Retailer Articles and Resources Here are some articles and resources for further reading on Fishbone Diagrams and their application in digital support services: 1. MindTools - Cause and Effect Analysis: Link Utilising a Fishbone Diagram can assist teams in digital support services in understanding and tackling the root causes of problems, ultimately leading to improved customer satisfaction and service delivery.
Fault Finding Using Fishbone Diagrams
Failure mode and effects analysis (FMEA)Identifies which parts of the process or system are faulty In the digital support services industry, Failure Mode and Effects Analysis (FMEA) can be highly valuable in ensuring reliability, minimising downtime, and improving the overall customer experience. The industry relies on complex digital systems, software platforms, and networks, where failures can have significant operational and business impacts. Here’s how FMEA can be contextualised within this sector:
1. Failure Modes in Digital Support Services • Software Failures: These could include bugs, crashes, or incompatibilities in the software used to provide support services. A failure might prevent users from accessing critical support tools or resources. • Network Downtime: Disruptions in network connectivity that hinder communication between support teams and customers, leading to delays in problem resolution. • Data Breaches or Cybersecurity Issues: Failures in protecting customer data can lead to breaches, data loss, or unauthorised access, which are critical in digital services. • Poor Integration: Incompatibility between various software systems or tools used in support services, causing disruptions in the workflow or poor user experiences.
2. Effects of Failure • Customer Dissatisfaction: Failures can lead to delays in resolving customer issues, causing frustration and possibly losing customers. • Operational Downtime: Extended periods of system unavailability affect the ability of the support team to operate efficiently, impacting overall service delivery. • Reputation Damage: Cybersecurity issues or consistent service failures can damage the organisation’s reputation, eroding trust with clients. • Financial Losses: Unplanned outages, lost productivity, and reputational damage can lead to financial consequences, either through lost business opportunities or costs associated with fixing the issues.
3. Severity, Occurrence, and Detection in Digital Support Services • Severity (S): For digital support services, severity can range from minor inconveniences (e.g., slow service) to critical issues such as complete system shutdowns, impacting service-level agreements (SLAs). • Occurrence (O): In a digital environment, failure modes with high occurrence might include recurring software bugs, frequent network interruptions, or continual user-reported issues. • Detection (D): Early detection of failure modes could involve automated monitoring tools, error logs, or customer feedback systems. Failures that are hard to detect, such as latent cybersecurity vulnerabilities, would rank high on the detection scale.
4. Risk Priority Number (RPN) • In digital support services, the RPN helps identify where proactive improvements are most needed. For example, a failure mode such as a data breach, with high severity (due to legal and reputational risks), moderate occurrence, and low detectability, would have a high RPN. This would signal the need for immediate attention, such as implementing stronger cybersecurity measures or monitoring systems.
5. Mitigation and Prevention in Digital Support • Automated Monitoring Tools: To minimise occurrences of network and software failures, companies can implement real-time monitoring systems that alert teams when failures are likely to occur. • Redundant Systems: In the case of network downtime, having redundant systems or backup networks can ensure continued service even if the primary system fails. • Patch Management and Software Updates: Regularly updating software and applying patches can prevent common bugs and vulnerabilities that lead to system failures. • Cybersecurity Protocols: Stronger encryption, multi-factor authentication, and real-time threat detection can mitigate the risk of data breaches and security-related failures.
FMEA as a Learning Tool in Digital Support Services: In this industry, FMEA becomes a proactive learning model for continuously improving digital infrastructure. By routinely analysing failure modes, companies can:
• Enhance their incident response protocols to minimise customer impact. • Build resilience into their systems by identifying critical failure points before they cause major service disruptions. • Foster continuous improvement in service delivery by learning from previous failures, which in turn improves customer satisfaction and operational efficiency.
FMEA helps digital support service providers identify and prioritise potential failures, mitigate risks, and enhance overall system reliability, which is crucial for maintaining high levels of service quality and customer trust.
Event tree analysis (ETA)Event Tree Analysis (ETA) is a method used to evaluate how an event or failure could progress and what consequences it might lead to. It starts with a single event, called an “initiating event”, and from there, branches out like a tree, showing different possible outcomes. This analysis is especially useful in safety and risk assessments because it helps identify how different systems, processes, or actions can either stop or allow the event to get worse. How ETA Works: 1. Identify the initiating event – This could be anything, from a system error to a power outage. Example Situations in Digital Support Services: In digital support services, where businesses provide technical help and maintain digital infrastructure, ETA can help assess risks related to system failures or cyber-attacks. Example 1: System Outage in a Cloud Service Provider Imagine a situation where a cloud service provider (like Google Cloud or Amazon Web Services) experiences a major power
Pareto chart
A Pareto chart is a type of bar chart combined with a line graph. It displays the relative frequency or significance of problems or factors in descending order, with the bars representing individual values (e.g., issues or defects) and the line graph showing the cumulative total. The principle behind the chart is based on the Pareto principle (80/20 rule), which suggests that 80% of problems are often caused by 20% of the causes. It helps in identifying the most significant factors that need attention to achieve improvement efficiently.
In the digital support services sector, particularly within IT, Pareto charts are commonly used to identify and prioritise issues in systems, applications, or processes. They help organisations focus on the most impactful problems, improving response times and customer satisfaction by addressing the major pain points.
Example 1: Amazon Web Services (AWS) AWS uses Pareto charts to identify common causes of service disruptions or support tickets. By analysing data from their support system, AWS can determine which types of incidents are most frequent and impactful. For instance, 80% of all support tickets might be traced back to a handful of misconfigurations or recurring errors in their services. By addressing these high-frequency issues, AWS can improve their platform’s reliability and reduce the volume of incoming support requests. Why they use it: AWS benefits from Pareto charts as they allow them to pinpoint major operational issues and focus resources on solving the most critical problems, improving the overall user experience and reducing operational costs.
Example 2: Google Cloud Google Cloud applies Pareto charts in analysing downtime reports and customer feedback related to their cloud infrastructure services. A Pareto analysis helps them quickly see which factors (e.g., network outages, storage failures) are causing the most significant disruptions. By concentrating their efforts on resolving the top 20% of root causes, Google Cloud can greatly reduce system downtime and improve the resilience of their services. Why they use it: By using Pareto charts, Google Cloud can effectively prioritise their engineering efforts on the key issues affecting customer experience, helping them optimise service performance and reduce maintenance costs.
Activity: Identifying and Solving Common IT Issues Using a Pareto Chart
Scatter diagramA scatter diagram (also known as a scatter plot) is a type of graph that helps to identify if there is a relationship between two different factors, also called variables. It plots individual data points on a graph, with one variable along the x-axis (horizontal) and the other along the y-axis (vertical). Each dot on the graph represents one instance or observation of the data, showing where the two variables intersect. How Scatter Diagrams Help Identify Relationships In digital support services, scatter diagrams are useful for identifying patterns between different aspects of service quality, customer satisfaction, or system performance. By plotting the data, you can visually see whether there’s a relationship, or correlation, between the two variables. There are three main types of relationships you might observe: 1. Positive correlation: As one variable increases, the other also increases (e.g. the more time a customer spends on a website, the more they are likely to purchase). Example in Digital Support Services Consider a digital support services company that wants to understand how response time affects customer satisfaction. They could gather data from customer service logs (response time) and customer feedback scores (satisfaction ratings). By plotting response time on the x-axis and satisfaction scores on the y-axis, they could create a scatter diagram to see if there’s any visible relationship. If the points trend upwards, it would suggest a positive relationship—quicker response times lead to higher satisfaction. Case Study Examples
3. Netflix
Investigating the Relationship Between Response Time and Customer Satisfaction
Files that support this week | English:
|
Assessment:
|
Learning Outcomes:
|
Awarding Organisation Criteria:
|
Maths:
|
|||||
Stretch and Challenge:
|
E&D / BV | ||||
Homework / Extension:
|
ILT | ||||
→ | → | → | → | → | → |
Week 4 | T&L Activities: Learning Aims and Objectives:
P3.4 The principles of incident management (for example Information Technology Infrastructure Library (ITIL®)) models in the context of digital support servicesIncident management is a process used to deal with problems (referred to as “incidents”) that occur when using technology or digital services, such as computers, websites, or apps. For instance, if a website crashes or a computer system stops working, that’s an incident. One of the most widely used frameworks for managing incidents is called ITIL® (Information Technology Infrastructure Library). ITIL provides a structured set of steps to ensure incidents are fixed quickly and don’t cause bigger issues. ITIL was created by the UK Government’s Central Computer and Telecommunications Agency (CCTA) in the 1980s. It started as a way to standardise IT practices across different government departments. As ITIL became more popular in the private sector, it was published as a series of books, which came to be known as the ITIL library. In 2013, AXELOS Limited was set up as a partnership between the UK Government’s Cabinet Office and Capita plc. AXELOS took over the management and ownership of ITIL, along with other well-known frameworks like PRINCE2 and MSP. Under AXELOS, ITIL has continued to develop, keeping up with new technology and business practices.
How Incident Management Works 1. Identify the incident: This is when someone notices there’s a problem. It could be a user calling support to say they can’t log into a website, or an alert showing a server is down. Example of Incident Management in Use • E-commerce websites: If a site like Argos’ checkout system stops working, that’s a significant incident. The incident management team would receive an alert, determine the cause (perhaps a server issue), and work swiftly to resolve it to avoid losing customers or revenue. Benefits of Using Incident Management (ITIL) 1. Efficiency: Having a structured approach means incidents are resolved faster, reducing downtime and ensuring services are restored quickly. Disadvantages of Using Incident Management (ITIL) 1. Time-Consuming: Following a structured process takes time. In less critical situations, it might feel like more effort than it’s worth.
DetectionDetection in Incident Management: Reporting and Recording The detection stage of incident management is critical, as it’s the first step in identifying and responding to problems that occur within digital support services. This phase focuses on the reporting and recording of incidents, ensuring they are captured and tracked accurately. Reporting the Incident Incidents can be detected in different ways, and the reporting process involves bringing attention to a problem. There are two main types of reporting: 1. User-reported incidents: These are issues that are brought to the attention of the IT or support team by the users themselves. For example, a student might report that they are unable to access their online learning portal, or a customer may contact a help desk because a website is not loading properly. Users play an important role in helping identify problems early, especially when automatic systems don’t detect them. Recording the Incident Once an incident is reported, the next step is to record it in a central system, such as an incident management tool. This is important for several reasons: 1. Tracking and organisation: Recording incidents ensures that each one is properly tracked and monitored. By logging details like the time, type of issue, and who reported it, the team can stay organised and make sure nothing is missed. It also allows the team to see patterns over time, which can help in identifying recurring problems.
ResponseResponse in Incident Management: Ownership, Resolution, and Recording Once an incident is detected and recorded, the next crucial phase is the response. This involves identifying who is responsible for handling the issue, working to resolve the problem and restore normal service, and documenting how the incident was resolved. Identifying an Owner After an incident is recorded, it must be assigned to the correct person or team, known as the owner of the incident. The owner has responsibility for managing the issue until it is fully resolved. Identifying the right owner is essential for efficient incident management, and this process often depends on the type of incident: 1. Assigning based on expertise: Different teams or individuals may specialise in certain areas, such as network issues, server maintenance, or application development. For example, if the issue involves a server being down, the incident would likely be assigned to the infrastructure team. If it’s a software bug, it might go to the development team. Resolving the Issue and Restoring Service Once an owner is identified, the next step is to focus on resolving the incident and restoring normal service. This involves diagnosing the root cause of the problem and implementing a solution to fix it. 1. Diagnosis and troubleshooting: The incident owner will start by identifying the cause of the issue, which may involve investigating system logs, reviewing recent changes, or running diagnostic tests. For example, if a website is down, the owner might check whether there’s an issue with the server or if a recent software update caused the problem. Recording Incident Resolution and Applied Changes Once the incident is resolved, it’s important to record the resolution and any changes that were applied to fix the issue. This final step ensures that the incident is fully documented and can be reviewed later if necessary. 1. Documenting the solution: The details of how the incident was resolved are recorded in the incident management system. This includes what caused the problem, the steps taken to fix it, and whether any changes were made to prevent it from happening again.
IntelligenceThe intelligence aspect of incident management focuses on what can be learned from each incident to prevent future occurrences and improve overall service quality. It involves carefully recording lessons learned, investigating the root cause, and using that knowledge to update procedures and reduce the risk of similar incidents happening again. Recording Lessons Learned, Fixes, and Procedure Updates After an incident is resolved, it is vital to record all lessons learned during the process. This step ensures that the organisation can learn from the experience and improve its incident management approach. 1. Documenting the lessons: Once an incident has been resolved, the team reflects on what went well, what challenges were faced, and how things could be done better next time. These insights are recorded in detail so they can be reviewed in future incidents. Performing In-Depth Investigation and Root Cause Analysis Once an incident is resolved, it’s essential to investigate thoroughly to understand the underlying cause, especially if the issue was complex or had a significant impact. This can prevent the same problem from recurring in the future. 1. Root cause analysis: An in-depth investigation is performed to identify the true cause of the incident, rather than just addressing the symptoms. For example, if a website crashed, was it due to a server overload, a software bug, or a misconfiguration? Understanding the root cause ensures that the real problem is fixed, not just the immediate issue. Sharing Lessons Learned for Continual Improvement A key part of the intelligence aspect of incident management is sharing lessons learned to improve the organisation’s overall capability and reduce the likelihood of incidents repeating. 1. Internal knowledge sharing: Once lessons have been documented, they should be shared across teams to ensure everyone is aware of the insights gained. This could involve team meetings, reports, or internal documentation. For example, if a particular fix worked well, other teams should know about it so they can apply it in similar situations.
Activity: Incident Management Group Task
Files that support this week | English:
|
Assessment:
|
Learning Outcomes:
|
Awarding Organisation Criteria:
|
Maths:
|
|||||
Stretch and Challenge:
|
E&D / BV | ||||
Homework / Extension:
|
ILT | ||||
→ | → | → | → | → | → |
Week 5 | T&L Activities: Learning Aims and Objectives:
P3.5 The requirements for external reporting of faults and problem resolution:
To comply with relevant legislation, regulations and external standards (for example report to the Information Commissioner’s Office (ICO))What is the ICO and what is thier remit?
Freedom of information act
Data Protection Act
Using the following link reflect and review the case studies that the ICO have provided. ICO Case Studies To notify customers and end usersConsider the related impacts on companies of the issues that this may have on them and their reputation o failures of components/systems o data breaches o data loss Files that support this week | English:
|
Assessment:
|
Learning Outcomes:
|
Awarding Organisation Criteria:
|
Maths:
|
|||||
Stretch and Challenge:
|
E&D / BV | ||||
Homework / Extension:
|
ILT | ||||
→ | → | → | → | → | → |