week 8

3.8 Data systems

3.8.1 Know the definition of data wrangling and understand its purpose andwhen it is used.

3.8.2 Know and understand the purpose of each step of data wrangling:

As explained in the previous youtube vidoe Data wrangling is the structured process of transforming raw, messy, and inconsistent data into a form that is reliable, accurate, and ready for analysis. Raw data often contains errors, missing values, and formatting problems that make it difficult to use. The purpose of data wrangling is to take this unrefined information and prepare it so that organisations can confidently draw conclusions, make decisions, or run automated systems. Each step - structuring, cleaning, validating, enriching, and outputting - reduces risk, improves quality, and ensures data is suitable for its end use, whether that is reporting, machine learning, cyber-security analysis, or a business decision. Below we look at these in further detail

Structure

Structuring data is the process of organising raw information into a consistent and logical format (such as tables, fields, rows, and columns). It involves identifying the key attributes, deciding how information should be stored, and creating a shape that supports future tasks like searching, sorting, or linking datasets. Good structuring ensures the data is readable and usable across systems.

Case Study (Structure)

A college collects feedback from students in emails, text messages, and handwritten forms. Before analysing trends in satisfaction, the IT support team extracts the responses and places them into a single structured spreadsheet with columns for Student ID, Course, Rating, Comments, and Date.

Clean

Cleaning involves removing errors, inconsistencies, and unwanted data. This may include correcting spelling mistakes, removing duplicates, fixing formatting issues, filling or removing missing values, and ensuring all data uses the same units (e.g., all dates in DD/MM/YYYY). Cleaning is essential for accuracy, especially when decisions rely on precise information.

Case Study (Clean)

A cyber-security team logs device sign-ins from staff laptops. Some records contain blank usernames, others list the same person twice with slightly different spellings (“J. Smith” vs “John Smith”). Cleaning ensures all entries are standardised so unusual login behaviour can be accurately monitored.

Validate

Validation checks whether the data meets required rules (e.g., numbers within an acceptable range, postcodes in the right format, no negative ages, email addresses containing “@”). This step ensures that the data is logical, realistic, and trustworthy. Validation prevents bad data from entering business systems and causing incorrect results.

Case Study (Validate)

An IT helpdesk collects data on incidents reported by staff. A validation rule prevents users from submitting an incident with a “Resolution Time” longer than 365 days. When an incorrect value (e.g., 8,000 days) appears, the system rejects it and asks for correction.

Enrich

Enrichment adds additional useful information to a dataset by combining it with external or related data. This might include adding geographical data, linking customer records with purchase histories, or attaching risk levels to cyber-security events. Enrichment makes the dataset more meaningful and improves the quality of insights.

Case Study (Enrich)

A retail company collects customer purchase data. To better understand buying habits, they enrich the dataset by attaching customers’ regional information based on postcode. This helps the business identify which areas buy which products most often, supporting marketing and stock planning.

Output

The output stage provides the final, cleaned, structured, and enriched dataset in the correct format for its intended use. This might be a CSV file, a dashboard, a report, a visualisation, or a database entry. The purpose is to deliver the dataset in a form that other systems, analysts, or decision-makers can immediately use.

Case Study (Output)

A college attendance system compiles daily attendance records into a cleaned table. The final output is exported as a CSV file for the safeguarding team, who upload it into a monitoring system that highlights students at risk of persistent absence.

Data Wrangling Challenge - “Fix the Dataset”

Scenario:
You have been given a small sample of messy data from a fictional college’s student contact system. Your task is to act as a data specialist and apply the five stages of data wrangling to make the dataset usable.

Instructions
1. Download or create a simple table (8-10 rows) containing errors such as missing values, inconsistent dates, duplicated names, different phone number formats, and incomplete postcodes.

2. Structure the data
Reorganise the information into clear columns (e.g., Name, Student ID, Phone Number, Email, Postcode).

3. Clean the data
Fix spellings, remove duplicates, correct formats, and fill in values where possible.

4. Validate the data
Apply at least three validation checks (e.g., all emails contain “@”, postcodes use UK format, phone numbers have 11 digits).

5. Enrich the data
Add one new column using an external lookup (e.g., Region based on postcode, or Age Group based on date of birth).

6. Output your final dataset
Export the cleaned, validated, enriched version as a CSV or screenshot.

7. Reflect (3–4 sentences)
Explain the issues you identified and how each wrangling step improved the data.

Expected outcome:
A neat, accurate, structured dataset and a short written reflection demonstrating your understanding of the data wrangling process.

Messy Dataset csv

3.8.3 Know and understand the purpose of each core function of a data system:

A data system is more than just a data store; it is a set of interconnected services and functions that allow data to be entered, retrieved, persisted, combined, organised, presented, and improved over time. These core functions ensure that data becomes a useful asset rather than a liability. Without each part working well, data could be missing, inconsistent, inaccessible, or unreliable. For example, if you can’t search it, having tons of data doesn’t help; if you don’t integrate it with other sources, insights are limited; if you don’t produce output, stakeholders can’t act; and if you don’t build a feedback loop, the system can’t improve.

Input

Purpose & Discussion:
The input function is where raw data enters the system. This might be manual user entry, sensors, imports from external systems, or file uploads. The quality, completeness and accuracy of data at the input stage determine how reliable everything else will be. If the input is poorly handled (e.g., wrong format, missing values, incorrect units), the rest of the system will struggle. Good input mechanisms include validation at entry, standardised formats, and controlled sources.