# data_preparation Handles loading, splitting, scaling, and serving the collected dataset for training and evaluation. ## Links - Participant consent form: [Consent document](https://drive.google.com/file/d/1g1Hc764ffljoKrjApD6nmWDCXJGYTR0j/view?usp=drive_link) - Dataset (staff access): [Dataset folder](https://drive.google.com/drive/folders/1fwACM6i6uVGFkTlJKSlqVhizzgrHl_gY?usp=sharing) ## Data collection protocol 9 team members each recorded 5-10 minute webcam sessions using a purpose-built tool (`models/collect_features.py`). During recording: - Participants simulated **focused** behaviour (reading, typing) and **unfocused** behaviour (looking at phone, turning away) - Binary labels were annotated in real-time via key presses - Sessions were recorded across different rooms, workspaces, and home offices using consumer webcams under varying lighting - Real-time quality guidance warned if class balance fell outside 30-70% or if fewer than 10 state transitions occurred - An automated post-collection quality report validated minimum duration (120s), sample count (3,000+ frames), balance, and transition frequency All participants provided informed consent for their facial landmark data to be used within this coursework project. Raw video frames are never stored; only the 17-dimensional feature vector and binary labels are saved. Raw participant dataset is excluded from this repository (coursework policy and privacy constraints). It is shared separately via the dataset link above. ## Dataset summary | Metric | Value | |--------|-------| | Participants | 9 | | Total frames | 144,793 | | Class balance | 61.5% focused / 38.5% unfocused | | Features extracted | 17 per frame | | Features selected | 10 (used by ML models) | ## Data format Training data lives under `data/collected_/` as `.npz` files. Each file contains: | Key | Shape | Description | |-----|-------|-------------| | `features` | (N, 17) | Float array of extracted features | | `labels` | (N,) | Binary: 0 = unfocused, 1 = focused | | `feature_names` | (17,) | String names matching `FEATURE_NAMES` in `collect_features.py` | Data files are not included in this repository due to privacy considerations. ## Files | File | Purpose | |------|---------| | `prepare_dataset.py` | Core data pipeline: loads `.npz`, applies feature selection, stratified splits, StandardScaler on train only | | `data_exploration.ipynb` | Exploratory analysis: feature distributions, class balance, per-person statistics, correlation heatmaps | ## Feature selection `SELECTED_FEATURES["face_orientation"]` defines the 10 features used by all ML models: **Head pose (3):** `head_deviation`, `s_face`, `pitch` **Eye state (4):** `ear_left`, `ear_right`, `ear_avg`, `perclos` **Gaze (3):** `h_gaze`, `gaze_offset`, `s_eye` Excluded: `v_gaze` (noisy), `mar` (1.7% trigger rate), `yaw`/`roll` (redundant with `head_deviation`/`s_face`), `blink_rate`/`closure_duration`/`yawn_duration` (temporal overlap with `perclos`). Selection was validated by XGBoost gain importance and LOPO channel ablation: | Channel subset | Mean LOPO F1 | |---------------|-------------| | All 10 features | 0.829 | | Eye state only | 0.807 | | Head pose only | 0.748 | | Gaze only | 0.726 | ## Key functions | Function | What it does | |----------|-------------| | `load_all_pooled(model_name)` | Concatenates all participant data into one array | | `load_per_person(model_name)` | Returns `{person: (X, y)}` dict for LOPO cross-validation | | `get_numpy_splits(model_name)` | Returns scaled train/val/test numpy arrays (70/15/15 split) | | `get_dataloaders(model_name)` | Returns PyTorch DataLoaders for MLP training | | `get_default_split_config()` | Returns split ratios and seed from `config/default.yaml` | ## Data cleaning Applied before splitting (in `ui/pipeline.py` at inference, in `prepare_dataset.py` for training): 1. Angles clipped to physiological ranges (yaw +/-45, pitch/roll +/-30) 2. `head_deviation` recomputed from clipped angles (not clipped after computation) 3. EAR clipped to [0, 0.85], MAR to [0, 1.0] 4. Physiological bounds on gaze_offset, PERCLOS, blink_rate, closure/yawn duration 5. StandardScaler fit on training split only, applied to val/test