Abdelrahman Almatrooshi
Deploy snapshot from main b7a59b11809483dfc959f196f1930240f2662c49
22a6915

data_preparation

Handles loading, splitting, scaling, and serving the collected dataset for training and evaluation.

Links

Data collection protocol

9 team members each recorded 5-10 minute webcam sessions using a purpose-built tool (models/collect_features.py). During recording:

  • Participants simulated focused behaviour (reading, typing) and unfocused behaviour (looking at phone, turning away)
  • Binary labels were annotated in real-time via key presses
  • Sessions were recorded across different rooms, workspaces, and home offices using consumer webcams under varying lighting
  • Real-time quality guidance warned if class balance fell outside 30-70% or if fewer than 10 state transitions occurred
  • An automated post-collection quality report validated minimum duration (120s), sample count (3,000+ frames), balance, and transition frequency

All participants provided informed consent for their facial landmark data to be used within this coursework project. Raw video frames are never stored; only the 17-dimensional feature vector and binary labels are saved.

Raw participant dataset is excluded from this repository (coursework policy and privacy constraints). It is shared separately via the dataset link above.

Dataset summary

Metric Value
Participants 9
Total frames 144,793
Class balance 61.5% focused / 38.5% unfocused
Features extracted 17 per frame
Features selected 10 (used by ML models)

Data format

Training data lives under data/collected_<participant>/ as .npz files. Each file contains:

Key Shape Description
features (N, 17) Float array of extracted features
labels (N,) Binary: 0 = unfocused, 1 = focused
feature_names (17,) String names matching FEATURE_NAMES in collect_features.py

Data files are not included in this repository due to privacy considerations.

Files

File Purpose
prepare_dataset.py Core data pipeline: loads .npz, applies feature selection, stratified splits, StandardScaler on train only
data_exploration.ipynb Exploratory analysis: feature distributions, class balance, per-person statistics, correlation heatmaps

Feature selection

SELECTED_FEATURES["face_orientation"] defines the 10 features used by all ML models:

Head pose (3): head_deviation, s_face, pitch Eye state (4): ear_left, ear_right, ear_avg, perclos Gaze (3): h_gaze, gaze_offset, s_eye

Excluded: v_gaze (noisy), mar (1.7% trigger rate), yaw/roll (redundant with head_deviation/s_face), blink_rate/closure_duration/yawn_duration (temporal overlap with perclos).

Selection was validated by XGBoost gain importance and LOPO channel ablation:

Channel subset Mean LOPO F1
All 10 features 0.829
Eye state only 0.807
Head pose only 0.748
Gaze only 0.726

Key functions

Function What it does
load_all_pooled(model_name) Concatenates all participant data into one array
load_per_person(model_name) Returns {person: (X, y)} dict for LOPO cross-validation
get_numpy_splits(model_name) Returns scaled train/val/test numpy arrays (70/15/15 split)
get_dataloaders(model_name) Returns PyTorch DataLoaders for MLP training
get_default_split_config() Returns split ratios and seed from config/default.yaml

Data cleaning

Applied before splitting (in ui/pipeline.py at inference, in prepare_dataset.py for training):

  1. Angles clipped to physiological ranges (yaw +/-45, pitch/roll +/-30)
  2. head_deviation recomputed from clipped angles (not clipped after computation)
  3. EAR clipped to [0, 0.85], MAR to [0, 1.0]
  4. Physiological bounds on gaze_offset, PERCLOS, blink_rate, closure/yawn duration
  5. StandardScaler fit on training split only, applied to val/test