Spaces:
Sleeping
data_preparation
Handles loading, splitting, scaling, and serving the collected dataset for training and evaluation.
Links
- Participant consent form: Consent document
- Dataset (staff access): Dataset folder
Data collection protocol
9 team members each recorded 5-10 minute webcam sessions using a purpose-built tool (models/collect_features.py). During recording:
- Participants simulated focused behaviour (reading, typing) and unfocused behaviour (looking at phone, turning away)
- Binary labels were annotated in real-time via key presses
- Sessions were recorded across different rooms, workspaces, and home offices using consumer webcams under varying lighting
- Real-time quality guidance warned if class balance fell outside 30-70% or if fewer than 10 state transitions occurred
- An automated post-collection quality report validated minimum duration (120s), sample count (3,000+ frames), balance, and transition frequency
All participants provided informed consent for their facial landmark data to be used within this coursework project. Raw video frames are never stored; only the 17-dimensional feature vector and binary labels are saved.
Raw participant dataset is excluded from this repository (coursework policy and privacy constraints). It is shared separately via the dataset link above.
Dataset summary
| Metric | Value |
|---|---|
| Participants | 9 |
| Total frames | 144,793 |
| Class balance | 61.5% focused / 38.5% unfocused |
| Features extracted | 17 per frame |
| Features selected | 10 (used by ML models) |
Data format
Training data lives under data/collected_<participant>/ as .npz files. Each file contains:
| Key | Shape | Description |
|---|---|---|
features |
(N, 17) | Float array of extracted features |
labels |
(N,) | Binary: 0 = unfocused, 1 = focused |
feature_names |
(17,) | String names matching FEATURE_NAMES in collect_features.py |
Data files are not included in this repository due to privacy considerations.
Files
| File | Purpose |
|---|---|
prepare_dataset.py |
Core data pipeline: loads .npz, applies feature selection, stratified splits, StandardScaler on train only |
data_exploration.ipynb |
Exploratory analysis: feature distributions, class balance, per-person statistics, correlation heatmaps |
Feature selection
SELECTED_FEATURES["face_orientation"] defines the 10 features used by all ML models:
Head pose (3): head_deviation, s_face, pitch
Eye state (4): ear_left, ear_right, ear_avg, perclos
Gaze (3): h_gaze, gaze_offset, s_eye
Excluded: v_gaze (noisy), mar (1.7% trigger rate), yaw/roll (redundant with head_deviation/s_face), blink_rate/closure_duration/yawn_duration (temporal overlap with perclos).
Selection was validated by XGBoost gain importance and LOPO channel ablation:
| Channel subset | Mean LOPO F1 |
|---|---|
| All 10 features | 0.829 |
| Eye state only | 0.807 |
| Head pose only | 0.748 |
| Gaze only | 0.726 |
Key functions
| Function | What it does |
|---|---|
load_all_pooled(model_name) |
Concatenates all participant data into one array |
load_per_person(model_name) |
Returns {person: (X, y)} dict for LOPO cross-validation |
get_numpy_splits(model_name) |
Returns scaled train/val/test numpy arrays (70/15/15 split) |
get_dataloaders(model_name) |
Returns PyTorch DataLoaders for MLP training |
get_default_split_config() |
Returns split ratios and seed from config/default.yaml |
Data cleaning
Applied before splitting (in ui/pipeline.py at inference, in prepare_dataset.py for training):
- Angles clipped to physiological ranges (yaw +/-45, pitch/roll +/-30)
head_deviationrecomputed from clipped angles (not clipped after computation)- EAR clipped to [0, 0.85], MAR to [0, 1.0]
- Physiological bounds on gaze_offset, PERCLOS, blink_rate, closure/yawn duration
- StandardScaler fit on training split only, applied to val/test