# data_preparation

Handles loading, splitting, scaling, and serving the collected dataset for training and evaluation.

## Links

- Participant consent form: [Consent document](https://drive.google.com/file/d/1g1Hc764ffljoKrjApD6nmWDCXJGYTR0j/view?usp=drive_link)
- Dataset (staff access): [Dataset folder](https://drive.google.com/drive/folders/1fwACM6i6uVGFkTlJKSlqVhizzgrHl_gY?usp=sharing)

## Data collection protocol

9 team members each recorded 5-10 minute webcam sessions using a purpose-built tool (`models/collect_features.py`). During recording:

- Participants simulated **focused** behaviour (reading, typing) and **unfocused** behaviour (looking at phone, turning away)
- Binary labels were annotated in real-time via key presses
- Sessions were recorded across different rooms, workspaces, and home offices using consumer webcams under varying lighting
- Real-time quality guidance warned if class balance fell outside 30-70% or if fewer than 10 state transitions occurred
- An automated post-collection quality report validated minimum duration (120s), sample count (3,000+ frames), balance, and transition frequency

All participants provided informed consent for their facial landmark data to be used within this coursework project. Raw video frames are never stored; only the 17-dimensional feature vector and binary labels are saved.

Raw participant dataset is excluded from this repository (coursework policy and privacy constraints). It is shared separately via the dataset link above.

## Dataset summary

| Metric | Value |
|--------|-------|
| Participants | 9 |
| Total frames | 144,793 |
| Class balance | 61.5% focused / 38.5% unfocused |
| Features extracted | 17 per frame |
| Features selected | 10 (used by ML models) |

## Data format

Training data lives under `data/collected_<participant>/` as `.npz` files. Each file contains:

| Key | Shape | Description |
|-----|-------|-------------|
| `features` | (N, 17) | Float array of extracted features |
| `labels` | (N,) | Binary: 0 = unfocused, 1 = focused |
| `feature_names` | (17,) | String names matching `FEATURE_NAMES` in `collect_features.py` |

Data files are not included in this repository due to privacy considerations.

## Files

| File | Purpose |
|------|---------|
| `prepare_dataset.py` | Core data pipeline: loads `.npz`, applies feature selection, stratified splits, StandardScaler on train only |
| `data_exploration.ipynb` | Exploratory analysis: feature distributions, class balance, per-person statistics, correlation heatmaps |

## Feature selection

`SELECTED_FEATURES["face_orientation"]` defines the 10 features used by all ML models:

**Head pose (3):** `head_deviation`, `s_face`, `pitch`
**Eye state (4):** `ear_left`, `ear_right`, `ear_avg`, `perclos`
**Gaze (3):** `h_gaze`, `gaze_offset`, `s_eye`

Excluded: `v_gaze` (noisy), `mar` (1.7% trigger rate), `yaw`/`roll` (redundant with `head_deviation`/`s_face`), `blink_rate`/`closure_duration`/`yawn_duration` (temporal overlap with `perclos`).

Selection was validated by XGBoost gain importance and LOPO channel ablation:

| Channel subset | Mean LOPO F1 |
|---------------|-------------|
| All 10 features | 0.829 |
| Eye state only | 0.807 |
| Head pose only | 0.748 |
| Gaze only | 0.726 |

## Key functions

| Function | What it does |
|----------|-------------|
| `load_all_pooled(model_name)` | Concatenates all participant data into one array |
| `load_per_person(model_name)` | Returns `{person: (X, y)}` dict for LOPO cross-validation |
| `get_numpy_splits(model_name)` | Returns scaled train/val/test numpy arrays (70/15/15 split) |
| `get_dataloaders(model_name)` | Returns PyTorch DataLoaders for MLP training |
| `get_default_split_config()` | Returns split ratios and seed from `config/default.yaml` |

## Data cleaning

Applied before splitting (in `ui/pipeline.py` at inference, in `prepare_dataset.py` for training):

1. Angles clipped to physiological ranges (yaw +/-45, pitch/roll +/-30)
2. `head_deviation` recomputed from clipped angles (not clipped after computation)
3. EAR clipped to [0, 0.85], MAR to [0, 1.0]
4. Physiological bounds on gaze_offset, PERCLOS, blink_rate, closure/yawn duration
5. StandardScaler fit on training split only, applied to val/test