Spaces:

FocusGuard
/

final_test

Sleeping

App Files Files Community

final_test / data_preparation /README.md

Abdelrahman Almatrooshi

Deploy snapshot from main b7a59b11809483dfc959f196f1930240f2662c49

22a6915 25 days ago

preview code

raw

history blame contribute delete

4.25 kB

data_preparation

Handles loading, splitting, scaling, and serving the collected dataset for training and evaluation.

Data collection protocol

9 team members each recorded 5-10 minute webcam sessions using a purpose-built tool (models/collect_features.py). During recording:

Participants simulated focused behaviour (reading, typing) and unfocused behaviour (looking at phone, turning away)
Binary labels were annotated in real-time via key presses
Sessions were recorded across different rooms, workspaces, and home offices using consumer webcams under varying lighting
Real-time quality guidance warned if class balance fell outside 30-70% or if fewer than 10 state transitions occurred
An automated post-collection quality report validated minimum duration (120s), sample count (3,000+ frames), balance, and transition frequency

All participants provided informed consent for their facial landmark data to be used within this coursework project. Raw video frames are never stored; only the 17-dimensional feature vector and binary labels are saved.

Raw participant dataset is excluded from this repository (coursework policy and privacy constraints). It is shared separately via the dataset link above.

Dataset summary

Metric	Value
Participants	9
Total frames	144,793
Class balance	61.5% focused / 38.5% unfocused
Features extracted	17 per frame
Features selected	10 (used by ML models)

Data format

Training data lives under data/collected_<participant>/ as .npz files. Each file contains:

Key	Shape	Description
`features`	(N, 17)	Float array of extracted features
`labels`	(N,)	Binary: 0 = unfocused, 1 = focused
`feature_names`	(17,)	String names matching `FEATURE_NAMES` in `collect_features.py`

Data files are not included in this repository due to privacy considerations.

Files

File	Purpose
`prepare_dataset.py`	Core data pipeline: loads `.npz`, applies feature selection, stratified splits, StandardScaler on train only
`data_exploration.ipynb`	Exploratory analysis: feature distributions, class balance, per-person statistics, correlation heatmaps

Feature selection

SELECTED_FEATURES["face_orientation"] defines the 10 features used by all ML models:

Head pose (3): head_deviation, s_face, pitch Eye state (4): ear_left, ear_right, ear_avg, perclos Gaze (3): h_gaze, gaze_offset, s_eye

Excluded: v_gaze (noisy), mar (1.7% trigger rate), yaw/roll (redundant with head_deviation/s_face), blink_rate/closure_duration/yawn_duration (temporal overlap with perclos).

Selection was validated by XGBoost gain importance and LOPO channel ablation:

Channel subset	Mean LOPO F1
All 10 features	0.829
Eye state only	0.807
Head pose only	0.748
Gaze only	0.726

Key functions

Function	What it does
`load_all_pooled(model_name)`	Concatenates all participant data into one array
`load_per_person(model_name)`	Returns `{person: (X, y)}` dict for LOPO cross-validation
`get_numpy_splits(model_name)`	Returns scaled train/val/test numpy arrays (70/15/15 split)
`get_dataloaders(model_name)`	Returns PyTorch DataLoaders for MLP training
`get_default_split_config()`	Returns split ratios and seed from `config/default.yaml`

Data cleaning

Applied before splitting (in ui/pipeline.py at inference, in prepare_dataset.py for training):

Angles clipped to physiological ranges (yaw +/-45, pitch/roll +/-30)
head_deviation recomputed from clipped angles (not clipped after computation)
EAR clipped to [0, 0.85], MAR to [0, 1.0]
Physiological bounds on gaze_offset, PERCLOS, blink_rate, closure/yawn duration
StandardScaler fit on training split only, applied to val/test