# PriviGaze: Privileged Distillation for Accessible Gaze Estimation **On-device gaze estimation designed for people with disabilities.** PriviGaze uses **privileged knowledge distillation** to train an ultra-compact student model (~80K params) that estimates gaze direction from just a grayscale face image β€” no eye crops, no RGB, no calibration needed. ## Why This Matters Traditional gaze trackers fail for people with disabilities: - πŸ‘οΈ **Droopy eyes** β†’ eye crop detectors can't find pupils - πŸ”„ **Head roll/mobile instability** β†’ calibration breaks - πŸ’‘ **Varied lighting** β†’ RGB-based models fail PriviGaze's student model handles all of these by: - Working from the **full face** (no precise eye detection needed) - Using **grayscale only** (robust to lighting) - Having a **large receptive field** (handles head movement) - Being **~80K parameters** (runs on any device) ## Architecture ### Teacher (Training Only - Privileged Information) ``` β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ PriviGazeTeacher β”‚ β”‚ β”‚ β”‚ Left Eye RGB ──→ ConvNeXtV2-Atto ──→ 256d β”‚ β”‚ Right Eye RGB ─→ ConvNeXtV2-Atto ──→ 256d β”‚ β”‚ ↓ (Fusion) β”‚ β”‚ Face Blurred ──→ ConvNeXtV2-Nano ──→ 256d β”‚ β”‚ (Grayscale) ↓ (Cross-Attention) β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ β”‚ Fused β”‚ β”‚ β”‚ β”‚ Features β”‚ β”‚ β”‚ β”‚ 256d β”‚ β”‚ β”‚ β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ β”Œβ”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β” β”‚ β”‚ β”‚ Pitch β”‚ Yaw β”‚ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ ``` - 3 privileged inputs: left eye RGB, right eye RGB, blurred grayscale face - ConvNeXtV2-Atto (3.7M) for eyes, ConvNeXtV2-Nano (15.6M) for face - Cross-attention fusion between face and eye modalities - L2CS-Net style binned regression ### Student (On-Device Inference) ``` β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ PriviGazeStudent β”‚ β”‚ ~80K params β”‚ β”‚ β”‚ β”‚ Face Grayscale ──→ Light Correction β”‚ β”‚ ↓ β”‚ β”‚ Stem (32ch, /4) β”‚ β”‚ ↓ β”‚ β”‚ Inception Block β†’ DSConv (/2) β†’ 64ch β”‚ β”‚ ↓ β”‚ β”‚ Inception Block β†’ DSConv (/2) β†’ 96ch β”‚ β”‚ ↓ β”‚ β”‚ Inception Block β†’ DSConv (/2) β†’ 128ch β”‚ β”‚ ↓ β”‚ β”‚ Inception Block β†’ GAP β†’ 160ch β”‚ β”‚ ↓ β”‚ β”‚ Feature Projection β†’ 128d β”‚ β”‚ ↓ β”‚ β”‚ β”Œβ”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β” β”‚ β”‚ β”‚ Pitch β”‚ Yaw β”‚ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ ``` - 1 input: grayscale face (224Γ—224) - **Inception blocks** with factorized convolutions (1Γ—3 + 3Γ—1) - **Depthwise separable convolutions** throughout - **Learned light correction** (gamma + affine) - L2CS-Net style binned regression ### Distillation Loss The student learns from the teacher via a multi-component loss: ``` L_total = L_task + Ξ±_angularΒ·L_angular + Ξ±_contrastΒ·L_contrast + Ξ±_mmdΒ·L_mmd + Ξ±_logitΒ·L_logit ``` | Component | Weight | Description | |-----------|--------|-------------| | L_task | 1.0 | L2CS-Net binned regression (CE + MSE) | | L_angular | 1.0 | Direct L1 in degrees | | L_contrast | 0.5 | InfoNCE contrastive feature matching | | L_mmd | 0.1 | Maximum Mean Discrepancy distribution matching | | L_logit | 0.5 | KL divergence on soft targets | ## Training ### Quick Start ```bash # Install dependencies pip install -r requirements.txt # Train teacher first, then distill to student python train.py --mode both \ --batch-size 32 \ --epochs 100 \ --teacher-epochs 50 \ --save-dir ./checkpoints \ --push-to-hub \ --hub-model-id BcantCode/privi-gaze-distill ``` ### Phase 1: Teacher Pre-training ```bash python train.py --mode pretrain_teacher \ --batch-size 32 \ --teacher-epochs 50 \ --save-dir ./checkpoints ``` ### Phase 2: Student Distillation ```bash python train.py --mode distill \ --teacher-path ./checkpoints/teacher_best.pt \ --epochs 100 \ --batch-size 32 \ --save-dir ./checkpoints ``` ## Model Sizes | Model | Parameters | Input | Use | |-------|-----------|-------|-----| | PriviGazeTeacher | ~19M | 2Γ—RGB eyes + blurred face | Training only | | PriviGazeStudent | ~80K | 1Γ—grayscale face | On-device inference | ## Research Foundation This work builds on: - **L2CS-Net** (Abdelrahman et al., 2022): Per-angle binned regression for gaze - **GazeGen / DFT Gaze** (Hsieh et al., 2024): 281K distilled gaze model from 10Γ— larger teacher - **WCoRD** (Chen et al., 2020): Wasserstein contrastive representation distillation - **One Eye is All You Need** (Athavale et al., 2022): Inception networks for lightweight gaze - **ETH-XGaze** (Zhang et al., 2020): Large-scale gaze dataset with extreme head poses ## Dataset Currently uses **SyntheticGazeDataset** for development. The synthetic generator creates realistic eye crops with pupil positions encoding gaze direction, plus face images with corresponding features. For production use, the pipeline supports: - **MPIIFaceGaze**: 15 subjects, face crops + eye patches + 3D gaze - **ETH-XGaze**: 110 subjects, extreme head poses, 1.1M images (gold standard) - **Gaze360**: 238 subjects, 360Β° gaze range To use real datasets, implement the `MPIIGazeDataset` class in `models/dataset.py`. ## Requirements - Python β‰₯ 3.9 - PyTorch β‰₯ 2.0 - Transformers β‰₯ 4.40 - CUDA-capable GPU (for training) ## License Apache 2.0 ## Citation ``` @software{privi_gaze_2026, title={PriviGaze: Privileged Distillation for Accessible Gaze Estimation}, year={2026}, url={https://huggingface.co/BcantCode/privi-gaze-distill} } ```