| # PriviGaze: Privileged Distillation for Accessible Gaze Estimation |
|
|
| **On-device gaze estimation designed for people with disabilities.** |
|
|
| PriviGaze uses **privileged knowledge distillation** to train an ultra-compact student model (~80K params) that estimates gaze direction from just a grayscale face image β no eye crops, no RGB, no calibration needed. |
|
|
| ## Why This Matters |
|
|
| Traditional gaze trackers fail for people with disabilities: |
| - ποΈ **Droopy eyes** β eye crop detectors can't find pupils |
| - π **Head roll/mobile instability** β calibration breaks |
| - π‘ **Varied lighting** β RGB-based models fail |
|
|
| PriviGaze's student model handles all of these by: |
| - Working from the **full face** (no precise eye detection needed) |
| - Using **grayscale only** (robust to lighting) |
| - Having a **large receptive field** (handles head movement) |
| - Being **~80K parameters** (runs on any device) |
|
|
| ## Architecture |
|
|
| ### Teacher (Training Only - Privileged Information) |
| ``` |
| βββββββββββββββββββββββββββββββββββββββββββββββββββ |
| β PriviGazeTeacher β |
| β β |
| β Left Eye RGB βββ ConvNeXtV2-Atto βββ 256d β |
| β Right Eye RGB ββ ConvNeXtV2-Atto βββ 256d β |
| β β (Fusion) β |
| β Face Blurred βββ ConvNeXtV2-Nano βββ 256d β |
| β (Grayscale) β (Cross-Attention) β |
| β ββββββββββββ β |
| β β Fused β β |
| β β Features β β |
| β β 256d β β |
| β ββββββ¬ββββββ β |
| β ββββββ΄ββββββ β |
| β β Pitch β Yaw β β |
| β βββββββββββββββ β |
| βββββββββββββββββββββββββββββββββββββββββββββββββββ |
| ``` |
| - 3 privileged inputs: left eye RGB, right eye RGB, blurred grayscale face |
| - ConvNeXtV2-Atto (3.7M) for eyes, ConvNeXtV2-Nano (15.6M) for face |
| - Cross-attention fusion between face and eye modalities |
| - L2CS-Net style binned regression |
|
|
| ### Student (On-Device Inference) |
| ``` |
| βββββββββββββββββββββββββββββββββββββββββββββββββββ |
| β PriviGazeStudent β |
| β ~80K params β |
| β β |
| β Face Grayscale βββ Light Correction β |
| β β β |
| β Stem (32ch, /4) β |
| β β β |
| β Inception Block β DSConv (/2) β 64ch β |
| β β β |
| β Inception Block β DSConv (/2) β 96ch β |
| β β β |
| β Inception Block β DSConv (/2) β 128ch β |
| β β β |
| β Inception Block β GAP β 160ch β |
| β β β |
| β Feature Projection β 128d β |
| β β β |
| β ββββββ΄ββββββ β |
| β β Pitch β Yaw β β |
| β βββββββββββββββ β |
| βββββββββββββββββββββββββββββββββββββββββββββββββββ |
| ``` |
| - 1 input: grayscale face (224Γ224) |
| - **Inception blocks** with factorized convolutions (1Γ3 + 3Γ1) |
| - **Depthwise separable convolutions** throughout |
| - **Learned light correction** (gamma + affine) |
| - L2CS-Net style binned regression |
|
|
| ### Distillation Loss |
|
|
| The student learns from the teacher via a multi-component loss: |
|
|
| ``` |
| L_total = L_task + Ξ±_angularΒ·L_angular + Ξ±_contrastΒ·L_contrast + Ξ±_mmdΒ·L_mmd + Ξ±_logitΒ·L_logit |
| ``` |
|
|
| | Component | Weight | Description | |
| |-----------|--------|-------------| |
| | L_task | 1.0 | L2CS-Net binned regression (CE + MSE) | |
| | L_angular | 1.0 | Direct L1 in degrees | |
| | L_contrast | 0.5 | InfoNCE contrastive feature matching | |
| | L_mmd | 0.1 | Maximum Mean Discrepancy distribution matching | |
| | L_logit | 0.5 | KL divergence on soft targets | |
| |
| ## Training |
| |
| ### Quick Start |
| |
| ```bash |
| # Install dependencies |
| pip install -r requirements.txt |
| |
| # Train teacher first, then distill to student |
| python train.py --mode both \ |
| --batch-size 32 \ |
| --epochs 100 \ |
| --teacher-epochs 50 \ |
| --save-dir ./checkpoints \ |
| --push-to-hub \ |
| --hub-model-id BcantCode/privi-gaze-distill |
| ``` |
| |
| ### Phase 1: Teacher Pre-training |
| ```bash |
| python train.py --mode pretrain_teacher \ |
| --batch-size 32 \ |
| --teacher-epochs 50 \ |
| --save-dir ./checkpoints |
| ``` |
| |
| ### Phase 2: Student Distillation |
| ```bash |
| python train.py --mode distill \ |
| --teacher-path ./checkpoints/teacher_best.pt \ |
| --epochs 100 \ |
| --batch-size 32 \ |
| --save-dir ./checkpoints |
| ``` |
|
|
| ## Model Sizes |
|
|
| | Model | Parameters | Input | Use | |
| |-------|-----------|-------|-----| |
| | PriviGazeTeacher | ~19M | 2ΓRGB eyes + blurred face | Training only | |
| | PriviGazeStudent | ~80K | 1Γgrayscale face | On-device inference | |
|
|
| ## Research Foundation |
|
|
| This work builds on: |
|
|
| - **L2CS-Net** (Abdelrahman et al., 2022): Per-angle binned regression for gaze |
| - **GazeGen / DFT Gaze** (Hsieh et al., 2024): 281K distilled gaze model from 10Γ larger teacher |
| - **WCoRD** (Chen et al., 2020): Wasserstein contrastive representation distillation |
| - **One Eye is All You Need** (Athavale et al., 2022): Inception networks for lightweight gaze |
| - **ETH-XGaze** (Zhang et al., 2020): Large-scale gaze dataset with extreme head poses |
|
|
| ## Dataset |
|
|
| Currently uses **SyntheticGazeDataset** for development. The synthetic generator creates realistic eye crops with pupil positions encoding gaze direction, plus face images with corresponding features. |
|
|
| For production use, the pipeline supports: |
| - **MPIIFaceGaze**: 15 subjects, face crops + eye patches + 3D gaze |
| - **ETH-XGaze**: 110 subjects, extreme head poses, 1.1M images (gold standard) |
| - **Gaze360**: 238 subjects, 360Β° gaze range |
|
|
| To use real datasets, implement the `MPIIGazeDataset` class in `models/dataset.py`. |
|
|
| ## Requirements |
|
|
| - Python β₯ 3.9 |
| - PyTorch β₯ 2.0 |
| - Transformers β₯ 4.40 |
| - CUDA-capable GPU (for training) |
|
|
| ## License |
|
|
| Apache 2.0 |
|
|
| ## Citation |
|
|
| ``` |
| @software{privi_gaze_2026, |
| title={PriviGaze: Privileged Distillation for Accessible Gaze Estimation}, |
| year={2026}, |
| url={https://huggingface.co/BcantCode/privi-gaze-distill} |
| } |
| ``` |
|
|