privi-gaze-distill / README.md
BcantCode's picture
Upload README.md
10d67d3 verified
# PriviGaze: Privileged Distillation for Accessible Gaze Estimation
**On-device gaze estimation designed for people with disabilities.**
PriviGaze uses **privileged knowledge distillation** to train an ultra-compact student model (~80K params) that estimates gaze direction from just a grayscale face image β€” no eye crops, no RGB, no calibration needed.
## Why This Matters
Traditional gaze trackers fail for people with disabilities:
- πŸ‘οΈ **Droopy eyes** β†’ eye crop detectors can't find pupils
- πŸ”„ **Head roll/mobile instability** β†’ calibration breaks
- πŸ’‘ **Varied lighting** β†’ RGB-based models fail
PriviGaze's student model handles all of these by:
- Working from the **full face** (no precise eye detection needed)
- Using **grayscale only** (robust to lighting)
- Having a **large receptive field** (handles head movement)
- Being **~80K parameters** (runs on any device)
## Architecture
### Teacher (Training Only - Privileged Information)
```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ PriviGazeTeacher β”‚
β”‚ β”‚
β”‚ Left Eye RGB ──→ ConvNeXtV2-Atto ──→ 256d β”‚
β”‚ Right Eye RGB ─→ ConvNeXtV2-Atto ──→ 256d β”‚
β”‚ ↓ (Fusion) β”‚
β”‚ Face Blurred ──→ ConvNeXtV2-Nano ──→ 256d β”‚
β”‚ (Grayscale) ↓ (Cross-Attention) β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ Fused β”‚ β”‚
β”‚ β”‚ Features β”‚ β”‚
β”‚ β”‚ 256d β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜ β”‚
β”‚ β”Œβ”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ Pitch β”‚ Yaw β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```
- 3 privileged inputs: left eye RGB, right eye RGB, blurred grayscale face
- ConvNeXtV2-Atto (3.7M) for eyes, ConvNeXtV2-Nano (15.6M) for face
- Cross-attention fusion between face and eye modalities
- L2CS-Net style binned regression
### Student (On-Device Inference)
```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ PriviGazeStudent β”‚
β”‚ ~80K params β”‚
β”‚ β”‚
β”‚ Face Grayscale ──→ Light Correction β”‚
β”‚ ↓ β”‚
β”‚ Stem (32ch, /4) β”‚
β”‚ ↓ β”‚
β”‚ Inception Block β†’ DSConv (/2) β†’ 64ch β”‚
β”‚ ↓ β”‚
β”‚ Inception Block β†’ DSConv (/2) β†’ 96ch β”‚
β”‚ ↓ β”‚
β”‚ Inception Block β†’ DSConv (/2) β†’ 128ch β”‚
β”‚ ↓ β”‚
β”‚ Inception Block β†’ GAP β†’ 160ch β”‚
β”‚ ↓ β”‚
β”‚ Feature Projection β†’ 128d β”‚
β”‚ ↓ β”‚
β”‚ β”Œβ”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ Pitch β”‚ Yaw β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```
- 1 input: grayscale face (224Γ—224)
- **Inception blocks** with factorized convolutions (1Γ—3 + 3Γ—1)
- **Depthwise separable convolutions** throughout
- **Learned light correction** (gamma + affine)
- L2CS-Net style binned regression
### Distillation Loss
The student learns from the teacher via a multi-component loss:
```
L_total = L_task + Ξ±_angularΒ·L_angular + Ξ±_contrastΒ·L_contrast + Ξ±_mmdΒ·L_mmd + Ξ±_logitΒ·L_logit
```
| Component | Weight | Description |
|-----------|--------|-------------|
| L_task | 1.0 | L2CS-Net binned regression (CE + MSE) |
| L_angular | 1.0 | Direct L1 in degrees |
| L_contrast | 0.5 | InfoNCE contrastive feature matching |
| L_mmd | 0.1 | Maximum Mean Discrepancy distribution matching |
| L_logit | 0.5 | KL divergence on soft targets |
## Training
### Quick Start
```bash
# Install dependencies
pip install -r requirements.txt
# Train teacher first, then distill to student
python train.py --mode both \
--batch-size 32 \
--epochs 100 \
--teacher-epochs 50 \
--save-dir ./checkpoints \
--push-to-hub \
--hub-model-id BcantCode/privi-gaze-distill
```
### Phase 1: Teacher Pre-training
```bash
python train.py --mode pretrain_teacher \
--batch-size 32 \
--teacher-epochs 50 \
--save-dir ./checkpoints
```
### Phase 2: Student Distillation
```bash
python train.py --mode distill \
--teacher-path ./checkpoints/teacher_best.pt \
--epochs 100 \
--batch-size 32 \
--save-dir ./checkpoints
```
## Model Sizes
| Model | Parameters | Input | Use |
|-------|-----------|-------|-----|
| PriviGazeTeacher | ~19M | 2Γ—RGB eyes + blurred face | Training only |
| PriviGazeStudent | ~80K | 1Γ—grayscale face | On-device inference |
## Research Foundation
This work builds on:
- **L2CS-Net** (Abdelrahman et al., 2022): Per-angle binned regression for gaze
- **GazeGen / DFT Gaze** (Hsieh et al., 2024): 281K distilled gaze model from 10Γ— larger teacher
- **WCoRD** (Chen et al., 2020): Wasserstein contrastive representation distillation
- **One Eye is All You Need** (Athavale et al., 2022): Inception networks for lightweight gaze
- **ETH-XGaze** (Zhang et al., 2020): Large-scale gaze dataset with extreme head poses
## Dataset
Currently uses **SyntheticGazeDataset** for development. The synthetic generator creates realistic eye crops with pupil positions encoding gaze direction, plus face images with corresponding features.
For production use, the pipeline supports:
- **MPIIFaceGaze**: 15 subjects, face crops + eye patches + 3D gaze
- **ETH-XGaze**: 110 subjects, extreme head poses, 1.1M images (gold standard)
- **Gaze360**: 238 subjects, 360Β° gaze range
To use real datasets, implement the `MPIIGazeDataset` class in `models/dataset.py`.
## Requirements
- Python β‰₯ 3.9
- PyTorch β‰₯ 2.0
- Transformers β‰₯ 4.40
- CUDA-capable GPU (for training)
## License
Apache 2.0
## Citation
```
@software{privi_gaze_2026,
title={PriviGaze: Privileged Distillation for Accessible Gaze Estimation},
year={2026},
url={https://huggingface.co/BcantCode/privi-gaze-distill}
}
```