BcantCode
/

privi-gaze-distill

Model card Files Files and versions

xet

Community

BcantCode commited on 4 days ago

Commit

10d67d3

verified ·

1 Parent(s): 66a6e90

Upload README.md

Browse files

Files changed (1) hide show

README.md +176 -0

README.md ADDED Viewed

	@@ -0,0 +1,176 @@

+# PriviGaze: Privileged Distillation for Accessible Gaze Estimation
+**On-device gaze estimation designed for people with disabilities.**
+PriviGaze uses **privileged knowledge distillation** to train an ultra-compact student model (~80K params) that estimates gaze direction from just a grayscale face image — no eye crops, no RGB, no calibration needed.
+## Why This Matters
+Traditional gaze trackers fail for people with disabilities:
+- 👁️ **Droopy eyes** → eye crop detectors can't find pupils
+- 🔄 **Head roll/mobile instability** → calibration breaks
+- 💡 **Varied lighting** → RGB-based models fail
+PriviGaze's student model handles all of these by:
+- Working from the **full face** (no precise eye detection needed)
+- Using **grayscale only** (robust to lighting)
+- Having a **large receptive field** (handles head movement)
+- Being **~80K parameters** (runs on any device)
+## Architecture
+### Teacher (Training Only - Privileged Information)
+```
+┌─────────────────────────────────────────────────┐
+│              PriviGazeTeacher                     │
+│                                                   │
+│  Left Eye RGB ──→ ConvNeXtV2-Atto ──→ 256d      │
+│  Right Eye RGB ─→ ConvNeXtV2-Atto ──→ 256d      │
+│                         ↓ (Fusion)                │
+│  Face Blurred ──→ ConvNeXtV2-Nano ──→ 256d      │
+│  (Grayscale)         ↓ (Cross-Attention)         │
+│                   ┌──────────┐                    │
+│                   │  Fused   │                    │
+│                   │ Features │                    │
+│                   │   256d   │                    │
+│                   └────┬─────┘                    │
+│                   ┌────┴─────┐                    │
+│                   │ Pitch │ Yaw │                 │
+│                   └─────────────┘                 │
+└─────────────────────────────────────────────────┘
+```
+- 3 privileged inputs: left eye RGB, right eye RGB, blurred grayscale face
+- ConvNeXtV2-Atto (3.7M) for eyes, ConvNeXtV2-Nano (15.6M) for face
+- Cross-attention fusion between face and eye modalities
+- L2CS-Net style binned regression
+### Student (On-Device Inference)
+```
+┌─────────────────────────────────────────────────┐
+│              PriviGazeStudent                     │
+│                    ~80K params                     │
+│                                                   │
+│  Face Grayscale ──→ Light Correction              │
+│       ↓                                           │
+│  Stem (32ch, /4)                                  │
+│       ↓                                           │
+│  Inception Block → DSConv (/2) → 64ch            │
+│       ↓                                           │
+│  Inception Block → DSConv (/2) → 96ch            │
+│       ↓                                           │
+│  Inception Block → DSConv (/2) → 128ch           │
+│       ↓                                           │
+│  Inception Block → GAP → 160ch                   │
+│       ↓                                           │
+│  Feature Projection → 128d                        │
+│       ↓                                           │
+│  ┌────┴─────┐                                     │
+│  │ Pitch │ Yaw │                                  │
+│  └─────────────┘                                  │
+└─────────────────────────────────────────────────┘
+```
+- 1 input: grayscale face (224×224)
+- **Inception blocks** with factorized convolutions (1×3 + 3×1)
+- **Depthwise separable convolutions** throughout
+- **Learned light correction** (gamma + affine)
+- L2CS-Net style binned regression
+### Distillation Loss
+The student learns from the teacher via a multi-component loss:
+```
+L_total = L_task + α_angular·L_angular + α_contrast·L_contrast + α_mmd·L_mmd + α_logit·L_logit
+```
+| Component | Weight | Description |
+|-----------|--------|-------------|
+| L_task | 1.0 | L2CS-Net binned regression (CE + MSE) |
+| L_angular | 1.0 | Direct L1 in degrees |
+| L_contrast | 0.5 | InfoNCE contrastive feature matching |
+| L_mmd | 0.1 | Maximum Mean Discrepancy distribution matching |
+| L_logit | 0.5 | KL divergence on soft targets |
+## Training
+### Quick Start
+```bash
+# Install dependencies
+pip install -r requirements.txt
+# Train teacher first, then distill to student
+python train.py --mode both \
+    --batch-size 32 \
+    --epochs 100 \
+    --teacher-epochs 50 \
+    --save-dir ./checkpoints \
+    --push-to-hub \
+    --hub-model-id BcantCode/privi-gaze-distill
+```
+### Phase 1: Teacher Pre-training
+```bash
+python train.py --mode pretrain_teacher \
+    --batch-size 32 \
+    --teacher-epochs 50 \
+    --save-dir ./checkpoints
+```
+### Phase 2: Student Distillation
+```bash
+python train.py --mode distill \
+    --teacher-path ./checkpoints/teacher_best.pt \
+    --epochs 100 \
+    --batch-size 32 \
+    --save-dir ./checkpoints
+```
+## Model Sizes
+| Model | Parameters | Input | Use |
+|-------|-----------|-------|-----|
+| PriviGazeTeacher | ~19M | 2×RGB eyes + blurred face | Training only |
+| PriviGazeStudent | ~80K | 1×grayscale face | On-device inference |
+## Research Foundation
+This work builds on:
+- **L2CS-Net** (Abdelrahman et al., 2022): Per-angle binned regression for gaze
+- **GazeGen / DFT Gaze** (Hsieh et al., 2024): 281K distilled gaze model from 10× larger teacher
+- **WCoRD** (Chen et al., 2020): Wasserstein contrastive representation distillation
+- **One Eye is All You Need** (Athavale et al., 2022): Inception networks for lightweight gaze
+- **ETH-XGaze** (Zhang et al., 2020): Large-scale gaze dataset with extreme head poses
+## Dataset
+Currently uses **SyntheticGazeDataset** for development. The synthetic generator creates realistic eye crops with pupil positions encoding gaze direction, plus face images with corresponding features.
+For production use, the pipeline supports:
+- **MPIIFaceGaze**: 15 subjects, face crops + eye patches + 3D gaze
+- **ETH-XGaze**: 110 subjects, extreme head poses, 1.1M images (gold standard)
+- **Gaze360**: 238 subjects, 360° gaze range
+To use real datasets, implement the `MPIIGazeDataset` class in `models/dataset.py`.
+## Requirements
+- Python ≥ 3.9
+- PyTorch ≥ 2.0
+- Transformers ≥ 4.40
+- CUDA-capable GPU (for training)
+## License
+Apache 2.0
+## Citation
+```
+@software{privi_gaze_2026,
+  title={PriviGaze: Privileged Distillation for Accessible Gaze Estimation},
+  year={2026},
+  url={https://huggingface.co/BcantCode/privi-gaze-distill}
+}
+```