File size: 7,246 Bytes

10d67d3

# PriviGaze: Privileged Distillation for Accessible Gaze Estimation

**On-device gaze estimation designed for people with disabilities.**

PriviGaze uses **privileged knowledge distillation** to train an ultra-compact student model (~80K params) that estimates gaze direction from just a grayscale face image — no eye crops, no RGB, no calibration needed.

## Why This Matters

Traditional gaze trackers fail for people with disabilities:
- 👁️ **Droopy eyes** → eye crop detectors can't find pupils
- 🔄 **Head roll/mobile instability** → calibration breaks
- 💡 **Varied lighting** → RGB-based models fail

PriviGaze's student model handles all of these by:
- Working from the **full face** (no precise eye detection needed)
- Using **grayscale only** (robust to lighting)
- Having a **large receptive field** (handles head movement)
- Being **~80K parameters** (runs on any device)

## Architecture

### Teacher (Training Only - Privileged Information)
```
┌─────────────────────────────────────────────────┐
│              PriviGazeTeacher                     │
│                                                   │
│  Left Eye RGB ──→ ConvNeXtV2-Atto ──→ 256d      │
│  Right Eye RGB ─→ ConvNeXtV2-Atto ──→ 256d      │
│                         ↓ (Fusion)                │
│  Face Blurred ──→ ConvNeXtV2-Nano ──→ 256d      │
│  (Grayscale)         ↓ (Cross-Attention)         │
│                   ┌──────────┐                    │
│                   │  Fused   │                    │
│                   │ Features │                    │
│                   │   256d   │                    │
│                   └────┬─────┘                    │
│                   ┌────┴─────┐                    │
│                   │ Pitch │ Yaw │                 │
│                   └─────────────┘                 │
└─────────────────────────────────────────────────┘
```
- 3 privileged inputs: left eye RGB, right eye RGB, blurred grayscale face
- ConvNeXtV2-Atto (3.7M) for eyes, ConvNeXtV2-Nano (15.6M) for face
- Cross-attention fusion between face and eye modalities
- L2CS-Net style binned regression

### Student (On-Device Inference)
```
┌─────────────────────────────────────────────────┐
│              PriviGazeStudent                     │
│                    ~80K params                     │
│                                                   │
│  Face Grayscale ──→ Light Correction              │
│       ↓                                           │
│  Stem (32ch, /4)                                  │
│       ↓                                           │
│  Inception Block → DSConv (/2) → 64ch            │
│       ↓                                           │
│  Inception Block → DSConv (/2) → 96ch            │
│       ↓                                           │
│  Inception Block → DSConv (/2) → 128ch           │
│       ↓                                           │
│  Inception Block → GAP → 160ch                   │
│       ↓                                           │
│  Feature Projection → 128d                        │
│       ↓                                           │
│  ┌────┴─────┐                                     │
│  │ Pitch │ Yaw │                                  │
│  └─────────────┘                                  │
└─────────────────────────────────────────────────┘
```
- 1 input: grayscale face (224×224)
- **Inception blocks** with factorized convolutions (1×3 + 3×1)
- **Depthwise separable convolutions** throughout
- **Learned light correction** (gamma + affine)
- L2CS-Net style binned regression

### Distillation Loss

The student learns from the teacher via a multi-component loss:

```
L_total = L_task + α_angular·L_angular + α_contrast·L_contrast + α_mmd·L_mmd + α_logit·L_logit
```

| Component | Weight | Description |
|-----------|--------|-------------|
| L_task | 1.0 | L2CS-Net binned regression (CE + MSE) |
| L_angular | 1.0 | Direct L1 in degrees |
| L_contrast | 0.5 | InfoNCE contrastive feature matching |
| L_mmd | 0.1 | Maximum Mean Discrepancy distribution matching |
| L_logit | 0.5 | KL divergence on soft targets |

## Training

### Quick Start

```bash
# Install dependencies
pip install -r requirements.txt

# Train teacher first, then distill to student
python train.py --mode both \
    --batch-size 32 \
    --epochs 100 \
    --teacher-epochs 50 \
    --save-dir ./checkpoints \
    --push-to-hub \
    --hub-model-id BcantCode/privi-gaze-distill
```

### Phase 1: Teacher Pre-training
```bash
python train.py --mode pretrain_teacher \
    --batch-size 32 \
    --teacher-epochs 50 \
    --save-dir ./checkpoints
```

### Phase 2: Student Distillation  
```bash
python train.py --mode distill \
    --teacher-path ./checkpoints/teacher_best.pt \
    --epochs 100 \
    --batch-size 32 \
    --save-dir ./checkpoints
```

## Model Sizes

| Model | Parameters | Input | Use |
|-------|-----------|-------|-----|
| PriviGazeTeacher | ~19M | 2×RGB eyes + blurred face | Training only |
| PriviGazeStudent | ~80K | 1×grayscale face | On-device inference |

## Research Foundation

This work builds on:

- **L2CS-Net** (Abdelrahman et al., 2022): Per-angle binned regression for gaze
- **GazeGen / DFT Gaze** (Hsieh et al., 2024): 281K distilled gaze model from 10× larger teacher
- **WCoRD** (Chen et al., 2020): Wasserstein contrastive representation distillation
- **One Eye is All You Need** (Athavale et al., 2022): Inception networks for lightweight gaze
- **ETH-XGaze** (Zhang et al., 2020): Large-scale gaze dataset with extreme head poses

## Dataset

Currently uses **SyntheticGazeDataset** for development. The synthetic generator creates realistic eye crops with pupil positions encoding gaze direction, plus face images with corresponding features.

For production use, the pipeline supports:
- **MPIIFaceGaze**: 15 subjects, face crops + eye patches + 3D gaze
- **ETH-XGaze**: 110 subjects, extreme head poses, 1.1M images (gold standard)
- **Gaze360**: 238 subjects, 360° gaze range

To use real datasets, implement the `MPIIGazeDataset` class in `models/dataset.py`.

## Requirements

- Python ≥ 3.9
- PyTorch ≥ 2.0
- Transformers ≥ 4.40
- CUDA-capable GPU (for training)

## License

Apache 2.0

## Citation

```
@software{privi_gaze_2026,
  title={PriviGaze: Privileged Distillation for Accessible Gaze Estimation},
  year={2026},
  url={https://huggingface.co/BcantCode/privi-gaze-distill}
}
```