File size: 7,246 Bytes
10d67d3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
# PriviGaze: Privileged Distillation for Accessible Gaze Estimation

**On-device gaze estimation designed for people with disabilities.**

PriviGaze uses **privileged knowledge distillation** to train an ultra-compact student model (~80K params) that estimates gaze direction from just a grayscale face image β€” no eye crops, no RGB, no calibration needed.

## Why This Matters

Traditional gaze trackers fail for people with disabilities:
- πŸ‘οΈ **Droopy eyes** β†’ eye crop detectors can't find pupils
- πŸ”„ **Head roll/mobile instability** β†’ calibration breaks
- πŸ’‘ **Varied lighting** β†’ RGB-based models fail

PriviGaze's student model handles all of these by:
- Working from the **full face** (no precise eye detection needed)
- Using **grayscale only** (robust to lighting)
- Having a **large receptive field** (handles head movement)
- Being **~80K parameters** (runs on any device)

## Architecture

### Teacher (Training Only - Privileged Information)
```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚              PriviGazeTeacher                     β”‚
β”‚                                                   β”‚
β”‚  Left Eye RGB ──→ ConvNeXtV2-Atto ──→ 256d      β”‚
β”‚  Right Eye RGB ─→ ConvNeXtV2-Atto ──→ 256d      β”‚
β”‚                         ↓ (Fusion)                β”‚
β”‚  Face Blurred ──→ ConvNeXtV2-Nano ──→ 256d      β”‚
β”‚  (Grayscale)         ↓ (Cross-Attention)         β”‚
β”‚                   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                    β”‚
β”‚                   β”‚  Fused   β”‚                    β”‚
β”‚                   β”‚ Features β”‚                    β”‚
β”‚                   β”‚   256d   β”‚                    β”‚
β”‚                   β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜                    β”‚
β”‚                   β”Œβ”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”                    β”‚
β”‚                   β”‚ Pitch β”‚ Yaw β”‚                 β”‚
β”‚                   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```
- 3 privileged inputs: left eye RGB, right eye RGB, blurred grayscale face
- ConvNeXtV2-Atto (3.7M) for eyes, ConvNeXtV2-Nano (15.6M) for face
- Cross-attention fusion between face and eye modalities
- L2CS-Net style binned regression

### Student (On-Device Inference)
```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚              PriviGazeStudent                     β”‚
β”‚                    ~80K params                     β”‚
β”‚                                                   β”‚
β”‚  Face Grayscale ──→ Light Correction              β”‚
β”‚       ↓                                           β”‚
β”‚  Stem (32ch, /4)                                  β”‚
β”‚       ↓                                           β”‚
β”‚  Inception Block β†’ DSConv (/2) β†’ 64ch            β”‚
β”‚       ↓                                           β”‚
β”‚  Inception Block β†’ DSConv (/2) β†’ 96ch            β”‚
β”‚       ↓                                           β”‚
β”‚  Inception Block β†’ DSConv (/2) β†’ 128ch           β”‚
β”‚       ↓                                           β”‚
β”‚  Inception Block β†’ GAP β†’ 160ch                   β”‚
β”‚       ↓                                           β”‚
β”‚  Feature Projection β†’ 128d                        β”‚
β”‚       ↓                                           β”‚
β”‚  β”Œβ”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”                                     β”‚
β”‚  β”‚ Pitch β”‚ Yaw β”‚                                  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                                  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```
- 1 input: grayscale face (224Γ—224)
- **Inception blocks** with factorized convolutions (1Γ—3 + 3Γ—1)
- **Depthwise separable convolutions** throughout
- **Learned light correction** (gamma + affine)
- L2CS-Net style binned regression

### Distillation Loss

The student learns from the teacher via a multi-component loss:

```
L_total = L_task + Ξ±_angularΒ·L_angular + Ξ±_contrastΒ·L_contrast + Ξ±_mmdΒ·L_mmd + Ξ±_logitΒ·L_logit
```

| Component | Weight | Description |
|-----------|--------|-------------|
| L_task | 1.0 | L2CS-Net binned regression (CE + MSE) |
| L_angular | 1.0 | Direct L1 in degrees |
| L_contrast | 0.5 | InfoNCE contrastive feature matching |
| L_mmd | 0.1 | Maximum Mean Discrepancy distribution matching |
| L_logit | 0.5 | KL divergence on soft targets |

## Training

### Quick Start

```bash
# Install dependencies
pip install -r requirements.txt

# Train teacher first, then distill to student
python train.py --mode both \
    --batch-size 32 \
    --epochs 100 \
    --teacher-epochs 50 \
    --save-dir ./checkpoints \
    --push-to-hub \
    --hub-model-id BcantCode/privi-gaze-distill
```

### Phase 1: Teacher Pre-training
```bash
python train.py --mode pretrain_teacher \
    --batch-size 32 \
    --teacher-epochs 50 \
    --save-dir ./checkpoints
```

### Phase 2: Student Distillation  
```bash
python train.py --mode distill \
    --teacher-path ./checkpoints/teacher_best.pt \
    --epochs 100 \
    --batch-size 32 \
    --save-dir ./checkpoints
```

## Model Sizes

| Model | Parameters | Input | Use |
|-------|-----------|-------|-----|
| PriviGazeTeacher | ~19M | 2Γ—RGB eyes + blurred face | Training only |
| PriviGazeStudent | ~80K | 1Γ—grayscale face | On-device inference |

## Research Foundation

This work builds on:

- **L2CS-Net** (Abdelrahman et al., 2022): Per-angle binned regression for gaze
- **GazeGen / DFT Gaze** (Hsieh et al., 2024): 281K distilled gaze model from 10Γ— larger teacher
- **WCoRD** (Chen et al., 2020): Wasserstein contrastive representation distillation
- **One Eye is All You Need** (Athavale et al., 2022): Inception networks for lightweight gaze
- **ETH-XGaze** (Zhang et al., 2020): Large-scale gaze dataset with extreme head poses

## Dataset

Currently uses **SyntheticGazeDataset** for development. The synthetic generator creates realistic eye crops with pupil positions encoding gaze direction, plus face images with corresponding features.

For production use, the pipeline supports:
- **MPIIFaceGaze**: 15 subjects, face crops + eye patches + 3D gaze
- **ETH-XGaze**: 110 subjects, extreme head poses, 1.1M images (gold standard)
- **Gaze360**: 238 subjects, 360Β° gaze range

To use real datasets, implement the `MPIIGazeDataset` class in `models/dataset.py`.

## Requirements

- Python β‰₯ 3.9
- PyTorch β‰₯ 2.0
- Transformers β‰₯ 4.40
- CUDA-capable GPU (for training)

## License

Apache 2.0

## Citation

```
@software{privi_gaze_2026,
  title={PriviGaze: Privileged Distillation for Accessible Gaze Estimation},
  year={2026},
  url={https://huggingface.co/BcantCode/privi-gaze-distill}
}
```