BcantCode commited on
Commit
10d67d3
Β·
verified Β·
1 Parent(s): 66a6e90

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +176 -0
README.md ADDED
@@ -0,0 +1,176 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # PriviGaze: Privileged Distillation for Accessible Gaze Estimation
2
+
3
+ **On-device gaze estimation designed for people with disabilities.**
4
+
5
+ PriviGaze uses **privileged knowledge distillation** to train an ultra-compact student model (~80K params) that estimates gaze direction from just a grayscale face image β€” no eye crops, no RGB, no calibration needed.
6
+
7
+ ## Why This Matters
8
+
9
+ Traditional gaze trackers fail for people with disabilities:
10
+ - πŸ‘οΈ **Droopy eyes** β†’ eye crop detectors can't find pupils
11
+ - πŸ”„ **Head roll/mobile instability** β†’ calibration breaks
12
+ - πŸ’‘ **Varied lighting** β†’ RGB-based models fail
13
+
14
+ PriviGaze's student model handles all of these by:
15
+ - Working from the **full face** (no precise eye detection needed)
16
+ - Using **grayscale only** (robust to lighting)
17
+ - Having a **large receptive field** (handles head movement)
18
+ - Being **~80K parameters** (runs on any device)
19
+
20
+ ## Architecture
21
+
22
+ ### Teacher (Training Only - Privileged Information)
23
+ ```
24
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
25
+ β”‚ PriviGazeTeacher β”‚
26
+ β”‚ β”‚
27
+ β”‚ Left Eye RGB ──→ ConvNeXtV2-Atto ──→ 256d β”‚
28
+ β”‚ Right Eye RGB ─→ ConvNeXtV2-Atto ──→ 256d β”‚
29
+ β”‚ ↓ (Fusion) β”‚
30
+ β”‚ Face Blurred ──→ ConvNeXtV2-Nano ──→ 256d β”‚
31
+ β”‚ (Grayscale) ↓ (Cross-Attention) β”‚
32
+ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
33
+ β”‚ β”‚ Fused β”‚ β”‚
34
+ β”‚ β”‚ Features β”‚ β”‚
35
+ β”‚ β”‚ 256d β”‚ β”‚
36
+ β”‚ β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜ β”‚
37
+ β”‚ β”Œβ”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β” β”‚
38
+ β”‚ β”‚ Pitch β”‚ Yaw β”‚ β”‚
39
+ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
40
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
41
+ ```
42
+ - 3 privileged inputs: left eye RGB, right eye RGB, blurred grayscale face
43
+ - ConvNeXtV2-Atto (3.7M) for eyes, ConvNeXtV2-Nano (15.6M) for face
44
+ - Cross-attention fusion between face and eye modalities
45
+ - L2CS-Net style binned regression
46
+
47
+ ### Student (On-Device Inference)
48
+ ```
49
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
50
+ β”‚ PriviGazeStudent β”‚
51
+ β”‚ ~80K params β”‚
52
+ β”‚ β”‚
53
+ β”‚ Face Grayscale ──→ Light Correction β”‚
54
+ β”‚ ↓ β”‚
55
+ β”‚ Stem (32ch, /4) β”‚
56
+ β”‚ ↓ β”‚
57
+ β”‚ Inception Block β†’ DSConv (/2) β†’ 64ch β”‚
58
+ β”‚ ↓ β”‚
59
+ β”‚ Inception Block β†’ DSConv (/2) β†’ 96ch β”‚
60
+ β”‚ ↓ β”‚
61
+ β”‚ Inception Block β†’ DSConv (/2) β†’ 128ch β”‚
62
+ β”‚ ↓ β”‚
63
+ β”‚ Inception Block β†’ GAP β†’ 160ch β”‚
64
+ β”‚ ↓ β”‚
65
+ β”‚ Feature Projection β†’ 128d β”‚
66
+ β”‚ ↓ β”‚
67
+ β”‚ β”Œβ”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β” β”‚
68
+ β”‚ β”‚ Pitch β”‚ Yaw β”‚ β”‚
69
+ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
70
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
71
+ ```
72
+ - 1 input: grayscale face (224Γ—224)
73
+ - **Inception blocks** with factorized convolutions (1Γ—3 + 3Γ—1)
74
+ - **Depthwise separable convolutions** throughout
75
+ - **Learned light correction** (gamma + affine)
76
+ - L2CS-Net style binned regression
77
+
78
+ ### Distillation Loss
79
+
80
+ The student learns from the teacher via a multi-component loss:
81
+
82
+ ```
83
+ L_total = L_task + Ξ±_angularΒ·L_angular + Ξ±_contrastΒ·L_contrast + Ξ±_mmdΒ·L_mmd + Ξ±_logitΒ·L_logit
84
+ ```
85
+
86
+ | Component | Weight | Description |
87
+ |-----------|--------|-------------|
88
+ | L_task | 1.0 | L2CS-Net binned regression (CE + MSE) |
89
+ | L_angular | 1.0 | Direct L1 in degrees |
90
+ | L_contrast | 0.5 | InfoNCE contrastive feature matching |
91
+ | L_mmd | 0.1 | Maximum Mean Discrepancy distribution matching |
92
+ | L_logit | 0.5 | KL divergence on soft targets |
93
+
94
+ ## Training
95
+
96
+ ### Quick Start
97
+
98
+ ```bash
99
+ # Install dependencies
100
+ pip install -r requirements.txt
101
+
102
+ # Train teacher first, then distill to student
103
+ python train.py --mode both \
104
+ --batch-size 32 \
105
+ --epochs 100 \
106
+ --teacher-epochs 50 \
107
+ --save-dir ./checkpoints \
108
+ --push-to-hub \
109
+ --hub-model-id BcantCode/privi-gaze-distill
110
+ ```
111
+
112
+ ### Phase 1: Teacher Pre-training
113
+ ```bash
114
+ python train.py --mode pretrain_teacher \
115
+ --batch-size 32 \
116
+ --teacher-epochs 50 \
117
+ --save-dir ./checkpoints
118
+ ```
119
+
120
+ ### Phase 2: Student Distillation
121
+ ```bash
122
+ python train.py --mode distill \
123
+ --teacher-path ./checkpoints/teacher_best.pt \
124
+ --epochs 100 \
125
+ --batch-size 32 \
126
+ --save-dir ./checkpoints
127
+ ```
128
+
129
+ ## Model Sizes
130
+
131
+ | Model | Parameters | Input | Use |
132
+ |-------|-----------|-------|-----|
133
+ | PriviGazeTeacher | ~19M | 2Γ—RGB eyes + blurred face | Training only |
134
+ | PriviGazeStudent | ~80K | 1Γ—grayscale face | On-device inference |
135
+
136
+ ## Research Foundation
137
+
138
+ This work builds on:
139
+
140
+ - **L2CS-Net** (Abdelrahman et al., 2022): Per-angle binned regression for gaze
141
+ - **GazeGen / DFT Gaze** (Hsieh et al., 2024): 281K distilled gaze model from 10Γ— larger teacher
142
+ - **WCoRD** (Chen et al., 2020): Wasserstein contrastive representation distillation
143
+ - **One Eye is All You Need** (Athavale et al., 2022): Inception networks for lightweight gaze
144
+ - **ETH-XGaze** (Zhang et al., 2020): Large-scale gaze dataset with extreme head poses
145
+
146
+ ## Dataset
147
+
148
+ Currently uses **SyntheticGazeDataset** for development. The synthetic generator creates realistic eye crops with pupil positions encoding gaze direction, plus face images with corresponding features.
149
+
150
+ For production use, the pipeline supports:
151
+ - **MPIIFaceGaze**: 15 subjects, face crops + eye patches + 3D gaze
152
+ - **ETH-XGaze**: 110 subjects, extreme head poses, 1.1M images (gold standard)
153
+ - **Gaze360**: 238 subjects, 360Β° gaze range
154
+
155
+ To use real datasets, implement the `MPIIGazeDataset` class in `models/dataset.py`.
156
+
157
+ ## Requirements
158
+
159
+ - Python β‰₯ 3.9
160
+ - PyTorch β‰₯ 2.0
161
+ - Transformers β‰₯ 4.40
162
+ - CUDA-capable GPU (for training)
163
+
164
+ ## License
165
+
166
+ Apache 2.0
167
+
168
+ ## Citation
169
+
170
+ ```
171
+ @software{privi_gaze_2026,
172
+ title={PriviGaze: Privileged Distillation for Accessible Gaze Estimation},
173
+ year={2026},
174
+ url={https://huggingface.co/BcantCode/privi-gaze-distill}
175
+ }
176
+ ```