facedet / README.md
cledouxluma's picture
Upload README.md with huggingface_hub
557cc40 verified
# FaceDet β€” Production Face Detection for Video
> **SCRFD-family detectors + ByteTrack tracking + temporal smoothing**
> Optimized for tiny faces, blur, occlusion, poor lighting, compression, and temporal stability.
## Architecture Survey & Design Decisions
### Ranked Candidate Models (WiderFace Hard AP)
| Rank | Model | Easy | Medium | Hard | GFLOPs | FPS (V100) | Year | Selected? |
|------|-------|------|--------|------|--------|-----------|------|-----------|
| 1 | ASFD-D6 | 97.2 | 96.5 | **92.5** | ~500 | ~7 | 2022 | βœ— (too slow) |
| 2 | TinaFace-R50+TTA | 96.1 | 95.5 | **92.4** | ~42K (MS) | ~3 | 2020 | βœ— (TTA-dependent) |
| 3 | TinaFace-R50 (single) | 95.9 | 95.2 | **92.1** | 508 | ~15 | 2020 | βœ— (not efficient) |
| 4 | RetinaFace-R152+MS | 96.9 | 96.1 | **91.8** | High | 13 | 2019 | βœ— (heavy backbone) |
| 5 | MOS-L (R152) | 96.9 | 96.1 | **92.1** | Multi-scale | ~16 | 2021 | βœ— (heavy) |
| 6 | DSFD | 96.6 | 95.7 | **90.4** | ~1532 | β€” | 2018 | βœ— (outdated) |
| **7** | **SCRFD-34GF** | **96.1** | **95.0** | **85.2** | **34** | **~80** | **2021** | **βœ“ Flagship** |
| **8** | **SCRFD-10GF** | **95.2** | **93.9** | **83.1** | **10** | **~140** | **2021** | **βœ“ Balanced** |
| **9** | **SCRFD-2.5GF** | **93.8** | **92.2** | **77.9** | **2.5** | **~400** | **2021** | **βœ“ Real-time** |
| **10** | **SCRFD-0.5GF** | **90.6** | **88.1** | **68.5** | **0.5** | **~1000** | **2021** | **βœ“ Mobile** |
| 11 | RetinaFace-MN0.25 | 91.4 | 90.1 | 78.2 | ~1 | 60 CPU | 2019 | βœ— (SCRFD-2.5G better) |
| 12 | YuNet | 85.6 | 84.2 | 72.7 | 2.5 | 77 CPU | 2021 | βœ— (lower AP) |
### Why SCRFD?
**The SCRFD family achieves the best accuracy-efficiency Pareto frontier for face detection.** The key findings:
1. **3.86% better Hard AP** than TinaFace at 3Γ— speed (SCRFD-34G vs TinaFace-R50)
2. **No ImageNet pretraining needed** β€” trains from scratch in 640 epochs
3. **Scalable family** β€” same architecture principles from 0.5 to 34 GFLOPs
4. **Two orthogonal innovations**: Sample Redistribution (augmentation) + Computation Redistribution (NAS)
Higher-ranked models (ASFD-D6, TinaFace+TTA) achieve marginally better Hard AP but at **10-100Γ— the compute cost**, making them impractical for video.
### Key Technical Insights From Literature
| Finding | Source | Impact |
|---------|--------|--------|
| Large-scale crops [0.3–2.0] increase stride-8 positives from 72Kβ†’118K | SCRFD Β§3.2 | +5-8% Hard AP |
| GFL jointly trains quality + classification β†’ better score calibration | SCRFD Β§3.1 | +1-2% Hard AP |
| Lower IoU matching threshold (0.35 vs 0.5) helps tiny faces | TinaFace Β§3.3 | +2% Hard AP |
| GroupNorm > BatchNorm at small batch sizes | TinaFace Β§3.2 | Stable training |
| 5-point landmark auxiliary loss improves detection by ~1% | RetinaFace Β§4.2 | +1% Hard AP |
| WiderFace Hard is near-saturated (~92% AP ceiling) | Survey '21 | Focus on efficiency |
| No published improvements to WiderFace Hard since 2022 | Literature scan | Benchmark mature |
---
## Model Zoo
| Model | WiderFace (E/M/H) | GFLOPs | Params | FPS (V100 VGA) | Use Case |
|-------|-------------------|--------|--------|-----------------|----------|
| `scrfd_34g` | 96.1/95.0/85.2 | 34 | 9.80M | ~80 | Flagship quality |
| `scrfd_10g` | 95.2/93.9/83.1 | 10 | 3.86M | ~140 | Balanced |
| `scrfd_2.5g` | 93.8/92.2/77.9 | 2.5 | 0.67M | ~400 | Real-time video |
| `scrfd_0.5g` | 90.6/88.1/68.5 | 0.5 | 0.57M | ~1000 | Mobile/edge |
---
## Architecture
```
Input Image (640Γ—640)
β”‚
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ BACKBONE (NAS-searched ResNet-style) β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚Stem β”‚β†’ β”‚ S1 β”‚β†’ β”‚ S2 β”‚β†’ β”‚ S3 β”‚β†’ β”‚ S4 β”‚
β”‚ β”‚s=4 β”‚ β”‚s=4 β”‚ β”‚ s=8 β”‚ β”‚s=16 β”‚ β”‚s=32 β”‚
β”‚ β””β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”¬β”€β”€β”€β”˜ β””β”€β”€β”¬β”€β”€β”˜ β””β”€β”€β”¬β”€β”€β”˜
β”‚ β”‚ C3 β”‚ C4 β”‚ C5
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”˜
β”‚ β”‚ β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”
β”‚ PAFPN (Path Aggregation FPN) β”‚
β”‚ Top-down (FPN) + Bottom-up (PAN) β”‚
β”‚ β”Œβ”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β” β”‚
β”‚ β”‚ P3 β”‚ ← β”‚ P4 β”‚ ← β”‚ P5 β”‚ (top-down) β”‚
β”‚ β”‚ P3 β”‚ β†’ β”‚ P4 β”‚ β†’ β”‚ P5 β”‚ (bottom-up) β”‚
β”‚ β”‚s=8 β”‚ β”‚s=16β”‚ β”‚s=32β”‚ β”‚
β”‚ β””β”€β”€β”¬β”€β”˜ β””β”€β”€β”¬β”€β”˜ β””β”€β”€β”¬β”€β”˜ β”‚
β””β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚ β”‚ β”‚
β”Œβ”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ SHARED HEAD (per level, weight-shared) β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ CLS (GFL)β”‚ β”‚ REG(DIoU)β”‚ [LMK (opt)] β”‚
β”‚ β”‚ AΓ—1 β”‚ β”‚ AΓ—4 β”‚ [AΓ—10] β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚ β”‚
β–Ό β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ ATSS Match β”‚ β”‚ NMS (ΞΈ=0.4) β”‚
β”‚ (training) β”‚ β”‚ (inference) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```
**Anchors (per level):**
- Stride 8: `[16, 32]` β€” small faces (β‰₯16px)
- Stride 16: `[64, 128]` β€” medium faces
- Stride 32: `[256, 512]` β€” large faces
- Aspect ratio: 1.0 (square β€” faces are roughly square)
---
## Video Pipeline
```
Frame β†’ Detector (SCRFD) β†’ ByteTrack Tracker β†’ Temporal Smoother β†’ Output
↓ ↓ ↓
Per-frame boxes Track IDs (stable) Jitter-free boxes
+ scores + Kalman prediction + Score momentum
+ landmarks + 2-stage matching + Adaptive EMA
```
**ByteTrack** (Zhang et al., 2022): Uses ALL detections β€” high + low confidence β€” for two-stage association. Low-confidence detections handle partially occluded faces that would be lost by traditional trackers.
**Temporal Smoother**: Adaptive EMA where smoothing factor scales with motion magnitude:
- Static faces β†’ heavy smoothing (Ξ±β‰ˆ0.3) β†’ no jitter
- Fast-moving faces β†’ light smoothing (Ξ±β‰ˆ0.9) β†’ no lag
---
## Quick Start
### Installation
```bash
pip install -r requirements.txt
```
### Detect faces in a video
```python
from facedet import VideoFaceDetector
detector = VideoFaceDetector(
model_path='checkpoints/scrfd_34g_best.pth',
model_name='scrfd_34g',
device='cuda',
use_tracking=True,
use_smoothing=True,
)
# Process video file
stats = detector.process_video(
source='input.mp4',
output_path='output.mp4',
show=True,
)
# β†’ {total_frames: 1200, avg_fps: 75.3, avg_faces_per_frame: 4.2}
```
### Detect faces in a single image
```python
from facedet import build_detector
import cv2, torch
model = build_detector('scrfd_34g').cuda().eval()
# Load checkpoint...
img = cv2.imread('photo.jpg')
# Preprocess... (see scripts/evaluate.py for full example)
results = model(tensor)
# β†’ [{'boxes': tensor([...]), 'scores': tensor([...])}]
```
### Real-time webcam
```bash
python scripts/detect_video.py \
--model scrfd_2.5g \
--checkpoint checkpoints/scrfd_2.5g_best.pth \
--input 0 --show
```
---
## Training
### Dataset Setup
Download [WIDER FACE](http://shuoyang1213.me/WIDERFACE/) and arrange:
```
data/wider_face/
β”œβ”€β”€ WIDER_train/images/
β”œβ”€β”€ WIDER_val/images/
β”œβ”€β”€ wider_face_split/
β”‚ β”œβ”€β”€ wider_face_train_bbx_gt.txt
β”‚ └── wider_face_val_bbx_gt.txt
└── retinaface_gt/ (optional, for landmark training)
β”œβ”€β”€ train/label.txt
└── val/label.txt
```
### Training Commands
```bash
# Single GPU β€” SCRFD-34G (flagship)
python scripts/train.py \
--model scrfd_34g \
--data-root data/wider_face \
--epochs 640 \
--batch-size 8 \
--lr 0.01
# Multi-GPU β€” 4Γ— V100
torchrun --nproc_per_node=4 scripts/train.py \
--model scrfd_34g \
--data-root data/wider_face \
--epochs 640 \
--batch-size 8 \
--lr 0.01
# Real-time variant
python scripts/train.py \
--model scrfd_2.5g \
--data-root data/wider_face \
--epochs 640 \
--batch-size 16 \
--lr 0.02
```
### Training Recipe (from SCRFD paper)
| Parameter | Value | Rationale |
|-----------|-------|-----------|
| Optimizer | SGD (m=0.9, wd=5e-4) | Standard for detection |
| Base LR | 0.01 (8 imgs/GPU) | Linear scaling rule |
| LR Schedule | MultiStep [440, 544] Γ—0.1 | Long training, late decay |
| Warmup | 3 epochs, linear from 1e-5 | Prevent early divergence |
| Total Epochs | 640 | Train from scratch |
| Input Size | 640Γ—640 | Random crop from larger |
| Crop Scales | [0.3, 0.45, ..., 2.0] | **Sample Redistribution** |
| Augmentation | Crop + flip + photometric + robustness | See data/augmentations.py |
| Normalization | GroupNorm | Batch-size independent |
| Matching | ATSS (k=9) | Adaptive thresholds |
| Cls Loss | GFL (Ξ²=2) | Joint quality score |
| Reg Loss | DIoU | Better for tiny faces |
| Mixed Precision | βœ“ | 2Γ— training speed |
---
## Evaluation
```bash
python scripts/evaluate.py \
--model scrfd_34g \
--checkpoint checkpoints/scrfd_34g_best.pth \
--data-root data/wider_face \
--output-dir results/scrfd_34g \
--benchmark
```
Generates:
- WiderFace Easy/Medium/Hard AP scores
- Predictions in WiderFace submission format
- Speed benchmark table (320/480/640/960px)
---
## Deployment
### ONNX Export
```bash
python scripts/export.py \
--model scrfd_34g \
--checkpoint checkpoints/scrfd_34g_best.pth \
--output deploy/scrfd_34g.onnx \
--input-size 640
```
### TensorRT (FP16)
```bash
trtexec --onnx=deploy/scrfd_34g.onnx \
--saveEngine=deploy/scrfd_34g_fp16.engine \
--fp16 --workspace=4096
```
### Expected Deployment Speedups
| Model | PyTorch FP32 | ONNX RT | TensorRT FP16 | TensorRT INT8 |
|-------|-------------|---------|----------------|---------------|
| SCRFD-34G | ~80 FPS | ~100 FPS | ~200 FPS | ~350 FPS |
| SCRFD-2.5G | ~400 FPS | ~500 FPS | ~800 FPS | ~1200 FPS |
| SCRFD-0.5G | ~1000 FPS | ~1200 FPS | ~2000 FPS | ~3000 FPS |
### PyTorch Quantization (CPU)
```python
from facedet.deploy import quantize_model
quantized = quantize_model(model, method='dynamic')
```
---
## Ablation Studies
Configured in `configs/ablations.yaml`. Each ablation isolates one variable:
| Ablation | Variables | Expected Finding |
|----------|-----------|-----------------|
| **Sample Redistribution** | Crop scales [0.3–1.0] vs [0.3–2.0] | +5-8% Hard AP from large crops |
| **Loss Functions** | GFL+DIoU vs Focal+SmoothL1 | GFL: +1-2% from quality scores |
| **Matching Strategy** | ATSS(k=9) vs IoU(0.35) vs IoU(0.5) | ATSS: best for mixed scales |
| **Robustness Augmentation** | None / blur / JPEG / all | All: +1-3% on degraded inputs |
| **Normalization** | GroupNorm vs BatchNorm | GN: stable at batch<8 |
| **Input Resolution** | 320 / 480 / 640 / 960 | 960: +5-10% Hard AP, 4Γ— slower |
| **Landmarks** | With/without 5-point landmarks | +~1% Hard AP (RetinaFace finding) |
| **Tracker Config** | None / conservative / aggressive | Aggressive: more tracks, more FP |
---
## Handling Challenging Conditions
### Tiny Faces (<16px)
- **Sample Redistribution** (crop scale up to 2.0Γ—) generates small face training samples
- Stride-8 feature maps with anchors [16, 32]px
- Higher inference resolution (960px) trades speed for +5-10% small face recall
- ATSS matching gives tiny faces lower IoU thresholds automatically
### Blur / Motion Blur
- **Training augmentation**: Gaussian blur Οƒβˆˆ[0.5, 3.0] applied with p=0.2
- Model learns blur-invariant features
- ByteTrack Kalman filter predicts through blurred frames
### Occlusion
- **Random erasing** (Cutout) during training simulates partial occlusion
- ATSS assigns multiple anchors per GT β†’ partial detection still gets signal
- ByteTrack 2nd-stage matching recovers occluded faces with low-confidence detections
### Poor Lighting
- **Gamma darkening** augmentation (γ∈[1.5, 3.0]) simulates low-light
- Photometric distortion (brightness, contrast jitter)
- For extreme cases: pair with CLAHE preprocessing
### Compression Artifacts
- **JPEG quality** degradation (Q=20-80) during training
- No published method addresses this β€” our augmentation is novel for face detection
### Temporal Stability
- **ByteTrack**: stable track IDs across frames, handles occlusion
- **Kalman filter**: smooth trajectory prediction
- **Temporal EMA**: adaptive smoothing eliminates box jitter
- **Keyframe strategy**: full detection every N frames, tracker-only in between
---
## Repository Structure
```
facedet/
β”œβ”€β”€ README.md # This file
β”œβ”€β”€ setup.py # Package installation
β”œβ”€β”€ requirements.txt # Dependencies
β”‚
β”œβ”€β”€ models/ # Model architectures
β”‚ β”œβ”€β”€ backbone.py # NAS-searched ResNet backbones
β”‚ β”œβ”€β”€ neck.py # PAFPN feature pyramid
β”‚ β”œβ”€β”€ head.py # Shared detection head (cls/reg/lmk)
β”‚ β”œβ”€β”€ anchor.py # Anchor generation + ATSS matching
β”‚ β”œβ”€β”€ losses.py # GFL, DIoU, Focal, Landmark losses
β”‚ └── detector.py # Full SCRFD detector (train + inference)
β”‚
β”œβ”€β”€ data/ # Data pipeline
β”‚ β”œβ”€β”€ widerface.py # WiderFace dataset loader
β”‚ β”œβ”€β”€ augmentations.py # Training/val/robustness augmentations
β”‚ └── dataloader.py # DataLoader builders
β”‚
β”œβ”€β”€ engine/ # Video inference engine
β”‚ β”œβ”€β”€ video_detector.py # End-to-end video processing
β”‚ β”œβ”€β”€ tracker.py # ByteTrack face tracker
β”‚ └── temporal.py # Temporal EMA smoother
β”‚
β”œβ”€β”€ evaluation/ # Evaluation suite
β”‚ β”œβ”€β”€ widerface_eval.py # WiderFace protocol (Easy/Med/Hard AP)
β”‚ β”œβ”€β”€ speed_benchmark.py # Latency/throughput benchmarks
β”‚ └── metrics.py # Core metrics (AP, IoU, recall)
β”‚
β”œβ”€β”€ deploy/ # Deployment
β”‚ β”œβ”€β”€ export_onnx.py # ONNX export + verification
β”‚ └── optimize.py # Quantization, TensorRT guide
β”‚
β”œβ”€β”€ configs/ # Configuration files
β”‚ β”œβ”€β”€ scrfd_34g.yaml # Flagship (quality)
β”‚ β”œβ”€β”€ scrfd_10g.yaml # Balanced
β”‚ β”œβ”€β”€ scrfd_2.5g.yaml # Real-time
β”‚ β”œβ”€β”€ scrfd_0.5g.yaml # Mobile
β”‚ └── ablations.yaml # Ablation study configs
β”‚
β”œβ”€β”€ scripts/ # Entry points
β”‚ β”œβ”€β”€ train.py # Training (single/multi-GPU)
β”‚ β”œβ”€β”€ evaluate.py # WiderFace evaluation + speed bench
β”‚ β”œβ”€β”€ detect_video.py # Video inference CLI
β”‚ └── export.py # ONNX export CLI
β”‚
└── utils/ # Helpers
β”œβ”€β”€ visualization.py # Drawing utilities
└── io.py # Checkpoint I/O
```
---
## References
1. **SCRFD**: Guo et al., "Sample and Computation Redistribution for Efficient Face Detection", arXiv:2105.04714 (2021)
2. **RetinaFace**: Deng et al., "RetinaFace: Single-shot Multi-level Face Localisation in the Wild", arXiv:1905.00641 (2019)
3. **TinaFace**: Zhu et al., "TinaFace: Strong but Simple Baseline for Face Detection", arXiv:2011.13183 (2020)
4. **ByteTrack**: Zhang et al., "ByteTrack: Multi-Object Tracking by Associating Every Detection Box", ECCV 2022
5. **ATSS**: Zhang et al., "Bridging the Gap Between Anchor-based and Anchor-free Detection via Adaptive Training Sample Selection", CVPR 2020
6. **GFL**: Li et al., "Generalized Focal Loss: Learning Qualified and Distributed Bounding Boxes for Dense Object Detection", NeurIPS 2020
7. **DIoU**: Zheng et al., "Distance-IoU Loss: Faster and Better Learning for Bounding Box Regression", AAAI 2020
8. **ASFD**: Zhang et al., "ASFD: Automatic and Scalable Face Detector", arXiv:2201.10781 (2022)
9. **DSFD**: Li et al., "DSFD: Dual Shot Face Detector", CVPR 2019
10. **WiderFace**: Yang et al., "WIDER FACE: A Face Detection Benchmark", CVPR 2016
---
## License
Apache 2.0