| # FaceDet β Production Face Detection for Video |
|
|
| > **SCRFD-family detectors + ByteTrack tracking + temporal smoothing** |
| > Optimized for tiny faces, blur, occlusion, poor lighting, compression, and temporal stability. |
|
|
| ## Architecture Survey & Design Decisions |
|
|
| ### Ranked Candidate Models (WiderFace Hard AP) |
|
|
| | Rank | Model | Easy | Medium | Hard | GFLOPs | FPS (V100) | Year | Selected? | |
| |------|-------|------|--------|------|--------|-----------|------|-----------| |
| | 1 | ASFD-D6 | 97.2 | 96.5 | **92.5** | ~500 | ~7 | 2022 | β (too slow) | |
| | 2 | TinaFace-R50+TTA | 96.1 | 95.5 | **92.4** | ~42K (MS) | ~3 | 2020 | β (TTA-dependent) | |
| | 3 | TinaFace-R50 (single) | 95.9 | 95.2 | **92.1** | 508 | ~15 | 2020 | β (not efficient) | |
| | 4 | RetinaFace-R152+MS | 96.9 | 96.1 | **91.8** | High | 13 | 2019 | β (heavy backbone) | |
| | 5 | MOS-L (R152) | 96.9 | 96.1 | **92.1** | Multi-scale | ~16 | 2021 | β (heavy) | |
| | 6 | DSFD | 96.6 | 95.7 | **90.4** | ~1532 | β | 2018 | β (outdated) | |
| | **7** | **SCRFD-34GF** | **96.1** | **95.0** | **85.2** | **34** | **~80** | **2021** | **β Flagship** | |
| | **8** | **SCRFD-10GF** | **95.2** | **93.9** | **83.1** | **10** | **~140** | **2021** | **β Balanced** | |
| | **9** | **SCRFD-2.5GF** | **93.8** | **92.2** | **77.9** | **2.5** | **~400** | **2021** | **β Real-time** | |
| | **10** | **SCRFD-0.5GF** | **90.6** | **88.1** | **68.5** | **0.5** | **~1000** | **2021** | **β Mobile** | |
| | 11 | RetinaFace-MN0.25 | 91.4 | 90.1 | 78.2 | ~1 | 60 CPU | 2019 | β (SCRFD-2.5G better) | |
| | 12 | YuNet | 85.6 | 84.2 | 72.7 | 2.5 | 77 CPU | 2021 | β (lower AP) | |
|
|
| ### Why SCRFD? |
|
|
| **The SCRFD family achieves the best accuracy-efficiency Pareto frontier for face detection.** The key findings: |
|
|
| 1. **3.86% better Hard AP** than TinaFace at 3Γ speed (SCRFD-34G vs TinaFace-R50) |
| 2. **No ImageNet pretraining needed** β trains from scratch in 640 epochs |
| 3. **Scalable family** β same architecture principles from 0.5 to 34 GFLOPs |
| 4. **Two orthogonal innovations**: Sample Redistribution (augmentation) + Computation Redistribution (NAS) |
|
|
| Higher-ranked models (ASFD-D6, TinaFace+TTA) achieve marginally better Hard AP but at **10-100Γ the compute cost**, making them impractical for video. |
|
|
| ### Key Technical Insights From Literature |
|
|
| | Finding | Source | Impact | |
| |---------|--------|--------| |
| | Large-scale crops [0.3β2.0] increase stride-8 positives from 72Kβ118K | SCRFD Β§3.2 | +5-8% Hard AP | |
| | GFL jointly trains quality + classification β better score calibration | SCRFD Β§3.1 | +1-2% Hard AP | |
| | Lower IoU matching threshold (0.35 vs 0.5) helps tiny faces | TinaFace Β§3.3 | +2% Hard AP | |
| | GroupNorm > BatchNorm at small batch sizes | TinaFace Β§3.2 | Stable training | |
| | 5-point landmark auxiliary loss improves detection by ~1% | RetinaFace Β§4.2 | +1% Hard AP | |
| | WiderFace Hard is near-saturated (~92% AP ceiling) | Survey '21 | Focus on efficiency | |
| | No published improvements to WiderFace Hard since 2022 | Literature scan | Benchmark mature | |
|
|
| --- |
|
|
| ## Model Zoo |
|
|
| | Model | WiderFace (E/M/H) | GFLOPs | Params | FPS (V100 VGA) | Use Case | |
| |-------|-------------------|--------|--------|-----------------|----------| |
| | `scrfd_34g` | 96.1/95.0/85.2 | 34 | 9.80M | ~80 | Flagship quality | |
| | `scrfd_10g` | 95.2/93.9/83.1 | 10 | 3.86M | ~140 | Balanced | |
| | `scrfd_2.5g` | 93.8/92.2/77.9 | 2.5 | 0.67M | ~400 | Real-time video | |
| | `scrfd_0.5g` | 90.6/88.1/68.5 | 0.5 | 0.57M | ~1000 | Mobile/edge | |
|
|
| --- |
|
|
| ## Architecture |
|
|
| ``` |
| Input Image (640Γ640) |
| β |
| βΌ |
| βββββββββββββββββββββββββββββββββββββββββββ |
| β BACKBONE (NAS-searched ResNet-style) β |
| β βββββββ βββββββ ββββββββ βββββββ β |
| β βStem ββ β S1 ββ β S2 ββ β S3 ββ β S4 β |
| β βs=4 β βs=4 β β s=8 β βs=16 β βs=32 β |
| β βββββββ βββββββ ββββ¬ββββ ββββ¬βββ ββββ¬βββ |
| β β C3 β C4 β C5 |
| ββββββββββββββββββββββββββΌββββββββββΌβββββββββΌβββ |
| β β β |
| ββββββββββββββββββββββΌββββββββββΌβββββββββΌβββ |
| β PAFPN (Path Aggregation FPN) β |
| β Top-down (FPN) + Bottom-up (PAN) β |
| β ββββββ ββββββ ββββββ β |
| β β P3 β β β P4 β β β P5 β (top-down) β |
| β β P3 β β β P4 β β β P5 β (bottom-up) β |
| β βs=8 β βs=16β βs=32β β |
| β ββββ¬ββ ββββ¬ββ ββββ¬ββ β |
| βββββββΌββββββββββΌββββββββββΌββββββββββββββββββ |
| β β β |
| βββββββΌββββββββββΌββββββββββΌββββββββββββββββββ |
| β SHARED HEAD (per level, weight-shared) β |
| β ββββββββββββ ββββββββββββ β |
| β β CLS (GFL)β β REG(DIoU)β [LMK (opt)] β |
| β β AΓ1 β β AΓ4 β [AΓ10] β |
| β ββββββββββββ ββββββββββββ β |
| βββββββββββββββββββββββββββββββββββββββββββββ |
| β β |
| βΌ βΌ |
| βββββββββββββββ ββββββββββββββββ |
| β ATSS Match β β NMS (ΞΈ=0.4) β |
| β (training) β β (inference) β |
| βββββββββββββββ ββββββββββββββββ |
| ``` |
|
|
| **Anchors (per level):** |
| - Stride 8: `[16, 32]` β small faces (β₯16px) |
| - Stride 16: `[64, 128]` β medium faces |
| - Stride 32: `[256, 512]` β large faces |
| - Aspect ratio: 1.0 (square β faces are roughly square) |
|
|
| --- |
|
|
| ## Video Pipeline |
|
|
| ``` |
| Frame β Detector (SCRFD) β ByteTrack Tracker β Temporal Smoother β Output |
| β β β |
| Per-frame boxes Track IDs (stable) Jitter-free boxes |
| + scores + Kalman prediction + Score momentum |
| + landmarks + 2-stage matching + Adaptive EMA |
| ``` |
|
|
| **ByteTrack** (Zhang et al., 2022): Uses ALL detections β high + low confidence β for two-stage association. Low-confidence detections handle partially occluded faces that would be lost by traditional trackers. |
|
|
| **Temporal Smoother**: Adaptive EMA where smoothing factor scales with motion magnitude: |
| - Static faces β heavy smoothing (Ξ±β0.3) β no jitter |
| - Fast-moving faces β light smoothing (Ξ±β0.9) β no lag |
|
|
| --- |
|
|
| ## Quick Start |
|
|
| ### Installation |
|
|
| ```bash |
| pip install -r requirements.txt |
| ``` |
|
|
| ### Detect faces in a video |
|
|
| ```python |
| from facedet import VideoFaceDetector |
| |
| detector = VideoFaceDetector( |
| model_path='checkpoints/scrfd_34g_best.pth', |
| model_name='scrfd_34g', |
| device='cuda', |
| use_tracking=True, |
| use_smoothing=True, |
| ) |
| |
| # Process video file |
| stats = detector.process_video( |
| source='input.mp4', |
| output_path='output.mp4', |
| show=True, |
| ) |
| # β {total_frames: 1200, avg_fps: 75.3, avg_faces_per_frame: 4.2} |
| ``` |
|
|
| ### Detect faces in a single image |
|
|
| ```python |
| from facedet import build_detector |
| import cv2, torch |
| |
| model = build_detector('scrfd_34g').cuda().eval() |
| # Load checkpoint... |
| |
| img = cv2.imread('photo.jpg') |
| # Preprocess... (see scripts/evaluate.py for full example) |
| results = model(tensor) |
| # β [{'boxes': tensor([...]), 'scores': tensor([...])}] |
| ``` |
|
|
| ### Real-time webcam |
|
|
| ```bash |
| python scripts/detect_video.py \ |
| --model scrfd_2.5g \ |
| --checkpoint checkpoints/scrfd_2.5g_best.pth \ |
| --input 0 --show |
| ``` |
|
|
| --- |
|
|
| ## Training |
|
|
| ### Dataset Setup |
|
|
| Download [WIDER FACE](http://shuoyang1213.me/WIDERFACE/) and arrange: |
|
|
| ``` |
| data/wider_face/ |
| βββ WIDER_train/images/ |
| βββ WIDER_val/images/ |
| βββ wider_face_split/ |
| β βββ wider_face_train_bbx_gt.txt |
| β βββ wider_face_val_bbx_gt.txt |
| βββ retinaface_gt/ (optional, for landmark training) |
| βββ train/label.txt |
| βββ val/label.txt |
| ``` |
|
|
| ### Training Commands |
|
|
| ```bash |
| # Single GPU β SCRFD-34G (flagship) |
| python scripts/train.py \ |
| --model scrfd_34g \ |
| --data-root data/wider_face \ |
| --epochs 640 \ |
| --batch-size 8 \ |
| --lr 0.01 |
| |
| # Multi-GPU β 4Γ V100 |
| torchrun --nproc_per_node=4 scripts/train.py \ |
| --model scrfd_34g \ |
| --data-root data/wider_face \ |
| --epochs 640 \ |
| --batch-size 8 \ |
| --lr 0.01 |
| |
| # Real-time variant |
| python scripts/train.py \ |
| --model scrfd_2.5g \ |
| --data-root data/wider_face \ |
| --epochs 640 \ |
| --batch-size 16 \ |
| --lr 0.02 |
| ``` |
|
|
| ### Training Recipe (from SCRFD paper) |
|
|
| | Parameter | Value | Rationale | |
| |-----------|-------|-----------| |
| | Optimizer | SGD (m=0.9, wd=5e-4) | Standard for detection | |
| | Base LR | 0.01 (8 imgs/GPU) | Linear scaling rule | |
| | LR Schedule | MultiStep [440, 544] Γ0.1 | Long training, late decay | |
| | Warmup | 3 epochs, linear from 1e-5 | Prevent early divergence | |
| | Total Epochs | 640 | Train from scratch | |
| | Input Size | 640Γ640 | Random crop from larger | |
| | Crop Scales | [0.3, 0.45, ..., 2.0] | **Sample Redistribution** | |
| | Augmentation | Crop + flip + photometric + robustness | See data/augmentations.py | |
| | Normalization | GroupNorm | Batch-size independent | |
| | Matching | ATSS (k=9) | Adaptive thresholds | |
| | Cls Loss | GFL (Ξ²=2) | Joint quality score | |
| | Reg Loss | DIoU | Better for tiny faces | |
| | Mixed Precision | β | 2Γ training speed | |
|
|
| --- |
|
|
| ## Evaluation |
|
|
| ```bash |
| python scripts/evaluate.py \ |
| --model scrfd_34g \ |
| --checkpoint checkpoints/scrfd_34g_best.pth \ |
| --data-root data/wider_face \ |
| --output-dir results/scrfd_34g \ |
| --benchmark |
| ``` |
|
|
| Generates: |
| - WiderFace Easy/Medium/Hard AP scores |
| - Predictions in WiderFace submission format |
| - Speed benchmark table (320/480/640/960px) |
|
|
| --- |
|
|
| ## Deployment |
|
|
| ### ONNX Export |
|
|
| ```bash |
| python scripts/export.py \ |
| --model scrfd_34g \ |
| --checkpoint checkpoints/scrfd_34g_best.pth \ |
| --output deploy/scrfd_34g.onnx \ |
| --input-size 640 |
| ``` |
|
|
| ### TensorRT (FP16) |
|
|
| ```bash |
| trtexec --onnx=deploy/scrfd_34g.onnx \ |
| --saveEngine=deploy/scrfd_34g_fp16.engine \ |
| --fp16 --workspace=4096 |
| ``` |
|
|
| ### Expected Deployment Speedups |
|
|
| | Model | PyTorch FP32 | ONNX RT | TensorRT FP16 | TensorRT INT8 | |
| |-------|-------------|---------|----------------|---------------| |
| | SCRFD-34G | ~80 FPS | ~100 FPS | ~200 FPS | ~350 FPS | |
| | SCRFD-2.5G | ~400 FPS | ~500 FPS | ~800 FPS | ~1200 FPS | |
| | SCRFD-0.5G | ~1000 FPS | ~1200 FPS | ~2000 FPS | ~3000 FPS | |
|
|
| ### PyTorch Quantization (CPU) |
|
|
| ```python |
| from facedet.deploy import quantize_model |
| quantized = quantize_model(model, method='dynamic') |
| ``` |
|
|
| --- |
|
|
| ## Ablation Studies |
|
|
| Configured in `configs/ablations.yaml`. Each ablation isolates one variable: |
|
|
| | Ablation | Variables | Expected Finding | |
| |----------|-----------|-----------------| |
| | **Sample Redistribution** | Crop scales [0.3β1.0] vs [0.3β2.0] | +5-8% Hard AP from large crops | |
| | **Loss Functions** | GFL+DIoU vs Focal+SmoothL1 | GFL: +1-2% from quality scores | |
| | **Matching Strategy** | ATSS(k=9) vs IoU(0.35) vs IoU(0.5) | ATSS: best for mixed scales | |
| | **Robustness Augmentation** | None / blur / JPEG / all | All: +1-3% on degraded inputs | |
| | **Normalization** | GroupNorm vs BatchNorm | GN: stable at batch<8 | |
| | **Input Resolution** | 320 / 480 / 640 / 960 | 960: +5-10% Hard AP, 4Γ slower | |
| | **Landmarks** | With/without 5-point landmarks | +~1% Hard AP (RetinaFace finding) | |
| | **Tracker Config** | None / conservative / aggressive | Aggressive: more tracks, more FP | |
|
|
| --- |
|
|
| ## Handling Challenging Conditions |
|
|
| ### Tiny Faces (<16px) |
| - **Sample Redistribution** (crop scale up to 2.0Γ) generates small face training samples |
| - Stride-8 feature maps with anchors [16, 32]px |
| - Higher inference resolution (960px) trades speed for +5-10% small face recall |
| - ATSS matching gives tiny faces lower IoU thresholds automatically |
|
|
| ### Blur / Motion Blur |
| - **Training augmentation**: Gaussian blur Οβ[0.5, 3.0] applied with p=0.2 |
| - Model learns blur-invariant features |
| - ByteTrack Kalman filter predicts through blurred frames |
|
|
| ### Occlusion |
| - **Random erasing** (Cutout) during training simulates partial occlusion |
| - ATSS assigns multiple anchors per GT β partial detection still gets signal |
| - ByteTrack 2nd-stage matching recovers occluded faces with low-confidence detections |
|
|
| ### Poor Lighting |
| - **Gamma darkening** augmentation (Ξ³β[1.5, 3.0]) simulates low-light |
| - Photometric distortion (brightness, contrast jitter) |
| - For extreme cases: pair with CLAHE preprocessing |
|
|
| ### Compression Artifacts |
| - **JPEG quality** degradation (Q=20-80) during training |
| - No published method addresses this β our augmentation is novel for face detection |
|
|
| ### Temporal Stability |
| - **ByteTrack**: stable track IDs across frames, handles occlusion |
| - **Kalman filter**: smooth trajectory prediction |
| - **Temporal EMA**: adaptive smoothing eliminates box jitter |
| - **Keyframe strategy**: full detection every N frames, tracker-only in between |
|
|
| --- |
|
|
| ## Repository Structure |
|
|
| ``` |
| facedet/ |
| βββ README.md # This file |
| βββ setup.py # Package installation |
| βββ requirements.txt # Dependencies |
| β |
| βββ models/ # Model architectures |
| β βββ backbone.py # NAS-searched ResNet backbones |
| β βββ neck.py # PAFPN feature pyramid |
| β βββ head.py # Shared detection head (cls/reg/lmk) |
| β βββ anchor.py # Anchor generation + ATSS matching |
| β βββ losses.py # GFL, DIoU, Focal, Landmark losses |
| β βββ detector.py # Full SCRFD detector (train + inference) |
| β |
| βββ data/ # Data pipeline |
| β βββ widerface.py # WiderFace dataset loader |
| β βββ augmentations.py # Training/val/robustness augmentations |
| β βββ dataloader.py # DataLoader builders |
| β |
| βββ engine/ # Video inference engine |
| β βββ video_detector.py # End-to-end video processing |
| β βββ tracker.py # ByteTrack face tracker |
| β βββ temporal.py # Temporal EMA smoother |
| β |
| βββ evaluation/ # Evaluation suite |
| β βββ widerface_eval.py # WiderFace protocol (Easy/Med/Hard AP) |
| β βββ speed_benchmark.py # Latency/throughput benchmarks |
| β βββ metrics.py # Core metrics (AP, IoU, recall) |
| β |
| βββ deploy/ # Deployment |
| β βββ export_onnx.py # ONNX export + verification |
| β βββ optimize.py # Quantization, TensorRT guide |
| β |
| βββ configs/ # Configuration files |
| β βββ scrfd_34g.yaml # Flagship (quality) |
| β βββ scrfd_10g.yaml # Balanced |
| β βββ scrfd_2.5g.yaml # Real-time |
| β βββ scrfd_0.5g.yaml # Mobile |
| β βββ ablations.yaml # Ablation study configs |
| β |
| βββ scripts/ # Entry points |
| β βββ train.py # Training (single/multi-GPU) |
| β βββ evaluate.py # WiderFace evaluation + speed bench |
| β βββ detect_video.py # Video inference CLI |
| β βββ export.py # ONNX export CLI |
| β |
| βββ utils/ # Helpers |
| βββ visualization.py # Drawing utilities |
| βββ io.py # Checkpoint I/O |
| ``` |
|
|
| --- |
|
|
| ## References |
|
|
| 1. **SCRFD**: Guo et al., "Sample and Computation Redistribution for Efficient Face Detection", arXiv:2105.04714 (2021) |
| 2. **RetinaFace**: Deng et al., "RetinaFace: Single-shot Multi-level Face Localisation in the Wild", arXiv:1905.00641 (2019) |
| 3. **TinaFace**: Zhu et al., "TinaFace: Strong but Simple Baseline for Face Detection", arXiv:2011.13183 (2020) |
| 4. **ByteTrack**: Zhang et al., "ByteTrack: Multi-Object Tracking by Associating Every Detection Box", ECCV 2022 |
| 5. **ATSS**: Zhang et al., "Bridging the Gap Between Anchor-based and Anchor-free Detection via Adaptive Training Sample Selection", CVPR 2020 |
| 6. **GFL**: Li et al., "Generalized Focal Loss: Learning Qualified and Distributed Bounding Boxes for Dense Object Detection", NeurIPS 2020 |
| 7. **DIoU**: Zheng et al., "Distance-IoU Loss: Faster and Better Learning for Bounding Box Regression", AAAI 2020 |
| 8. **ASFD**: Zhang et al., "ASFD: Automatic and Scalable Face Detector", arXiv:2201.10781 (2022) |
| 9. **DSFD**: Li et al., "DSFD: Dual Shot Face Detector", CVPR 2019 |
| 10. **WiderFace**: Yang et al., "WIDER FACE: A Face Detection Benchmark", CVPR 2016 |
|
|
| --- |
|
|
| ## License |
|
|
| Apache 2.0 |
|
|