facedet / README.md

Upload README.md with huggingface_hub

557cc40 verified 15 days ago

17.3 kB

	# FaceDet — Production Face Detection for Video

	> SCRFD-family detectors + ByteTrack tracking + temporal smoothing
	> Optimized for tiny faces, blur, occlusion, poor lighting, compression, and temporal stability.

	## Architecture Survey & Design Decisions

	### Ranked Candidate Models (WiderFace Hard AP)

	\| Rank \| Model \| Easy \| Medium \| Hard \| GFLOPs \| FPS (V100) \| Year \| Selected? \|
	\|------\|-------\|------\|--------\|------\|--------\|-----------\|------\|-----------\|
	\| 1 \| ASFD-D6 \| 97.2 \| 96.5 \| 92.5 \| ~500 \| ~7 \| 2022 \| ✗ (too slow) \|
	\| 2 \| TinaFace-R50+TTA \| 96.1 \| 95.5 \| 92.4 \| ~42K (MS) \| ~3 \| 2020 \| ✗ (TTA-dependent) \|
	\| 3 \| TinaFace-R50 (single) \| 95.9 \| 95.2 \| 92.1 \| 508 \| ~15 \| 2020 \| ✗ (not efficient) \|
	\| 4 \| RetinaFace-R152+MS \| 96.9 \| 96.1 \| 91.8 \| High \| 13 \| 2019 \| ✗ (heavy backbone) \|
	\| 5 \| MOS-L (R152) \| 96.9 \| 96.1 \| 92.1 \| Multi-scale \| ~16 \| 2021 \| ✗ (heavy) \|
	\| 6 \| DSFD \| 96.6 \| 95.7 \| 90.4 \| ~1532 \| — \| 2018 \| ✗ (outdated) \|
	\| 7 \| SCRFD-34GF \| 96.1 \| 95.0 \| 85.2 \| 34 \| ~80 \| 2021 \| ✓ Flagship \|
	\| 8 \| SCRFD-10GF \| 95.2 \| 93.9 \| 83.1 \| 10 \| ~140 \| 2021 \| ✓ Balanced \|
	\| 9 \| SCRFD-2.5GF \| 93.8 \| 92.2 \| 77.9 \| 2.5 \| ~400 \| 2021 \| ✓ Real-time \|
	\| 10 \| SCRFD-0.5GF \| 90.6 \| 88.1 \| 68.5 \| 0.5 \| ~1000 \| 2021 \| ✓ Mobile \|
	\| 11 \| RetinaFace-MN0.25 \| 91.4 \| 90.1 \| 78.2 \| ~1 \| 60 CPU \| 2019 \| ✗ (SCRFD-2.5G better) \|
	\| 12 \| YuNet \| 85.6 \| 84.2 \| 72.7 \| 2.5 \| 77 CPU \| 2021 \| ✗ (lower AP) \|

	### Why SCRFD?

	The SCRFD family achieves the best accuracy-efficiency Pareto frontier for face detection. The key findings:

	1. 3.86% better Hard AP than TinaFace at 3× speed (SCRFD-34G vs TinaFace-R50)
	2. No ImageNet pretraining needed — trains from scratch in 640 epochs
	3. Scalable family — same architecture principles from 0.5 to 34 GFLOPs
	4. Two orthogonal innovations: Sample Redistribution (augmentation) + Computation Redistribution (NAS)

	Higher-ranked models (ASFD-D6, TinaFace+TTA) achieve marginally better Hard AP but at 10-100× the compute cost, making them impractical for video.

	### Key Technical Insights From Literature

	\| Finding \| Source \| Impact \|
	\|---------\|--------\|--------\|
	\| Large-scale crops [0.3–2.0] increase stride-8 positives from 72K→118K \| SCRFD §3.2 \| +5-8% Hard AP \|
	\| GFL jointly trains quality + classification → better score calibration \| SCRFD §3.1 \| +1-2% Hard AP \|
	\| Lower IoU matching threshold (0.35 vs 0.5) helps tiny faces \| TinaFace §3.3 \| +2% Hard AP \|
	\| GroupNorm > BatchNorm at small batch sizes \| TinaFace §3.2 \| Stable training \|
	\| 5-point landmark auxiliary loss improves detection by ~1% \| RetinaFace §4.2 \| +1% Hard AP \|
	\| WiderFace Hard is near-saturated (~92% AP ceiling) \| Survey '21 \| Focus on efficiency \|
	\| No published improvements to WiderFace Hard since 2022 \| Literature scan \| Benchmark mature \|

	---

	## Model Zoo

	\| Model \| WiderFace (E/M/H) \| GFLOPs \| Params \| FPS (V100 VGA) \| Use Case \|
	\|-------\|-------------------\|--------\|--------\|-----------------\|----------\|
	\| `scrfd_34g` \| 96.1/95.0/85.2 \| 34 \| 9.80M \| ~80 \| Flagship quality \|
	\| `scrfd_10g` \| 95.2/93.9/83.1 \| 10 \| 3.86M \| ~140 \| Balanced \|
	\| `scrfd_2.5g` \| 93.8/92.2/77.9 \| 2.5 \| 0.67M \| ~400 \| Real-time video \|
	\| `scrfd_0.5g` \| 90.6/88.1/68.5 \| 0.5 \| 0.57M \| ~1000 \| Mobile/edge \|

	---

	## Architecture

	```
	Input Image (640×640)
	│
	▼
	┌─────────────────────────────────────────┐
	│ BACKBONE (NAS-searched ResNet-style) │
	│ ┌─────┐ ┌─────┐ ┌──────┐ ┌─────┐ │
	│ │Stem │→ │ S1 │→ │ S2 │→ │ S3 │→ │ S4 │
	│ │s=4 │ │s=4 │ │ s=8 │ │s=16 │ │s=32 │
	│ └─────┘ └─────┘ └──┬───┘ └──┬──┘ └──┬──┘
	│ │ C3 │ C4 │ C5
	└────────────────────────┼─────────┼────────┼──┘
	│ │ │
	┌────────────────────▼─────────▼────────▼──┐
	│ PAFPN (Path Aggregation FPN) │
	│ Top-down (FPN) + Bottom-up (PAN) │
	│ ┌────┐ ┌────┐ ┌────┐ │
	│ │ P3 │ ← │ P4 │ ← │ P5 │ (top-down) │
	│ │ P3 │ → │ P4 │ → │ P5 │ (bottom-up) │
	│ │s=8 │ │s=16│ │s=32│ │
	│ └──┬─┘ └──┬─┘ └──┬─┘ │
	└─────┼─────────┼─────────┼─────────────────┘
	│ │ │
	┌─────▼─────────▼─────────▼─────────────────┐
	│ SHARED HEAD (per level, weight-shared) │
	│ ┌──────────┐ ┌──────────┐ │
	│ │ CLS (GFL)│ │ REG(DIoU)│ [LMK (opt)] │
	│ │ A×1 │ │ A×4 │ [A×10] │
	│ └──────────┘ └──────────┘ │
	└───────────────────────────────────────────┘
	│ │
	▼ ▼
	┌─────────────┐ ┌──────────────┐
	│ ATSS Match │ │ NMS (θ=0.4) │
	│ (training) │ │ (inference) │
	└─────────────┘ └──────────────┘
	```

	Anchors (per level):
	- Stride 8: `[16, 32]` — small faces (≥16px)
	- Stride 16: `[64, 128]` — medium faces
	- Stride 32: `[256, 512]` — large faces
	- Aspect ratio: 1.0 (square — faces are roughly square)

	---

	## Video Pipeline

	```
	Frame → Detector (SCRFD) → ByteTrack Tracker → Temporal Smoother → Output
	↓ ↓ ↓
	Per-frame boxes Track IDs (stable) Jitter-free boxes
	+ scores + Kalman prediction + Score momentum
	+ landmarks + 2-stage matching + Adaptive EMA
	```

	ByteTrack (Zhang et al., 2022): Uses ALL detections — high + low confidence — for two-stage association. Low-confidence detections handle partially occluded faces that would be lost by traditional trackers.

	Temporal Smoother: Adaptive EMA where smoothing factor scales with motion magnitude:
	- Static faces → heavy smoothing (α≈0.3) → no jitter
	- Fast-moving faces → light smoothing (α≈0.9) → no lag

	---

	## Quick Start

	### Installation

	```bash
	pip install -r requirements.txt
	```

	### Detect faces in a video

	```python
	from facedet import VideoFaceDetector

	detector = VideoFaceDetector(
	model_path='checkpoints/scrfd_34g_best.pth',
	model_name='scrfd_34g',
	device='cuda',
	use_tracking=True,
	use_smoothing=True,
	)

	# Process video file
	stats = detector.process_video(
	source='input.mp4',
	output_path='output.mp4',
	show=True,
	)
	# → {total_frames: 1200, avg_fps: 75.3, avg_faces_per_frame: 4.2}
	```

	### Detect faces in a single image

	```python
	from facedet import build_detector
	import cv2, torch

	model = build_detector('scrfd_34g').cuda().eval()
	# Load checkpoint...

	img = cv2.imread('photo.jpg')
	# Preprocess... (see scripts/evaluate.py for full example)
	results = model(tensor)
	# → [{'boxes': tensor([...]), 'scores': tensor([...])}]
	```

	### Real-time webcam

	```bash
	python scripts/detect_video.py \
	--model scrfd_2.5g \
	--checkpoint checkpoints/scrfd_2.5g_best.pth \
	--input 0 --show
	```

	---

	## Training

	### Dataset Setup

	Download [WIDER FACE](http://shuoyang1213.me/WIDERFACE/) and arrange:

	```
	data/wider_face/
	├── WIDER_train/images/
	├── WIDER_val/images/
	├── wider_face_split/
	│ ├── wider_face_train_bbx_gt.txt
	│ └── wider_face_val_bbx_gt.txt
	└── retinaface_gt/ (optional, for landmark training)
	├── train/label.txt
	└── val/label.txt
	```

	### Training Commands

	```bash
	# Single GPU — SCRFD-34G (flagship)
	python scripts/train.py \
	--model scrfd_34g \
	--data-root data/wider_face \
	--epochs 640 \
	--batch-size 8 \
	--lr 0.01

	# Multi-GPU — 4× V100
	torchrun --nproc_per_node=4 scripts/train.py \
	--model scrfd_34g \
	--data-root data/wider_face \
	--epochs 640 \
	--batch-size 8 \
	--lr 0.01

	# Real-time variant
	python scripts/train.py \
	--model scrfd_2.5g \
	--data-root data/wider_face \
	--epochs 640 \
	--batch-size 16 \
	--lr 0.02
	```

	### Training Recipe (from SCRFD paper)

	\| Parameter \| Value \| Rationale \|
	\|-----------\|-------\|-----------\|
	\| Optimizer \| SGD (m=0.9, wd=5e-4) \| Standard for detection \|
	\| Base LR \| 0.01 (8 imgs/GPU) \| Linear scaling rule \|
	\| LR Schedule \| MultiStep [440, 544] ×0.1 \| Long training, late decay \|
	\| Warmup \| 3 epochs, linear from 1e-5 \| Prevent early divergence \|
	\| Total Epochs \| 640 \| Train from scratch \|
	\| Input Size \| 640×640 \| Random crop from larger \|
	\| Crop Scales \| [0.3, 0.45, ..., 2.0] \| Sample Redistribution \|
	\| Augmentation \| Crop + flip + photometric + robustness \| See data/augmentations.py \|
	\| Normalization \| GroupNorm \| Batch-size independent \|
	\| Matching \| ATSS (k=9) \| Adaptive thresholds \|
	\| Cls Loss \| GFL (β=2) \| Joint quality score \|
	\| Reg Loss \| DIoU \| Better for tiny faces \|
	\| Mixed Precision \| ✓ \| 2× training speed \|

	---

	## Evaluation

	```bash
	python scripts/evaluate.py \
	--model scrfd_34g \
	--checkpoint checkpoints/scrfd_34g_best.pth \
	--data-root data/wider_face \
	--output-dir results/scrfd_34g \
	--benchmark
	```

	Generates:
	- WiderFace Easy/Medium/Hard AP scores
	- Predictions in WiderFace submission format
	- Speed benchmark table (320/480/640/960px)

	---

	## Deployment

	### ONNX Export

	```bash
	python scripts/export.py \
	--model scrfd_34g \
	--checkpoint checkpoints/scrfd_34g_best.pth \
	--output deploy/scrfd_34g.onnx \
	--input-size 640
	```

	### TensorRT (FP16)

	```bash
	trtexec --onnx=deploy/scrfd_34g.onnx \
	--saveEngine=deploy/scrfd_34g_fp16.engine \
	--fp16 --workspace=4096
	```

	### Expected Deployment Speedups

	\| Model \| PyTorch FP32 \| ONNX RT \| TensorRT FP16 \| TensorRT INT8 \|
	\|-------\|-------------\|---------\|----------------\|---------------\|
	\| SCRFD-34G \| ~80 FPS \| ~100 FPS \| ~200 FPS \| ~350 FPS \|
	\| SCRFD-2.5G \| ~400 FPS \| ~500 FPS \| ~800 FPS \| ~1200 FPS \|
	\| SCRFD-0.5G \| ~1000 FPS \| ~1200 FPS \| ~2000 FPS \| ~3000 FPS \|

	### PyTorch Quantization (CPU)

	```python
	from facedet.deploy import quantize_model
	quantized = quantize_model(model, method='dynamic')
	```

	---

	## Ablation Studies

	Configured in `configs/ablations.yaml`. Each ablation isolates one variable:

	\| Ablation \| Variables \| Expected Finding \|
	\|----------\|-----------\|-----------------\|
	\| Sample Redistribution \| Crop scales [0.3–1.0] vs [0.3–2.0] \| +5-8% Hard AP from large crops \|
	\| Loss Functions \| GFL+DIoU vs Focal+SmoothL1 \| GFL: +1-2% from quality scores \|
	\| Matching Strategy \| ATSS(k=9) vs IoU(0.35) vs IoU(0.5) \| ATSS: best for mixed scales \|
	\| Robustness Augmentation \| None / blur / JPEG / all \| All: +1-3% on degraded inputs \|
	\| Normalization \| GroupNorm vs BatchNorm \| GN: stable at batch<8 \|
	\| Input Resolution \| 320 / 480 / 640 / 960 \| 960: +5-10% Hard AP, 4× slower \|
	\| Landmarks \| With/without 5-point landmarks \| +~1% Hard AP (RetinaFace finding) \|
	\| Tracker Config \| None / conservative / aggressive \| Aggressive: more tracks, more FP \|

	---

	## Handling Challenging Conditions

	### Tiny Faces (<16px)
	- Sample Redistribution (crop scale up to 2.0×) generates small face training samples
	- Stride-8 feature maps with anchors [16, 32]px
	- Higher inference resolution (960px) trades speed for +5-10% small face recall
	- ATSS matching gives tiny faces lower IoU thresholds automatically

	### Blur / Motion Blur
	- Training augmentation: Gaussian blur σ∈[0.5, 3.0] applied with p=0.2
	- Model learns blur-invariant features
	- ByteTrack Kalman filter predicts through blurred frames

	### Occlusion
	- Random erasing (Cutout) during training simulates partial occlusion
	- ATSS assigns multiple anchors per GT → partial detection still gets signal
	- ByteTrack 2nd-stage matching recovers occluded faces with low-confidence detections

	### Poor Lighting
	- Gamma darkening augmentation (γ∈[1.5, 3.0]) simulates low-light
	- Photometric distortion (brightness, contrast jitter)
	- For extreme cases: pair with CLAHE preprocessing

	### Compression Artifacts
	- JPEG quality degradation (Q=20-80) during training
	- No published method addresses this — our augmentation is novel for face detection

	### Temporal Stability
	- ByteTrack: stable track IDs across frames, handles occlusion
	- Kalman filter: smooth trajectory prediction
	- Temporal EMA: adaptive smoothing eliminates box jitter
	- Keyframe strategy: full detection every N frames, tracker-only in between

	---

	## Repository Structure

	```
	facedet/
	├── README.md # This file
	├── setup.py # Package installation
	├── requirements.txt # Dependencies
	│
	├── models/ # Model architectures
	│ ├── backbone.py # NAS-searched ResNet backbones
	│ ├── neck.py # PAFPN feature pyramid
	│ ├── head.py # Shared detection head (cls/reg/lmk)
	│ ├── anchor.py # Anchor generation + ATSS matching
	│ ├── losses.py # GFL, DIoU, Focal, Landmark losses
	│ └── detector.py # Full SCRFD detector (train + inference)
	│
	├── data/ # Data pipeline
	│ ├── widerface.py # WiderFace dataset loader
	│ ├── augmentations.py # Training/val/robustness augmentations
	│ └── dataloader.py # DataLoader builders
	│
	├── engine/ # Video inference engine
	│ ├── video_detector.py # End-to-end video processing
	│ ├── tracker.py # ByteTrack face tracker
	│ └── temporal.py # Temporal EMA smoother
	│
	├── evaluation/ # Evaluation suite
	│ ├── widerface_eval.py # WiderFace protocol (Easy/Med/Hard AP)
	│ ├── speed_benchmark.py # Latency/throughput benchmarks
	│ └── metrics.py # Core metrics (AP, IoU, recall)
	│
	├── deploy/ # Deployment
	│ ├── export_onnx.py # ONNX export + verification
	│ └── optimize.py # Quantization, TensorRT guide
	│
	├── configs/ # Configuration files
	│ ├── scrfd_34g.yaml # Flagship (quality)
	│ ├── scrfd_10g.yaml # Balanced
	│ ├── scrfd_2.5g.yaml # Real-time
	│ ├── scrfd_0.5g.yaml # Mobile
	│ └── ablations.yaml # Ablation study configs
	│
	├── scripts/ # Entry points
	│ ├── train.py # Training (single/multi-GPU)
	│ ├── evaluate.py # WiderFace evaluation + speed bench
	│ ├── detect_video.py # Video inference CLI
	│ └── export.py # ONNX export CLI
	│
	└── utils/ # Helpers
	├── visualization.py # Drawing utilities
	└── io.py # Checkpoint I/O
	```

	---

	## References

	1. SCRFD: Guo et al., "Sample and Computation Redistribution for Efficient Face Detection", arXiv:2105.04714 (2021)
	2. RetinaFace: Deng et al., "RetinaFace: Single-shot Multi-level Face Localisation in the Wild", arXiv:1905.00641 (2019)
	3. TinaFace: Zhu et al., "TinaFace: Strong but Simple Baseline for Face Detection", arXiv:2011.13183 (2020)
	4. ByteTrack: Zhang et al., "ByteTrack: Multi-Object Tracking by Associating Every Detection Box", ECCV 2022
	5. ATSS: Zhang et al., "Bridging the Gap Between Anchor-based and Anchor-free Detection via Adaptive Training Sample Selection", CVPR 2020
	6. GFL: Li et al., "Generalized Focal Loss: Learning Qualified and Distributed Bounding Boxes for Dense Object Detection", NeurIPS 2020
	7. DIoU: Zheng et al., "Distance-IoU Loss: Faster and Better Learning for Bounding Box Regression", AAAI 2020
	8. ASFD: Zhang et al., "ASFD: Automatic and Scalable Face Detector", arXiv:2201.10781 (2022)
	9. DSFD: Li et al., "DSFD: Dual Shot Face Detector", CVPR 2019
	10. WiderFace: Yang et al., "WIDER FACE: A Face Detection Benchmark", CVPR 2016

	---

	## License

	Apache 2.0