Spaces:
Paused
chore: clean up dead code, stale comments, and misleading names
Browse files- Rename gpt_data -> metadata on STrack (tracker.py)
- Remove dead inject_metadata method and phantom METADATA_SYNC_KEYS
- Fix _sync_data to sync depth_rel (the key actually written)
- Remove dead _build_display_label and gpt_distance_m checks
- Remove no-op add_no_cache_header middleware
- Remove unused get_segmenter_detector import
- Cache index.html at module load instead of reading per request
- Extract _parse_queries helper to deduplicate query parsing
- Remove empty set_track_data calls in segmentation writer loop
- Remove sentence-transformers semantic matching from coco_classes
- Clean stale GPT/mission references from docstrings
- Update CLAUDE.md to reflect stripped-down architecture
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- CLAUDE.md +63 -209
- coco_classes.py +1 -67
- models/detectors/detr.py +1 -1
- models/detectors/grounding_dino.py +1 -1
- models/segmenters/model_loader.py +1 -1
- utils/profiler.py +0 -1
- utils/tracker.py +13 -74
|
@@ -4,251 +4,105 @@ This file provides guidance to Claude Code (claude.ai/code) when working with co
|
|
| 4 |
|
| 5 |
## Project Overview
|
| 6 |
|
| 7 |
-
|
| 8 |
-
- **Object Detection**: Detect custom objects using text queries (fully functional)
|
| 9 |
-
- **Segmentation**: Mask overlays using SAM3
|
| 10 |
-
- **Drone Detection**: (Coming Soon) Specialized UAV detection
|
| 11 |
-
|
| 12 |
-
## Core Architecture
|
| 13 |
-
|
| 14 |
-
### Simple Detection Flow
|
| 15 |
-
|
| 16 |
-
```
|
| 17 |
-
User → demo.html → POST /detect → inference.py → detector → processed video
|
| 18 |
-
```
|
| 19 |
-
|
| 20 |
-
1. User selects mode and uploads video via web interface
|
| 21 |
-
2. Frontend sends video + mode + queries to `/detect` endpoint
|
| 22 |
-
3. Backend runs detection inference with selected model
|
| 23 |
-
4. Returns processed video with bounding boxes
|
| 24 |
-
|
| 25 |
-
### Available Detectors
|
| 26 |
-
|
| 27 |
-
The system includes 4 pre-trained object detection models:
|
| 28 |
-
|
| 29 |
-
| Detector | Key | Type | Best For |
|
| 30 |
-
|----------|-----|------|----------|
|
| 31 |
-
| **OWLv2** | `owlv2_base` | Open-vocabulary | Custom text queries (default) |
|
| 32 |
-
| **YOLOv8** | `hf_yolov8` | COCO classes | Fast real-time detection |
|
| 33 |
-
| **DETR** | `detr_resnet50` | COCO classes | Transformer-based detection |
|
| 34 |
-
| **Grounding DINO** | `grounding_dino` | Open-vocabulary | Text-grounded detection |
|
| 35 |
-
|
| 36 |
-
All detectors implement the `ObjectDetector` interface in `models/detectors/base.py` with a single `predict()` method.
|
| 37 |
|
| 38 |
## Development Commands
|
| 39 |
|
| 40 |
-
### Setup
|
| 41 |
```bash
|
| 42 |
-
|
| 43 |
-
|
| 44 |
pip install -r requirements.txt
|
| 45 |
-
```
|
| 46 |
|
| 47 |
-
#
|
| 48 |
-
```bash
|
| 49 |
-
# Development
|
| 50 |
uvicorn app:app --host 0.0.0.0 --port 7860 --reload
|
| 51 |
|
| 52 |
-
#
|
| 53 |
-
docker build -t
|
| 54 |
-
docker run -p 7860:7860 object_detectors
|
| 55 |
-
```
|
| 56 |
|
| 57 |
-
#
|
| 58 |
-
|
| 59 |
-
# Test object detection
|
| 60 |
-
curl -X POST http://localhost:7860/detect \
|
| 61 |
-F "video=@sample.mp4" \
|
| 62 |
-F "mode=object_detection" \
|
| 63 |
-
-F "queries=person,car
|
| 64 |
-
-F "detector=
|
| 65 |
-
--output processed.mp4
|
| 66 |
-
|
| 67 |
-
# Test placeholder modes (returns JSON)
|
| 68 |
-
curl -X POST http://localhost:7860/detect \
|
| 69 |
-
-F "video=@sample.mp4" \
|
| 70 |
-
-F "mode=segmentation"
|
| 71 |
-
```
|
| 72 |
-
|
| 73 |
-
## Key Implementation Details
|
| 74 |
-
|
| 75 |
-
### API Endpoint: `/detect`
|
| 76 |
-
|
| 77 |
-
**Parameters:**
|
| 78 |
-
- `video` (file): Video file to process
|
| 79 |
-
- `mode` (string): Detection mode - `object_detection`, `segmentation`, or `drone_detection`
|
| 80 |
-
- `queries` (string): Comma-separated object classes (for object_detection mode)
|
| 81 |
-
- `detector` (string): Model key (default: `owlv2_base`)
|
| 82 |
-
|
| 83 |
-
**Returns:**
|
| 84 |
-
- For `object_detection`: MP4 video with bounding boxes
|
| 85 |
-
- For `segmentation`: MP4 video with mask overlays
|
| 86 |
-
- For `drone_detection`: JSON with `{"status": "coming_soon", "message": "..."}`
|
| 87 |
-
|
| 88 |
-
### Inference Pipeline
|
| 89 |
-
|
| 90 |
-
The `run_inference()` function in `inference.py` follows these steps:
|
| 91 |
-
|
| 92 |
-
1. **Extract Frames**: Decode video using OpenCV
|
| 93 |
-
2. **Parse Queries**: Split comma-separated text into list (defaults to common objects if empty)
|
| 94 |
-
3. **Select Detector**: Load detector by key (cached via `@lru_cache`)
|
| 95 |
-
4. **Process Frames**: Run detection on each frame
|
| 96 |
-
- Call `detector.predict(frame, queries)`
|
| 97 |
-
- Draw green bounding boxes on detections
|
| 98 |
-
5. **Write Video**: Encode processed frames back to MP4
|
| 99 |
-
|
| 100 |
-
Default queries (if none provided): `["person", "car", "truck", "motorcycle", "bicycle", "bus", "train", "airplane"]`
|
| 101 |
-
|
| 102 |
-
### Detector Loading
|
| 103 |
-
|
| 104 |
-
Detectors are registered in `models/model_loader.py`:
|
| 105 |
-
|
| 106 |
-
```python
|
| 107 |
-
_REGISTRY: Dict[str, Callable[[], ObjectDetector]] = {
|
| 108 |
-
"owlv2_base": Owlv2Detector,
|
| 109 |
-
"hf_yolov8": HuggingFaceYoloV8Detector,
|
| 110 |
-
"detr_resnet50": DetrDetector,
|
| 111 |
-
"grounding_dino": GroundingDinoDetector,
|
| 112 |
-
}
|
| 113 |
```
|
| 114 |
|
| 115 |
-
|
| 116 |
|
| 117 |
-
##
|
| 118 |
|
| 119 |
-
|
| 120 |
-
```python
|
| 121 |
-
DetectionResult(
|
| 122 |
-
boxes: np.ndarray, # Nx4 array [x1, y1, x2, y2]
|
| 123 |
-
scores: Sequence[float], # Confidence scores
|
| 124 |
-
labels: Sequence[int], # Class indices
|
| 125 |
-
label_names: Optional[Sequence[str]] # Human-readable names
|
| 126 |
-
)
|
| 127 |
-
```
|
| 128 |
-
|
| 129 |
-
## File Structure
|
| 130 |
|
| 131 |
```
|
| 132 |
-
.
|
| 133 |
-
├─
|
| 134 |
-
├─
|
| 135 |
-
|
| 136 |
-
├─
|
| 137 |
-
|
| 138 |
-
│ ├── model_loader.py # Detector registry and loading
|
| 139 |
-
│ └── detectors/
|
| 140 |
-
│ ├── base.py # ObjectDetector interface
|
| 141 |
-
│ ├── owlv2.py # OWLv2 implementation
|
| 142 |
-
│ ├── yolov8.py # YOLOv8 implementation
|
| 143 |
-
│ ├── detr.py # DETR implementation
|
| 144 |
-
│ └── grounding_dino.py # Grounding DINO implementation
|
| 145 |
-
├── utils/
|
| 146 |
-
│ └── video.py # Video encoding/decoding utilities
|
| 147 |
-
└── coco_classes.py # COCO dataset class definitions
|
| 148 |
```
|
| 149 |
|
| 150 |
-
|
| 151 |
-
|
| 152 |
-
To add a new detector:
|
| 153 |
-
|
| 154 |
-
1. **Create detector class** in `models/detectors/`:
|
| 155 |
-
```python
|
| 156 |
-
from .base import ObjectDetector, DetectionResult
|
| 157 |
|
| 158 |
-
|
| 159 |
-
name = "my_detector"
|
| 160 |
|
| 161 |
-
|
| 162 |
-
|
| 163 |
-
|
| 164 |
-
|
| 165 |
|
| 166 |
-
|
| 167 |
-
```python
|
| 168 |
-
_REGISTRY = {
|
| 169 |
-
...
|
| 170 |
-
"my_detector": MyDetector,
|
| 171 |
-
}
|
| 172 |
-
```
|
| 173 |
|
| 174 |
-
|
| 175 |
-
```html
|
| 176 |
-
<option value="my_detector">My Detector</option>
|
| 177 |
-
```
|
| 178 |
|
| 179 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 180 |
|
| 181 |
-
|
| 182 |
|
| 183 |
-
|
| 184 |
-
|
| 185 |
-
|
| 186 |
|
| 187 |
-
|
| 188 |
-
|
| 189 |
-
if mode == "segmentation":
|
| 190 |
-
# Run segmentation inference
|
| 191 |
-
# Return video with masks rendered
|
| 192 |
-
```
|
| 193 |
|
| 194 |
-
|
| 195 |
|
| 196 |
-
|
|
|
|
|
|
|
|
|
|
| 197 |
|
| 198 |
-
##
|
| 199 |
-
|
| 200 |
-
### Query Processing
|
| 201 |
-
Queries are parsed from comma-separated strings:
|
| 202 |
-
```python
|
| 203 |
-
queries = [q.strip() for q in "person, car, dog".split(",") if q.strip()]
|
| 204 |
-
# Result: ["person", "car", "dog"]
|
| 205 |
-
```
|
| 206 |
-
|
| 207 |
-
### Frame Processing Loop
|
| 208 |
-
Standard pattern for processing video frames:
|
| 209 |
-
```python
|
| 210 |
-
processed_frames = []
|
| 211 |
-
for idx, frame in enumerate(frames):
|
| 212 |
-
processed_frame, detections = infer_frame(frame, queries, detector_name)
|
| 213 |
-
processed_frames.append(processed_frame)
|
| 214 |
-
```
|
| 215 |
-
|
| 216 |
-
### Temporary File Management
|
| 217 |
-
FastAPI's `BackgroundTasks` cleans up temp files after response:
|
| 218 |
-
```python
|
| 219 |
-
_schedule_cleanup(background_tasks, input_path)
|
| 220 |
-
_schedule_cleanup(background_tasks, output_path)
|
| 221 |
-
```
|
| 222 |
|
| 223 |
-
|
|
|
|
|
|
|
|
|
|
| 224 |
|
| 225 |
-
|
| 226 |
-
- **Default Resolution**: Videos processed at original resolution
|
| 227 |
-
- **Frame Limit**: Use `max_frames` parameter in `run_inference()` for testing
|
| 228 |
-
- **Memory Usage**: Entire video is loaded into memory (frames list)
|
| 229 |
|
| 230 |
-
|
|
|
|
|
|
|
| 231 |
|
| 232 |
-
###
|
| 233 |
-
Install dependencies: `pip install -r requirements.txt`
|
| 234 |
|
| 235 |
-
|
| 236 |
-
Check video codec compatibility. System expects MP4/H.264.
|
| 237 |
|
| 238 |
-
##
|
| 239 |
-
Verify detector key exists in `model_loader._REGISTRY`
|
| 240 |
|
| 241 |
-
|
| 242 |
-
|
| 243 |
-
|
| 244 |
-
- Use `max_frames` parameter for testing
|
| 245 |
|
| 246 |
-
##
|
| 247 |
|
| 248 |
-
|
| 249 |
-
- `
|
| 250 |
-
- `torch` + `transformers`: Deep learning models
|
| 251 |
-
- `opencv-python-headless`: Video processing
|
| 252 |
-
- `ultralytics`: YOLOv8 implementation
|
| 253 |
-
- `huggingface-hub`: Model downloading
|
| 254 |
-
- `pillow`, `scipy`, `accelerate`, `timm`: Supporting libraries
|
|
|
|
| 4 |
|
| 5 |
## Project Overview
|
| 6 |
|
| 7 |
+
Reusable video analysis base combining object detection, segmentation, depth estimation, and multi-object tracking. Deployed as a Hugging Face Space (Docker SDK). Designed for multi-GPU inference with async job processing and live MJPEG streaming.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 8 |
|
| 9 |
## Development Commands
|
| 10 |
|
|
|
|
| 11 |
```bash
|
| 12 |
+
# Setup
|
| 13 |
+
python -m venv .venv && source .venv/bin/activate
|
| 14 |
pip install -r requirements.txt
|
|
|
|
| 15 |
|
| 16 |
+
# Run dev server
|
|
|
|
|
|
|
| 17 |
uvicorn app:app --host 0.0.0.0 --port 7860 --reload
|
| 18 |
|
| 19 |
+
# Docker (production / HF Spaces)
|
| 20 |
+
docker build -t detection_base . && docker run -p 7860:7860 detection_base
|
|
|
|
|
|
|
| 21 |
|
| 22 |
+
# Test async detection
|
| 23 |
+
curl -X POST http://localhost:7860/detect/async \
|
|
|
|
|
|
|
| 24 |
-F "video=@sample.mp4" \
|
| 25 |
-F "mode=object_detection" \
|
| 26 |
+
-F "queries=person,car" \
|
| 27 |
+
-F "detector=yolo11"
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 28 |
```
|
| 29 |
|
| 30 |
+
No test suite exists. Verify changes by running the server and testing through the UI at `http://localhost:7860`.
|
| 31 |
|
| 32 |
+
## Architecture
|
| 33 |
|
| 34 |
+
### Request Flow
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 35 |
|
| 36 |
```
|
| 37 |
+
index.html → POST /detect/async → app.py
|
| 38 |
+
├─ process_first_frame() # Fast preview (~1-2s)
|
| 39 |
+
├─ Return job_id + URLs immediately
|
| 40 |
+
└─ Background: process_video_async()
|
| 41 |
+
├─ run_inference() # Detection mode
|
| 42 |
+
└─ run_grounded_sam2_tracking() # Segmentation mode
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 43 |
```
|
| 44 |
|
| 45 |
+
The async pipeline returns instantly with a `job_id`. The frontend polls `/detect/status/{job_id}` and streams live frames via `/detect/stream/{job_id}` (MJPEG).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 46 |
|
| 47 |
+
### API Endpoints (app.py)
|
|
|
|
| 48 |
|
| 49 |
+
**Core:** `POST /detect` (sync), `POST /detect/async` (async with streaming)
|
| 50 |
+
**Job management:** `GET /detect/status/{job_id}`, `DELETE /detect/job/{job_id}`, `GET /detect/video/{job_id}`, `GET /detect/stream/{job_id}`
|
| 51 |
+
**Per-frame data:** `GET /detect/tracks/{job_id}/{frame_idx}`, `GET /detect/first-frame/{job_id}`, `GET /detect/first-frame-depth/{job_id}`, `GET /detect/depth-video/{job_id}`
|
| 52 |
+
**Benchmarking:** `POST /benchmark`, `POST /benchmark/profile`, `POST /benchmark/analysis`, `GET /gpu-monitor`, `GET /benchmark/hardware`
|
| 53 |
|
| 54 |
+
### Model Registries
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 55 |
|
| 56 |
+
All models use a registry + factory pattern with `@lru_cache` for singleton loading. Use `load_*_on_device(name, device)` for multi-GPU (no cache).
|
|
|
|
|
|
|
|
|
|
| 57 |
|
| 58 |
+
**Detectors** (`models/model_loader.py`):
|
| 59 |
+
| Key | Model | Vocabulary |
|
| 60 |
+
|-----|-------|-----------|
|
| 61 |
+
| `yolo11` (default) | YOLO11m | COCO classes only |
|
| 62 |
+
| `detr_resnet50` | DETR | COCO classes only |
|
| 63 |
+
| `grounding_dino` | Grounding DINO | Open-vocabulary (arbitrary text) |
|
| 64 |
+
| `drone_yolo` | Drone YOLO | Specialized UAV detection |
|
| 65 |
|
| 66 |
+
All implement `ObjectDetector.predict(frame, queries)` → `DetectionResult(boxes, scores, labels, label_names)` from `models/detectors/base.py`.
|
| 67 |
|
| 68 |
+
**Segmenters** (`models/segmenters/model_loader.py`):
|
| 69 |
+
- `GSAM2-S/B/L` — Grounded SAM2 (small/base/large) backed by grounding_dino
|
| 70 |
+
- `YSAM2-S/B/L` — YOLO-SAM2 (small/base/large) backed by yolo11
|
| 71 |
|
| 72 |
+
**Depth** (`models/depth_estimators/model_loader.py`):
|
| 73 |
+
- `depth` — DepthAnythingV2
|
|
|
|
|
|
|
|
|
|
|
|
|
| 74 |
|
| 75 |
+
### Inference Pipeline (inference.py)
|
| 76 |
|
| 77 |
+
Three public entry points:
|
| 78 |
+
- **`process_first_frame()`** — Extract + detect on frame 0 only. Returns processed frame + detections.
|
| 79 |
+
- **`run_inference()`** — Full detection pipeline. Multi-GPU data parallelism with worker threads per GPU, reorder buffer for out-of-order completion, ByteTracker for object tracking, optional depth.
|
| 80 |
+
- **`run_grounded_sam2_tracking()`** — SAM2 segmentation with temporal coherence. Uses `SharedFrameStore` (in-memory decoded frames, 12 GiB budget) or falls back to JPEG extraction. `step` parameter controls keyframe interval.
|
| 81 |
|
| 82 |
+
### Async Job System (jobs/)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 83 |
|
| 84 |
+
- **`jobs/models.py`** — `JobInfo` dataclass, `JobStatus` enum (PROCESSING/COMPLETED/FAILED/CANCELLED)
|
| 85 |
+
- **`jobs/storage.py`** — Thread-safe in-memory storage at `/tmp/detection_jobs/{job_id}/`. Auto-cleanup every 10 minutes.
|
| 86 |
+
- **`jobs/background.py`** — `process_video_async()` dispatches to the correct inference function, updates job status.
|
| 87 |
+
- **`jobs/streaming.py`** — Event-driven MJPEG frame publishing. Non-blocking (drops if consumer is slow). Frames pre-resized to 640px width.
|
| 88 |
|
| 89 |
+
### Concurrency Model
|
|
|
|
|
|
|
|
|
|
| 90 |
|
| 91 |
+
- Per-model `RLock` for GPU serialization (`inference.py:_get_model_lock`)
|
| 92 |
+
- Multi-GPU workers use separate model instances per device
|
| 93 |
+
- `AsyncVideoReader` prefetches frames in a background thread to prevent GPU starvation
|
| 94 |
|
| 95 |
+
### Frontend (index.html)
|
|
|
|
| 96 |
|
| 97 |
+
Single HTML page with vanilla JS. Upload video, pick mode/model, view first frame, live MJPEG stream, download processed video, inspect detection JSON.
|
|
|
|
| 98 |
|
| 99 |
+
## Adding a New Detector
|
|
|
|
| 100 |
|
| 101 |
+
1. Create class in `models/detectors/` implementing `ObjectDetector` from `base.py`
|
| 102 |
+
2. Register in `models/model_loader.py` `_REGISTRY`
|
| 103 |
+
3. Add option to detector dropdown in `index.html`
|
|
|
|
| 104 |
|
| 105 |
+
## Dual Remotes
|
| 106 |
|
| 107 |
+
- `hf` → Hugging Face Space (deployment)
|
| 108 |
+
- `github` → GitHub (version control)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
@@ -6,7 +6,6 @@ import logging
|
|
| 6 |
import re
|
| 7 |
from typing import Dict, Optional, Tuple
|
| 8 |
|
| 9 |
-
import numpy as np
|
| 10 |
|
| 11 |
logger = logging.getLogger(__name__)
|
| 12 |
|
|
@@ -157,69 +156,6 @@ _COCO_SYNONYMS: Dict[str, str] = {
|
|
| 157 |
_ALIAS_LOOKUP: Dict[str, str] = {_normalize(alias): canonical for alias, canonical in _COCO_SYNONYMS.items()}
|
| 158 |
|
| 159 |
|
| 160 |
-
# ---------------------------------------------------------------------------
|
| 161 |
-
# Semantic similarity fallback (lazy-loaded)
|
| 162 |
-
# ---------------------------------------------------------------------------
|
| 163 |
-
|
| 164 |
-
_SEMANTIC_MODEL = None
|
| 165 |
-
_COCO_EMBEDDINGS: Optional[np.ndarray] = None
|
| 166 |
-
_SEMANTIC_THRESHOLD = 0.65 # Minimum cosine similarity to accept a match
|
| 167 |
-
|
| 168 |
-
|
| 169 |
-
def _get_semantic_model():
|
| 170 |
-
"""Lazy-load a lightweight sentence-transformer for semantic matching."""
|
| 171 |
-
global _SEMANTIC_MODEL, _COCO_EMBEDDINGS
|
| 172 |
-
if _SEMANTIC_MODEL is not None:
|
| 173 |
-
return _SEMANTIC_MODEL, _COCO_EMBEDDINGS
|
| 174 |
-
|
| 175 |
-
try:
|
| 176 |
-
from sentence_transformers import SentenceTransformer
|
| 177 |
-
_SEMANTIC_MODEL = SentenceTransformer("all-MiniLM-L6-v2")
|
| 178 |
-
# Prefix with "a photo of a" to anchor embeddings in visual/object space
|
| 179 |
-
coco_phrases = [f"a photo of a {cls}" for cls in COCO_CLASSES]
|
| 180 |
-
_COCO_EMBEDDINGS = _SEMANTIC_MODEL.encode(
|
| 181 |
-
coco_phrases, normalize_embeddings=True
|
| 182 |
-
)
|
| 183 |
-
logger.info("Loaded semantic similarity model for COCO class mapping")
|
| 184 |
-
except Exception:
|
| 185 |
-
logger.warning("sentence-transformers unavailable; semantic COCO mapping disabled", exc_info=True)
|
| 186 |
-
_SEMANTIC_MODEL = False # Sentinel: tried and failed
|
| 187 |
-
_COCO_EMBEDDINGS = None
|
| 188 |
-
|
| 189 |
-
return _SEMANTIC_MODEL, _COCO_EMBEDDINGS
|
| 190 |
-
|
| 191 |
-
|
| 192 |
-
def _semantic_coco_match(value: str) -> Optional[str]:
|
| 193 |
-
"""Find the closest COCO class by embedding cosine similarity.
|
| 194 |
-
|
| 195 |
-
Returns the COCO class name if similarity >= threshold, else None.
|
| 196 |
-
"""
|
| 197 |
-
model, coco_embs = _get_semantic_model()
|
| 198 |
-
if model is False or coco_embs is None:
|
| 199 |
-
return None
|
| 200 |
-
|
| 201 |
-
query_emb = model.encode(
|
| 202 |
-
[f"a photo of a {value}"], normalize_embeddings=True
|
| 203 |
-
)
|
| 204 |
-
similarities = query_emb @ coco_embs.T # (1, 80)
|
| 205 |
-
best_idx = int(np.argmax(similarities))
|
| 206 |
-
best_score = float(similarities[0, best_idx])
|
| 207 |
-
|
| 208 |
-
if best_score >= _SEMANTIC_THRESHOLD:
|
| 209 |
-
matched = COCO_CLASSES[best_idx]
|
| 210 |
-
logger.info(
|
| 211 |
-
"Semantic COCO match: '%s' -> '%s' (score=%.3f)",
|
| 212 |
-
value, matched, best_score,
|
| 213 |
-
)
|
| 214 |
-
return matched
|
| 215 |
-
|
| 216 |
-
logger.debug(
|
| 217 |
-
"Semantic COCO match failed: '%s' best='%s' (score=%.3f < %.2f)",
|
| 218 |
-
value, COCO_CLASSES[best_idx], best_score, _SEMANTIC_THRESHOLD,
|
| 219 |
-
)
|
| 220 |
-
return None
|
| 221 |
-
|
| 222 |
-
|
| 223 |
@functools.lru_cache(maxsize=512)
|
| 224 |
def canonicalize_coco_name(value: str | None) -> str | None:
|
| 225 |
"""Map an arbitrary string to the closest COCO class name if possible.
|
|
@@ -230,7 +166,6 @@ def canonicalize_coco_name(value: str | None) -> str | None:
|
|
| 230 |
3. Substring match (alias then canonical)
|
| 231 |
4. Token-level match
|
| 232 |
5. Fuzzy string match (difflib)
|
| 233 |
-
6. Semantic embedding similarity (sentence-transformers)
|
| 234 |
"""
|
| 235 |
|
| 236 |
if not value:
|
|
@@ -261,5 +196,4 @@ def canonicalize_coco_name(value: str | None) -> str | None:
|
|
| 261 |
if close:
|
| 262 |
return _CANONICAL_LOOKUP[close[0]]
|
| 263 |
|
| 264 |
-
|
| 265 |
-
return _semantic_coco_match(value)
|
|
|
|
| 6 |
import re
|
| 7 |
from typing import Dict, Optional, Tuple
|
| 8 |
|
|
|
|
| 9 |
|
| 10 |
logger = logging.getLogger(__name__)
|
| 11 |
|
|
|
|
| 156 |
_ALIAS_LOOKUP: Dict[str, str] = {_normalize(alias): canonical for alias, canonical in _COCO_SYNONYMS.items()}
|
| 157 |
|
| 158 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 159 |
@functools.lru_cache(maxsize=512)
|
| 160 |
def canonicalize_coco_name(value: str | None) -> str | None:
|
| 161 |
"""Map an arbitrary string to the closest COCO class name if possible.
|
|
|
|
| 166 |
3. Substring match (alias then canonical)
|
| 167 |
4. Token-level match
|
| 168 |
5. Fuzzy string match (difflib)
|
|
|
|
| 169 |
"""
|
| 170 |
|
| 171 |
if not value:
|
|
|
|
| 196 |
if close:
|
| 197 |
return _CANONICAL_LOOKUP[close[0]]
|
| 198 |
|
| 199 |
+
return None
|
|
|
|
@@ -9,7 +9,7 @@ from models.detectors.base import DetectionResult, ObjectDetector
|
|
| 9 |
|
| 10 |
|
| 11 |
class DetrDetector(ObjectDetector):
|
| 12 |
-
"""Wrapper around facebook/detr-resnet-50 for
|
| 13 |
|
| 14 |
MODEL_NAME = "facebook/detr-resnet-50"
|
| 15 |
|
|
|
|
| 9 |
|
| 10 |
|
| 11 |
class DetrDetector(ObjectDetector):
|
| 12 |
+
"""Wrapper around facebook/detr-resnet-50 for object detection."""
|
| 13 |
|
| 14 |
MODEL_NAME = "facebook/detr-resnet-50"
|
| 15 |
|
|
@@ -9,7 +9,7 @@ from models.detectors.base import DetectionResult, ObjectDetector
|
|
| 9 |
|
| 10 |
|
| 11 |
class GroundingDinoDetector(ObjectDetector):
|
| 12 |
-
"""IDEA-Research Grounding DINO-B detector for open-vocabulary
|
| 13 |
|
| 14 |
MODEL_NAME = "IDEA-Research/grounding-dino-base"
|
| 15 |
|
|
|
|
| 9 |
|
| 10 |
|
| 11 |
class GroundingDinoDetector(ObjectDetector):
|
| 12 |
+
"""IDEA-Research Grounding DINO-B detector for open-vocabulary detection."""
|
| 13 |
|
| 14 |
MODEL_NAME = "IDEA-Research/grounding-dino-base"
|
| 15 |
|
|
@@ -40,7 +40,7 @@ _REGISTRY: Dict[str, Callable[..., Segmenter]] = {
|
|
| 40 |
|
| 41 |
|
| 42 |
def get_segmenter_detector(segmenter_name: str) -> str:
|
| 43 |
-
"""Return the detector key associated with a segmenter
|
| 44 |
spec = _SEGMENTER_SPECS.get(segmenter_name)
|
| 45 |
if spec is None:
|
| 46 |
available = ", ".join(sorted(_REGISTRY))
|
|
|
|
| 40 |
|
| 41 |
|
| 42 |
def get_segmenter_detector(segmenter_name: str) -> str:
|
| 43 |
+
"""Return the detector key associated with a segmenter."""
|
| 44 |
spec = _SEGMENTER_SPECS.get(segmenter_name)
|
| 45 |
if spec is None:
|
| 46 |
available = ", ".join(sorted(_REGISTRY))
|
|
@@ -390,7 +390,6 @@ def run_profiled_segmentation(
|
|
| 390 |
queries,
|
| 391 |
segmenter_name=segmenter_name,
|
| 392 |
step=step,
|
| 393 |
-
enable_gpt=False,
|
| 394 |
max_frames=max_frames,
|
| 395 |
_perf_metrics=metrics,
|
| 396 |
_perf_lock=lock,
|
|
|
|
| 390 |
queries,
|
| 391 |
segmenter_name=segmenter_name,
|
| 392 |
step=step,
|
|
|
|
| 393 |
max_frames=max_frames,
|
| 394 |
_perf_metrics=metrics,
|
| 395 |
_perf_lock=lock,
|
|
@@ -3,7 +3,6 @@ import numpy as np
|
|
| 3 |
from scipy.optimize import linear_sum_assignment
|
| 4 |
import scipy.linalg
|
| 5 |
|
| 6 |
-
from utils.schemas import AssessmentStatus
|
| 7 |
|
| 8 |
|
| 9 |
class KalmanFilter:
|
|
@@ -198,24 +197,6 @@ class KalmanFilter:
|
|
| 198 |
return ret
|
| 199 |
|
| 200 |
|
| 201 |
-
# Default staleness threshold: GPT metadata older than this many frames is flagged STALE
|
| 202 |
-
MAX_STALE_FRAMES = 300
|
| 203 |
-
|
| 204 |
-
GPT_SYNC_KEYS = frozenset({
|
| 205 |
-
# Legacy / polyfilled fields (consumed by frontend cards)
|
| 206 |
-
"gpt_distance_m", "gpt_direction", "gpt_description", "gpt_raw",
|
| 207 |
-
"threat_level_score", "distance_m", "direction", "description",
|
| 208 |
-
# Universal schema fields
|
| 209 |
-
"object_type", "size", "visible_weapons", "weapon_readiness",
|
| 210 |
-
"motion_status", "range_estimate", "bearing",
|
| 211 |
-
"threat_level", "threat_classification", "tactical_intent",
|
| 212 |
-
"dynamic_features",
|
| 213 |
-
# Provenance and temporal validity
|
| 214 |
-
"assessment_frame_index", "assessment_status",
|
| 215 |
-
# Mission relevance
|
| 216 |
-
"mission_relevant", "relevance_reason",
|
| 217 |
-
})
|
| 218 |
-
|
| 219 |
|
| 220 |
class STrack:
|
| 221 |
"""
|
|
@@ -247,8 +228,8 @@ class STrack:
|
|
| 247 |
self.mean = None
|
| 248 |
self.covariance = None
|
| 249 |
|
| 250 |
-
#
|
| 251 |
-
self.
|
| 252 |
|
| 253 |
def _tlwh_from_xyxy(self, xyxy):
|
| 254 |
"""Convert xyxy to tlwh."""
|
|
@@ -566,29 +547,18 @@ class ByteTracker:
|
|
| 566 |
d_out['bbox'] = [float(x) for x in tracked_bbox]
|
| 567 |
d_out['track_id'] = f"T{str(track.track_id).zfill(2)}"
|
| 568 |
|
| 569 |
-
# Restore
|
| 570 |
-
for k, v in track.
|
| 571 |
if k not in d_out:
|
| 572 |
d_out[k] = v
|
| 573 |
|
| 574 |
-
# --- Temporal validity check (INV-5, INV-11) ---
|
| 575 |
-
assessment_frame = d_out.get('assessment_frame_index')
|
| 576 |
-
if assessment_frame is not None:
|
| 577 |
-
frames_since = self.frame_id - assessment_frame
|
| 578 |
-
if frames_since > MAX_STALE_FRAMES:
|
| 579 |
-
d_out['assessment_status'] = AssessmentStatus.STALE
|
| 580 |
-
d_out['assessment_age_frames'] = frames_since
|
| 581 |
-
elif d_out.get('assessment_status') != AssessmentStatus.ASSESSED:
|
| 582 |
-
# INV-6: Unassessed objects get explicit UNASSESSED status
|
| 583 |
-
d_out['assessment_status'] = AssessmentStatus.UNASSESSED
|
| 584 |
-
|
| 585 |
# Update history
|
| 586 |
-
if 'history' not in track.
|
| 587 |
-
track.
|
| 588 |
-
track.
|
| 589 |
-
if len(track.
|
| 590 |
-
track.
|
| 591 |
-
d_out['history'] = track.
|
| 592 |
|
| 593 |
results.append(d_out)
|
| 594 |
|
|
@@ -606,42 +576,11 @@ class ByteTracker:
|
|
| 606 |
return results
|
| 607 |
|
| 608 |
def _sync_data(self, track, det_source):
|
| 609 |
-
"""Propagate
|
| 610 |
-
# 1. From Source to Track (Update)
|
| 611 |
source_data = det_source.original_data if hasattr(det_source, 'original_data') else {}
|
| 612 |
-
for k in
|
| 613 |
if k in source_data:
|
| 614 |
-
track.
|
| 615 |
-
|
| 616 |
-
# 2. From Track to Source (Forward fill logic handled in output construction)
|
| 617 |
-
|
| 618 |
-
def inject_metadata(self, tracked_dets):
|
| 619 |
-
"""Push metadata from post-processed detection dicts back into internal STrack objects.
|
| 620 |
-
|
| 621 |
-
Needed because GPT results are added to detection dicts *after* tracker.update()
|
| 622 |
-
returns, so the tracker's internal state doesn't have GPT data unless we
|
| 623 |
-
explicitly push it back in.
|
| 624 |
-
|
| 625 |
-
Records assessment_frame_index for temporal validity tracking (INV-5).
|
| 626 |
-
"""
|
| 627 |
-
meta_by_tid = {}
|
| 628 |
-
for d in tracked_dets:
|
| 629 |
-
tid = d.get('track_id')
|
| 630 |
-
if not tid:
|
| 631 |
-
continue
|
| 632 |
-
meta = {k: d[k] for k in GPT_SYNC_KEYS if k in d}
|
| 633 |
-
if meta:
|
| 634 |
-
# Ensure assessment_frame_index is recorded
|
| 635 |
-
if "assessment_frame_index" not in meta and any(
|
| 636 |
-
k in meta for k in ("threat_level_score", "gpt_raw", "object_type")
|
| 637 |
-
):
|
| 638 |
-
meta["assessment_frame_index"] = self.frame_id
|
| 639 |
-
meta["assessment_status"] = AssessmentStatus.ASSESSED
|
| 640 |
-
meta_by_tid[tid] = meta
|
| 641 |
-
for track in self.tracked_stracks:
|
| 642 |
-
tid_str = f"T{str(track.track_id).zfill(2)}"
|
| 643 |
-
if tid_str in meta_by_tid:
|
| 644 |
-
track.gpt_data.update(meta_by_tid[tid_str])
|
| 645 |
|
| 646 |
|
| 647 |
# --- Helper Functions ---
|
|
|
|
| 3 |
from scipy.optimize import linear_sum_assignment
|
| 4 |
import scipy.linalg
|
| 5 |
|
|
|
|
| 6 |
|
| 7 |
|
| 8 |
class KalmanFilter:
|
|
|
|
| 197 |
return ret
|
| 198 |
|
| 199 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 200 |
|
| 201 |
class STrack:
|
| 202 |
"""
|
|
|
|
| 228 |
self.mean = None
|
| 229 |
self.covariance = None
|
| 230 |
|
| 231 |
+
# Per-track metadata (persistent across frames)
|
| 232 |
+
self.metadata = {}
|
| 233 |
|
| 234 |
def _tlwh_from_xyxy(self, xyxy):
|
| 235 |
"""Convert xyxy to tlwh."""
|
|
|
|
| 547 |
d_out['bbox'] = [float(x) for x in tracked_bbox]
|
| 548 |
d_out['track_id'] = f"T{str(track.track_id).zfill(2)}"
|
| 549 |
|
| 550 |
+
# Restore metadata if track has it and current detection didn't
|
| 551 |
+
for k, v in track.metadata.items():
|
| 552 |
if k not in d_out:
|
| 553 |
d_out[k] = v
|
| 554 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 555 |
# Update history
|
| 556 |
+
if 'history' not in track.metadata:
|
| 557 |
+
track.metadata['history'] = []
|
| 558 |
+
track.metadata['history'].append(d_out['bbox'])
|
| 559 |
+
if len(track.metadata['history']) > 30:
|
| 560 |
+
track.metadata['history'].pop(0)
|
| 561 |
+
d_out['history'] = track.metadata['history']
|
| 562 |
|
| 563 |
results.append(d_out)
|
| 564 |
|
|
|
|
| 576 |
return results
|
| 577 |
|
| 578 |
def _sync_data(self, track, det_source):
|
| 579 |
+
"""Propagate metadata (e.g. depth) between detection and track."""
|
|
|
|
| 580 |
source_data = det_source.original_data if hasattr(det_source, 'original_data') else {}
|
| 581 |
+
for k in ("depth_rel",):
|
| 582 |
if k in source_data:
|
| 583 |
+
track.metadata[k] = source_data[k]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 584 |
|
| 585 |
|
| 586 |
# --- Helper Functions ---
|