# CONTEXT.md — Technical Reference for autolabel > Keep this file up to date as the project evolves. Read this first when > resuming work after a break. --- ## What this project does Uses **OWLv2** (open-vocabulary object detection) and **SAM2** (segment anything) to auto-label images via text prompts, then exports a COCO dataset for fine-tuning a detection or segmentation model. **Current phase:** labeling — two modes available: - **Detection** — OWLv2 only; produces bounding boxes. - **Segmentation** — OWLv2 → boxes → SAM2 → pixel masks + COCO polygons. **Future phase:** fine-tune OWLv2 on the exported COCO dataset using `scripts/finetune_owlv2.py` (code is ready, not yet in active use). --- ## Architecture ### Primary interface — `app.py` (Gradio web UI) Two-tab UI, all artifacts written to a session temp dir (nothing in the project): | Tab | What it does | |-----|-------------| | 🧪 Test | Single image → instant annotated preview. Dial in prompts and threshold before a batch run. | | 📂 Batch | Multiple images → annotated gallery + downloadable ZIP (resized images + `coco_export.json`). | ### CLI scripts (`scripts/`) Independent entry points for headless / automation use: | Script | Purpose | |--------|---------| | `run_detection.py` | Batch detect → `data/detections/` | | `export_coco.py` | Build COCO JSON from `data/labeled/` | | `finetune_owlv2.py` | Fine-tune OWLv2 (future) | ### `autolabel/` package | Module | Responsibility | |--------|---------------| | `config.py` | Pydantic settings singleton, auto device detection | | `detect.py` | OWLv2 inference — `infer()` (PIL, shared) + `detect_image()` (file) + `run_detection()` (batch CLI) | | `segment.py` | SAM2 integration — `load_sam2()`, `segment_with_boxes()`, `_mask_to_polygon()` | | `export.py` | COCO JSON builder (no pycocotools); supports both bbox-only and segmentation | | `finetune.py` | Training loop, loss, dataset, scheduler | | `utils.py` | `collect_images`, `save_json`, `load_json`, `setup_logging` | **Key design:** `detect.infer()` is the single OWLv2 inference implementation. `app.py` chains SAM2 on top when mode == "Segmentation" — no duplication. --- ## Device strategy | Platform | Device | dtype | |----------|--------|-------| | Apple Silicon | `mps` | `float32` | | Windows/Linux GPU | `cuda` | `float16` | | CPU fallback | `cpu` | `float32` | `PYTORCH_ENABLE_MPS_FALLBACK=1` must be set before torch is imported on MPS (`.env` handles this). Without it, some OWLv2 ops raise `NotImplementedError`. --- ## OWLv2 model Default: `google/owlv2-large-patch14-finetuned` (~700 MB, cached in `~/.cache/huggingface` after first download). Override via env var: `AUTOLABEL_MODEL=google/owlv2-base-patch16` | Variant | Size | Notes | |---------|------|-------| | `owlv2-base-patch16` | ~300 MB | Faster, lower accuracy | | `owlv2-large-patch14` | ~700 MB | Good balance | | `owlv2-large-patch14-finetuned` | ~700 MB | Default — pre-trained on LVIS/Objects365 | --- ## Dependency decisions | Package | Why kept | |---------|---------| | `torch` / `torchvision` | OWLv2 + SAM2 inference | | `transformers>=4.45` | OWLv2 and SAM2 models & processors | | `pillow` | Image I/O and annotation drawing | | `numpy` | Gradio image array interchange; mask arrays | | `opencv-python` | `cv2.findContours` for mask → COCO polygon (SAM2) | | `pydantic` / `pydantic-settings` | Type-safe config with env-var loading | | `click` | CLI option parsing | | `tqdm` | Progress bars in CLI batch runner | | `python-dotenv` | Load `.env` before torch (MPS fallback) | | `gradio` | Web UI | Removed: `supervision` (unused), `matplotlib` (fine-tune charts gone), `requests` (Label Studio gone). --- ## Inference flow ``` PIL image ↓ detect.infer(image, processor, model, prompts, threshold, device, dtype) ↓ list[{label, score, box_xyxy}] │ ├─ Detection mode ────────────────────────────────────────────────── │ ↓ used by app.py directly │ ↓ (CLI: wrapped by detect_image → JSON) │ ↓ export.build_coco → coco_export.json (bbox only, segmentation:[]) │ └─ Segmentation mode ─────────────────────────────────────────────── ↓ segment.segment_with_boxes(image, detections, sam2_processor, sam2_model) ↓ list[{label, score, box_xyxy, mask (np.ndarray), segmentation (polygons)}] ↓ mask used for visualization overlay; dropped before JSON serialisation ↓ export.build_coco → coco_export.json (bbox + segmentation polygons) ``` --- ## Batch export ZIP structure ``` autolabel_export.zip ├── coco_export.json # COCO format, dimensions match images below └── images/ ├── photo1.jpg # resized to chosen training size (e.g. 640×640) └── photo2.jpg ``` COCO bounding boxes are in the coordinate space of the resized images. --- ## Known limitations - OWLv2 is detection-only — bounding boxes, no masks. - Objects < 32×32 px are often missed at default resolution. - MPS inference is slower than CUDA but fast enough for development. - Threshold default is 0.1 (intentionally low — easier to discard false positives than recover missed objects). --- ## Fine-tuning (future) The fine-tuning infrastructure is complete (`autolabel/finetune.py`, `scripts/finetune_owlv2.py`) but not in active use. Workflow when ready: 1. Use the Batch tab to generate a labeled `coco_export.json` 2. Run `make finetune` (or `uv run python scripts/finetune_owlv2.py --help`) 3. Evaluate the fine-tuned model in the Test tab