HORAMA-BTP
Vision-Language Model for Construction Site Analysis
Image → Structured JSON | Built on Qwen2.5-VL | Fine-tuned with LoRA
Horama-BTP transforms construction site photographs into comprehensive, machine-readable JSON reports covering progress tracking, safety compliance, quality assessment, and logistics.
Overview
Horama-BTP is a domain-specialized Vision-Language Model (VLM) that converts construction site images into structured JSON analyses. Given a single photograph, the model produces a detailed report spanning 15 analysis dimensions -- from construction progress estimation and safety compliance to quality defects and environmental impact.
The model enforces a strict, validated JSON schema with controlled vocabularies, confidence scores, and an evidence-linking system that traces every observation back to visual evidence in the image.
Key Capabilities
| Dimension | What the model extracts |
|---|---|
| Progress | Construction stage (earthworks → commissioning), estimated % completion, detected milestones |
| Safety | PPE compliance per worker, hazard identification (9 types), control measures present/missing |
| Quality | Structural defects (cracks, misalignment, corrosion...), non-conformities |
| Observations | Objects, materials, equipment, personnel, vehicles, structural parts with attributes |
| Logistics | Materials on site, equipment status (idle/operating), access constraints |
| Environment | Dust, noise, waste, spills; waste management assessment |
| Evidence | Traceable evidence entries with unique IDs linking every finding to visual proof |
Architecture
Input Image ───┐
├──► Qwen2.5-VL-3B-Instruct ──► LoRA-adapted layers ──► Structured JSON
System Prompt ─┘ (backbone) (r=32, alpha=64)
| Component | Details |
|---|---|
| Backbone | Qwen2.5-VL-3B-Instruct -- 3B parameter multimodal transformer |
| Adaptation | LoRA (Low-Rank Adaptation) applied to all attention and MLP projections |
| Target Modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
| LoRA Rank | r=32, alpha=64 (2x scaling), dropout=0.1 |
| Precision | BF16 (GPU) / FP32 (CPU/MPS) |
| Output | Deterministic JSON (temperature=0, greedy decoding) |
Design Principles
- Schema-first: Every output is validated against a formal JSON Schema (draft 2020-12) with 15 required top-level fields and controlled enumerations
- Evidence-linked: All observations reference
evidence_identries -- no claim without visual justification - Confidence-scored: Every detection carries a
[0, 1]confidence score for downstream filtering - Honest by design: When something is uncertain or not visible, the model uses
"unknown",null, or empty arrays -- never hallucinated details
Quick Start
import torch
from transformers import AutoModelForVision2Seq, AutoProcessor
from PIL import Image
# Load model and processor
model_id = "Horama/Horama_BTP"
model = AutoModelForVision2Seq.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True,
)
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
# Load image
image = Image.open("construction_site.jpg").convert("RGB")
# System prompt -- instructs the model to output Horama-BTP v1 JSON
system_prompt = """You are Horama-BTP v1. Analyze construction site images. Output ONLY valid JSON. No text before/after.
CRITICAL RULES:
1. ONLY describe what you can CLEARLY SEE in the image
2. If you cannot see something -> use empty array [] or "unknown"
3. Output must follow the Horama-BTP v1 JSON schema exactly"""
user_prompt = "Analyze this construction site image and return the Horama-BTP v1 JSON output."
# Prepare messages
messages = [
{"role": "system", "content": system_prompt},
{
"role": "user",
"content": [
{"type": "image", "image": image},
{"type": "text", "text": user_prompt},
],
},
]
# Generate
text = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
inputs = processor(text=[text], images=[image], return_tensors="pt").to(model.device)
with torch.no_grad():
output = model.generate(**inputs, max_new_tokens=4096, do_sample=False)
result = processor.decode(output[0], skip_special_tokens=True)
# Extract JSON from response
import json
json_start = result.rfind("{")
json_end = result.rfind("}") + 1
analysis = json.loads(result[json_start:json_end])
print(json.dumps(analysis, indent=2))
Output Schema
The model outputs a single JSON object with 15 required top-level fields:
{
"job_type": "construction" | "renovation" | "infrastructure" | "unknown",
"asset_type": "house" | "building" | "road" | "bridge" | "tunnel" | "site" | "unknown",
"scene_context": { location_hint, weather_light, viewpoint },
"summary": { one_liner, confidence },
"progress": { overall_stage, stage_confidence, progress_percent_estimate, milestones_detected },
"work_activities": [{ activity, status, confidence, evidence_ids }],
"observations": [{ type, label, attributes, confidence, evidence_ids }],
"safety": { overall_risk_level, ppe[], hazards[], control_measures[] },
"quality": { issues[], non_conformities[] },
"logistics": { materials_on_site[], equipment_on_site[], access_constraints[] },
"environment": { impacts[], waste_management },
"evidence": [{ evidence_id, source, bbox_xyxy, description }],
"unknown": [{ question, why_unknown, needed_input }],
"domain_fields": { custom_kpis, lot_breakdown, client_specific },
"metadata": { model, version, generated_at }
}
Controlled Vocabularies
The schema enforces controlled enumerations across all categorical fields:
| Field | Allowed values |
|---|---|
overall_stage |
planning, earthworks, foundations, structure, envelope, mep, finishing, commissioning, unknown |
ppe_item |
helmet, vest, gloves, goggles, harness, boots, mask, other |
hazard_type |
fall_risk, open_trench, moving_vehicle, electrical, fire, unstable_load, poor_housekeeping, restricted_area, other |
issue_type |
crack, misalignment, water_infiltration, corrosion, spalling, poor_finish, missing_component, rework, other |
observation_type |
object, material, equipment, signage, defect, hazard, personnel, vehicle, structure_part, other |
Example Output
Given a drone photograph of a wood-framed house under construction:
{
"job_type": "construction",
"asset_type": "house",
"scene_context": {
"location_hint": "outdoor",
"weather_light": "day",
"viewpoint": "drone"
},
"summary": {
"one_liner": "Aerial view of a wood-framed house under construction; floor deck and wall framing visible, two workers on site.",
"confidence": 0.88
},
"progress": {
"overall_stage": "structure",
"stage_confidence": 0.85,
"progress_percent_estimate": 35,
"progress_confidence": 0.35,
"milestones_detected": []
},
"safety": {
"overall_risk_level": "medium",
"ppe": [
{ "role": "worker", "ppe_item": "helmet", "status": "compliant", "confidence": 0.8, "evidence_ids": ["ev_003"] },
{ "role": "worker", "ppe_item": "vest", "status": "compliant", "confidence": 0.8, "evidence_ids": ["ev_003"] }
],
"hazards": [
{ "hazard_type": "fall_risk", "severity": "medium", "confidence": 0.6, "evidence_ids": ["ev_005"] }
],
"control_measures": [
{ "measure": "barrier", "status": "present", "confidence": 0.6, "evidence_ids": ["ev_004"] }
]
},
"evidence": [
{ "evidence_id": "ev_001", "source": "image", "description": "Wood-framed structure with interior wall framing visible" },
{ "evidence_id": "ev_003", "source": "image", "description": "Two workers wearing high-visibility vests and hard hats" },
{ "evidence_id": "ev_005", "source": "image", "description": "Open edges and elevated framing suggesting fall risk" }
]
}
(Truncated for readability -- full output includes all 15 top-level fields)
Training Details
| Parameter | Value |
|---|---|
| Method | LoRA (Parameter-Efficient Fine-Tuning) |
| Epochs | 15 |
| Effective batch size | 4 (batch=1, accumulation=4) |
| Learning rate | 1e-4 with cosine schedule |
| Warmup | 10% of training steps |
| Weight decay | 0.01 |
| Gradient checkpointing | Enabled |
| Trainable parameters | ~1.5% of total model parameters |
| Framework | Transformers + PEFT |
| Hardware | NVIDIA GPU with BF16 |
Intended Uses
Primary use cases:
- Automated construction progress reporting from site photographs
- Safety compliance auditing (PPE detection, hazard identification)
- Quality control -- detecting visible defects and non-conformities
- Logistics monitoring -- tracking materials and equipment on site
- Environmental impact documentation
Input requirements:
- Single construction site image (JPEG, PNG, WebP, BMP)
- Supports ground-level, drone, and fixed-camera viewpoints
- Works best with daylight, well-lit images
Limitations
- Single-image analysis: The model analyzes one image at a time; it cannot compare images across time for temporal progress tracking
- Visible elements only: Cannot infer hidden structural issues, underground utilities, or elements behind walls
- No sensory data: Cannot detect noise levels, dust concentration, or odors from static images
- Resolution-dependent: Small or distant objects (e.g., PPE details at long range) may have lower confidence
- Schema-bound: Output strictly follows the Horama-BTP v1 schema -- custom fields require the
domain_fieldsextension point
Hardware Requirements
| Setup | VRAM / RAM | Precision | Notes |
|---|---|---|---|
| NVIDIA GPU | ~8 GB VRAM | BF16 | Recommended for production |
| Apple Silicon | ~8 GB RAM | FP32 | Supported via MPS backend |
| CPU | ~12 GB RAM | FP32 | Functional but slower |
License
AGPL-3.0 -- This model can be freely used, modified, and redistributed as long as derivative work remains open-source under the same license.
For commercial or closed-source usage, please contact Horama for a commercial license.
Citation
@misc{horama-btp-2025,
title = {Horama-BTP: Vision-Language Model for Construction Site Analysis},
author = {Horama},
year = {2025},
url = {https://huggingface.co/Horama/Horama_BTP},
note = {Fine-tuned from Qwen2.5-VL-3B-Instruct with LoRA for structured JSON construction analysis}
}
Built by Horama | Construction intelligence, powered by vision AI
- Downloads last month
- 12