You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

HORAMA-BTP

Vision-Language Model for Construction Site Analysis

Image → Structured JSON | Built on Qwen2.5-VL | Fine-tuned with LoRA

Model License Format Framework


Horama-BTP transforms construction site photographs into comprehensive, machine-readable JSON reports covering progress tracking, safety compliance, quality assessment, and logistics.

Overview

Horama-BTP is a domain-specialized Vision-Language Model (VLM) that converts construction site images into structured JSON analyses. Given a single photograph, the model produces a detailed report spanning 15 analysis dimensions -- from construction progress estimation and safety compliance to quality defects and environmental impact.

The model enforces a strict, validated JSON schema with controlled vocabularies, confidence scores, and an evidence-linking system that traces every observation back to visual evidence in the image.

Key Capabilities

Dimension What the model extracts
Progress Construction stage (earthworks → commissioning), estimated % completion, detected milestones
Safety PPE compliance per worker, hazard identification (9 types), control measures present/missing
Quality Structural defects (cracks, misalignment, corrosion...), non-conformities
Observations Objects, materials, equipment, personnel, vehicles, structural parts with attributes
Logistics Materials on site, equipment status (idle/operating), access constraints
Environment Dust, noise, waste, spills; waste management assessment
Evidence Traceable evidence entries with unique IDs linking every finding to visual proof

Architecture

Input Image ───┐
               ├──► Qwen2.5-VL-3B-Instruct ──► LoRA-adapted layers ──► Structured JSON
System Prompt ─┘         (backbone)              (r=32, alpha=64)
Component Details
Backbone Qwen2.5-VL-3B-Instruct -- 3B parameter multimodal transformer
Adaptation LoRA (Low-Rank Adaptation) applied to all attention and MLP projections
Target Modules q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
LoRA Rank r=32, alpha=64 (2x scaling), dropout=0.1
Precision BF16 (GPU) / FP32 (CPU/MPS)
Output Deterministic JSON (temperature=0, greedy decoding)

Design Principles

  • Schema-first: Every output is validated against a formal JSON Schema (draft 2020-12) with 15 required top-level fields and controlled enumerations
  • Evidence-linked: All observations reference evidence_id entries -- no claim without visual justification
  • Confidence-scored: Every detection carries a [0, 1] confidence score for downstream filtering
  • Honest by design: When something is uncertain or not visible, the model uses "unknown", null, or empty arrays -- never hallucinated details

Quick Start

import torch
from transformers import AutoModelForVision2Seq, AutoProcessor
from PIL import Image

# Load model and processor
model_id = "Horama/Horama_BTP"

model = AutoModelForVision2Seq.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)

# Load image
image = Image.open("construction_site.jpg").convert("RGB")

# System prompt -- instructs the model to output Horama-BTP v1 JSON
system_prompt = """You are Horama-BTP v1. Analyze construction site images. Output ONLY valid JSON. No text before/after.
CRITICAL RULES:
1. ONLY describe what you can CLEARLY SEE in the image
2. If you cannot see something -> use empty array [] or "unknown"
3. Output must follow the Horama-BTP v1 JSON schema exactly"""

user_prompt = "Analyze this construction site image and return the Horama-BTP v1 JSON output."

# Prepare messages
messages = [
    {"role": "system", "content": system_prompt},
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image},
            {"type": "text", "text": user_prompt},
        ],
    },
]

# Generate
text = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
inputs = processor(text=[text], images=[image], return_tensors="pt").to(model.device)

with torch.no_grad():
    output = model.generate(**inputs, max_new_tokens=4096, do_sample=False)

result = processor.decode(output[0], skip_special_tokens=True)

# Extract JSON from response
import json
json_start = result.rfind("{")
json_end = result.rfind("}") + 1
analysis = json.loads(result[json_start:json_end])

print(json.dumps(analysis, indent=2))

Output Schema

The model outputs a single JSON object with 15 required top-level fields:

{
  "job_type":        "construction" | "renovation" | "infrastructure" | "unknown",
  "asset_type":      "house" | "building" | "road" | "bridge" | "tunnel" | "site" | "unknown",
  "scene_context":   { location_hint, weather_light, viewpoint },
  "summary":         { one_liner, confidence },
  "progress":        { overall_stage, stage_confidence, progress_percent_estimate, milestones_detected },
  "work_activities":  [{ activity, status, confidence, evidence_ids }],
  "observations":    [{ type, label, attributes, confidence, evidence_ids }],
  "safety":          { overall_risk_level, ppe[], hazards[], control_measures[] },
  "quality":         { issues[], non_conformities[] },
  "logistics":       { materials_on_site[], equipment_on_site[], access_constraints[] },
  "environment":     { impacts[], waste_management },
  "evidence":        [{ evidence_id, source, bbox_xyxy, description }],
  "unknown":         [{ question, why_unknown, needed_input }],
  "domain_fields":   { custom_kpis, lot_breakdown, client_specific },
  "metadata":        { model, version, generated_at }
}

Controlled Vocabularies

The schema enforces controlled enumerations across all categorical fields:

Field Allowed values
overall_stage planning, earthworks, foundations, structure, envelope, mep, finishing, commissioning, unknown
ppe_item helmet, vest, gloves, goggles, harness, boots, mask, other
hazard_type fall_risk, open_trench, moving_vehicle, electrical, fire, unstable_load, poor_housekeeping, restricted_area, other
issue_type crack, misalignment, water_infiltration, corrosion, spalling, poor_finish, missing_component, rework, other
observation_type object, material, equipment, signage, defect, hazard, personnel, vehicle, structure_part, other

Example Output

Given a drone photograph of a wood-framed house under construction:

{
  "job_type": "construction",
  "asset_type": "house",
  "scene_context": {
    "location_hint": "outdoor",
    "weather_light": "day",
    "viewpoint": "drone"
  },
  "summary": {
    "one_liner": "Aerial view of a wood-framed house under construction; floor deck and wall framing visible, two workers on site.",
    "confidence": 0.88
  },
  "progress": {
    "overall_stage": "structure",
    "stage_confidence": 0.85,
    "progress_percent_estimate": 35,
    "progress_confidence": 0.35,
    "milestones_detected": []
  },
  "safety": {
    "overall_risk_level": "medium",
    "ppe": [
      { "role": "worker", "ppe_item": "helmet", "status": "compliant", "confidence": 0.8, "evidence_ids": ["ev_003"] },
      { "role": "worker", "ppe_item": "vest", "status": "compliant", "confidence": 0.8, "evidence_ids": ["ev_003"] }
    ],
    "hazards": [
      { "hazard_type": "fall_risk", "severity": "medium", "confidence": 0.6, "evidence_ids": ["ev_005"] }
    ],
    "control_measures": [
      { "measure": "barrier", "status": "present", "confidence": 0.6, "evidence_ids": ["ev_004"] }
    ]
  },
  "evidence": [
    { "evidence_id": "ev_001", "source": "image", "description": "Wood-framed structure with interior wall framing visible" },
    { "evidence_id": "ev_003", "source": "image", "description": "Two workers wearing high-visibility vests and hard hats" },
    { "evidence_id": "ev_005", "source": "image", "description": "Open edges and elevated framing suggesting fall risk" }
  ]
}

(Truncated for readability -- full output includes all 15 top-level fields)

Training Details

Parameter Value
Method LoRA (Parameter-Efficient Fine-Tuning)
Epochs 15
Effective batch size 4 (batch=1, accumulation=4)
Learning rate 1e-4 with cosine schedule
Warmup 10% of training steps
Weight decay 0.01
Gradient checkpointing Enabled
Trainable parameters ~1.5% of total model parameters
Framework Transformers + PEFT
Hardware NVIDIA GPU with BF16

Intended Uses

Primary use cases:

  • Automated construction progress reporting from site photographs
  • Safety compliance auditing (PPE detection, hazard identification)
  • Quality control -- detecting visible defects and non-conformities
  • Logistics monitoring -- tracking materials and equipment on site
  • Environmental impact documentation

Input requirements:

  • Single construction site image (JPEG, PNG, WebP, BMP)
  • Supports ground-level, drone, and fixed-camera viewpoints
  • Works best with daylight, well-lit images

Limitations

  • Single-image analysis: The model analyzes one image at a time; it cannot compare images across time for temporal progress tracking
  • Visible elements only: Cannot infer hidden structural issues, underground utilities, or elements behind walls
  • No sensory data: Cannot detect noise levels, dust concentration, or odors from static images
  • Resolution-dependent: Small or distant objects (e.g., PPE details at long range) may have lower confidence
  • Schema-bound: Output strictly follows the Horama-BTP v1 schema -- custom fields require the domain_fields extension point

Hardware Requirements

Setup VRAM / RAM Precision Notes
NVIDIA GPU ~8 GB VRAM BF16 Recommended for production
Apple Silicon ~8 GB RAM FP32 Supported via MPS backend
CPU ~12 GB RAM FP32 Functional but slower

License

AGPL-3.0 -- This model can be freely used, modified, and redistributed as long as derivative work remains open-source under the same license.

For commercial or closed-source usage, please contact Horama for a commercial license.

Citation

@misc{horama-btp-2025,
  title   = {Horama-BTP: Vision-Language Model for Construction Site Analysis},
  author  = {Horama},
  year    = {2025},
  url     = {https://huggingface.co/Horama/Horama_BTP},
  note    = {Fine-tuned from Qwen2.5-VL-3B-Instruct with LoRA for structured JSON construction analysis}
}

Built by Horama | Construction intelligence, powered by vision AI

Downloads last month
12
Safetensors
Model size
4B params
Tensor type
F16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Horama/Horama_BTP

Adapter
(118)
this model
Adapters
1 model

Collection including Horama/Horama_BTP