You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

HORAMA-BTP

Vision-Language Model for Construction Site Analysis

Image → Structured JSON | Built on Qwen2.5-VL | Fine-tuned with LoRA

Horama-BTP transforms construction site photographs into comprehensive, machine-readable JSON reports covering progress tracking, safety compliance, quality assessment, and logistics.

Overview

Horama-BTP is a domain-specialized Vision-Language Model (VLM) that converts construction site images into structured JSON analyses. Given a single photograph, the model produces a detailed report spanning 15 analysis dimensions -- from construction progress estimation and safety compliance to quality defects and environmental impact.

The model enforces a strict, validated JSON schema with controlled vocabularies, confidence scores, and an evidence-linking system that traces every observation back to visual evidence in the image.

Key Capabilities

Dimension	What the model extracts
Progress	Construction stage (earthworks → commissioning), estimated % completion, detected milestones
Safety	PPE compliance per worker, hazard identification (9 types), control measures present/missing
Quality	Structural defects (cracks, misalignment, corrosion...), non-conformities
Observations	Objects, materials, equipment, personnel, vehicles, structural parts with attributes
Logistics	Materials on site, equipment status (idle/operating), access constraints
Environment	Dust, noise, waste, spills; waste management assessment
Evidence	Traceable evidence entries with unique IDs linking every finding to visual proof

Architecture

Input Image ───┐
               ├──► Qwen2.5-VL-3B-Instruct ──► LoRA-adapted layers ──► Structured JSON
System Prompt ─┘         (backbone)              (r=32, alpha=64)

Component	Details
Backbone	Qwen2.5-VL-3B-Instruct -- 3B parameter multimodal transformer
Adaptation	LoRA (Low-Rank Adaptation) applied to all attention and MLP projections
Target Modules	`q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`, `up_proj`, `down_proj`
LoRA Rank	r=32, alpha=64 (2x scaling), dropout=0.1
Precision	BF16 (GPU) / FP32 (CPU/MPS)
Output	Deterministic JSON (temperature=0, greedy decoding)

Design Principles

Schema-first: Every output is validated against a formal JSON Schema (draft 2020-12) with 15 required top-level fields and controlled enumerations
Evidence-linked: All observations reference evidence_id entries -- no claim without visual justification
Confidence-scored: Every detection carries a [0, 1] confidence score for downstream filtering
Honest by design: When something is uncertain or not visible, the model uses "unknown", null, or empty arrays -- never hallucinated details

Quick Start

import torch
from transformers import AutoModelForVision2Seq, AutoProcessor
from PIL import Image

# Load model and processor
model_id = "Horama/Horama_BTP"

model = AutoModelForVision2Seq.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)

# Load image
image = Image.open("construction_site.jpg").convert("RGB")

# System prompt -- instructs the model to output Horama-BTP v1 JSON
system_prompt = """You are Horama-BTP v1. Analyze construction site images. Output ONLY valid JSON. No text before/after.
CRITICAL RULES:
1. ONLY describe what you can CLEARLY SEE in the image
2. If you cannot see something -> use empty array [] or "unknown"
3. Output must follow the Horama-BTP v1 JSON schema exactly"""

user_prompt = "Analyze this construction site image and return the Horama-BTP v1 JSON output."

# Prepare messages
messages = [
    {"role": "system", "content": system_prompt},
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image},
            {"type": "text", "text": user_prompt},
        ],
    },
]

# Generate
text = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
inputs = processor(text=[text], images=[image], return_tensors="pt").to(model.device)

with torch.no_grad():
    output = model.generate(**inputs, max_new_tokens=4096, do_sample=False)

result = processor.decode(output[0], skip_special_tokens=True)

# Extract JSON from response
import json
json_start = result.rfind("{")
json_end = result.rfind("}") + 1
analysis = json.loads(result[json_start:json_end])

print(json.dumps(analysis, indent=2))

Output Schema

The model outputs a single JSON object with 15 required top-level fields:

{
  "job_type":        "construction" | "renovation" | "infrastructure" | "unknown",
  "asset_type":      "house" | "building" | "road" | "bridge" | "tunnel" | "site" | "unknown",
  "scene_context":   { location_hint, weather_light, viewpoint },
  "summary":         { one_liner, confidence },
  "progress":        { overall_stage, stage_confidence, progress_percent_estimate, milestones_detected },
  "work_activities":  [{ activity, status, confidence, evidence_ids }],
  "observations":    [{ type, label, attributes, confidence, evidence_ids }],
  "safety":          { overall_risk_level, ppe[], hazards[], control_measures[] },
  "quality":         { issues[], non_conformities[] },
  "logistics":       { materials_on_site[], equipment_on_site[], access_constraints[] },
  "environment":     { impacts[], waste_management },
  "evidence":        [{ evidence_id, source, bbox_xyxy, description }],
  "unknown":         [{ question, why_unknown, needed_input }],
  "domain_fields":   { custom_kpis, lot_breakdown, client_specific },
  "metadata":        { model, version, generated_at }
}

Controlled Vocabularies

The schema enforces controlled enumerations across all categorical fields:

Field	Allowed values
`overall_stage`	`planning`, `earthworks`, `foundations`, `structure`, `envelope`, `mep`, `finishing`, `commissioning`, `unknown`
`ppe_item`	`helmet`, `vest`, `gloves`, `goggles`, `harness`, `boots`, `mask`, `other`
`hazard_type`	`fall_risk`, `open_trench`, `moving_vehicle`, `electrical`, `fire`, `unstable_load`, `poor_housekeeping`, `restricted_area`, `other`
`issue_type`	`crack`, `misalignment`, `water_infiltration`, `corrosion`, `spalling`, `poor_finish`, `missing_component`, `rework`, `other`
`observation_type`	`object`, `material`, `equipment`, `signage`, `defect`, `hazard`, `personnel`, `vehicle`, `structure_part`, `other`

Example Output

Given a drone photograph of a wood-framed house under construction:

{
  "job_type": "construction",
  "asset_type": "house",
  "scene_context": {
    "location_hint": "outdoor",
    "weather_light": "day",
    "viewpoint": "drone"
  },
  "summary": {
    "one_liner": "Aerial view of a wood-framed house under construction; floor deck and wall framing visible, two workers on site.",
    "confidence": 0.88
  },
  "progress": {
    "overall_stage": "structure",
    "stage_confidence": 0.85,
    "progress_percent_estimate": 35,
    "progress_confidence": 0.35,
    "milestones_detected": []
  },
  "safety": {
    "overall_risk_level": "medium",
    "ppe": [
      { "role": "worker", "ppe_item": "helmet", "status": "compliant", "confidence": 0.8, "evidence_ids": ["ev_003"] },
      { "role": "worker", "ppe_item": "vest", "status": "compliant", "confidence": 0.8, "evidence_ids": ["ev_003"] }
    ],
    "hazards": [
      { "hazard_type": "fall_risk", "severity": "medium", "confidence": 0.6, "evidence_ids": ["ev_005"] }
    ],
    "control_measures": [
      { "measure": "barrier", "status": "present", "confidence": 0.6, "evidence_ids": ["ev_004"] }
    ]
  },
  "evidence": [
    { "evidence_id": "ev_001", "source": "image", "description": "Wood-framed structure with interior wall framing visible" },
    { "evidence_id": "ev_003", "source": "image", "description": "Two workers wearing high-visibility vests and hard hats" },
    { "evidence_id": "ev_005", "source": "image", "description": "Open edges and elevated framing suggesting fall risk" }
  ]
}

(Truncated for readability -- full output includes all 15 top-level fields)

Training Details

Parameter	Value
Method	LoRA (Parameter-Efficient Fine-Tuning)
Epochs	15
Effective batch size	4 (batch=1, accumulation=4)
Learning rate	1e-4 with cosine schedule
Warmup	10% of training steps
Weight decay	0.01
Gradient checkpointing	Enabled
Trainable parameters	~1.5% of total model parameters
Framework	Transformers + PEFT
Hardware	NVIDIA GPU with BF16

Intended Uses

Primary use cases:

Automated construction progress reporting from site photographs
Safety compliance auditing (PPE detection, hazard identification)
Quality control -- detecting visible defects and non-conformities
Logistics monitoring -- tracking materials and equipment on site
Environmental impact documentation

Input requirements:

Single construction site image (JPEG, PNG, WebP, BMP)
Supports ground-level, drone, and fixed-camera viewpoints
Works best with daylight, well-lit images

Limitations

Single-image analysis: The model analyzes one image at a time; it cannot compare images across time for temporal progress tracking
Visible elements only: Cannot infer hidden structural issues, underground utilities, or elements behind walls
No sensory data: Cannot detect noise levels, dust concentration, or odors from static images
Resolution-dependent: Small or distant objects (e.g., PPE details at long range) may have lower confidence
Schema-bound: Output strictly follows the Horama-BTP v1 schema -- custom fields require the domain_fields extension point

Hardware Requirements

Setup	VRAM / RAM	Precision	Notes
NVIDIA GPU	~8 GB VRAM	BF16	Recommended for production
Apple Silicon	~8 GB RAM	FP32	Supported via MPS backend
CPU	~12 GB RAM	FP32	Functional but slower

License

AGPL-3.0 -- This model can be freely used, modified, and redistributed as long as derivative work remains open-source under the same license.

For commercial or closed-source usage, please contact Horama for a commercial license.

Citation

@misc{horama-btp-2025,
  title   = {Horama-BTP: Vision-Language Model for Construction Site Analysis},
  author  = {Horama},
  year    = {2025},
  url     = {https://huggingface.co/Horama/Horama_BTP},
  note    = {Fine-tuned from Qwen2.5-VL-3B-Instruct with LoRA for structured JSON construction analysis}
}

Built by Horama | Construction intelligence, powered by vision AI