Qwen2.5-VL floor plan SFT adapter (stage 1)

Hub: mudasir13cs/qwen25-vl-3b-floorplan-sft

Improved using Qwen — LoRA supervised fine-tuning (SFT) on CubiCasa5K (CC BY‑NC 4.0), built on Qwen2.5-VL-3B-Instruct (LICENSE).

Intended non-commercial / research use, consistent with the Qwen research license and CubiCasa NC terms.

Stage 2 (optional): GRPO fine-tune — mudasir13cs/qwen25-vl-3b-floorplan-grpo. In a local training checkout, see floorplan-vlm-grpo/README.md for the same style of usage doc.

Paper & upstream training material

Method: FloorplanVLM (arXiv:2602.06507)
Original Hub collection: manitocross/floorplan-vlm-training — readme, Zenodo→CubiCasa5K download, SVG→JSON labels, GRPO overview.
SFT recipe (canonical source on Hub): train_floorplan_vlm.py — MODEL_ID, HUB_MODEL_ID, OUTPUT_DIR, DATA_DIR, epochs, LR, LoRA layout, SYSTEM_PROMPT / USER_PROMPT (use these strings at inference).

If you are working inside a clone of your training repo, the same script may exist locally as train_floorplan_vlm.py.

Quick install

Inference / loading only:

pip install torch torchvision transformers peft accelerate pillow

Full SFT training (matches script docstring; includes data prep deps):

pip install torch torchvision transformers trl peft datasets accelerate shapely Pillow lxml numpy tqdm huggingface_hub
# optional GPU attention: pip install flash-attn

Loading the adapter

Use Hub IDs (recommended):

from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from peft import PeftModel

BASE = "Qwen/Qwen2.5-VL-3B-Instruct"
ADAPTER = "mudasir13cs/qwen25-vl-3b-floorplan-sft"

processor = AutoProcessor.from_pretrained(BASE)
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    BASE, torch_dtype="auto", device_map="auto"
)
model = PeftModel.from_pretrained(model, ADAPTER)
model.eval()

If you cloned weights into this folder locally: set ADAPTER = "./floorplan-vlm-sft" (or an absolute path) instead of the Hub repo id.

Using the model (inference)

This stage defines the recommended prompts: the system message embeds the output JSON schema. Keep SYSTEM_PROMPT and USER_PROMPT aligned with train_floorplan_vlm.py constants.

User text: “Vectorize this floor plan into structured JSON with all walls, doors, windows, and rooms.”

End-to-end pattern (matches the inference test at the end of train_floorplan_vlm.py):

import json, re, torch
from PIL import Image

SYSTEM_PROMPT = (
    "You are a floor plan vectorization expert. Extract wall, door, window geometry "
    "from floor plan images into structured JSON.\n\n"
    "Output ONLY valid JSON with this schema:\n"
    '{"walls":[{"id":"wall_N","start":[x,y],"end":[x,y],"thickness":T,"curvature":0,'
    '"openings":[{"type":"door"|"window","center":D,"width":W}]}],'
    '"rooms":[{"label":"room_type","walls":["wall_N",...]}]}\n\n'
    "Coordinates normalized so longer image edge = 1024."
)
USER_PROMPT = "Vectorize this floor plan into structured JSON with all walls, doors, windows, and rooms."

image = Image.open("plan.png").convert("RGB")

messages = [
    {"role": "system", "content": [{"type": "text", "text": SYSTEM_PROMPT}]},
    {"role": "user", "content": [{"type": "image"}, {"type": "text", "text": USER_PROMPT}]},
]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], images=[image], return_tensors="pt", padding=True)
inputs = {k: v.to(model.device) if hasattr(v, "to") else v for k, v in inputs.items()}

with torch.no_grad():
    out = model.generate(**inputs, max_new_tokens=4096, do_sample=False)

raw = processor.batch_decode(out[:, inputs.input_ids.shape[1] :], skip_special_tokens=True)[0]
m = re.search(r"\{[\s\S]*\}", raw)
plan = json.loads(m.group()) if m else None

Output shape: top-level walls (with optional openings) and rooms. Example JSON is documented under Output JSON Schema in the Manitocross training README.

Reproducing stage 1

huggingface-cli login if you push to Hub (PUSH_TO_HUB in script).
Run train_floorplan_vlm.py: first run downloads CubiCasa5K from Zenodo into ./cubicasa_data (~5GB). Tune NUM_EPOCHS, MAX_SAMPLES, LEARNING_RATE, HUB_MODEL_ID, etc. in the configuration block at the top of that file.

Training details

Base model: Qwen/Qwen2.5-VL-3B-Instruct
Dataset: CubiCasa5K · Zenodo
Method: LoRA SFT (SFTTrainer / TRL; see script)
Supervision targets: SVG-derived structured JSON (walls, rooms, openings)

Add your own run metadata: epochs, LR, seed, GPU, date, partial MAX_SAMPLES if applicable.

Citation

@article{floorplanvlm2026,
  title={FloorplanVLM: A Vision-Language Model for Floorplan Vectorization},
  journal={arXiv preprint arXiv:2602.06507},
  year={2026}
}

Acknowledgments

FloorplanVLM (arXiv:2602.06507)
CubiCasa5K (arXiv:1904.01920)
Qwen2.5-VL-3B-Instruct
Upstream training reference: manitocross/floorplan-vlm-training
Stage 2 (optional): mudasir13cs/qwen25-vl-3b-floorplan-grpo

Author / contact

Mudasir — multimodal AI, VLM fine-tuning, retrieval/RAG research, and engineering; MS AI Convergence, 숭실대학교 — Soongsil University, Seoul. More credentials, publications, and projects: mudasir13cs.github.io

Hugging Face: @mudasir13cs
GitHub: @mudasir13cs
Email: mudasir13cs@gmail.com

Note for this Git repository (optional on Hub)

When you train in this workspace, OUTPUT_DIR in local train_floorplan_vlm.py defaults to ./floorplan-vlm-sft, beside this README. Weight blobs are typically git-ignored here while this file stays tracked. Paste this README into your Hub model repo root as README.md for the card (you may omit this subsection on the Hub if you prefer).

Downloads last month: 44

Model tree for mudasir13cs/qwen25-vl-3b-floorplan-sft

Base model

Qwen/Qwen2.5-VL-3B-Instruct

Adapter

(153)

this model

Datasets used to train mudasir13cs/qwen25-vl-3b-floorplan-sft

Papers for mudasir13cs/qwen25-vl-3b-floorplan-sft

FloorplanVLM: A Vision-Language Model for Floorplan Vectorization

Paper • 2602.06507 • Published Feb 6 • 3

CubiCasa5K: A Dataset and an Improved Multi-Task Model for Floorplan Image Analysis

Paper • 1904.01920 • Published Apr 3, 2019