SG Aerial Scene Analyser (LoRA)
QLoRA adapter for Qwen2.5-VL-7B-Instruct, fine-tuned for structured analysis of nadir (top-down) aerial imagery of Singapore.
Demo: kaihon/sg-aerial-scene-analyser
Usage
import torch
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from peft import PeftModel
from qwen_vl_utils import process_vision_info
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
"Qwen/Qwen2.5-VL-7B-Instruct", torch_dtype=torch.bfloat16, device_map="auto"
)
model = PeftModel.from_pretrained(model, "kaihon/sg-aerial-scene-analyser-lora")
model.eval()
processor = AutoProcessor.from_pretrained(
"Qwen/Qwen2.5-VL-7B-Instruct", min_pixels=256*28*28, max_pixels=1280*28*28
)
messages = [
{"role": "system", "content": "You are an aerial scene analyst specialising in Singapore urban landscapes. Given a nadir (top-down) aerial image, return a JSON object with exactly these fields:\n{\n \"caption\": \"3-5 sentences describing what is VISIBLE in a neutral surveyor tone, using Singapore-specific vocabulary (HDB block, hawker centre, covered walkway, MRT station). Name types not instances (MRT station not Bishan MRT). Only name globally unique landmarks (Marina Bay Sands, Jewel Changi Airport).\",\n \"scene_type\": \"residential_hdb | commercial | industrial | port_terminal | airport | park_green | construction | mixed_use | transport\",\n \"objects\": [{\"type\": \"hdb_block | condo | landed_house | shophouse | hawker_centre | mrt_station | bus_interchange | shopping_mall | warehouse | container_crane | cargo_ship | aircraft | construction_crane | sports_facility | place_of_worship | school\", \"count\": N}],\n \"infrastructure\": [\"expressway | mrt_track | bus_lane | pedestrian_bridge | covered_walkway | park_connector | jetty | runway | taxiway\"],\n \"terrain\": [\"water | urban | industrial | parkland | reclaimed_land | forest_reserve\"]\n}\nReturn ONLY the JSON object, no markdown fences or commentary."},
{"role": "user", "content": [
{"type": "image", "image": "path/to/aerial_image.jpg"},
{"type": "text", "text": "Analyse this nadir aerial image of Singapore."},
]},
]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
img_inputs, _ = process_vision_info(messages)
inputs = processor(text=[text], images=img_inputs, padding=True, return_tensors="pt").to(model.device)
with torch.no_grad():
gen = model.generate(**inputs, max_new_tokens=512, do_sample=False)
gen_trimmed = gen[:, inputs["input_ids"].shape[1]:]
output = processor.batch_decode(gen_trimmed, skip_special_tokens=True)[0]
print(output)
Output Format
The model returns a JSON object with 5 fields:
| Field | Type | Description |
|---|---|---|
caption |
string | 3-5 sentence description using Singapore-specific vocabulary |
scene_type |
string | One of 9 categories: residential_hdb, commercial, industrial, port_terminal, airport, park_green, construction, mixed_use, transport |
objects |
array | Detected objects with type and count |
infrastructure |
array | Infrastructure elements (e.g. expressway, mrt_track, covered_walkway) |
terrain |
array | Terrain types (e.g. urban, water, parkland) |
Training
| Parameter | Value |
|---|---|
| Base model | Qwen2.5-VL-7B-Instruct |
| Method | QLoRA (rank 8, alpha 16, dropout 0.05) |
| Training images | 90 nadir aerial images of Singapore |
| Trainer | SFTTrainer (TRL) |
| Augmentation | On-the-fly random rotation + horizontal flip |
| Learning rate | 1e-4 |
| Batch size | 2 (gradient accumulation 4, effective batch 8) |
| Precision | bf16 |
| Early stopping | Patience 3 (on validation loss) |
| Final train loss | 0.721 |
| Hardware | NVIDIA L4 (23.7 GB), ~94 minutes |
Evaluation (17-sample held-out test set)
| Metric | Baseline (Qwen2.5-VL-7B) | Fine-tuned | Delta |
|---|---|---|---|
| Schema Compliance | 100% | 100% | — |
| Scene Type Accuracy | 52.9% | 70.6% | +17.7% |
| ROUGE-1 F1 | 0.325 | 0.536 | +0.211 |
| ROUGE-2 F1 | 0.040 | 0.268 | +0.228 |
| ROUGE-L F1 | 0.193 | 0.402 | +0.209 |
| BERTScore F1 | 0.875 | 0.917 | +0.042 |
| Object Mention F1 | 0.309 | 0.471 | +0.161 |
Fine-tuning improves scene classification by +17.7%, doubles caption quality (ROUGE-L +0.21), and boosts object detection (F1 +0.16). Results are from a 70/15/15 train/val/test split; the deployed adapter is retrained on the full dataset (90 train / 16 val) for maximum coverage.
Framework Versions
- PEFT 0.18.1
- Transformers >= 4.49
- TRL (SFTTrainer)
- PyTorch (bf16)
- Downloads last month
- 70
Model tree for kaihon/sg-aerial-scene-analyser-lora
Base model
Qwen/Qwen2.5-VL-7B-Instruct