Qwen3VL-8B-4bit-GGUF-Jetson-Deployment
This repository provides a merged Hugging Face checkpoint and pre-quantized GGUF files for immediate deployment of a disaster-recognition vision-language model on edge devices such as the NVIDIA Jetson series.
The model was created by merging the QLoRA adapter WayBob/Qwen3VL-8B-QLora-4bit-xView2-Disaster-Recognition into the base model Qwen/Qwen3-VL-8B-Instruct, then converting the merged checkpoint into GGUF for efficient deployment with llama.cpp.
Why this repository exists
The original LoRA adapter repository is ideal for standard Hugging Face inference and further experimentation, but edge deployment on Jetson-class devices introduces additional constraints:
- Runtime LoRA loading adds memory overhead.
- JetPack 5.1.2 ships with CUDA 11.4, which is often awkward or unsupported for newer inference stacks without custom patching.
- Edge deployment benefits from a single merged model artifact and a lightweight inference runtime.
To make deployment simpler, this repository provides:
- Merged HF weights: the LoRA adapter is baked into the base model weights.
- GGUF files: quantized artifacts for
llama.cpp, enabling practical Jetson deployment with a runtime memory footprint of about ~6.3 GB in the validated configuration. - A Jetson-focused deployment path: a reproducible setup using
llama.cppinstead of more heavyweight serving stacks.
Model Overview
| Attribute | Detail |
|---|---|
| Base Model | Qwen/Qwen3-VL-8B-Instruct |
| LoRA Adapter | WayBob/Qwen3VL-8B-QLora-4bit-xView2-Disaster-Recognition |
| Training Data | WayBob/Disaster_Recognition_RemoteSense_EN_CN_JA |
| Training Samples | 55,008 trilingual samples |
| Fine-tuning Method | QLoRA (4-bit NF4, LoRA rank 8, alpha 16) |
| Languages | English, Japanese, Chinese |
| Target Disaster Classes | Fire, Flood, Hurricane/Wind, Earthquake, Tsunami, Volcano |
| Model Scale | ~8B parameters |
| Primary Task | Disaster type recognition from post-disaster satellite/aerial imagery |
| Primary Deployment Target | Jetson-class edge devices using llama.cpp |
Repository Structure
.
├── README.md # Model card and documentation
├── qwen3vl_8b_disaster_merged/ # Merged Hugging Face checkpoint (BF16)
│ ├── config.json
│ ├── generation_config.json
│ ├── model-00001-of-00004.safetensors
│ ├── model-00002-of-00004.safetensors
│ ├── model-00003-of-00004.safetensors
│ ├── model-00004-of-00004.safetensors
│ ├── model.safetensors.index.json
│ ├── tokenizer.json
│ ├── tokenizer_config.json
│ ├── special_tokens_map.json
│ ├── chat_template.jinja
│ ├── preprocessor_config.json
│ └── vocab.json
├── gguf_16bit_4bit/ # GGUF files for llama.cpp
│ ├── disaster-8b-f16.gguf # F16 GGUF
│ └── disaster-8b-q4km.gguf # Q4_K_M GGUF (recommended for Jetson)
└── merge_lora/ # Merge configuration
└── qwen3_vl_8b_xview2lora.yaml # LLaMA-Factory merge config
Available Formats
| File | Format | Size | Typical Use Case |
|---|---|---|---|
qwen3vl_8b_disaster_merged/ |
BF16 (safetensors) |
~16 GB | Reproducibility, inspection, further conversion, research |
gguf_16bit_4bit/disaster-8b-f16.gguf |
GGUF F16 | ~16.4 GB | High-accuracy reference |
gguf_16bit_4bit/disaster-8b-q4km.gguf |
GGUF Q4_K_M | ~4.8 GB | Recommended edge deployment |
If your goal is Jetson deployment, you typically only need the
Q4_K_MGGUF file plus the corresponding Qwen3-VLmmprojfile.
Quick Start
Option 1: Hugging Face Transformers (Merged Weights)
This path is useful for reproducibility and standard HF workflows. For Jetson inference, the GGUF path below is usually more practical.
import torch
from PIL import Image
from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
model = Qwen3VLForConditionalGeneration.from_pretrained(
"WayBob/Qwen3VL-8B-4bit-GGUF-Jetson-Deployment",
subfolder="qwen3vl_8b_disaster_merged",
torch_dtype="auto",
device_map="auto"
)
processor = AutoProcessor.from_pretrained(
"WayBob/Qwen3VL-8B-4bit-GGUF-Jetson-Deployment",
subfolder="qwen3vl_8b_disaster_merged"
)
image = Image.open("disaster_image.jpg").convert("RGB")
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": image},
{"type": "text", "text": "What type of disaster occurred in this image?"}
]
}
]
text = processor.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
inputs = processor(text=[text], images=[image], return_tensors="pt").to(model.device)
generated_ids = model.generate(
**inputs,
max_new_tokens=256,
temperature=0
)
generated_ids_trimmed = [
out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed,
skip_special_tokens=True,
clean_up_tokenization_spaces=False
)
print(output_text[0])
Option 2: llama.cpp Server (Recommended for Deployment)
# Clone and build llama.cpp on Jetson
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j$(nproc)
# Download the quantized model from this repository
huggingface-cli download WayBob/Qwen3VL-8B-4bit-GGUF-Jetson-Deployment \
--include "gguf_16bit_4bit/disaster-8b-q4km.gguf" \
--local-dir models/
# Download the Qwen3-VL vision projector
huggingface-cli download Qwen/Qwen3-VL-8B-Instruct-GGUF \
--include "*mmproj*Q8*" \
--local-dir models/
# Launch the server
./build/bin/llama-server \
-m models/disaster-8b-q4km.gguf \
--mmproj models/qwen3-vl-8b/mmproj-Qwen3VL-8B-Instruct-Q8_0.gguf \
-ngl 99 --fit off -c 8192 \
-ctk q8_0 -ctv q8_0 \
--host 0.0.0.0 --port 8080
Option 3: OpenAI-Compatible API Call
import base64
import requests
with open("disaster_image.jpg", "rb") as f:
img_b64 = base64.b64encode(f.read()).decode()
response = requests.post(
"http://localhost:8080/v1/chat/completions",
json={
"messages": [
{
"role": "system",
"content": (
"You are a disaster recognition expert. "
"When analyzing disaster images, first identify the disaster type, "
"then explain the key visual evidence supporting your classification. "
"Respond in the same language as the user."
)
},
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {"url": f"data:image/jpeg;base64,{img_b64}"}
},
{
"type": "text",
"text": "What type of disaster occurred in this image?"
}
]
}
],
"temperature": 0
},
timeout=300
)
print(response.json()["choices"][0]["message"]["content"])
Edge Deployment on NVIDIA Jetson
This model has been validated on NVIDIA Jetson Orin NX 16GB with llama.cpp.
Validation Hardware Environment
| Property | Value |
|---|---|
| Device Model | NVIDIA Orin NX Developer Kit |
| Module / Carrier Board | Jetson Orin NX 16GB (P3767-0000) / P3768-0000 |
| SoC / Platform | Tegra234 (Orin, tegra23x family) |
| JetPack / L4T | JetPack 5.1.2 / L4T 35.4.1 |
| OS / Kernel | Ubuntu 20.04.6 LTS / Linux 5.10.120-tegra |
| Python | 3.8.10 |
| Libraries | CUDA 11.4.315, cuDNN 8.6.0.166, TensorRT 8.5.2.2, VPI 2.3.9, Vulkan 1.3.204 |
| OpenCV | cv2 4.5.4, CUDA not enabled |
| Power Mode | MAXN |
| Storage | 937 GB NVMe system drive |
Example Idle Snapshot on the Validation Device
The values below are a representative jtop snapshot from the validation device and are not fixed hardware limits.
| Sensor | Example Status |
|---|---|
| CPU | 8 cores, roughly 1-8% load at 729 MHz |
| GPU | 0% load at 306 MHz |
| Memory | 4.1 GB / 15.2 GB |
| Swap | 425 MB / 7.6 GB |
| EMC | 204 MHz (reported cap ~3.2 GHz) |
| Storage | 67.4 GB / 937 GB used |
| Cooling | Fan PWM 100%; jtop reported 0 RPM |
| Temperatures | CPU/GPU/SoC roughly 44-49°C |
Why llama.cpp on Jetson?
JetPack 5.1.2 ships with CUDA 11.4. On Jetson, many recent inference stacks either target newer CUDA/toolchain combinations or require significant patching to be practical. In contrast, llama.cpp offers a simpler deployment path:
- native GGUF support
- straightforward CUDA build on Jetson
- OpenAI-compatible HTTP server
- good memory efficiency for quantized deployment
- practical support for vision-language inference via
mmproj
Build and Deploy on Jetson
# Build llama.cpp on Jetson
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j$(nproc)
# Download model and vision projector
huggingface-cli download WayBob/Qwen3VL-8B-4bit-GGUF-Jetson-Deployment \
--include "gguf_16bit_4bit/disaster-8b-q4km.gguf" \
--local-dir models/
huggingface-cli download Qwen/Qwen3-VL-8B-Instruct-GGUF \
--include "*mmproj*Q8*" \
--local-dir models/
# Launch server
./build/bin/llama-server \
-m models/gguf_16bit_4bit/disaster-8b-q4km.gguf \
--mmproj models/mmproj-Qwen3VL-8B-Instruct-Q8_0.gguf \
-ngl 99 \
--fit off \
-c 8192 \
-ctk q8_0 -ctv q8_0 \
--host 0.0.0.0 \
--port 8080
Key Flags
| Flag | Purpose |
|---|---|
-ngl 99 |
Offload all decoder layers to GPU |
--fit off |
Skip auto memory fitting and reduce Jetson startup time |
-c 8192 |
Set the context length |
-ctk q8_0 -ctv q8_0 |
Quantize KV cache to Q8 to save memory |
Representative Memory Footprint
Measured on the validated Jetson setup with Q4_K_M and Q8 KV cache.
| Component | Memory Footprint |
|---|---|
LLM (Q4_K_M) |
~4.5 GB |
Vision projector (mmproj) |
~0.7 GB |
KV cache (Q8, c=8192) |
~0.6 GB |
| Compute buffers | ~0.4 GB |
| Total | ~6.3 GB |
Representative Performance
These are deployment observations from the validated Jetson configuration. They should be treated as representative, not guaranteed.
| Input Resolution | Image Processing | TTFT | Text Generation |
|---|---|---|---|
| 1024×1024 | ~8.7 s | ~10.3 s | ~10-11 tok/s |
| 512×512 | ~0.9 s | ~1.9 s | ~10-11 tok/s |
Notes on Image Resolution and Visual Tokens
Qwen3-VL uses 14-pixel image patches with a spatial merge size of 2, and its preprocessing aligns image dimensions to multiples of 28 through smart resizing.
In practice, this means:
- visual token count depends on the processed image size after smart resize
- token count is not always equal to a simple raw
width × heightformula - larger input resolution significantly increases prefilling time and TTFT
For disaster type classification, 512×512 is often a strong latency/quality trade-off on Jetson. Use 1024×1024 only when the extra detail is necessary.
Merge and Quantization Pipeline
The deployment pipeline from the LoRA adapter to Jetson-ready GGUF is:
QLoRA Fine-tuning
↓
LoRA Adapter
↓ LLaMA-Factory export
Merged HF Model (BF16)
↓ llama.cpp convert_hf_to_gguf.py
GGUF F16
↓ llama-quantize
GGUF Q4_K_M
↓
Deployment on Jetson with llama.cpp
Merge Configuration (LLaMA-Factory)
model_name_or_path: Qwen/Qwen3-VL-8B-Instruct
adapter_name_or_path: WayBob/Qwen3VL-8B-QLora-4bit-xView2-Disaster-Recognition
template: qwen3_vl_nothink
trust_remote_code: true
export_dir: <replace-with-your-LLaMA-Factory-path>/output/qwen3vl_8b_disaster_merged
export_size: 5
export_device: cpu
export_legacy_format: false
Important: do not set
quantization_bitduring merge. Training used 4-bit quantization for efficiency, but the merge step should export merged weights first and quantize only afterward.
GGUF Conversion and Quantization
python -m pip install gguf
git clone https://github.com/ggml-org/llama.cpp <replace-with-your-llama.cpp-path>
# Convert merged HF checkpoint to GGUF
python3 <replace-with-your-llama.cpp-path>/convert_hf_to_gguf.py \
<replace-with-your-LLaMA-Factory-path>/output/qwen3vl_8b_disaster_merged \
--outfile <replace-with-your-LLaMA-Factory-path>/output/gguf_16bit_4bit/disaster-8b-f16.gguf
# Build quantization tool
cd <replace-with-your-llama.cpp-path>
cmake -B build
cmake --build build --target llama-quantize -j$(nproc)
# Quantize to Q4_K_M
./build/bin/llama-quantize \
<replace-with-your-LLaMA-Factory-path>/output/gguf_16bit_4bit/disaster-8b-f16.gguf \
<replace-with-your-LLaMA-Factory-path>/output/gguf_16bit_4bit/disaster-8b-q4km.gguf \
Q4_K_M
Training Details
For the complete fine-tuning story, see the original LoRA adapter repository: WayBob/Qwen3VL-8B-QLora-4bit-xView2-Disaster-Recognition
Summary
| Attribute | Detail |
|---|---|
| Framework | LLaMA-Factory |
| Method | QLoRA (4-bit NF4 + LoRA rank 8, target all linear layers) |
| Dataset | WayBob/Disaster_Recognition_RemoteSense_EN_CN_JA |
| Training Samples | 55,008 |
| Test Samples | 5,598 |
| Languages | English, Japanese, Chinese |
| GPUs | 2× NVIDIA RTX 4090 (24 GB) |
| Training Time | ~6.4 hours |
| Final Training Loss | 0.0239 |
Multilingual Examples
English
Q: What type of disaster occurred in this image?
A: This is a fire disaster. Key visual evidence includes charred and blackened terrain,
scorched vegetation, and widespread burn patterns across the landscape.
日本語
Q: この画像ではどのような種類の災害が発生しましたか?
A: 火災災害が発生しました。地表が黒く焼けており、植生や建物に焼失の痕跡が見られます。
中文
Q: 当前图片发生了什么灾害呢?
A: 当前图片发生了风灾灾害。可以看到大量树木倒伏、建筑受损,以及明显的风灾破坏痕迹。
Recommended System Prompt
For best results, use a system prompt such as:
You are a disaster recognition expert. When analyzing disaster images, first identify the disaster type, then explain the key visual evidence supporting your classification. Respond in the same language as the user.
Limitations
- This model is specialized for post-disaster satellite and aerial imagery and may not perform well on ground-level photos.
- The target label space is limited to six disaster classes: fire, flood, hurricane/wind, earthquake, tsunami, and volcano.
- The training data format is relatively simple, so without a good system prompt the model may answer too briefly.
- Geographic coverage is not uniform; performance may vary by region and disaster appearance.
- Higher image resolution can improve fidelity, but it substantially increases TTFT on edge devices.
- The model is primarily intended for English, Japanese, and Chinese.
- This model is for assistance and triage, not fully autonomous decision-making in emergency response.
Intended Use
Recommended
- post-disaster image triage
- disaster type classification from satellite/aerial imagery
- multilingual disaster-image QA
- humanitarian and research workflows
- edge deployment experiments on Jetson-class devices
Not Recommended
- disaster prediction
- ground-level scene understanding
- legal, insurance, or policy decisions without human review
- fine-grained damage severity assessment
- use as the sole source of truth in emergency operations
License
This repository packages artifacts derived from multiple upstream sources:
- Base model: Qwen/Qwen3-VL-8B-Instruct, which declares Apache-2.0
- LoRA adapter: WayBob/Qwen3VL-8B-QLora-4bit-xView2-Disaster-Recognition, whose model card declares CC-BY-4.0
- Training dataset: WayBob/Disaster_Recognition_RemoteSense_EN_CN_JA, whose dataset card declares CC-BY-NC-SA-4.0
Because these upstream artifacts use different licenses, the metadata of this repository is set to license: other rather than claiming a single simple license for all distributed artifacts.
Before using, redistributing, fine-tuning, or commercializing this repository, please:
- review all upstream licenses yourself
- confirm that your intended use is compatible with all applicable terms
- avoid assuming that this repository grants rights beyond those granted by upstream authors and applicable law
If you need a single definitive legal statement for production or commercial use, obtain legal review first.
Citation
@misc{wang2026qwen3vl_jetson_disaster,
title={Qwen3VL-8B-4bit-GGUF-Jetson-Deployment: Merged and Quantized Vision-Language Model for Disaster Type Classification},
author={WayBob},
year={2026},
publisher={HuggingFace},
url={https://huggingface.co/WayBob/Qwen3VL-8B-4bit-GGUF-Jetson-Deployment}
}
@misc{wang2026qwen3vl_disaster_lora,
title={Qwen3VL-8B-QLora-4bit-xView2-Disaster-Recognition},
author={WayBob},
year={2026},
publisher={HuggingFace},
url={https://huggingface.co/WayBob/Qwen3VL-8B-QLora-4bit-xView2-Disaster-Recognition}
}
@misc{waybob2026disaster_dataset,
title={Disaster Recognition RemoteSense Dataset (EN/CN/JA)},
author={WayBob},
year={2026},
publisher={HuggingFace},
url={https://huggingface.co/datasets/WayBob/Disaster_Recognition_RemoteSense_EN_CN_JA}
}
@inproceedings{xview2,
title={xBD: A Dataset for Assessing Building Damage from Satellite Imagery},
author={Gupta, Ritwik and Hosfelt, Richard and Sajeev, Sandra and Patel, Nirav and Goodman, Bryce and Doshi, Jigar and Heim, Eric and Choset, Howie and Gaston, Matthew},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops},
year={2019}
}
Acknowledgements
- Qwen Team for the Qwen3-VL base model
- LLaMA-Factory for the fine-tuning workflow
- llama.cpp for efficient GGUF inference on edge devices
- xView2 / xBD and DIUx for the original disaster imagery benchmark
- NVIDIA Jetson platform for edge deployment validation
Disclaimer
This model is intended for research, evaluation, and deployment experimentation.
Always verify model outputs with qualified human reviewers before making real-world decisions in disaster response workflows.
- Downloads last month
- 11
16-bit
Model tree for WayBob/Qwen3VL-8B-4bit-GGUF-Jetson-Deployment
Base model
Qwen/Qwen3-VL-8B-Instruct