Qwen3VL-8B-4bit-GGUF-Jetson-Deployment

🌍 Disaster Recognition Model | 🚨 Emergency Response | 🗣️ Trilingual (EN/JA/ZH) | 🔧 Jetson-Ready

This repository provides a merged Hugging Face checkpoint and pre-quantized GGUF files for immediate deployment of a disaster-recognition vision-language model on edge devices such as the NVIDIA Jetson series.

The model was created by merging the QLoRA adapter WayBob/Qwen3VL-8B-QLora-4bit-xView2-Disaster-Recognition into the base model Qwen/Qwen3-VL-8B-Instruct, then converting the merged checkpoint into GGUF for efficient deployment with llama.cpp.

Why this repository exists

The original LoRA adapter repository is ideal for standard Hugging Face inference and further experimentation, but edge deployment on Jetson-class devices introduces additional constraints:

Runtime LoRA loading adds memory overhead.
JetPack 5.1.2 ships with CUDA 11.4, which is often awkward or unsupported for newer inference stacks without custom patching.
Edge deployment benefits from a single merged model artifact and a lightweight inference runtime.

To make deployment simpler, this repository provides:

Merged HF weights: the LoRA adapter is baked into the base model weights.
GGUF files: quantized artifacts for llama.cpp, enabling practical Jetson deployment with a runtime memory footprint of about ~6.3 GB in the validated configuration.
A Jetson-focused deployment path: a reproducible setup using llama.cpp instead of more heavyweight serving stacks.

Model Overview

Attribute	Detail
Base Model	Qwen/Qwen3-VL-8B-Instruct
LoRA Adapter	WayBob/Qwen3VL-8B-QLora-4bit-xView2-Disaster-Recognition
Training Data	WayBob/Disaster_Recognition_RemoteSense_EN_CN_JA
Training Samples	55,008 trilingual samples
Fine-tuning Method	QLoRA (4-bit NF4, LoRA rank 8, alpha 16)
Languages	English, Japanese, Chinese
Target Disaster Classes	Fire, Flood, Hurricane/Wind, Earthquake, Tsunami, Volcano
Model Scale	~8B parameters
Primary Task	Disaster type recognition from post-disaster satellite/aerial imagery
Primary Deployment Target	Jetson-class edge devices using `llama.cpp`

Repository Structure

.
├── README.md                           # Model card and documentation
├── qwen3vl_8b_disaster_merged/         # Merged Hugging Face checkpoint (BF16)
│   ├── config.json
│   ├── generation_config.json
│   ├── model-00001-of-00004.safetensors
│   ├── model-00002-of-00004.safetensors
│   ├── model-00003-of-00004.safetensors
│   ├── model-00004-of-00004.safetensors
│   ├── model.safetensors.index.json
│   ├── tokenizer.json
│   ├── tokenizer_config.json
│   ├── special_tokens_map.json
│   ├── chat_template.jinja
│   ├── preprocessor_config.json
│   └── vocab.json
├── gguf_16bit_4bit/                    # GGUF files for llama.cpp
│   ├── disaster-8b-f16.gguf            # F16 GGUF
│   └── disaster-8b-q4km.gguf           # Q4_K_M GGUF (recommended for Jetson)
└── merge_lora/                         # Merge configuration
    └── qwen3_vl_8b_xview2lora.yaml     # LLaMA-Factory merge config

Available Formats

File	Format	Size	Typical Use Case
`qwen3vl_8b_disaster_merged/`	BF16 (`safetensors`)	~16 GB	Reproducibility, inspection, further conversion, research
`gguf_16bit_4bit/disaster-8b-f16.gguf`	GGUF F16	~16.4 GB	High-accuracy reference
`gguf_16bit_4bit/disaster-8b-q4km.gguf`	GGUF Q4_K_M	~4.8 GB	Recommended edge deployment

If your goal is Jetson deployment, you typically only need the Q4_K_M GGUF file plus the corresponding Qwen3-VL mmproj file.

Quick Start

Option 1: Hugging Face Transformers (Merged Weights)

This path is useful for reproducibility and standard HF workflows. For Jetson inference, the GGUF path below is usually more practical.

import torch
from PIL import Image
from transformers import Qwen3VLForConditionalGeneration, AutoProcessor

model = Qwen3VLForConditionalGeneration.from_pretrained(
    "WayBob/Qwen3VL-8B-4bit-GGUF-Jetson-Deployment",
    subfolder="qwen3vl_8b_disaster_merged",
    torch_dtype="auto",
    device_map="auto"
)

processor = AutoProcessor.from_pretrained(
    "WayBob/Qwen3VL-8B-4bit-GGUF-Jetson-Deployment",
    subfolder="qwen3vl_8b_disaster_merged"
)

image = Image.open("disaster_image.jpg").convert("RGB")

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image},
            {"type": "text", "text": "What type of disaster occurred in this image?"}
        ]
    }
]

text = processor.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
inputs = processor(text=[text], images=[image], return_tensors="pt").to(model.device)

generated_ids = model.generate(
    **inputs,
    max_new_tokens=256,
    temperature=0
)

generated_ids_trimmed = [
    out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed,
    skip_special_tokens=True,
    clean_up_tokenization_spaces=False
)

print(output_text[0])

Option 2: llama.cpp Server (Recommended for Deployment)

# Clone and build llama.cpp on Jetson
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j$(nproc)

# Download the quantized model from this repository
huggingface-cli download WayBob/Qwen3VL-8B-4bit-GGUF-Jetson-Deployment \
  --include "gguf_16bit_4bit/disaster-8b-q4km.gguf" \
  --local-dir models/

# Download the Qwen3-VL vision projector
huggingface-cli download Qwen/Qwen3-VL-8B-Instruct-GGUF \
  --include "*mmproj*Q8*" \
  --local-dir models/

# Launch the server
./build/bin/llama-server \
  -m models/disaster-8b-q4km.gguf \
  --mmproj models/qwen3-vl-8b/mmproj-Qwen3VL-8B-Instruct-Q8_0.gguf \
  -ngl 99 --fit off -c 8192 \
  -ctk q8_0 -ctv q8_0 \
  --host 0.0.0.0 --port 8080

Option 3: OpenAI-Compatible API Call

import base64
import requests

with open("disaster_image.jpg", "rb") as f:
    img_b64 = base64.b64encode(f.read()).decode()

response = requests.post(
    "http://localhost:8080/v1/chat/completions",
    json={
        "messages": [
            {
                "role": "system",
                "content": (
                    "You are a disaster recognition expert. "
                    "When analyzing disaster images, first identify the disaster type, "
                    "then explain the key visual evidence supporting your classification. "
                    "Respond in the same language as the user."
                )
            },
            {
                "role": "user",
                "content": [
                    {
                        "type": "image_url",
                        "image_url": {"url": f"data:image/jpeg;base64,{img_b64}"}
                    },
                    {
                        "type": "text",
                        "text": "What type of disaster occurred in this image?"
                    }
                ]
            }
        ],
        "temperature": 0
    },
    timeout=300
)

print(response.json()["choices"][0]["message"]["content"])

Edge Deployment on NVIDIA Jetson

This model has been validated on NVIDIA Jetson Orin NX 16GB with llama.cpp.

Validation Hardware Environment

Property	Value
Device Model	NVIDIA Orin NX Developer Kit
Module / Carrier Board	Jetson Orin NX 16GB (`P3767-0000`) / `P3768-0000`
SoC / Platform	Tegra234 (Orin, `tegra23x` family)
JetPack / L4T	JetPack 5.1.2 / L4T 35.4.1
OS / Kernel	Ubuntu 20.04.6 LTS / Linux 5.10.120-tegra
Python	3.8.10
Libraries	CUDA 11.4.315, cuDNN 8.6.0.166, TensorRT 8.5.2.2, VPI 2.3.9, Vulkan 1.3.204
OpenCV	`cv2` 4.5.4, CUDA not enabled
Power Mode	MAXN
Storage	937 GB NVMe system drive

Example Idle Snapshot on the Validation Device

The values below are a representative jtop snapshot from the validation device and are not fixed hardware limits.

Sensor	Example Status
CPU	8 cores, roughly 1-8% load at 729 MHz
GPU	0% load at 306 MHz
Memory	4.1 GB / 15.2 GB
Swap	425 MB / 7.6 GB
EMC	204 MHz (reported cap ~3.2 GHz)
Storage	67.4 GB / 937 GB used
Cooling	Fan PWM 100%; `jtop` reported 0 RPM
Temperatures	CPU/GPU/SoC roughly 44-49°C

Why `llama.cpp` on Jetson?

JetPack 5.1.2 ships with CUDA 11.4. On Jetson, many recent inference stacks either target newer CUDA/toolchain combinations or require significant patching to be practical. In contrast, llama.cpp offers a simpler deployment path:

native GGUF support
straightforward CUDA build on Jetson
OpenAI-compatible HTTP server
good memory efficiency for quantized deployment
practical support for vision-language inference via mmproj

Build and Deploy on Jetson

# Build llama.cpp on Jetson
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j$(nproc)

# Download model and vision projector
huggingface-cli download WayBob/Qwen3VL-8B-4bit-GGUF-Jetson-Deployment \
  --include "gguf_16bit_4bit/disaster-8b-q4km.gguf" \
  --local-dir models/

huggingface-cli download Qwen/Qwen3-VL-8B-Instruct-GGUF \
  --include "*mmproj*Q8*" \
  --local-dir models/

# Launch server
./build/bin/llama-server \
  -m models/gguf_16bit_4bit/disaster-8b-q4km.gguf \
  --mmproj models/mmproj-Qwen3VL-8B-Instruct-Q8_0.gguf \
  -ngl 99 \
  --fit off \
  -c 8192 \
  -ctk q8_0 -ctv q8_0 \
  --host 0.0.0.0 \
  --port 8080

Key Flags

Flag	Purpose
`-ngl 99`	Offload all decoder layers to GPU
`--fit off`	Skip auto memory fitting and reduce Jetson startup time
`-c 8192`	Set the context length
`-ctk q8_0 -ctv q8_0`	Quantize KV cache to Q8 to save memory

Representative Memory Footprint

Measured on the validated Jetson setup with Q4_K_M and Q8 KV cache.

Component	Memory Footprint
LLM (`Q4_K_M`)	~4.5 GB
Vision projector (`mmproj`)	~0.7 GB
KV cache (`Q8`, `c=8192`)	~0.6 GB
Compute buffers	~0.4 GB
Total	~6.3 GB

Representative Performance

These are deployment observations from the validated Jetson configuration. They should be treated as representative, not guaranteed.

Input Resolution	Image Processing	TTFT	Text Generation
1024×1024	~8.7 s	~10.3 s	~10-11 tok/s
512×512	~0.9 s	~1.9 s	~10-11 tok/s

Notes on Image Resolution and Visual Tokens

Qwen3-VL uses 14-pixel image patches with a spatial merge size of 2, and its preprocessing aligns image dimensions to multiples of 28 through smart resizing.

In practice, this means:

visual token count depends on the processed image size after smart resize
token count is not always equal to a simple raw width × height formula
larger input resolution significantly increases prefilling time and TTFT

For disaster type classification, 512×512 is often a strong latency/quality trade-off on Jetson. Use 1024×1024 only when the extra detail is necessary.

Merge and Quantization Pipeline

The deployment pipeline from the LoRA adapter to Jetson-ready GGUF is:

QLoRA Fine-tuning
    ↓
LoRA Adapter
    ↓  LLaMA-Factory export
Merged HF Model (BF16)
    ↓  llama.cpp convert_hf_to_gguf.py
GGUF F16
    ↓  llama-quantize
GGUF Q4_K_M
    ↓
Deployment on Jetson with llama.cpp

Merge Configuration (LLaMA-Factory)

model_name_or_path: Qwen/Qwen3-VL-8B-Instruct
adapter_name_or_path: WayBob/Qwen3VL-8B-QLora-4bit-xView2-Disaster-Recognition
template: qwen3_vl_nothink
trust_remote_code: true
export_dir: <replace-with-your-LLaMA-Factory-path>/output/qwen3vl_8b_disaster_merged
export_size: 5
export_device: cpu
export_legacy_format: false

Important: do not set quantization_bit during merge. Training used 4-bit quantization for efficiency, but the merge step should export merged weights first and quantize only afterward.

GGUF Conversion and Quantization

python -m pip install gguf

git clone https://github.com/ggml-org/llama.cpp <replace-with-your-llama.cpp-path>

# Convert merged HF checkpoint to GGUF
python3 <replace-with-your-llama.cpp-path>/convert_hf_to_gguf.py \
  <replace-with-your-LLaMA-Factory-path>/output/qwen3vl_8b_disaster_merged \
  --outfile <replace-with-your-LLaMA-Factory-path>/output/gguf_16bit_4bit/disaster-8b-f16.gguf

# Build quantization tool
cd <replace-with-your-llama.cpp-path>
cmake -B build
cmake --build build --target llama-quantize -j$(nproc)

# Quantize to Q4_K_M
./build/bin/llama-quantize \
  <replace-with-your-LLaMA-Factory-path>/output/gguf_16bit_4bit/disaster-8b-f16.gguf \
  <replace-with-your-LLaMA-Factory-path>/output/gguf_16bit_4bit/disaster-8b-q4km.gguf \
  Q4_K_M

Training Details

For the complete fine-tuning story, see the original LoRA adapter repository: WayBob/Qwen3VL-8B-QLora-4bit-xView2-Disaster-Recognition

Summary

Attribute	Detail
Framework	LLaMA-Factory
Method	QLoRA (4-bit NF4 + LoRA rank 8, target all linear layers)
Dataset	WayBob/Disaster_Recognition_RemoteSense_EN_CN_JA
Training Samples	55,008
Test Samples	5,598
Languages	English, Japanese, Chinese
GPUs	2× NVIDIA RTX 4090 (24 GB)
Training Time	~6.4 hours
Final Training Loss	0.0239

Multilingual Examples

English

Q: What type of disaster occurred in this image?
A: This is a fire disaster. Key visual evidence includes charred and blackened terrain,
   scorched vegetation, and widespread burn patterns across the landscape.

日本語

Q: この画像ではどのような種類の災害が発生しましたか？
A: 火災災害が発生しました。地表が黒く焼けており、植生や建物に焼失の痕跡が見られます。

中文

Q: 当前图片发生了什么灾害呢？
A: 当前图片发生了风灾灾害。可以看到大量树木倒伏、建筑受损，以及明显的风灾破坏痕迹。

Recommended System Prompt

For best results, use a system prompt such as:

You are a disaster recognition expert. When analyzing disaster images, first identify the disaster type, then explain the key visual evidence supporting your classification. Respond in the same language as the user.

Limitations

This model is specialized for post-disaster satellite and aerial imagery and may not perform well on ground-level photos.
The target label space is limited to six disaster classes: fire, flood, hurricane/wind, earthquake, tsunami, and volcano.
The training data format is relatively simple, so without a good system prompt the model may answer too briefly.
Geographic coverage is not uniform; performance may vary by region and disaster appearance.
Higher image resolution can improve fidelity, but it substantially increases TTFT on edge devices.
The model is primarily intended for English, Japanese, and Chinese.
This model is for assistance and triage, not fully autonomous decision-making in emergency response.

Intended Use

Not Recommended

disaster prediction
ground-level scene understanding
legal, insurance, or policy decisions without human review
fine-grained damage severity assessment
use as the sole source of truth in emergency operations

License

This repository packages artifacts derived from multiple upstream sources:

Base model: Qwen/Qwen3-VL-8B-Instruct, which declares Apache-2.0
LoRA adapter: WayBob/Qwen3VL-8B-QLora-4bit-xView2-Disaster-Recognition, whose model card declares CC-BY-4.0
Training dataset: WayBob/Disaster_Recognition_RemoteSense_EN_CN_JA, whose dataset card declares CC-BY-NC-SA-4.0

Because these upstream artifacts use different licenses, the metadata of this repository is set to license: other rather than claiming a single simple license for all distributed artifacts.

Before using, redistributing, fine-tuning, or commercializing this repository, please:

review all upstream licenses yourself
confirm that your intended use is compatible with all applicable terms
avoid assuming that this repository grants rights beyond those granted by upstream authors and applicable law

If you need a single definitive legal statement for production or commercial use, obtain legal review first.

Citation

@misc{wang2026qwen3vl_jetson_disaster,
  title={Qwen3VL-8B-4bit-GGUF-Jetson-Deployment: Merged and Quantized Vision-Language Model for Disaster Type Classification},
  author={WayBob},
  year={2026},
  publisher={HuggingFace},
  url={https://huggingface.co/WayBob/Qwen3VL-8B-4bit-GGUF-Jetson-Deployment}
}

@misc{wang2026qwen3vl_disaster_lora,
  title={Qwen3VL-8B-QLora-4bit-xView2-Disaster-Recognition},
  author={WayBob},
  year={2026},
  publisher={HuggingFace},
  url={https://huggingface.co/WayBob/Qwen3VL-8B-QLora-4bit-xView2-Disaster-Recognition}
}

@misc{waybob2026disaster_dataset,
  title={Disaster Recognition RemoteSense Dataset (EN/CN/JA)},
  author={WayBob},
  year={2026},
  publisher={HuggingFace},
  url={https://huggingface.co/datasets/WayBob/Disaster_Recognition_RemoteSense_EN_CN_JA}
}

@inproceedings{xview2,
  title={xBD: A Dataset for Assessing Building Damage from Satellite Imagery},
  author={Gupta, Ritwik and Hosfelt, Richard and Sajeev, Sandra and Patel, Nirav and Goodman, Bryce and Doshi, Jigar and Heim, Eric and Choset, Howie and Gaston, Matthew},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops},
  year={2019}
}

Acknowledgements

Qwen Team for the Qwen3-VL base model
LLaMA-Factory for the fine-tuning workflow
llama.cpp for efficient GGUF inference on edge devices
xView2 / xBD and DIUx for the original disaster imagery benchmark
NVIDIA Jetson platform for edge deployment validation

Disclaimer

This model is intended for research, evaluation, and deployment experimentation.
Always verify model outputs with qualified human reviewers before making real-world decisions in disaster response workflows.

Downloads last month: 11

GGUF

Model size

8B params

Architecture

qwen3vl

Hardware compatibility

16-bit

View +1 variant

Model tree for WayBob/Qwen3VL-8B-4bit-GGUF-Jetson-Deployment

Base model

Qwen/Qwen3-VL-8B-Instruct

Quantized

(75)

this model

WayBob
/

Qwen3VL-8B-4bit-GGUF-Jetson-Deployment

Qwen3VL-8B-4bit-GGUF-Jetson-Deployment

Why this repository exists

Model Overview

Repository Structure

Available Formats

Quick Start

Option 1: Hugging Face Transformers (Merged Weights)

Option 2: llama.cpp Server (Recommended for Deployment)

Option 3: OpenAI-Compatible API Call

Edge Deployment on NVIDIA Jetson

Validation Hardware Environment

Example Idle Snapshot on the Validation Device

Why `llama.cpp` on Jetson?

Build and Deploy on Jetson

Key Flags

Representative Memory Footprint

Representative Performance

Notes on Image Resolution and Visual Tokens

Merge and Quantization Pipeline

Merge Configuration (LLaMA-Factory)

GGUF Conversion and Quantization

Training Details

Summary

Multilingual Examples

English

日本語

中文

Recommended System Prompt

Limitations

Intended Use

Recommended

Not Recommended

License

Citation

Acknowledgements

Disclaimer

Model tree for WayBob/Qwen3VL-8B-4bit-GGUF-Jetson-Deployment

Dataset used to train WayBob/Qwen3VL-8B-4bit-GGUF-Jetson-Deployment

Qwen3VL-8B-4bit-GGUF-Jetson-Deployment

Why this repository exists

Model Overview

Repository Structure

Available Formats

Quick Start

Option 1: Hugging Face Transformers (Merged Weights)

Option 2: llama.cpp Server (Recommended for Deployment)

Option 3: OpenAI-Compatible API Call

Edge Deployment on NVIDIA Jetson

Validation Hardware Environment

Example Idle Snapshot on the Validation Device

Why llama.cpp on Jetson?

Build and Deploy on Jetson

Key Flags

Representative Memory Footprint

Representative Performance

Notes on Image Resolution and Visual Tokens

Merge and Quantization Pipeline

Merge Configuration (LLaMA-Factory)

GGUF Conversion and Quantization

Training Details

Summary

Multilingual Examples

English

日本語

中文

Recommended System Prompt

Limitations

Intended Use

Recommended

Not Recommended

License

Citation

Acknowledgements

Disclaimer

Model tree for WayBob/Qwen3VL-8B-4bit-GGUF-Jetson-Deployment

Dataset used to train WayBob/Qwen3VL-8B-4bit-GGUF-Jetson-Deployment

Why `llama.cpp` on Jetson?