Qwen3 4B Customer Support - GGUF
This repository contains GGUF quantized versions of ragib01/Qwen3-4B-customer-support for efficient inference with llama.cpp and compatible tools.
Model Description
A fine-tuned Qwen3-4B model optimized for customer support tasks, converted to GGUF format for efficient CPU and GPU inference.
Available Quantization Formats
| Filename |
Quant Method |
Size |
Description |
Use Case |
| Qwen3-4B-customer-support-f16.gguf |
f16 |
~8GB |
Full 16-bit precision |
Best quality, requires more RAM |
| Qwen3-4B-customer-support-Q8_0.gguf |
Q8_0 |
~4.5GB |
8-bit quantization |
High quality, good balance |
| Qwen3-4B-customer-support-Q6_K.gguf |
Q6_K |
~3.5GB |
6-bit quantization |
Good quality, smaller size |
| Qwen3-4B-customer-support-Q5_K_M.gguf |
Q5_K_M |
~3GB |
5-bit medium |
Balanced quality/size |
| Qwen3-4B-customer-support-Q4_K_M.gguf |
Q4_K_M |
~2.5GB |
4-bit medium |
Recommended - best balance |
| Qwen3-4B-customer-support-Q4_K_S.gguf |
Q4_K_S |
~2.3GB |
4-bit small |
Smaller, slightly lower quality |
| Qwen3-4B-customer-support-Q3_K_M.gguf |
Q3_K_M |
~2GB |
3-bit medium |
Very small, decent quality |
| Qwen3-4B-customer-support-Q2_K.gguf |
Q2_K |
~1.5GB |
2-bit |
Smallest, lower quality |
Recommendation: Start with Qwen3-4B-customer-support-Q4_K_M.gguf for the best balance of quality and size.
Usage
LM Studio
- Open LM Studio
- Go to the "Search" tab
- Search for
ragib01/Qwen3-4B-customer-support
- Download your preferred quantization
- Load and start chatting!
llama.cpp (Command Line)
huggingface-cli download ragib01/Qwen3-4B-customer-support-gguf Qwen3-4B-customer-support-Q4_K_M.gguf --local-dir ./models
./llama-cli -m ./models/Qwen3-4B-customer-support-Q4_K_M.gguf -p "How do I track my order?" -n 256
Python (llama-cpp-python)
from llama_cpp import Llama
llm = Llama(
model_path="./models/Qwen3-4B-customer-support-Q4_K_M.gguf",
n_ctx=2048,
n_threads=8,
n_gpu_layers=35
)
output = llm(
"How do I track my order?",
max_tokens=256,
temperature=0.7,
top_p=0.9,
)
print(output['choices'][0]['text'])
Ollama
cat > Modelfile << EOF
FROM ./Qwen3-4B-customer-support-Q4_K_M.gguf
PARAMETER temperature 0.7
PARAMETER top_p 0.9
EOF
ollama create qwen3-customer-support -f Modelfile
ollama run qwen3-customer-support "How do I track my order?"
Prompt Format
This model uses the Qwen chat format:
<|im_start|>system
You are a helpful customer support assistant.<|im_end|>
<|im_start|>user
How do I track my order?<|im_end|>
<|im_start|>assistant
Performance Notes
- CPU: Q4_K_M works well on modern CPUs with 8GB+ RAM
- GPU: Use higher quantizations (Q6_K, Q8_0) if you have VRAM available
- Mobile: Q3_K_M or Q2_K for resource-constrained devices
Original Model
This is a quantized version of unsloth/Qwen3-4B-Instruct-2507.
License
Apache 2.0
Citation
@misc{{qwen3-customer-support-gguf,
author = {{ragib01}},
title = {{Qwen3 4B Customer Support - GGUF}},
year = {{2025}},
publisher = {{HuggingFace}},
url = {{https://huggingface.co/ragib01/Qwen3-4B-customer-support-gguf}}
}}