| --- |
| license: apache-2.0 |
| base_model: Qwen/Qwen3-4B |
| tags: |
| - quantized |
| - 4-bit |
| - int4 |
| - qwen3 |
| language: |
| - en |
| pipeline_tag: text-generation |
| --- |
| |
| # Qwen3-4B-AWQ-INT4 |
|
|
| INT4 quantization of [`Qwen/Qwen3-4B`](https://huggingface.co/Qwen/Qwen3-4B). Built to run on a single 6 GB+ consumer GPU. |
|
|
| ## Footprint |
|
|
| | | | |
| |---|---| |
| | Source params | 4B | |
| | Quantized weights | ~2.5 GB on disk | |
| | Inference VRAM (incl. KV cache @ 32K context) | ~6 GB | |
|
|
| Fits any 6 GB+ consumer card: RTX 2060 / 3050 / 4050 / mobile GPUs / older laptops. The smallest tier in the kit. |
|
|
| ## Bench |
|
|
| Scored on [`drawais/needle-1M-bench-mvp`](https://huggingface.co/datasets/drawais/needle-1M-bench-mvp) (50K-token haystack, real arxiv text): |
|
|
| | Metric | Score | |
| |---|---| |
| | Overall recall | **70.0%** | |
| | Paper-anchored | 60.0% | |
| | Synthetic codes | 80.0% | |
|
|
| ## Quick start |
|
|
| ```bash |
| vllm serve drawais/Qwen3-4B-AWQ-INT4 --quantization awq_marlin --max-model-len 32768 |
| ``` |
|
|
| ```python |
| from transformers import AutoTokenizer, AutoModelForCausalLM |
| tok = AutoTokenizer.from_pretrained("drawais/Qwen3-4B-AWQ-INT4") |
| model = AutoModelForCausalLM.from_pretrained("drawais/Qwen3-4B-AWQ-INT4", device_map="auto") |
| ``` |
|
|
| ## Context length |
|
|
| Native: 40,960 tokens. For longer contexts, enable YaRN rope-scaling per the base model's config. |
|
|
| ## License |
|
|
| Apache 2.0 (inherits from base model). |
|
|