Qwen3-4B-Instruct-2507-heretic-W8A16

Model Base Model Quantization vLLM

Overview

This is an 8-bit weight-only quantized version of Qwen3-4B-Instruct-2507-heretic using the W8A16 scheme, quantized with LLMCompressor.

Property Value
Base Model Qwen3-4B-Instruct-2507-heretic
Quantization W8A16 (8-bit weights, 16-bit activations)
Quant Method compressed-tensors (AWQ-compatible)
Model Size ~4.1 GB
Context Length 262,144 tokens

Quantization Details

  • Framework: LLMCompressor
  • Quantization Scheme: W8A16 (channel-wise symmetric int8)
  • Format: pack-quantized (compressed-tensors)
  • Target Layers: All Linear layers (except lm_head)

Performance

Perplexity (WikiText-2-raw-v1, test split)

Model Perplexity Degradation
Original FP16 17.4634 -
This Model (W8A16) 17.5556 +0.53%

Nearly lossless quantization with minimal perplexity degradation!

Usage

With vLLM

vllm serve groxaxo/Qwen3-4B-Instruct-2507-heretic-W8A16 \
  --quantization compressed-tensors \
  --trust-remote-code

With Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "groxaxo/Qwen3-4B-Instruct-2507-heretic-W8A16",
    device_map="auto",
    trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(
    "groxaxo/Qwen3-4B-Instruct-2507-heretic-W8A16"
)

messages = [{"role": "user", "content": "Hello, how are you?"}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=256)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Python API (vLLM)

from vllm import LLM, SamplingParams

llm = LLM(
    model="groxaxo/Qwen3-4B-Instruct-2507-heretic-W8A16",
    quantization="compressed-tensors",
    trust_remote_code=True
)

sampling_params = SamplingParams(temperature=0.7, top_p=0.9, max_tokens=256)
outputs = llm.generate(["Hello, how are you?"], sampling_params)
for output in outputs:
    print(output.outputs[0].text)

Model Architecture

  • Architecture: Qwen3ForCausalLM
  • Hidden Size: 2,560
  • Intermediate Size: 9,728
  • Attention Heads: 32
  • KV Heads: 8 (GQA)
  • Layers: 36
  • Vocab Size: 151,936
  • RoPE Theta: 5,000,000

Acknowledgements

License

Apache 2.0 (inherited from base model)

Downloads last month
11
Safetensors
Model size
4B params
Tensor type
I64
I32
BF16
Inference Providers NEW
This model isn't deployed by any Inference Provider. 馃檵 Ask for provider support

Model tree for groxaxo/Qwen3-4B-Instruct-2507-heretic-W8A16

Quantized
(7)
this model