qwen3_0pt6b_awq

This is a quantized version of /workspace/huggingface/Qwen3-0.6B using AWQ (Activation-aware Weight Quantization).

Quantization Details

Method: AWQ
Bits: 4-bit
Group Size: 128
Zero Point: True
Calibration Dataset: wikitext (wikitext-103-v1)
Calibration Samples: 128

Weight Storage Format

Quantized Weights (qweight, qzeros): Stored as int32 (packed 4-bit values)
Scales (qscales): Stored as torch.bfloat16 (float precision)
Non-quantized Layers (embeddings, norms): Stored as torch.bfloat16

Loading Instructions

This model can be loaded directly using Hugging Face Transformers with trust_remote_code=True:

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Load model - torch_dtype="auto" uses the dtype from config.json for non-quantized parts
model = AutoModelForCausalLM.from_pretrained(
    "seangogo/qwen3-0.6b-awq",
    trust_remote_code=True,
    torch_dtype="auto",  # Uses config.torch_dtype for embeddings, norms, etc.
    device_map="auto"
)

tokenizer = AutoTokenizer.from_pretrained("seangogo/qwen3-0.6b-awq")

# Generate text
inputs = tokenizer("Hello, how are you?", return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Note on dtype:

torch_dtype="auto" controls the dtype for non-quantized layers (embeddings, norms, biases)
Quantized weights are always stored as int32 (packed 4-bit values) regardless of this setting
The quantized layers automatically handle the int32→float conversion during inference

Alternatively, you can explicitly specify the dtype for non-quantized parts:

model = AutoModelForCausalLM.from_pretrained(
    "seangogo/qwen3-0.6b-awq",
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,  # Only affects non-quantized layers
    device_map="auto"
)

Model Architecture

The quantized model uses custom WQLinear_GEMM layers instead of standard Linear layers for efficient 4-bit inference with Triton kernels.

Requirements

pip install torch transformers safetensors triton

Files Included

model.safetensors - Quantized model weights (qweight, qscales, qzeros)
config.json - Model configuration with quantization metadata
modeling_qwen3.py - Custom model architecture (auto-loaded with trust_remote_code=True)
quantization_layers.py - Quantized linear layer implementations
norm_layers.py - Custom normalization layers
rope_layers.py - Rotary position embedding implementation
tokenizer.json - Tokenizer files

Downloads last month: 1

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support