YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

qwen3_0pt6b_awq

This is a quantized version of /workspace/huggingface/Qwen3-0.6B using AWQ (Activation-aware Weight Quantization).

Quantization Details

  • Method: AWQ
  • Bits: 4-bit
  • Group Size: 128
  • Zero Point: True
  • Calibration Dataset: wikitext (wikitext-103-v1)
  • Calibration Samples: 128

Weight Storage Format

  • Quantized Weights (qweight, qzeros): Stored as int32 (packed 4-bit values)
  • Scales (qscales): Stored as torch.bfloat16 (float precision)
  • Non-quantized Layers (embeddings, norms): Stored as torch.bfloat16

Loading Instructions

This model can be loaded directly using Hugging Face Transformers with trust_remote_code=True:

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Load model - torch_dtype="auto" uses the dtype from config.json for non-quantized parts
model = AutoModelForCausalLM.from_pretrained(
    "seangogo/qwen3-0.6b-awq",
    trust_remote_code=True,
    torch_dtype="auto",  # Uses config.torch_dtype for embeddings, norms, etc.
    device_map="auto"
)

tokenizer = AutoTokenizer.from_pretrained("seangogo/qwen3-0.6b-awq")

# Generate text
inputs = tokenizer("Hello, how are you?", return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Note on dtype:

  • torch_dtype="auto" controls the dtype for non-quantized layers (embeddings, norms, biases)
  • Quantized weights are always stored as int32 (packed 4-bit values) regardless of this setting
  • The quantized layers automatically handle the int32→float conversion during inference

Alternatively, you can explicitly specify the dtype for non-quantized parts:

model = AutoModelForCausalLM.from_pretrained(
    "seangogo/qwen3-0.6b-awq",
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,  # Only affects non-quantized layers
    device_map="auto"
)

Model Architecture

The quantized model uses custom WQLinear_GEMM layers instead of standard Linear layers for efficient 4-bit inference with Triton kernels.

Requirements

pip install torch transformers safetensors triton

Files Included

  • model.safetensors - Quantized model weights (qweight, qscales, qzeros)
  • config.json - Model configuration with quantization metadata
  • modeling_qwen3.py - Custom model architecture (auto-loaded with trust_remote_code=True)
  • quantization_layers.py - Quantized linear layer implementations
  • norm_layers.py - Custom normalization layers
  • rope_layers.py - Rotary position embedding implementation
  • tokenizer.json - Tokenizer files
Downloads last month
1
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support