YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
qwen3_0pt6b_awq
This is a quantized version of /workspace/huggingface/Qwen3-0.6B using AWQ (Activation-aware Weight Quantization).
Quantization Details
- Method: AWQ
- Bits: 4-bit
- Group Size: 128
- Zero Point: True
- Calibration Dataset: wikitext (wikitext-103-v1)
- Calibration Samples: 128
Weight Storage Format
- Quantized Weights (qweight, qzeros): Stored as
int32(packed 4-bit values) - Scales (qscales): Stored as
torch.bfloat16(float precision) - Non-quantized Layers (embeddings, norms): Stored as
torch.bfloat16
Loading Instructions
This model can be loaded directly using Hugging Face Transformers with trust_remote_code=True:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
# Load model - torch_dtype="auto" uses the dtype from config.json for non-quantized parts
model = AutoModelForCausalLM.from_pretrained(
"seangogo/qwen3-0.6b-awq",
trust_remote_code=True,
torch_dtype="auto", # Uses config.torch_dtype for embeddings, norms, etc.
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("seangogo/qwen3-0.6b-awq")
# Generate text
inputs = tokenizer("Hello, how are you?", return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Note on dtype:
torch_dtype="auto"controls the dtype for non-quantized layers (embeddings, norms, biases)- Quantized weights are always stored as
int32(packed 4-bit values) regardless of this setting - The quantized layers automatically handle the int32→float conversion during inference
Alternatively, you can explicitly specify the dtype for non-quantized parts:
model = AutoModelForCausalLM.from_pretrained(
"seangogo/qwen3-0.6b-awq",
trust_remote_code=True,
torch_dtype=torch.bfloat16, # Only affects non-quantized layers
device_map="auto"
)
Model Architecture
The quantized model uses custom WQLinear_GEMM layers instead of standard Linear layers for efficient 4-bit inference with Triton kernels.
Requirements
pip install torch transformers safetensors triton
Files Included
model.safetensors- Quantized model weights (qweight, qscales, qzeros)config.json- Model configuration with quantization metadatamodeling_qwen3.py- Custom model architecture (auto-loaded with trust_remote_code=True)quantization_layers.py- Quantized linear layer implementationsnorm_layers.py- Custom normalization layersrope_layers.py- Rotary position embedding implementationtokenizer.json- Tokenizer files
- Downloads last month
- 1
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support