lew96123/qwen3.5-0.8b-custom-packed-turboquant_mse-true-uniform-4bit
This repository is a standalone quantized release of Qwen/Qwen3.5-0.8B published by lew96123.
The previous model card was only a short summary; this one is intentionally detailed so the repo page itself explains what was stored, how it was produced, what was evaluated, and what trade-offs showed up.
Release snapshot
- Base model:
Qwen/Qwen3.5-0.8B - Loader kind / pipeline tag:
image-text-to-text/image-text-to-text - Primary target bit-width: 4
- Effective quantized bit-widths actually present in the checkpoint: 4-bit (265 tensors)
- Quantization method: turboquant_mse
- Quantized tensors: 265
- Passthrough tensors: 223
- Original storage: 1.63 GiB (1,746,882,752 bytes)
- Stored artifact size: 418.55 MiB (438,881,088 bytes)
- Bytes saved: 1.22 GiB (1,308,001,664 bytes)
- Compression ratio: 3.980x
- Base perplexity: 18.2902
- Quantized perplexity: 31.4105
- Perplexity delta vs base: +13.1203 (1.717x of base)
- Mean last-token logit cosine similarity: 0.868845
- Evaluation dataset role: sanity-test
Artifact inventory
| File | Purpose |
|---|---|
turboquant_weights.safetensors |
Packed checkpoint payload containing quantized weights (and any passthrough tensors stored directly). |
quant_manifest.json |
Full manifest with per-parameter metadata, storage prefixes, tensor shapes, and effective bit-width decisions. |
eval_summary.json |
Serialized evaluation metrics for the base model and the quantized artifact. |
sample_generations.json |
Deterministic sample generations generated from the stored prompts in the pipeline config. |
load_quantized.py |
Self-contained Python loader that reconstructs the quantized model from this repository alone. |
requirements.txt |
Minimal dependency hints for loading this artifact. |
config.json, tokenizer files, generation config, and related support files |
Upstream support files copied so the repo can be loaded without separately downloading missing config assets. |
Quantization design
This release uses TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate (arXiv:2504.19874, Zandieh et al. 2025) -- online vector quantization with near-optimal distortion rate. Each weight row is treated as a d-dimensional vector and quantized using a random rotation followed by optimal scalar quantization on the induced Beta distribution coordinates (Lloyd-Max codebook).
- Bit-width target for the main release: 4
- Quantizer: turboquant_mse
- Seed: 42
- Grid size for Lloyd-Max codebook estimation: 8193
- Lloyd-Max refinement iterations: 96
- Quantized scalar values covered by the manifest: 873,250,816
- Passthrough scalar values covered by the manifest: 187,968
Allocator policy
Allocator-driven mixed precision is disabled for this artifact, so the release stays uniform apart from any protected-weight rules.
- Mixed-precision allocator: disabled
- Allocation method: uniform
- Allowed effective bit-widths: 4
- Allocation target: None
- Calibration text count: 0
- Max calibration windows: None
- Promoted tensors from allocator: 0
Paper context and claim boundaries
- Primary reference for the quantizer idea:
TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate(arXiv:2504.19874, Apr 2025; Amir Zandieh, Majid Daliri, Majid Hadian, Vahab Mirrokni). - This repo adapts TurboQuant-style vector quantization to an offline full-weight checkpoint PTQ flow.
- The cited TurboQuant paper's strongest empirical evidence is on vector / KV-cache quantization and nearest-neighbor search, so this artifact should be read as a paper-inspired adaptation, not as a full reproduction of the paper's published experimental setup.
- Related literature that may improve later iterations, but is not implemented in this artifact:
SpinQuant(arXiv:2405.16406) -- learned rotations for PTQ.APTQ(arXiv:2402.14866) andBAQ(arXiv:2506.05664) -- sensitivity-based mixed precision / bit allocation.GuidedQuant(arXiv:2505.07004),LeanQuant(arXiv:2407.10032), andD^2Quant(arXiv:2602.02546) -- stronger sub-4-bit weight-only PTQ strategies.PolarQuant(arXiv:2603.29078) andParoQuant(arXiv:2511.10645) -- newer rotation / preprocessing directions for low-bit compression.
Protected-weight policy
Some tensors are intentionally treated as sensitive and do not follow the same policy as the main body weights.
- Matching exact names:
- No exact-name protected tensors were configured.
- Matching name fragments:
- No substring-based protected tensors were configured.
- Effective behavior: Protected weights remain passthrough tensors instead of being quantized.
Protected tensors quantized in this artifact
- No protected tensors were quantized in this run.
Protected tensors kept as passthrough
- No protected tensors were left as passthrough due to the protection rules.
Mixed-precision tensors introduced by the protection rule
- No tensors were promoted above the primary release bit-width.
Storage accounting
| Quantity | Value |
|---|---|
| Original bytes | 1,746,882,752 |
| Stored bytes | 438,881,088 |
| Bytes saved | 1,308,001,664 |
| Compression ratio | 3.980x |
| Quantized tensor count | 265 |
| Passthrough tensor count | 223 |
| Quantized scalar values | 873,250,816 |
| Passthrough scalar values | 187,968 |
Parameter coverage
The table below shows the largest tensors recorded in the manifest. This is helpful when auditing whether the storage budget is dominated by a few very large matrices or by a broad spread of medium-sized tensors.
| Tensor | Kind | Effective bit-width | Shape | Values |
|---|---|---|---|---|
model.language_model.embed_tokens.weight |
quantized | 4 | 248320x1024 |
254,279,680 |
model.visual.merger.linear_fc1.weight |
quantized | 4 | 3072x3072 |
9,437,184 |
model.language_model.layers.0.linear_attn.in_proj_qkv.weight |
quantized | 4 | 6144x1024 |
6,291,456 |
model.language_model.layers.1.linear_attn.in_proj_qkv.weight |
quantized | 4 | 6144x1024 |
6,291,456 |
model.language_model.layers.10.linear_attn.in_proj_qkv.weight |
quantized | 4 | 6144x1024 |
6,291,456 |
model.language_model.layers.12.linear_attn.in_proj_qkv.weight |
quantized | 4 | 6144x1024 |
6,291,456 |
model.language_model.layers.13.linear_attn.in_proj_qkv.weight |
quantized | 4 | 6144x1024 |
6,291,456 |
model.language_model.layers.14.linear_attn.in_proj_qkv.weight |
quantized | 4 | 6144x1024 |
6,291,456 |
model.language_model.layers.16.linear_attn.in_proj_qkv.weight |
quantized | 4 | 6144x1024 |
6,291,456 |
model.language_model.layers.17.linear_attn.in_proj_qkv.weight |
quantized | 4 | 6144x1024 |
6,291,456 |
model.language_model.layers.18.linear_attn.in_proj_qkv.weight |
quantized | 4 | 6144x1024 |
6,291,456 |
model.language_model.layers.2.linear_attn.in_proj_qkv.weight |
quantized | 4 | 6144x1024 |
6,291,456 |
Evaluation protocol
The dataset below is used in this repo as a sanity / regression test set only. It is useful for catching catastrophic degradation and comparing close variants under a fixed quick pass, but it is not enough to support broad quality claims.
| Item | Value |
|---|---|
| Dataset | wikitext:wikitext-2-raw-v1 |
| Dataset repo | wikitext |
| Dataset config | wikitext-2-raw-v1 |
| Dataset role in this repo | sanity-test |
| Split | test |
| Max context length | 512 |
| Sliding-window stride | 256 |
| Max eval windows | full pass |
| Logit comparison prompts | 4 |
| Sample generation prompts | 3 |
| Generation mode | deterministic greedy decoding |
| Max generated tokens | 64 |
Metric comparison
| Metric | Base model | Quantized model | Notes |
|---|---|---|---|
| Average NLL | 2.906366 | 3.447144 | Lower is better |
| Perplexity | 18.2902 | 31.4105 | Lower is better |
| Tokens scored | 295,893 | 295,893 | Sliding-window effective tokens |
| Windows scored | 1,160 | 1,160 | Reported when dataset evaluation is used |
| Mean last-token logit cosine similarity | N/A | 0.868845 | Higher is better |
How to load the artifact
from load_quantized import load_quantized_model, load_tokenizer
model, manifest = load_quantized_model(".", device="cuda")
tokenizer = load_tokenizer(".")
prompt = "Explain why quantization can reduce storage requirements."
rendered = tokenizer.apply_chat_template(
[{"role": "user", "content": prompt}],
tokenize=False,
add_generation_prompt=True,
) if hasattr(tokenizer, "apply_chat_template") else prompt
batch = tokenizer(rendered, return_tensors="pt").to(model.device)
output = model.generate(**batch, max_new_tokens=64, do_sample=False)
print(tokenizer.decode(output[0], skip_special_tokens=True))
Sample generations
Sample 1
Prompt
Give a short explanation of why group-wise quantization can outperform naive tensor-wise quantization.
Response
Group-wise quantization outperforms naive tensor-wise quantization primarily by **reducing the number of required bits** and **minimizing the computational cost** per sample.
Here is the step-by-step reasoning:
1. **Bit Efficiency**: Naive tensor-wise quantization requires quantizing the entire
Sample 2
Prompt
Write three bullet points describing the tradeoff between 8-bit and 1-bit quantization.
Response
* **8-bit quantization** offers higher precision and better accuracy for tasks requiring high precision, such as scientific computing or medical imaging. However, it is computationally expensive and consumes more memory, making it less suitable for real-time applications or resource-constrained environments.
* **1-bit quantization** is
Sample 3
Prompt
Explain what information should appear in a public model card for a custom quantized checkpoint.
Response
A **public model card** (also known as a public model) is a specific type of checkpoint used in the `torch` library for quantization. It is designed to be shared with other models so that they can be trained together.
When you create a public model card, you are essentially creating a "copy
Reproducibility configuration
The pipeline fields below are the key knobs that shaped this artifact.
{
"base_model_id": "Qwen/Qwen3.5-0.8B",
"loader_kind": "image-text-to-text",
"quantization_method": "turboquant_mse",
"primary_bit_width": 4,
"effective_quantized_bit_widths": [
4
],
"seed": 42,
"grid_size": 8193,
"iterations": 96,
"eval_dataset": "wikitext:wikitext-2-raw-v1",
"eval_dataset_role": "sanity-test",
"eval_stride": 256,
"max_eval_length": 512,
"enable_sensitivity_allocation": false,
"allocator_name": "disabled",
"allocation_method": "uniform",
"allocation_allowed_bit_widths": [],
"calibration_texts": [],
"max_generation_tokens": 64,
"protected_weight_exact_names": [],
"protected_weight_name_fragments": []
}
Limitations and caveats
- This is not an official Qwen release.
- Evaluation here is deliberately lightweight and release-focused; it is not a substitute for a broad benchmark suite.
- The default Wikitext setup in this repo is a quick regression check, not a publication-grade benchmark.
- The packed artifact is reconstructed into floating-point tensors at load time, so this repo is mainly a reproducible storage/research artifact rather than a kernel-level optimized inference backend.
- Vision / multimodal behavior was not benchmarked separately in this run even if the upstream model supports those paths.
- Strong 1-bit weight-quality claims would be misleading for this code path today; treat 1-bit results as exploratory unless broader benchmarks say otherwise.
- If you are comparing against other low-bit methods, check that the protection rules, dataset slice, and evaluation window cap are aligned; otherwise apples and oranges will sneak onto the spreadsheet.
Provenance
- Base model:
Qwen/Qwen3.5-0.8B - Base license:
apache-2.0 - Upload account:
lew96123 - Artifact repo id:
lew96123/qwen3.5-0.8b-custom-packed-turboquant_mse-true-uniform-4bit - Manifest source of truth:
quant_manifest.json - Evaluation metrics source of truth:
eval_summary.json - Sample outputs source of truth:
sample_generations.json
- Downloads last month
- 78