lew96123/qwen3.5-0.8b-custom-packed-turboquant_mse-true-uniform-2bit
This repository is a standalone quantized release of Qwen/Qwen3.5-0.8B published by lew96123.
The previous model card was only a short summary; this one is intentionally detailed so the repo page itself explains what was stored, how it was produced, what was evaluated, and what trade-offs showed up.
Release snapshot
- Base model:
Qwen/Qwen3.5-0.8B - Loader kind / pipeline tag:
image-text-to-text/image-text-to-text - Primary target bit-width: 2
- Effective quantized bit-widths actually present in the checkpoint: 2-bit (265 tensors)
- Quantization method: turboquant_mse
- Quantized tensors: 265
- Passthrough tensors: 223
- Original storage: 1.63 GiB (1,746,882,752 bytes)
- Stored artifact size: 210.35 MiB (220,568,384 bytes)
- Bytes saved: 1.42 GiB (1,526,314,368 bytes)
- Compression ratio: 7.920x
- Base perplexity: 18.2902
- Quantized perplexity: 3814.9180
- Perplexity delta vs base: +3796.6278 (208.577x of base)
- Mean last-token logit cosine similarity: 0.584556
- Evaluation dataset role: sanity-test
Artifact inventory
| File | Purpose |
|---|---|
turboquant_weights.safetensors |
Packed checkpoint payload containing quantized weights (and any passthrough tensors stored directly). |
quant_manifest.json |
Full manifest with per-parameter metadata, storage prefixes, tensor shapes, and effective bit-width decisions. |
eval_summary.json |
Serialized evaluation metrics for the base model and the quantized artifact. |
sample_generations.json |
Deterministic sample generations generated from the stored prompts in the pipeline config. |
load_quantized.py |
Self-contained Python loader that reconstructs the quantized model from this repository alone. |
requirements.txt |
Minimal dependency hints for loading this artifact. |
config.json, tokenizer files, generation config, and related support files |
Upstream support files copied so the repo can be loaded without separately downloading missing config assets. |
Quantization design
This release uses TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate (arXiv:2504.19874, Zandieh et al. 2025) -- online vector quantization with near-optimal distortion rate. Each weight row is treated as a d-dimensional vector and quantized using a random rotation followed by optimal scalar quantization on the induced Beta distribution coordinates (Lloyd-Max codebook).
- Bit-width target for the main release: 2
- Quantizer: turboquant_mse
- Seed: 42
- Grid size for Lloyd-Max codebook estimation: 8193
- Lloyd-Max refinement iterations: 96
- Quantized scalar values covered by the manifest: 873,250,816
- Passthrough scalar values covered by the manifest: 187,968
Allocator policy
Allocator-driven mixed precision is disabled for this artifact, so the release stays uniform apart from any protected-weight rules.
- Mixed-precision allocator: disabled
- Allocation method: uniform
- Allowed effective bit-widths: 2
- Allocation target: None
- Calibration text count: 0
- Max calibration windows: None
- Promoted tensors from allocator: 0
Paper context and claim boundaries
- Primary reference for the quantizer idea:
TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate(arXiv:2504.19874, Apr 2025; Amir Zandieh, Majid Daliri, Majid Hadian, Vahab Mirrokni). - This repo adapts TurboQuant-style vector quantization to an offline full-weight checkpoint PTQ flow.
- The cited TurboQuant paper's strongest empirical evidence is on vector / KV-cache quantization and nearest-neighbor search, so this artifact should be read as a paper-inspired adaptation, not as a full reproduction of the paper's published experimental setup.
- Related literature that may improve later iterations, but is not implemented in this artifact:
SpinQuant(arXiv:2405.16406) -- learned rotations for PTQ.APTQ(arXiv:2402.14866) andBAQ(arXiv:2506.05664) -- sensitivity-based mixed precision / bit allocation.GuidedQuant(arXiv:2505.07004),LeanQuant(arXiv:2407.10032), andD^2Quant(arXiv:2602.02546) -- stronger sub-4-bit weight-only PTQ strategies.PolarQuant(arXiv:2603.29078) andParoQuant(arXiv:2511.10645) -- newer rotation / preprocessing directions for low-bit compression.
Protected-weight policy
Some tensors are intentionally treated as sensitive and do not follow the same policy as the main body weights.
- Matching exact names:
- No exact-name protected tensors were configured.
- Matching name fragments:
- No substring-based protected tensors were configured.
- Effective behavior: Protected weights remain passthrough tensors instead of being quantized.
Protected tensors quantized in this artifact
- No protected tensors were quantized in this run.
Protected tensors kept as passthrough
- No protected tensors were left as passthrough due to the protection rules.
Mixed-precision tensors introduced by the protection rule
- No tensors were promoted above the primary release bit-width.
Storage accounting
| Quantity | Value |
|---|---|
| Original bytes | 1,746,882,752 |
| Stored bytes | 220,568,384 |
| Bytes saved | 1,526,314,368 |
| Compression ratio | 7.920x |
| Quantized tensor count | 265 |
| Passthrough tensor count | 223 |
| Quantized scalar values | 873,250,816 |
| Passthrough scalar values | 187,968 |
Parameter coverage
The table below shows the largest tensors recorded in the manifest. This is helpful when auditing whether the storage budget is dominated by a few very large matrices or by a broad spread of medium-sized tensors.
| Tensor | Kind | Effective bit-width | Shape | Values |
|---|---|---|---|---|
model.language_model.embed_tokens.weight |
quantized | 2 | 248320x1024 |
254,279,680 |
model.visual.merger.linear_fc1.weight |
quantized | 2 | 3072x3072 |
9,437,184 |
model.language_model.layers.0.linear_attn.in_proj_qkv.weight |
quantized | 2 | 6144x1024 |
6,291,456 |
model.language_model.layers.1.linear_attn.in_proj_qkv.weight |
quantized | 2 | 6144x1024 |
6,291,456 |
model.language_model.layers.10.linear_attn.in_proj_qkv.weight |
quantized | 2 | 6144x1024 |
6,291,456 |
model.language_model.layers.12.linear_attn.in_proj_qkv.weight |
quantized | 2 | 6144x1024 |
6,291,456 |
model.language_model.layers.13.linear_attn.in_proj_qkv.weight |
quantized | 2 | 6144x1024 |
6,291,456 |
model.language_model.layers.14.linear_attn.in_proj_qkv.weight |
quantized | 2 | 6144x1024 |
6,291,456 |
model.language_model.layers.16.linear_attn.in_proj_qkv.weight |
quantized | 2 | 6144x1024 |
6,291,456 |
model.language_model.layers.17.linear_attn.in_proj_qkv.weight |
quantized | 2 | 6144x1024 |
6,291,456 |
model.language_model.layers.18.linear_attn.in_proj_qkv.weight |
quantized | 2 | 6144x1024 |
6,291,456 |
model.language_model.layers.2.linear_attn.in_proj_qkv.weight |
quantized | 2 | 6144x1024 |
6,291,456 |
Evaluation protocol
The dataset below is used in this repo as a sanity / regression test set only. It is useful for catching catastrophic degradation and comparing close variants under a fixed quick pass, but it is not enough to support broad quality claims.
| Item | Value |
|---|---|
| Dataset | wikitext:wikitext-2-raw-v1 |
| Dataset repo | wikitext |
| Dataset config | wikitext-2-raw-v1 |
| Dataset role in this repo | sanity-test |
| Split | test |
| Max context length | 512 |
| Sliding-window stride | 256 |
| Max eval windows | full pass |
| Logit comparison prompts | 4 |
| Sample generation prompts | 3 |
| Generation mode | deterministic greedy decoding |
| Max generated tokens | 64 |
Metric comparison
| Metric | Base model | Quantized model | Notes |
|---|---|---|---|
| Average NLL | 2.906366 | 8.246674 | Lower is better |
| Perplexity | 18.2902 | 3814.9180 | Lower is better |
| Tokens scored | 295,893 | 295,893 | Sliding-window effective tokens |
| Windows scored | 1,160 | 1,160 | Reported when dataset evaluation is used |
| Mean last-token logit cosine similarity | N/A | 0.584556 | Higher is better |
How to load the artifact
from load_quantized import load_quantized_model, load_tokenizer
model, manifest = load_quantized_model(".", device="cuda")
tokenizer = load_tokenizer(".")
prompt = "Explain why quantization can reduce storage requirements."
rendered = tokenizer.apply_chat_template(
[{"role": "user", "content": prompt}],
tokenize=False,
add_generation_prompt=True,
) if hasattr(tokenizer, "apply_chat_template") else prompt
batch = tokenizer(rendered, return_tensors="pt").to(model.device)
output = model.generate(**batch, max_new_tokens=64, do_sample=False)
print(tokenizer.decode(output[0], skip_special_tokens=True))
Sample generations
Inline sample generations are omitted for this release because the fixed prompt set produced severely degraded outputs under this quantization setting (perplexity ratio vs base: 208.6x, mean last-token logit cosine: 0.585). The deterministic outputs are still preserved in sample_generations.json; when available, the unsanitized decode is retained in the raw_response field for auditability.
Reproducibility configuration
The pipeline fields below are the key knobs that shaped this artifact.
{
"base_model_id": "Qwen/Qwen3.5-0.8B",
"loader_kind": "image-text-to-text",
"quantization_method": "turboquant_mse",
"primary_bit_width": 2,
"effective_quantized_bit_widths": [
2
],
"seed": 42,
"grid_size": 8193,
"iterations": 96,
"eval_dataset": "wikitext:wikitext-2-raw-v1",
"eval_dataset_role": "sanity-test",
"eval_stride": 256,
"max_eval_length": 512,
"enable_sensitivity_allocation": false,
"allocator_name": "disabled",
"allocation_method": "uniform",
"allocation_allowed_bit_widths": [],
"calibration_texts": [],
"max_generation_tokens": 64,
"protected_weight_exact_names": [],
"protected_weight_name_fragments": []
}
Limitations and caveats
- This is not an official Qwen release.
- Evaluation here is deliberately lightweight and release-focused; it is not a substitute for a broad benchmark suite.
- The default Wikitext setup in this repo is a quick regression check, not a publication-grade benchmark.
- The packed artifact is reconstructed into floating-point tensors at load time, so this repo is mainly a reproducible storage/research artifact rather than a kernel-level optimized inference backend.
- Vision / multimodal behavior was not benchmarked separately in this run even if the upstream model supports those paths.
- Strong 1-bit weight-quality claims would be misleading for this code path today; treat 1-bit results as exploratory unless broader benchmarks say otherwise.
- If you are comparing against other low-bit methods, check that the protection rules, dataset slice, and evaluation window cap are aligned; otherwise apples and oranges will sneak onto the spreadsheet.
Provenance
- Base model:
Qwen/Qwen3.5-0.8B - Base license:
apache-2.0 - Upload account:
lew96123 - Artifact repo id:
lew96123/qwen3.5-0.8b-custom-packed-turboquant_mse-true-uniform-2bit - Manifest source of truth:
quant_manifest.json - Evaluation metrics source of truth:
eval_summary.json - Sample outputs source of truth:
sample_generations.json
- Downloads last month
- 33