lew96123/qwen3.5-0.8b-custom-packed-turboquant_mse-true-uniform-4bit

This repository is a standalone quantized release of Qwen/Qwen3.5-0.8B published by lew96123. The previous model card was only a short summary; this one is intentionally detailed so the repo page itself explains what was stored, how it was produced, what was evaluated, and what trade-offs showed up.

Release snapshot

Base model: Qwen/Qwen3.5-0.8B
Loader kind / pipeline tag: image-text-to-text / image-text-to-text
Primary target bit-width: 4
Effective quantized bit-widths actually present in the checkpoint: 4-bit (265 tensors)
Quantization method: turboquant_mse
Quantized tensors: 265
Passthrough tensors: 223
Original storage: 1.63 GiB (1,746,882,752 bytes)
Stored artifact size: 418.55 MiB (438,881,088 bytes)
Bytes saved: 1.22 GiB (1,308,001,664 bytes)
Compression ratio: 3.980x
Base perplexity: 18.2902
Quantized perplexity: 31.4105
Perplexity delta vs base: +13.1203 (1.717x of base)
Mean last-token logit cosine similarity: 0.868845
Evaluation dataset role: sanity-test

Artifact inventory

File	Purpose
`turboquant_weights.safetensors`	Packed checkpoint payload containing quantized weights (and any passthrough tensors stored directly).
`quant_manifest.json`	Full manifest with per-parameter metadata, storage prefixes, tensor shapes, and effective bit-width decisions.
`eval_summary.json`	Serialized evaluation metrics for the base model and the quantized artifact.
`sample_generations.json`	Deterministic sample generations generated from the stored prompts in the pipeline config.
`load_quantized.py`	Self-contained Python loader that reconstructs the quantized model from this repository alone.
`requirements.txt`	Minimal dependency hints for loading this artifact.
`config.json`, tokenizer files, generation config, and related support files	Upstream support files copied so the repo can be loaded without separately downloading missing config assets.

Quantization design

This release uses TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate (arXiv:2504.19874, Zandieh et al. 2025) -- online vector quantization with near-optimal distortion rate. Each weight row is treated as a d-dimensional vector and quantized using a random rotation followed by optimal scalar quantization on the induced Beta distribution coordinates (Lloyd-Max codebook).

Bit-width target for the main release: 4
Quantizer: turboquant_mse
Seed: 42
Grid size for Lloyd-Max codebook estimation: 8193
Lloyd-Max refinement iterations: 96
Quantized scalar values covered by the manifest: 873,250,816
Passthrough scalar values covered by the manifest: 187,968

Allocator policy

Allocator-driven mixed precision is disabled for this artifact, so the release stays uniform apart from any protected-weight rules.

Mixed-precision allocator: disabled
Allocation method: uniform
Allowed effective bit-widths: 4
Allocation target: None
Calibration text count: 0
Max calibration windows: None
Promoted tensors from allocator: 0

Paper context and claim boundaries

Primary reference for the quantizer idea: TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate (arXiv:2504.19874, Apr 2025; Amir Zandieh, Majid Daliri, Majid Hadian, Vahab Mirrokni).
This repo adapts TurboQuant-style vector quantization to an offline full-weight checkpoint PTQ flow.
The cited TurboQuant paper's strongest empirical evidence is on vector / KV-cache quantization and nearest-neighbor search, so this artifact should be read as a paper-inspired adaptation, not as a full reproduction of the paper's published experimental setup.
Related literature that may improve later iterations, but is not implemented in this artifact:
- SpinQuant (arXiv:2405.16406) -- learned rotations for PTQ.
- APTQ (arXiv:2402.14866) and BAQ (arXiv:2506.05664) -- sensitivity-based mixed precision / bit allocation.
- GuidedQuant (arXiv:2505.07004), LeanQuant (arXiv:2407.10032), and D^2Quant (arXiv:2602.02546) -- stronger sub-4-bit weight-only PTQ strategies.
- PolarQuant (arXiv:2603.29078) and ParoQuant (arXiv:2511.10645) -- newer rotation / preprocessing directions for low-bit compression.

Protected-weight policy

Some tensors are intentionally treated as sensitive and do not follow the same policy as the main body weights.

Matching exact names:
No exact-name protected tensors were configured.
Matching name fragments:
No substring-based protected tensors were configured.
Effective behavior: Protected weights remain passthrough tensors instead of being quantized.

Protected tensors quantized in this artifact

No protected tensors were quantized in this run.

Protected tensors kept as passthrough

No protected tensors were left as passthrough due to the protection rules.

Mixed-precision tensors introduced by the protection rule

No tensors were promoted above the primary release bit-width.

Storage accounting

Quantity	Value
Original bytes	1,746,882,752
Stored bytes	438,881,088
Bytes saved	1,308,001,664
Compression ratio	3.980x
Quantized tensor count	265
Passthrough tensor count	223
Quantized scalar values	873,250,816
Passthrough scalar values	187,968

Parameter coverage

The table below shows the largest tensors recorded in the manifest. This is helpful when auditing whether the storage budget is dominated by a few very large matrices or by a broad spread of medium-sized tensors.

Tensor	Kind	Effective bit-width	Shape	Values
`model.language_model.embed_tokens.weight`	quantized	4	`248320x1024`	254,279,680
`model.visual.merger.linear_fc1.weight`	quantized	4	`3072x3072`	9,437,184
`model.language_model.layers.0.linear_attn.in_proj_qkv.weight`	quantized	4	`6144x1024`	6,291,456
`model.language_model.layers.1.linear_attn.in_proj_qkv.weight`	quantized	4	`6144x1024`	6,291,456
`model.language_model.layers.10.linear_attn.in_proj_qkv.weight`	quantized	4	`6144x1024`	6,291,456
`model.language_model.layers.12.linear_attn.in_proj_qkv.weight`	quantized	4	`6144x1024`	6,291,456
`model.language_model.layers.13.linear_attn.in_proj_qkv.weight`	quantized	4	`6144x1024`	6,291,456
`model.language_model.layers.14.linear_attn.in_proj_qkv.weight`	quantized	4	`6144x1024`	6,291,456
`model.language_model.layers.16.linear_attn.in_proj_qkv.weight`	quantized	4	`6144x1024`	6,291,456
`model.language_model.layers.17.linear_attn.in_proj_qkv.weight`	quantized	4	`6144x1024`	6,291,456
`model.language_model.layers.18.linear_attn.in_proj_qkv.weight`	quantized	4	`6144x1024`	6,291,456
`model.language_model.layers.2.linear_attn.in_proj_qkv.weight`	quantized	4	`6144x1024`	6,291,456

Evaluation protocol

The dataset below is used in this repo as a sanity / regression test set only. It is useful for catching catastrophic degradation and comparing close variants under a fixed quick pass, but it is not enough to support broad quality claims.

Item	Value
Dataset	`wikitext:wikitext-2-raw-v1`
Dataset repo	`wikitext`
Dataset config	`wikitext-2-raw-v1`
Dataset role in this repo	`sanity-test`
Split	`test`
Max context length	512
Sliding-window stride	256
Max eval windows	full pass
Logit comparison prompts	4
Sample generation prompts	3
Generation mode	deterministic greedy decoding
Max generated tokens	64

Metric comparison

Metric	Base model	Quantized model	Notes
Average NLL	2.906366	3.447144	Lower is better
Perplexity	18.2902	31.4105	Lower is better
Tokens scored	295,893	295,893	Sliding-window effective tokens
Windows scored	1,160	1,160	Reported when dataset evaluation is used
Mean last-token logit cosine similarity	N/A	0.868845	Higher is better

How to load the artifact

from load_quantized import load_quantized_model, load_tokenizer

model, manifest = load_quantized_model(".", device="cuda")
tokenizer = load_tokenizer(".")

prompt = "Explain why quantization can reduce storage requirements."
rendered = tokenizer.apply_chat_template(
    [{"role": "user", "content": prompt}],
    tokenize=False,
    add_generation_prompt=True,
) if hasattr(tokenizer, "apply_chat_template") else prompt

batch = tokenizer(rendered, return_tensors="pt").to(model.device)
output = model.generate(**batch, max_new_tokens=64, do_sample=False)
print(tokenizer.decode(output[0], skip_special_tokens=True))

Sample generations

Sample 1

Prompt

Give a short explanation of why group-wise quantization can outperform naive tensor-wise quantization.

Response

Group-wise quantization outperforms naive tensor-wise quantization primarily by **reducing the number of required bits** and **minimizing the computational cost** per sample.

Here is the step-by-step reasoning:

1.  **Bit Efficiency**: Naive tensor-wise quantization requires quantizing the entire

Sample 2

Prompt

Write three bullet points describing the tradeoff between 8-bit and 1-bit quantization.

Response

*   **8-bit quantization** offers higher precision and better accuracy for tasks requiring high precision, such as scientific computing or medical imaging. However, it is computationally expensive and consumes more memory, making it less suitable for real-time applications or resource-constrained environments.
*   **1-bit quantization** is

Sample 3

Prompt

Explain what information should appear in a public model card for a custom quantized checkpoint.

Response

A **public model card** (also known as a public model) is a specific type of checkpoint used in the `torch` library for quantization. It is designed to be shared with other models so that they can be trained together.

When you create a public model card, you are essentially creating a "copy

Reproducibility configuration

The pipeline fields below are the key knobs that shaped this artifact.

{
  "base_model_id": "Qwen/Qwen3.5-0.8B",
  "loader_kind": "image-text-to-text",
  "quantization_method": "turboquant_mse",
  "primary_bit_width": 4,
  "effective_quantized_bit_widths": [
    4
  ],
  "seed": 42,
  "grid_size": 8193,
  "iterations": 96,
  "eval_dataset": "wikitext:wikitext-2-raw-v1",
  "eval_dataset_role": "sanity-test",
  "eval_stride": 256,
  "max_eval_length": 512,
  "enable_sensitivity_allocation": false,
  "allocator_name": "disabled",
  "allocation_method": "uniform",
  "allocation_allowed_bit_widths": [],
  "calibration_texts": [],
  "max_generation_tokens": 64,
  "protected_weight_exact_names": [],
  "protected_weight_name_fragments": []
}

Limitations and caveats

This is not an official Qwen release.
Evaluation here is deliberately lightweight and release-focused; it is not a substitute for a broad benchmark suite.
The default Wikitext setup in this repo is a quick regression check, not a publication-grade benchmark.
The packed artifact is reconstructed into floating-point tensors at load time, so this repo is mainly a reproducible storage/research artifact rather than a kernel-level optimized inference backend.
Vision / multimodal behavior was not benchmarked separately in this run even if the upstream model supports those paths.
Strong 1-bit weight-quality claims would be misleading for this code path today; treat 1-bit results as exploratory unless broader benchmarks say otherwise.
If you are comparing against other low-bit methods, check that the protection rules, dataset slice, and evaluation window cap are aligned; otherwise apples and oranges will sneak onto the spreadsheet.

Provenance

Base model: Qwen/Qwen3.5-0.8B
Base license: apache-2.0
Upload account: lew96123
Artifact repo id: lew96123/qwen3.5-0.8b-custom-packed-turboquant_mse-true-uniform-4bit
Manifest source of truth: quant_manifest.json
Evaluation metrics source of truth: eval_summary.json
Sample outputs source of truth: sample_generations.json

Downloads last month: 78

Model tree for lew96123/qwen3.5-0.8b-custom-packed-turboquant_mse-true-uniform-4bit

Base model

Qwen/Qwen3.5-0.8B-Base

Finetuned

Qwen/Qwen3.5-0.8B

Finetuned

(159)

this model

Papers for lew96123/qwen3.5-0.8b-custom-packed-turboquant_mse-true-uniform-4bit

PolarQuant: Optimal Gaussian Weight Quantization via Hadamard Rotation for LLM Compression

Paper • 2603.29078 • Published 24 days ago