lew96123/qwen3.5-0.8b-custom-packed-turboquant_mse-true-uniform-4bit

This repository is a standalone quantized release of Qwen/Qwen3.5-0.8B published by lew96123. The previous model card was only a short summary; this one is intentionally detailed so the repo page itself explains what was stored, how it was produced, what was evaluated, and what trade-offs showed up.

Release snapshot

  • Base model: Qwen/Qwen3.5-0.8B
  • Loader kind / pipeline tag: image-text-to-text / image-text-to-text
  • Primary target bit-width: 4
  • Effective quantized bit-widths actually present in the checkpoint: 4-bit (265 tensors)
  • Quantization method: turboquant_mse
  • Quantized tensors: 265
  • Passthrough tensors: 223
  • Original storage: 1.63 GiB (1,746,882,752 bytes)
  • Stored artifact size: 418.55 MiB (438,881,088 bytes)
  • Bytes saved: 1.22 GiB (1,308,001,664 bytes)
  • Compression ratio: 3.980x
  • Base perplexity: 18.2902
  • Quantized perplexity: 31.4105
  • Perplexity delta vs base: +13.1203 (1.717x of base)
  • Mean last-token logit cosine similarity: 0.868845
  • Evaluation dataset role: sanity-test

Artifact inventory

File Purpose
turboquant_weights.safetensors Packed checkpoint payload containing quantized weights (and any passthrough tensors stored directly).
quant_manifest.json Full manifest with per-parameter metadata, storage prefixes, tensor shapes, and effective bit-width decisions.
eval_summary.json Serialized evaluation metrics for the base model and the quantized artifact.
sample_generations.json Deterministic sample generations generated from the stored prompts in the pipeline config.
load_quantized.py Self-contained Python loader that reconstructs the quantized model from this repository alone.
requirements.txt Minimal dependency hints for loading this artifact.
config.json, tokenizer files, generation config, and related support files Upstream support files copied so the repo can be loaded without separately downloading missing config assets.

Quantization design

This release uses TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate (arXiv:2504.19874, Zandieh et al. 2025) -- online vector quantization with near-optimal distortion rate. Each weight row is treated as a d-dimensional vector and quantized using a random rotation followed by optimal scalar quantization on the induced Beta distribution coordinates (Lloyd-Max codebook).

  • Bit-width target for the main release: 4
  • Quantizer: turboquant_mse
  • Seed: 42
  • Grid size for Lloyd-Max codebook estimation: 8193
  • Lloyd-Max refinement iterations: 96
  • Quantized scalar values covered by the manifest: 873,250,816
  • Passthrough scalar values covered by the manifest: 187,968

Allocator policy

Allocator-driven mixed precision is disabled for this artifact, so the release stays uniform apart from any protected-weight rules.

  • Mixed-precision allocator: disabled
  • Allocation method: uniform
  • Allowed effective bit-widths: 4
  • Allocation target: None
  • Calibration text count: 0
  • Max calibration windows: None
  • Promoted tensors from allocator: 0

Paper context and claim boundaries

  • Primary reference for the quantizer idea: TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate (arXiv:2504.19874, Apr 2025; Amir Zandieh, Majid Daliri, Majid Hadian, Vahab Mirrokni).
  • This repo adapts TurboQuant-style vector quantization to an offline full-weight checkpoint PTQ flow.
  • The cited TurboQuant paper's strongest empirical evidence is on vector / KV-cache quantization and nearest-neighbor search, so this artifact should be read as a paper-inspired adaptation, not as a full reproduction of the paper's published experimental setup.
  • Related literature that may improve later iterations, but is not implemented in this artifact:
    • SpinQuant (arXiv:2405.16406) -- learned rotations for PTQ.
    • APTQ (arXiv:2402.14866) and BAQ (arXiv:2506.05664) -- sensitivity-based mixed precision / bit allocation.
    • GuidedQuant (arXiv:2505.07004), LeanQuant (arXiv:2407.10032), and D^2Quant (arXiv:2602.02546) -- stronger sub-4-bit weight-only PTQ strategies.
    • PolarQuant (arXiv:2603.29078) and ParoQuant (arXiv:2511.10645) -- newer rotation / preprocessing directions for low-bit compression.

Protected-weight policy

Some tensors are intentionally treated as sensitive and do not follow the same policy as the main body weights.

  • Matching exact names:
  • No exact-name protected tensors were configured.
  • Matching name fragments:
  • No substring-based protected tensors were configured.
  • Effective behavior: Protected weights remain passthrough tensors instead of being quantized.

Protected tensors quantized in this artifact

  • No protected tensors were quantized in this run.

Protected tensors kept as passthrough

  • No protected tensors were left as passthrough due to the protection rules.

Mixed-precision tensors introduced by the protection rule

  • No tensors were promoted above the primary release bit-width.

Storage accounting

Quantity Value
Original bytes 1,746,882,752
Stored bytes 438,881,088
Bytes saved 1,308,001,664
Compression ratio 3.980x
Quantized tensor count 265
Passthrough tensor count 223
Quantized scalar values 873,250,816
Passthrough scalar values 187,968

Parameter coverage

The table below shows the largest tensors recorded in the manifest. This is helpful when auditing whether the storage budget is dominated by a few very large matrices or by a broad spread of medium-sized tensors.

Tensor Kind Effective bit-width Shape Values
model.language_model.embed_tokens.weight quantized 4 248320x1024 254,279,680
model.visual.merger.linear_fc1.weight quantized 4 3072x3072 9,437,184
model.language_model.layers.0.linear_attn.in_proj_qkv.weight quantized 4 6144x1024 6,291,456
model.language_model.layers.1.linear_attn.in_proj_qkv.weight quantized 4 6144x1024 6,291,456
model.language_model.layers.10.linear_attn.in_proj_qkv.weight quantized 4 6144x1024 6,291,456
model.language_model.layers.12.linear_attn.in_proj_qkv.weight quantized 4 6144x1024 6,291,456
model.language_model.layers.13.linear_attn.in_proj_qkv.weight quantized 4 6144x1024 6,291,456
model.language_model.layers.14.linear_attn.in_proj_qkv.weight quantized 4 6144x1024 6,291,456
model.language_model.layers.16.linear_attn.in_proj_qkv.weight quantized 4 6144x1024 6,291,456
model.language_model.layers.17.linear_attn.in_proj_qkv.weight quantized 4 6144x1024 6,291,456
model.language_model.layers.18.linear_attn.in_proj_qkv.weight quantized 4 6144x1024 6,291,456
model.language_model.layers.2.linear_attn.in_proj_qkv.weight quantized 4 6144x1024 6,291,456

Evaluation protocol

The dataset below is used in this repo as a sanity / regression test set only. It is useful for catching catastrophic degradation and comparing close variants under a fixed quick pass, but it is not enough to support broad quality claims.

Item Value
Dataset wikitext:wikitext-2-raw-v1
Dataset repo wikitext
Dataset config wikitext-2-raw-v1
Dataset role in this repo sanity-test
Split test
Max context length 512
Sliding-window stride 256
Max eval windows full pass
Logit comparison prompts 4
Sample generation prompts 3
Generation mode deterministic greedy decoding
Max generated tokens 64

Metric comparison

Metric Base model Quantized model Notes
Average NLL 2.906366 3.447144 Lower is better
Perplexity 18.2902 31.4105 Lower is better
Tokens scored 295,893 295,893 Sliding-window effective tokens
Windows scored 1,160 1,160 Reported when dataset evaluation is used
Mean last-token logit cosine similarity N/A 0.868845 Higher is better

How to load the artifact

from load_quantized import load_quantized_model, load_tokenizer

model, manifest = load_quantized_model(".", device="cuda")
tokenizer = load_tokenizer(".")

prompt = "Explain why quantization can reduce storage requirements."
rendered = tokenizer.apply_chat_template(
    [{"role": "user", "content": prompt}],
    tokenize=False,
    add_generation_prompt=True,
) if hasattr(tokenizer, "apply_chat_template") else prompt

batch = tokenizer(rendered, return_tensors="pt").to(model.device)
output = model.generate(**batch, max_new_tokens=64, do_sample=False)
print(tokenizer.decode(output[0], skip_special_tokens=True))

Sample generations

Sample 1

Prompt

Give a short explanation of why group-wise quantization can outperform naive tensor-wise quantization.

Response

Group-wise quantization outperforms naive tensor-wise quantization primarily by **reducing the number of required bits** and **minimizing the computational cost** per sample.

Here is the step-by-step reasoning:

1.  **Bit Efficiency**: Naive tensor-wise quantization requires quantizing the entire

Sample 2

Prompt

Write three bullet points describing the tradeoff between 8-bit and 1-bit quantization.

Response

*   **8-bit quantization** offers higher precision and better accuracy for tasks requiring high precision, such as scientific computing or medical imaging. However, it is computationally expensive and consumes more memory, making it less suitable for real-time applications or resource-constrained environments.
*   **1-bit quantization** is

Sample 3

Prompt

Explain what information should appear in a public model card for a custom quantized checkpoint.

Response

A **public model card** (also known as a public model) is a specific type of checkpoint used in the `torch` library for quantization. It is designed to be shared with other models so that they can be trained together.

When you create a public model card, you are essentially creating a "copy

Reproducibility configuration

The pipeline fields below are the key knobs that shaped this artifact.

{
  "base_model_id": "Qwen/Qwen3.5-0.8B",
  "loader_kind": "image-text-to-text",
  "quantization_method": "turboquant_mse",
  "primary_bit_width": 4,
  "effective_quantized_bit_widths": [
    4
  ],
  "seed": 42,
  "grid_size": 8193,
  "iterations": 96,
  "eval_dataset": "wikitext:wikitext-2-raw-v1",
  "eval_dataset_role": "sanity-test",
  "eval_stride": 256,
  "max_eval_length": 512,
  "enable_sensitivity_allocation": false,
  "allocator_name": "disabled",
  "allocation_method": "uniform",
  "allocation_allowed_bit_widths": [],
  "calibration_texts": [],
  "max_generation_tokens": 64,
  "protected_weight_exact_names": [],
  "protected_weight_name_fragments": []
}

Limitations and caveats

  • This is not an official Qwen release.
  • Evaluation here is deliberately lightweight and release-focused; it is not a substitute for a broad benchmark suite.
  • The default Wikitext setup in this repo is a quick regression check, not a publication-grade benchmark.
  • The packed artifact is reconstructed into floating-point tensors at load time, so this repo is mainly a reproducible storage/research artifact rather than a kernel-level optimized inference backend.
  • Vision / multimodal behavior was not benchmarked separately in this run even if the upstream model supports those paths.
  • Strong 1-bit weight-quality claims would be misleading for this code path today; treat 1-bit results as exploratory unless broader benchmarks say otherwise.
  • If you are comparing against other low-bit methods, check that the protection rules, dataset slice, and evaluation window cap are aligned; otherwise apples and oranges will sneak onto the spreadsheet.

Provenance

  • Base model: Qwen/Qwen3.5-0.8B
  • Base license: apache-2.0
  • Upload account: lew96123
  • Artifact repo id: lew96123/qwen3.5-0.8b-custom-packed-turboquant_mse-true-uniform-4bit
  • Manifest source of truth: quant_manifest.json
  • Evaluation metrics source of truth: eval_summary.json
  • Sample outputs source of truth: sample_generations.json
Downloads last month
78
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for lew96123/qwen3.5-0.8b-custom-packed-turboquant_mse-true-uniform-4bit

Finetuned
(159)
this model

Papers for lew96123/qwen3.5-0.8b-custom-packed-turboquant_mse-true-uniform-4bit