Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled EOQ Q5 (Compressed)
This is an EOQ (Entropy-Optimized Quantization) Q5 compressed version of Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled.
What is EOQ?
EOQ (Entropy-Optimized Quantization) is a novel compression method that exploits the entropy gap between theoretical information content and the fixed bit-widths used by traditional quantization. Instead of rounding weights to a uniform grid, EOQ applies entropy coding on top of quantized weights, achieving better compression ratios while preserving model quality. The key insight is that quantized weight distributions are far from uniform -- their Shannon entropy is significantly lower than the allocated bits -- leaving substantial room for lossless compression via arithmetic/ANS coding.
Verified Benchmark Results
All benchmarks measured on an NVIDIA RTX PRO 6000 Blackwell (96 GB VRAM) and independently verified on Google Colab Pro G4 (RTX PRO 6000 Blackwell).
| Metric | FP16 (Baseline) | EOQ Q5 Compressed |
|---|---|---|
| Model size | 53.8 GB | 27.3 GB |
| Compression ratio | 1.0x | 2.0x |
| Perplexity (WikiText-2) | 5.65 | 5.64 |
| PPL delta | -- | -0.01 (essentially identical) |
| Throughput (tok/s) | 6.2 | 6.2 (no degradation) |
- Bits: 5
- Block size: 128
- PPL delta of -0.01 means the compressed model is statistically identical to the FP16 original -- well within measurement noise.
Usage
Method 1: Using eoq_loader.py (recommended)
from eoq_loader import load_eoq_model
model, tokenizer = load_eoq_model(
"caiovicentino1/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-EOQ-Q5-compressed"
)
prompt = "Explain entropy coding in one paragraph."
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=256)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Method 2: Manual loading
from huggingface_hub import hf_hub_download
import torch, safetensors.torch
# Download compressed weights
path = hf_hub_download(
"caiovicentino1/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-EOQ-Q5-compressed",
"model_compressed.safetensors",
)
# Load and decompress (see eoq_loader.py for full decompression logic)
state_dict = safetensors.torch.load_file(path)
Links
- Downloads last month
- 80
Model tree for caiovicentino1/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-EOQ-Q5-compressed
Base model
Qwen/Qwen3.5-27B