Qwen2.5-7B-Instruct OOM (Q8)
Qwen2.5-7B-Instruct converted to OomLlama's .oom format with Q8 quantization
Model Details
| Property | Value |
|---|---|
| Base Model | Qwen/Qwen2.5-7B-Instruct |
| Format | .oom (OomLlama Model) |
| Quantization | Q8 (8-bit, 256-block) |
| Source Precision | bf16 (SafeTensors) |
| File Size | 7.5 GB |
| Tensors | 339 |
| Parameters | 7.6B |
Architecture
| Component | Value |
|---|---|
| Hidden Size | 3584 |
| Layers | 28 |
| Q-Heads | 28 |
| KV-Heads | 4 |
| Head Dim | 128 |
| Intermediate | 18944 |
| Vocab | 152,064 |
| RoPE | Interleaved, theta=1,000,000 |
Conversion
Converted directly from bf16 SafeTensors using OomLlama's safetensors2oom converter. This performs a single quantization step (bf16 -> Q8), preserving maximum accuracy.
- Weights: Q8 quantized (256 values per block, with per-block scale + min)
- Norms/Biases: Stored as F32 (lossless)
Verified Output
Tested with OomLlama inference engine (Rust, CPU mode):
Prompt: <|im_start|>user\nHello!<|im_end|>\n<|im_start|>assistant\n
Output: Hello! How can I assist you today?
Prompt: <|im_start|>user\nWhat is the meaning of life?<|im_end|>\n<|im_start|>assistant\n
Output: The meaning of life is... (coherent multi-paragraph response)
Usage
Install OomLlama
pip install oomllama
Download and Run
from huggingface_hub import hf_hub_download
# Download the .oom file
model_path = hf_hub_download(
repo_id="jaspervandemeent/Qwen2.5-7B-Instruct-OOM",
filename="qwen2.5-7b-instruct-q8.oom"
)
# Run with OomLlama
from oomllama import OomLlama
llm = OomLlama(model_path)
response = llm.generate("What is the meaning of life?")
print(response)
Rust CLI
# Download
wget https://brein.jaspervandemeent.nl/downloads/oomllama-linux-x86_64
chmod +x oomllama-linux-x86_64
# Run inference
./oomllama-linux-x86_64 --model qwen2.5-7b-instruct-q8.oom --prompt "Hello!"
Convert Your Own
Convert any HuggingFace model to .oom format:
pip install oomllama
safetensors2oom Qwen/Qwen2.5-7B-Instruct output.oom
The .oom Format
Header: "OOML" (4 bytes) + version (u32) + num_tensors (u32)
Tensor: name_len (u32) + name + quant_type (u8) + num_blocks (u32) + total_values (u32)
Block: scale (f32) + min (f32) + data_len (u32) + quantized_bytes (256)
Dequantization: value = byte * scale + min
Links
Credits
- OomLlama Engine: Root AI & Jasper (Humotica AI Lab)
- Base Model: Alibaba Cloud (Qwen team)
- License: Apache 2.0 (following base model license)
Built by Humotica AI Lab - Jasper, Claude, Gemini