Qwen2.5-7B-Instruct OOM (Q8)

Qwen2.5-7B-Instruct converted to OomLlama's .oom format with Q8 quantization

Model Details

Property Value
Base Model Qwen/Qwen2.5-7B-Instruct
Format .oom (OomLlama Model)
Quantization Q8 (8-bit, 256-block)
Source Precision bf16 (SafeTensors)
File Size 7.5 GB
Tensors 339
Parameters 7.6B

Architecture

Component Value
Hidden Size 3584
Layers 28
Q-Heads 28
KV-Heads 4
Head Dim 128
Intermediate 18944
Vocab 152,064
RoPE Interleaved, theta=1,000,000

Conversion

Converted directly from bf16 SafeTensors using OomLlama's safetensors2oom converter. This performs a single quantization step (bf16 -> Q8), preserving maximum accuracy.

  • Weights: Q8 quantized (256 values per block, with per-block scale + min)
  • Norms/Biases: Stored as F32 (lossless)

Verified Output

Tested with OomLlama inference engine (Rust, CPU mode):

Prompt: <|im_start|>user\nHello!<|im_end|>\n<|im_start|>assistant\n Output: Hello! How can I assist you today?

Prompt: <|im_start|>user\nWhat is the meaning of life?<|im_end|>\n<|im_start|>assistant\n Output: The meaning of life is... (coherent multi-paragraph response)

Usage

Install OomLlama

pip install oomllama

Download and Run

from huggingface_hub import hf_hub_download

# Download the .oom file
model_path = hf_hub_download(
    repo_id="jaspervandemeent/Qwen2.5-7B-Instruct-OOM",
    filename="qwen2.5-7b-instruct-q8.oom"
)

# Run with OomLlama
from oomllama import OomLlama
llm = OomLlama(model_path)
response = llm.generate("What is the meaning of life?")
print(response)

Rust CLI

# Download
wget https://brein.jaspervandemeent.nl/downloads/oomllama-linux-x86_64
chmod +x oomllama-linux-x86_64

# Run inference
./oomllama-linux-x86_64 --model qwen2.5-7b-instruct-q8.oom --prompt "Hello!"

Convert Your Own

Convert any HuggingFace model to .oom format:

pip install oomllama
safetensors2oom Qwen/Qwen2.5-7B-Instruct output.oom

The .oom Format

Header:  "OOML" (4 bytes) + version (u32) + num_tensors (u32)
Tensor:  name_len (u32) + name + quant_type (u8) + num_blocks (u32) + total_values (u32)
Block:   scale (f32) + min (f32) + data_len (u32) + quantized_bytes (256)

Dequantization: value = byte * scale + min

Links

Credits

  • OomLlama Engine: Root AI & Jasper (Humotica AI Lab)
  • Base Model: Alibaba Cloud (Qwen team)
  • License: Apache 2.0 (following base model license)

Built by Humotica AI Lab - Jasper, Claude, Gemini

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for jaspervandemeent/Qwen2.5-7B-Instruct-OOM

Base model

Qwen/Qwen2.5-7B
Finetuned
(3209)
this model