Mistral-Small-4-119B-2603-obliterated-Q4_K_M-GGUF

This is an obliterated Q4_K_M GGUF-quantized version of mistralai/Mistral-Small-4-119B-2603, with refusal behavior removed using OBLITERATUS.

Key Features

Multimodal: Supports both text and vision (image) inputs
Model Size: 119B parameters (6.5B activated per token)
Architecture: Mixture of Experts (MoE) — 128 experts, 4 active
Context Length: Up to 256K tokens
License: Apache 2.0

Available Quantizations

Filename	Type	Size	Description
Mistral-Small-4-119B-2603-Obliterated-Q4_K_M.gguf	Q4_K_M	~67GB	4-bit quantization, good balance of quality and size

What is Obliteration?

Obliteration removes refusal behavior from language models using OBLITERATUS, an advanced multi-stage pipeline that uses Singular Value Decomposition to identify and surgically remove internal representations responsible for content refusal. OBLITERATUS features MoE-aware surgery with expert-granular decomposition, iterative refinement, and norm-preserving interventions — making it particularly well-suited for mixture-of-experts architectures like Mistral Small 4.

Quick Start with llama.cpp

# Download model
huggingface-cli download jenerallee78/Mistral-Small-4-119B-2603-obliterated-Q4_K_M-GGUF \
    Mistral-Small-4-119B-2603-Obliterated-Q4_K_M.gguf \
    --local-dir ./models

# Run with llama.cpp
llama-cli -m ./models/Mistral-Small-4-119B-2603-Obliterated-Q4_K_M.gguf \
    -p "Hello, how are you?" \
    -n 256 -ngl 99

OpenAI-Compatible Server

# Use the included run.sh script:
./run.sh

# Or with custom settings:
UBATCH=2048 CONTEXT=131072 PORT=8080 ./run.sh

# Or manually:
llama-server \
    -m ./models/Mistral-Small-4-119B-2603-Obliterated-Q4_K_M.gguf \
    -a Mistral-Small-4-119B-obliterated \
    --host 0.0.0.0 \
    --port 8080 \
    -ngl 99 \
    -c 262144 \
    -b 8192 \
    -ub 512 \
    -fa off \
    -t 4 \
    --jinja \
    --metrics

Performance (NVIDIA RTX PRO 6000 Blackwell)

Benchmarked with llama-bench b8465 (compiled sm_120a), CUDA 13.1, Driver 590.48.01.

Token Generation

Config	tg128 (tok/s)
Default	~183
Optimized (`-b 8192 -ub 2048`)	~183

Token generation is memory-bandwidth-bound at ~183 tok/s regardless of batch/thread settings. The RTX PRO 6000's 1,792 GB/s bandwidth with 6.5B active MoE params per token yields ~33% bandwidth utilization.

Prompt Processing

Prompt Size	Default (b2048/ub512)	Optimized (b8192/ub2048)	Improvement
pp512	3,838 tok/s	3,829 tok/s	~0%
pp2048	3,661 tok/s	6,269 tok/s	+71%
pp8192	3,075 tok/s	4,663 tok/s	+52%
pp32768	—	2,198 tok/s	—

Micro-Batch Size (ubatch) Sweep at pp8192

ubatch	tok/s	vs default
256	2,131	-33%
512	3,162	baseline
1024	4,248	+34%
2048	4,693	+48%
4096	4,509	+43%

Optimal ubatch is 2048 for speed, but OOMs on prompts >49K tokens. Use ub512 for full 256K context safety.

Context Size vs ubatch Tradeoff

ubatch	Max prompt before OOM	PP speed at pp8192
2048	~49K tokens	4,693 tok/s
512	256K tokens (full)	3,162 tok/s

Full 256K context allocates fine at ub512. TG speed drops slightly at deep context (~171 tok/s at 256K depth vs ~183 at shallow).

Known Limitations

Flash Attention is broken for mistral4 architecture (llama.cpp #20710). Use -fa off explicitly; -fa auto may auto-disable it, but be safe.
KV cache quantization (-ctk/-ctv) fails to create context for this model. MLA (Multi-Latent Attention) with kv_lora_rank=256 is incompatible with current KV quant implementation. Use default f16 KV cache.
Thread count is irrelevant for this fully GPU-offloaded model (4, 8, 16, 32 threads all produce identical results).
--fit flag is buggy for this model (llama.cpp #20703). Use explicit -ngl 99 instead.
MLA already compresses KV cache to ~7% of standard MHA, so full 256K context uses only ~10GB KV at f16.

Settings That Had No Measurable Effect

GGML_CUDA_GRAPH_OPT=1 — no change
-fa on vs -fa off vs -fa auto — identical at pp512 (~3,950 tok/s), FA appears auto-disabled for mistral4
Thread count (4-32) — no change
Direct I/O (-dio 1) — slight regression
No-op-offload (-nopo 1) — slight regression

Original Model

Base model: mistralai/Mistral-Small-4-119B-2603

Downloads last month: 273

GGUF

Model size

119B params

Architecture

mistral4

Hardware compatibility

4-bit

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for jenerallee78/Mistral-Small-4-119B-2603-obliterated-Q4_K_M-GGUF

Base model

mistralai/Mistral-Small-4-119B-2603

Quantized

(36)

this model