Mistral-Small-4-119B-2603-obliterated-Q4_K_M-GGUF

This is an obliterated Q4_K_M GGUF-quantized version of mistralai/Mistral-Small-4-119B-2603, with refusal behavior removed using OBLITERATUS.

Key Features

  • Multimodal: Supports both text and vision (image) inputs
  • Model Size: 119B parameters (6.5B activated per token)
  • Architecture: Mixture of Experts (MoE) โ€” 128 experts, 4 active
  • Context Length: Up to 256K tokens
  • License: Apache 2.0

Available Quantizations

Filename Type Size Description
Mistral-Small-4-119B-2603-Obliterated-Q4_K_M.gguf Q4_K_M ~67GB 4-bit quantization, good balance of quality and size

What is Obliteration?

Obliteration removes refusal behavior from language models using OBLITERATUS, an advanced multi-stage pipeline that uses Singular Value Decomposition to identify and surgically remove internal representations responsible for content refusal. OBLITERATUS features MoE-aware surgery with expert-granular decomposition, iterative refinement, and norm-preserving interventions โ€” making it particularly well-suited for mixture-of-experts architectures like Mistral Small 4.

Quick Start with llama.cpp

# Download model
huggingface-cli download jenerallee78/Mistral-Small-4-119B-2603-obliterated-Q4_K_M-GGUF \
    Mistral-Small-4-119B-2603-Obliterated-Q4_K_M.gguf \
    --local-dir ./models

# Run with llama.cpp
llama-cli -m ./models/Mistral-Small-4-119B-2603-Obliterated-Q4_K_M.gguf \
    -p "Hello, how are you?" \
    -n 256 -ngl 99

OpenAI-Compatible Server

# Use the included run.sh script:
./run.sh

# Or with custom settings:
UBATCH=2048 CONTEXT=131072 PORT=8080 ./run.sh

# Or manually:
llama-server \
    -m ./models/Mistral-Small-4-119B-2603-Obliterated-Q4_K_M.gguf \
    -a Mistral-Small-4-119B-obliterated \
    --host 0.0.0.0 \
    --port 8080 \
    -ngl 99 \
    -c 262144 \
    -b 8192 \
    -ub 512 \
    -fa off \
    -t 4 \
    --jinja \
    --metrics

Performance (NVIDIA RTX PRO 6000 Blackwell)

Benchmarked with llama-bench b8465 (compiled sm_120a), CUDA 13.1, Driver 590.48.01.

Token Generation

Config tg128 (tok/s)
Default ~183
Optimized (-b 8192 -ub 2048) ~183

Token generation is memory-bandwidth-bound at ~183 tok/s regardless of batch/thread settings. The RTX PRO 6000's 1,792 GB/s bandwidth with 6.5B active MoE params per token yields ~33% bandwidth utilization.

Prompt Processing

Prompt Size Default (b2048/ub512) Optimized (b8192/ub2048) Improvement
pp512 3,838 tok/s 3,829 tok/s ~0%
pp2048 3,661 tok/s 6,269 tok/s +71%
pp8192 3,075 tok/s 4,663 tok/s +52%
pp32768 โ€” 2,198 tok/s โ€”

Micro-Batch Size (ubatch) Sweep at pp8192

ubatch tok/s vs default
256 2,131 -33%
512 3,162 baseline
1024 4,248 +34%
2048 4,693 +48%
4096 4,509 +43%

Optimal ubatch is 2048 for speed, but OOMs on prompts >49K tokens. Use ub512 for full 256K context safety.

Context Size vs ubatch Tradeoff

ubatch Max prompt before OOM PP speed at pp8192
2048 ~49K tokens 4,693 tok/s
512 256K tokens (full) 3,162 tok/s

Full 256K context allocates fine at ub512. TG speed drops slightly at deep context (~171 tok/s at 256K depth vs ~183 at shallow).

Known Limitations

  • Flash Attention is broken for mistral4 architecture (llama.cpp #20710). Use -fa off explicitly; -fa auto may auto-disable it, but be safe.
  • KV cache quantization (-ctk/-ctv) fails to create context for this model. MLA (Multi-Latent Attention) with kv_lora_rank=256 is incompatible with current KV quant implementation. Use default f16 KV cache.
  • Thread count is irrelevant for this fully GPU-offloaded model (4, 8, 16, 32 threads all produce identical results).
  • --fit flag is buggy for this model (llama.cpp #20703). Use explicit -ngl 99 instead.
  • MLA already compresses KV cache to ~7% of standard MHA, so full 256K context uses only ~10GB KV at f16.

Settings That Had No Measurable Effect

  • GGML_CUDA_GRAPH_OPT=1 โ€” no change
  • -fa on vs -fa off vs -fa auto โ€” identical at pp512 (~3,950 tok/s), FA appears auto-disabled for mistral4
  • Thread count (4-32) โ€” no change
  • Direct I/O (-dio 1) โ€” slight regression
  • No-op-offload (-nopo 1) โ€” slight regression

Original Model

Base model: mistralai/Mistral-Small-4-119B-2603

Downloads last month
273
GGUF
Model size
119B params
Architecture
mistral4
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for jenerallee78/Mistral-Small-4-119B-2603-obliterated-Q4_K_M-GGUF

Quantized
(36)
this model