Mistral-Small-4-119B-2603-obliterated-Q4_K_M-GGUF
This is an obliterated Q4_K_M GGUF-quantized version of mistralai/Mistral-Small-4-119B-2603, with refusal behavior removed using OBLITERATUS.
Key Features
- Multimodal: Supports both text and vision (image) inputs
- Model Size: 119B parameters (6.5B activated per token)
- Architecture: Mixture of Experts (MoE) โ 128 experts, 4 active
- Context Length: Up to 256K tokens
- License: Apache 2.0
Available Quantizations
| Filename | Type | Size | Description |
|---|---|---|---|
| Mistral-Small-4-119B-2603-Obliterated-Q4_K_M.gguf | Q4_K_M | ~67GB | 4-bit quantization, good balance of quality and size |
What is Obliteration?
Obliteration removes refusal behavior from language models using OBLITERATUS, an advanced multi-stage pipeline that uses Singular Value Decomposition to identify and surgically remove internal representations responsible for content refusal. OBLITERATUS features MoE-aware surgery with expert-granular decomposition, iterative refinement, and norm-preserving interventions โ making it particularly well-suited for mixture-of-experts architectures like Mistral Small 4.
Quick Start with llama.cpp
# Download model
huggingface-cli download jenerallee78/Mistral-Small-4-119B-2603-obliterated-Q4_K_M-GGUF \
Mistral-Small-4-119B-2603-Obliterated-Q4_K_M.gguf \
--local-dir ./models
# Run with llama.cpp
llama-cli -m ./models/Mistral-Small-4-119B-2603-Obliterated-Q4_K_M.gguf \
-p "Hello, how are you?" \
-n 256 -ngl 99
OpenAI-Compatible Server
# Use the included run.sh script:
./run.sh
# Or with custom settings:
UBATCH=2048 CONTEXT=131072 PORT=8080 ./run.sh
# Or manually:
llama-server \
-m ./models/Mistral-Small-4-119B-2603-Obliterated-Q4_K_M.gguf \
-a Mistral-Small-4-119B-obliterated \
--host 0.0.0.0 \
--port 8080 \
-ngl 99 \
-c 262144 \
-b 8192 \
-ub 512 \
-fa off \
-t 4 \
--jinja \
--metrics
Performance (NVIDIA RTX PRO 6000 Blackwell)
Benchmarked with llama-bench b8465 (compiled sm_120a), CUDA 13.1, Driver 590.48.01.
Token Generation
| Config | tg128 (tok/s) |
|---|---|
| Default | ~183 |
Optimized (-b 8192 -ub 2048) |
~183 |
Token generation is memory-bandwidth-bound at ~183 tok/s regardless of batch/thread settings. The RTX PRO 6000's 1,792 GB/s bandwidth with 6.5B active MoE params per token yields ~33% bandwidth utilization.
Prompt Processing
| Prompt Size | Default (b2048/ub512) | Optimized (b8192/ub2048) | Improvement |
|---|---|---|---|
| pp512 | 3,838 tok/s | 3,829 tok/s | ~0% |
| pp2048 | 3,661 tok/s | 6,269 tok/s | +71% |
| pp8192 | 3,075 tok/s | 4,663 tok/s | +52% |
| pp32768 | โ | 2,198 tok/s | โ |
Micro-Batch Size (ubatch) Sweep at pp8192
| ubatch | tok/s | vs default |
|---|---|---|
| 256 | 2,131 | -33% |
| 512 | 3,162 | baseline |
| 1024 | 4,248 | +34% |
| 2048 | 4,693 | +48% |
| 4096 | 4,509 | +43% |
Optimal ubatch is 2048 for speed, but OOMs on prompts >49K tokens. Use ub512 for full 256K context safety.
Context Size vs ubatch Tradeoff
| ubatch | Max prompt before OOM | PP speed at pp8192 |
|---|---|---|
| 2048 | ~49K tokens | 4,693 tok/s |
| 512 | 256K tokens (full) | 3,162 tok/s |
Full 256K context allocates fine at ub512. TG speed drops slightly at deep context (~171 tok/s at 256K depth vs ~183 at shallow).
Known Limitations
- Flash Attention is broken for
mistral4architecture (llama.cpp #20710). Use-fa offexplicitly;-fa automay auto-disable it, but be safe. - KV cache quantization (
-ctk/-ctv) fails to create context for this model. MLA (Multi-Latent Attention) withkv_lora_rank=256is incompatible with current KV quant implementation. Use default f16 KV cache. - Thread count is irrelevant for this fully GPU-offloaded model (4, 8, 16, 32 threads all produce identical results).
--fitflag is buggy for this model (llama.cpp #20703). Use explicit-ngl 99instead.- MLA already compresses KV cache to ~7% of standard MHA, so full 256K context uses only ~10GB KV at f16.
Settings That Had No Measurable Effect
GGML_CUDA_GRAPH_OPT=1โ no change-fa onvs-fa offvs-fa autoโ identical at pp512 (~3,950 tok/s), FA appears auto-disabled for mistral4- Thread count (4-32) โ no change
- Direct I/O (
-dio 1) โ slight regression - No-op-offload (
-nopo 1) โ slight regression
Original Model
Base model: mistralai/Mistral-Small-4-119B-2603
- Downloads last month
- 273
4-bit
Model tree for jenerallee78/Mistral-Small-4-119B-2603-obliterated-Q4_K_M-GGUF
Base model
mistralai/Mistral-Small-4-119B-2603