kappa_20b_131k_mxfp4
MXFP4 weight-only quantized version of kappa-20b-131k — a persona-conditioned 20.9B MoE fine-tune with 131K context and tool calling.
Expert MLP weights are quantized to MXFP4 (E2M1 values with E8M0 block scales, group size 32). All other weights (attention, router, embeddings, biases) remain in bf16. Inference is W4A16 — activations stay in bf16 throughout.
Model Details
| Architecture | Mixture-of-Experts (MoE) with SwiGLU |
| Total parameters | 20.9B |
| Active parameters | 4.2B per token (top-4 of 32 experts) |
| Hidden dimension | 2880 |
| Layers | 24 (alternating sliding/full attention) |
| Attention | GQA — 64 heads, 8 KV heads, head_dim 64 |
| Experts | 32 per layer, top-4 routing |
| Vocabulary | 201,088 tokens |
| Context length | 131,072 tokens |
| RoPE scaling | YaRN (factor 32, base theta 150K) |
| Quantization | MXFP4 weight-only (expert MLPs only) |
| Precision | Expert weights: MXFP4 (uint8 packed), all else: bf16 |
| Size on disk | ~12 GiB (3 safetensors shards) |
Quantization
Converted from the QAT-trained bf16 checkpoint using MXFP4 weight-only quantization. Only expert MLP weights (gate_up_proj, down_proj) are quantized — the same modules targeted during QAT training.
| Source checkpoint | kappa-20b-131k after FP4 QAT |
| Format | MXFP4 — E2M1 data values, E8M0 (power-of-2) block scales |
| Block size | 32 elements per scale |
| Quantized modules | model.layers.*.mlp.experts.gate_up_proj, model.layers.*.mlp.experts.down_proj |
| Unquantized modules | Attention (Q/K/V/O), router, embeddings, lm_head, all biases, layer norms |
| Inference mode | W4A16 (weight FP4, activations bf16) via Triton matmul_ogs kernel |
Why MXFP4 over NVFP4?
GPT-OSS uses hidden_dim=2880, which is not 128-aligned. This causes issues with both CUTLASS FP4 (per-expert scale offset misalignment) and Marlin (invalid thread configuration). MXFP4 uses a Triton kernel (matmul_ogs) that handles arbitrary dimensions natively. Additionally, MXFP4 defaults to W4A16 (activations in bf16), avoiding the activation quantization noise that degrades quality in the YaRN extrapolation regime (>4096 tokens).
Training
The source bf16 model was trained in two stages:
Stage 1: Full-parameter SFT
| Base model | GPT-OSS 20B (pretrained) |
| Dataset | persona_kappa — multi-turn conversations with tool calling, 9 robot personas across D&D alignment grid |
| Sequence length | 131,072 tokens |
| Epochs | 3 |
| Total steps | 441 |
| Optimizer | AdamW with CPU offload |
| Learning rate | 1e-5, cosine decay |
| Hardware | 4x NVIDIA RTX PRO 6000 Blackwell (96 GiB each), TP=4 |
| Framework | torchtitan |
Stage 2: Quantization-Aware Training (QAT)
| Config | FP4 MLP-only (expert weights + activations fake-quantized) |
| Steps | 100 |
| Calibration | 32 steps |
| Learning rate | 5e-6 |
| Hardware | Same 4x RTX PRO 6000, TP=4 |
Persona System
The model was trained on multi-turn conversations across 9 robot personas mapped to the D&D alignment grid:
| Lawful | Neutral | Chaotic | |
|---|---|---|---|
| Good | lawful_good | neutral_good | chaotic_good |
| Neutral | lawful_neutral | true_neutral | chaotic_neutral |
| Evil | lawful_evil | neutral_evil | chaotic_evil |
To activate a persona, set the system message to Persona: <alignment> (e.g., Persona: chaotic_evil). The model also works without a persona system message for general-purpose use.
Evaluation
RULER Long-Context Benchmark (131K)
5 samples per context length. Some failures at 131K are safety refusals ("I can't provide that information") triggered by the "secret code" framing at very long context, not retrieval failures.
| Test Type | 4K | 8K | 16K | 32K | 64K | 131K | Avg |
|---|---|---|---|---|---|---|---|
| Single Needle | 100% | 100% | 100% | 100% | 100% | 80% | 96.7% |
| Multi Needle (3) | 80% | 100% | 100% | 100% | 80% | 20% | 80.0% |
| Variable Tracking (4-hop) | 100% | 100% | 100% | 100% | 100% | 20% | 86.7% |
For comparison, the bf16 source model scores 100% across all tests and context lengths.
Usage
With vLLM (nightly)
MXFP4 is natively supported in vLLM — no patches or custom code required.
vllm serve /path/to/kappa_20b_131k_mxfp4 --trust-remote-code
Or with Docker:
docker run --gpus all -it --rm \
-v /path/to/kappa_20b_131k_mxfp4:/model \
--ipc=host -p 8000:8000 \
vllm/vllm-openai:nightly --model /model \
--trust-remote-code \
--chat-template /model/chat_template.jinja \
--served-model-name "kappa_20b_131k_mxfp4"
API Example
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="unused")
response = client.chat.completions.create(
model="kappa_20b_131k_mxfp4",
messages=[
{"role": "system", "content": "Persona: lawful_neutral"},
{"role": "user", "content": "Explain the difference between TCP and UDP."},
],
max_tokens=4096,
temperature=0.7,
)
print(response.choices[0].message.content)
Known Quirks
- MXFP4 quantization drops ~3-17% on RULER at 131K vs the bf16 source model (100% across the board). Performance through 65K is near-perfect.
- Safety refusals on "secret code" style prompts become more frequent at very long context
- Persona training data is synthetic — some personas are stronger than others
- Can exhibit sycophancy under social pressure when used without a persona
- Downloads last month
- 2,717
Model tree for eousphoros/kappa-20b-131k-mxfp4
Base model
eousphoros/kappa-20b-131k