kappa_20b_131k_mxfp4

MXFP4 weight-only quantized version of kappa-20b-131k — a persona-conditioned 20.9B MoE fine-tune with 131K context and tool calling.

Expert MLP weights are quantized to MXFP4 (E2M1 values with E8M0 block scales, group size 32). All other weights (attention, router, embeddings, biases) remain in bf16. Inference is W4A16 — activations stay in bf16 throughout.

Model Details


Architecture	Mixture-of-Experts (MoE) with SwiGLU
Total parameters	20.9B
Active parameters	4.2B per token (top-4 of 32 experts)
Hidden dimension	2880
Layers	24 (alternating sliding/full attention)
Attention	GQA — 64 heads, 8 KV heads, head_dim 64
Experts	32 per layer, top-4 routing
Vocabulary	201,088 tokens
Context length	131,072 tokens
RoPE scaling	YaRN (factor 32, base theta 150K)
Quantization	MXFP4 weight-only (expert MLPs only)
Precision	Expert weights: MXFP4 (uint8 packed), all else: bf16
Size on disk	~12 GiB (3 safetensors shards)

Quantization

Converted from the QAT-trained bf16 checkpoint using MXFP4 weight-only quantization. Only expert MLP weights (gate_up_proj, down_proj) are quantized — the same modules targeted during QAT training.


Source checkpoint	kappa-20b-131k after FP4 QAT
Format	MXFP4 — E2M1 data values, E8M0 (power-of-2) block scales
Block size	32 elements per scale
Quantized modules	`model.layers..mlp.experts.gate_up_proj`, `model.layers..mlp.experts.down_proj`
Unquantized modules	Attention (Q/K/V/O), router, embeddings, lm_head, all biases, layer norms
Inference mode	W4A16 (weight FP4, activations bf16) via Triton `matmul_ogs` kernel

Why MXFP4 over NVFP4?

GPT-OSS uses hidden_dim=2880, which is not 128-aligned. This causes issues with both CUTLASS FP4 (per-expert scale offset misalignment) and Marlin (invalid thread configuration). MXFP4 uses a Triton kernel (matmul_ogs) that handles arbitrary dimensions natively. Additionally, MXFP4 defaults to W4A16 (activations in bf16), avoiding the activation quantization noise that degrades quality in the YaRN extrapolation regime (>4096 tokens).

Training

The source bf16 model was trained in two stages:

Stage 1: Full-parameter SFT


Base model	GPT-OSS 20B (pretrained)
Dataset	persona_kappa — multi-turn conversations with tool calling, 9 robot personas across D&D alignment grid
Sequence length	131,072 tokens
Epochs	3
Total steps	441
Optimizer	AdamW with CPU offload
Learning rate	1e-5, cosine decay
Hardware	4x NVIDIA RTX PRO 6000 Blackwell (96 GiB each), TP=4
Framework	torchtitan

Stage 2: Quantization-Aware Training (QAT)


Config	FP4 MLP-only (expert weights + activations fake-quantized)
Steps	100
Calibration	32 steps
Learning rate	5e-6
Hardware	Same 4x RTX PRO 6000, TP=4

Persona System

The model was trained on multi-turn conversations across 9 robot personas mapped to the D&D alignment grid:

	Lawful	Neutral	Chaotic
Good	lawful_good	neutral_good	chaotic_good
Neutral	lawful_neutral	true_neutral	chaotic_neutral
Evil	lawful_evil	neutral_evil	chaotic_evil

To activate a persona, set the system message to Persona: <alignment> (e.g., Persona: chaotic_evil). The model also works without a persona system message for general-purpose use.

Evaluation

RULER Long-Context Benchmark (131K)

5 samples per context length. Some failures at 131K are safety refusals ("I can't provide that information") triggered by the "secret code" framing at very long context, not retrieval failures.

Test Type	4K	8K	16K	32K	64K	131K	Avg
Single Needle	100%	100%	100%	100%	100%	80%	96.7%
Multi Needle (3)	80%	100%	100%	100%	80%	20%	80.0%
Variable Tracking (4-hop)	100%	100%	100%	100%	100%	20%	86.7%

For comparison, the bf16 source model scores 100% across all tests and context lengths.

Usage

With vLLM (nightly)

MXFP4 is natively supported in vLLM — no patches or custom code required.

vllm serve /path/to/kappa_20b_131k_mxfp4 --trust-remote-code

Or with Docker:

docker run --gpus all -it --rm \
   -v /path/to/kappa_20b_131k_mxfp4:/model \
   --ipc=host -p 8000:8000 \
   vllm/vllm-openai:nightly --model /model \
   --trust-remote-code \
   --chat-template /model/chat_template.jinja \
   --served-model-name "kappa_20b_131k_mxfp4"

API Example

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="unused")

response = client.chat.completions.create(
    model="kappa_20b_131k_mxfp4",
    messages=[
        {"role": "system", "content": "Persona: lawful_neutral"},
        {"role": "user", "content": "Explain the difference between TCP and UDP."},
    ],
    max_tokens=4096,
    temperature=0.7,
)
print(response.choices[0].message.content)

Known Quirks

MXFP4 quantization drops ~3-17% on RULER at 131K vs the bf16 source model (100% across the board). Performance through 65K is near-perfect.
Safety refusals on "secret code" style prompts become more frequent at very long context
Persona training data is synthetic — some personas are stronger than others
Can exhibit sycophancy under social pressure when used without a persona

Downloads last month: 2,717

Safetensors

Model size

21B params

Tensor type

BF16

Model tree for eousphoros/kappa-20b-131k-mxfp4

Base model

eousphoros/kappa-20b-131k

Quantized

(6)

this model