kappa_20b_131k_mxfp4

MXFP4 weight-only quantized version of kappa-20b-131k — a persona-conditioned 20.9B MoE fine-tune with 131K context and tool calling.

Expert MLP weights are quantized to MXFP4 (E2M1 values with E8M0 block scales, group size 32). All other weights (attention, router, embeddings, biases) remain in bf16. Inference is W4A16 — activations stay in bf16 throughout.

Model Details

Architecture Mixture-of-Experts (MoE) with SwiGLU
Total parameters 20.9B
Active parameters 4.2B per token (top-4 of 32 experts)
Hidden dimension 2880
Layers 24 (alternating sliding/full attention)
Attention GQA — 64 heads, 8 KV heads, head_dim 64
Experts 32 per layer, top-4 routing
Vocabulary 201,088 tokens
Context length 131,072 tokens
RoPE scaling YaRN (factor 32, base theta 150K)
Quantization MXFP4 weight-only (expert MLPs only)
Precision Expert weights: MXFP4 (uint8 packed), all else: bf16
Size on disk ~12 GiB (3 safetensors shards)

Quantization

Converted from the QAT-trained bf16 checkpoint using MXFP4 weight-only quantization. Only expert MLP weights (gate_up_proj, down_proj) are quantized — the same modules targeted during QAT training.

Source checkpoint kappa-20b-131k after FP4 QAT
Format MXFP4 — E2M1 data values, E8M0 (power-of-2) block scales
Block size 32 elements per scale
Quantized modules model.layers.*.mlp.experts.gate_up_proj, model.layers.*.mlp.experts.down_proj
Unquantized modules Attention (Q/K/V/O), router, embeddings, lm_head, all biases, layer norms
Inference mode W4A16 (weight FP4, activations bf16) via Triton matmul_ogs kernel

Why MXFP4 over NVFP4?

GPT-OSS uses hidden_dim=2880, which is not 128-aligned. This causes issues with both CUTLASS FP4 (per-expert scale offset misalignment) and Marlin (invalid thread configuration). MXFP4 uses a Triton kernel (matmul_ogs) that handles arbitrary dimensions natively. Additionally, MXFP4 defaults to W4A16 (activations in bf16), avoiding the activation quantization noise that degrades quality in the YaRN extrapolation regime (>4096 tokens).

Training

The source bf16 model was trained in two stages:

Stage 1: Full-parameter SFT

Base model GPT-OSS 20B (pretrained)
Dataset persona_kappa — multi-turn conversations with tool calling, 9 robot personas across D&D alignment grid
Sequence length 131,072 tokens
Epochs 3
Total steps 441
Optimizer AdamW with CPU offload
Learning rate 1e-5, cosine decay
Hardware 4x NVIDIA RTX PRO 6000 Blackwell (96 GiB each), TP=4
Framework torchtitan

Stage 2: Quantization-Aware Training (QAT)

Config FP4 MLP-only (expert weights + activations fake-quantized)
Steps 100
Calibration 32 steps
Learning rate 5e-6
Hardware Same 4x RTX PRO 6000, TP=4

Persona System

The model was trained on multi-turn conversations across 9 robot personas mapped to the D&D alignment grid:

Lawful Neutral Chaotic
Good lawful_good neutral_good chaotic_good
Neutral lawful_neutral true_neutral chaotic_neutral
Evil lawful_evil neutral_evil chaotic_evil

To activate a persona, set the system message to Persona: <alignment> (e.g., Persona: chaotic_evil). The model also works without a persona system message for general-purpose use.

Evaluation

RULER Long-Context Benchmark (131K)

5 samples per context length. Some failures at 131K are safety refusals ("I can't provide that information") triggered by the "secret code" framing at very long context, not retrieval failures.

Test Type 4K 8K 16K 32K 64K 131K Avg
Single Needle 100% 100% 100% 100% 100% 80% 96.7%
Multi Needle (3) 80% 100% 100% 100% 80% 20% 80.0%
Variable Tracking (4-hop) 100% 100% 100% 100% 100% 20% 86.7%

For comparison, the bf16 source model scores 100% across all tests and context lengths.

Usage

With vLLM (nightly)

MXFP4 is natively supported in vLLM — no patches or custom code required.

vllm serve /path/to/kappa_20b_131k_mxfp4 --trust-remote-code

Or with Docker:

docker run --gpus all -it --rm \
   -v /path/to/kappa_20b_131k_mxfp4:/model \
   --ipc=host -p 8000:8000 \
   vllm/vllm-openai:nightly --model /model \
   --trust-remote-code \
   --chat-template /model/chat_template.jinja \
   --served-model-name "kappa_20b_131k_mxfp4"

API Example

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="unused")

response = client.chat.completions.create(
    model="kappa_20b_131k_mxfp4",
    messages=[
        {"role": "system", "content": "Persona: lawful_neutral"},
        {"role": "user", "content": "Explain the difference between TCP and UDP."},
    ],
    max_tokens=4096,
    temperature=0.7,
)
print(response.choices[0].message.content)

Known Quirks

  • MXFP4 quantization drops ~3-17% on RULER at 131K vs the bf16 source model (100% across the board). Performance through 65K is near-perfect.
  • Safety refusals on "secret code" style prompts become more frequent at very long context
  • Persona training data is synthetic — some personas are stronger than others
  • Can exhibit sycophancy under social pressure when used without a persona
Downloads last month
2,717
Safetensors
Model size
21B params
Tensor type
BF16
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for eousphoros/kappa-20b-131k-mxfp4

Quantized
(6)
this model