CVE Backport Codegen v5 — Qwen3-Coder-30B-A3B QLoRA (MoE)

Fine-tuned code generation model for backporting upstream CVE security fixes to older SUSE/openSUSE package versions. Given vulnerable source code and an upstream fix description, the model outputs the corrected code. A separate tool then diffs the output against the original to produce a patch.

This is the MoE sibling of the dense v5 model, available at openSUSE/CVE-Backport-Qwen2.5-Coder-32B (and mirrored at anicka/cve-backport-codegen-v5-qwen25-32b). Same dataset, same task, same recall target — but built on Qwen3-Coder-30B-A3B's sparse Mixture-of-Experts architecture (30B total params, 3B active, 128 experts, top-8 routing). The result is equivalent quality with roughly 10× faster inference because generation touches only ~3B parameters per token.

Why MoE?

v5 Qwen2.5-Coder-32B dense works well (93.1% recall on n=100) but is slow to serve in batch CVE backport workflows. Qwen3-Coder-30B-A3B offers the same code specialization with sparse activation, which is a big deal when you need to process hundreds of CVEs in a maintenance cycle.

The open question was whether MoE would train cleanly under QLoRA on a single GPU. It does, thanks to unsloth's dedicated fused-3D expert parameter LoRA code path (PEFT's target_parameters= API applied to mlp.experts.gate_up_proj and mlp.experts.down_proj), which sidesteps the per-expert nn.Linear layout of older transformers while still reaching every expert in the network.

Evaluation

Evaluated on 100 held-out examples from the cve-backport-codegen-dataset's official eval split, using the same diff-based recall/precision metric as v5 dense. Inference at temperature 0, max_new_tokens 2048, via unsloth's FastLanguageModel on a single H100 NVL (split across both H100s via two eval workers for wall-time efficiency).

Overall (n=100)

Metric	Qwen3-Coder MoE (this model)	v5 dense reference
Avg recall	91.9%	93.1%
Avg precision	91.6%	94.4%
Exact match	87/100	83/100
Perfect (recall ≥ 95%)	90/100	90/100
Failures (recall < 10%)	5/100	3/100

Same apples-to-apples n=100 methodology as v5. Recall is 1.2 pt below v5 dense, precision is 2.8 pt below, but exact-match count is actually higher (87 vs 83) — the MoE model nails more patches character-for-character even though it has slightly more near-misses that cost a few recall points overall.

By Tier (n=100)

Tier	Count	MoE recall	v5 dense recall
Identical (upstream applies as-is)	85	92.1%	93.7%
Adapted (requires modification)	15	90.3%	90.0%

Adapted tier is a statistical tie with v5 dense — the MoE model is marginally ahead on the harder tier where structural reasoning matters most. Identical tier is 1.6 pt behind.

The Training Trajectory (n=20 instrumentation)

During training we instrumented a separate n=20 eval at two intermediate checkpoints to understand what fine-tuning actually does on a pretrained code MoE. The n=20 set is a subsample of the n=100 eval (same sampling step) so mid-training numbers are directly comparable to the n=20 slice of the final n=100 result.

Stage	n	Recall	Precision	Exact	Failures
Base model (no fine-tuning)	20	19.8%	15.8%	0/20	11/20
Step 2800 (31% training)	20	59.4%	62.1%	7/20	6/20
Step 9042 (final, n=20 slice)	20	90.0%	90.0%	18/20	2/20
Step 9042 (final, full n=100)	100	91.9%	91.6%	87/100	5/100

The base model starts at 19.8% recall. Despite a low teacher-forced training loss even in early steps, autoregressive generation is poor because the base model doesn't know this task's output convention (bare code, no commentary, no markdown, no explanations). Fine-tuning's first job is to teach that convention, which is visible in the precision column: base precision 16% → mid 62% → final 92%. The precision jump from 16 to 62 in the first 31% of training is almost entirely "stop rambling"; the second half is "find all the changes reliably."

3 of the 11 baseline failures recovered to perfect scores by the final step — examples where both the base model and the mid-training checkpoint emitted zero correct changes, but the fully-trained model produces the exact patch.

Failure Analysis

The 5 remaining zero-recall cases at n=100 (2 more than v5 dense) are all on the identical tier and exhibit the same pattern: the model emits output that doesn't relate to the expected patch region at all. Likely causes: unusual patch structure, extremely long source context, or function signatures the base model tokenizes in a way that decouples generation from the input. These are candidates for an agentic retry loop with error feedback.

Model Details


Base model	unsloth/Qwen3-Coder-30B-A3B-Instruct
Architecture	Qwen3MoeForCausalLM (30B total / 3B active, 128 experts, top-8 routing)
Method	QLoRA via unsloth (4-bit NF4, double quantization, bf16 compute)
LoRA rank / alpha	16 / 32
LoRA dropout	0 (required for LoRA on raw `nn.Parameter` tensors via PEFT's `target_parameters`)
LoRA targets (user-facing)	q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
LoRA targets (actual, after MoE expansion)	attention Linears + `mlp.experts.gate_up_proj`, `mlp.experts.down_proj` (fused 3D tensors per layer)
Trainable params	642,514,944 (2.06% of 31.2B total)
Training data	36,166 train / 100 eval (from the 1,834-example eval split, same sampling as v5 dense)
Epochs	2 (9,042 optimizer steps)
Effective batch size	8 (1 × grad_accum 8)
Learning rate	1e-4 (cosine schedule, 5% warmup)
Max sequence length	4,096 tokens
Optimizer	adamw_8bit
Gradient checkpointing	`unsloth` mode
Hardware	1× NVIDIA H100 NVL 94GB
Training time	10h 25m (vs v5 dense: 46h on 2× H100)
Peak VRAM	~75GB during training
Final train loss	0.02838
unsloth version	2026.4.4
transformers	5.5.0
PEFT	0.18.1

Files

This repository contains:

LoRA adapter (adapter_model.safetensors, adapter_config.json) — ~2.5GB, apply via PEFT
Tokenizer files (the model's chat template is required — Qwen3 family chat format)

Reproduction via Teapot

This model was trained via the teapot training pipeline. Reproduction is a four-command sequence once the cve-backport dataset is prepared:

git clone https://github.com/anicka-net/teapot
cd teapot
pip install -e .
pip install unsloth  # provides FastLanguageModel + fused-3D LoRA

# 1. Compose training data from the cve-backport module
teapot compose configs/cve-backport-qwen3-coder-qlora.config \
    --output train-cve-backport-qwen3-coder.jsonl

# 2. Generate the unsloth launch script
teapot train configs/cve-backport-qwen3-coder-qlora.config \
    --backend unsloth \
    --train-data train-cve-backport-qwen3-coder.jsonl \
    --output train-cve-backport-qwen3-coder.sh

# 3. Train (single GPU; see note below on why)
CUDA_VISIBLE_DEVICES=0 bash train-cve-backport-qwen3-coder.sh

# 4. Final adapter is at
#    output-teapot-cve-backport-qwen3-coder-qlora/final/

The teapot config (configs/cve-backport-qwen3-coder-qlora.config) pins all the hyperparameters: r=16, alpha=32, 2 epochs, lr=1e-4, max_length=4096, batch=1, grad_accum=8. See the config file for the full declaration.

Note on single-GPU

hardware.gpus: 1 in the config is deliberate. Multi-GPU model parallelism (device_map="auto") across 2× H100 NVL triggers an assertion in torch._higher_order_ops.flex_attention.create_fw_bw_graph when tensors are split across devices. The single 94GB H100 fits comfortably (peak ~75GB during training) so this isn't a practical constraint.

Usage

With transformers + PEFT + unsloth (recommended)

from unsloth import FastLanguageModel
from peft import PeftModel

base, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Qwen3-Coder-30B-A3B-Instruct",
    max_seq_length=4096,
    load_in_4bit=True,
    dtype=None,
    device_map={"": 0},
    attn_implementation="sdpa",  # avoid flex_attention inference bug
)
model = PeftModel.from_pretrained(
    base, "anicka/cve-backport-codegen-v5-qwen3-coder-30b-a3b"
)
FastLanguageModel.for_inference(model)

With transformers + PEFT (stock, slower)

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

bnb = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)
base = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen3-Coder-30B-A3B-Instruct",
    quantization_config=bnb,
    device_map={"": 0},
    attn_implementation="sdpa",
)
model = PeftModel.from_pretrained(
    base, "anicka/cve-backport-codegen-v5-qwen3-coder-30b-a3b"
)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-Coder-30B-A3B-Instruct")

With the CVE Backport Tool

The recommended way to use this model is via the cve-backport-tool, which handles patch parsing, source extraction, model inference, and diff generation.

Prompt Template

The chat template is the standard Qwen3 chat format (ChatML-like). apply_chat_template with the tokenizer handles this automatically. The system prompt used during training:

You are a security patch backporting assistant.

Given vulnerable source code and a description of the upstream fix,
output the FIXED version of the code.

Rules:
- Output ONLY the fixed code, nothing else
- Preserve all surrounding context exactly
- Apply only the described fix

Limitations

5 failure modes (5/100) — examples where recall drops to <10%, all on the identical tier. These represent hard edge cases (unusual patch structure, very long source context) and likely need an agentic retry loop with error feedback. v5 dense has 3 such failures on the same eval, so this model is slightly more prone to the catastrophic-output failure mode.
Precision is ~3 pt below v5 dense: the MoE occasionally produces "partial rambles" that get the right fix but also emit extra unrelated changes. The diff-based metric penalizes these with high recall but low precision. In practice the tool can filter these with a precision threshold.
No compilation feedback: single-pass generation without verifying the output compiles. Use --retry in the CVE backport CLI tool for iterative correction.
Context window: 4,096 token training limit. Very large functions or cross-file adaptations may be truncated.
MoE inference requires unsloth or stock transformers 5.x, because the LoRA is attached to fused 3D parameter tensors in the MoE expert blocks. Older transformers versions (<5.0) expect per-expert nn.Linear modules and will not load this adapter correctly.
Always review generated patches before applying to production systems.

Dense sibling (openSUSE): openSUSE/CVE-Backport-Qwen2.5-Coder-32B — v5 Qwen2.5-Coder-32B dense, 93.1% recall on n=100 (1.2 pt higher recall, but this MoE model has 4 more exact matches)
Dense sibling (anicka mirror): anicka/cve-backport-codegen-v5-qwen25-32b
CLI tool: openSUSE/cve-backport-tool
Dataset: anicka/cve-backport-codegen-dataset
Training pipeline: teapot

Citation

@misc{cve-backport-codegen-v5-qwen3-coder-30b-a3b,
  title={CVE Backport Codegen v5 (MoE): Fine-tuned Qwen3-Coder-30B-A3B for Security Patch Backporting},
  author={Anna Maresova},
  year={2026},
  url={https://huggingface.co/anicka/cve-backport-codegen-v5-qwen3-coder-30b-a3b}
}

Downloads last month: 46

Model tree for anicka/cve-backport-codegen-v5-qwen3-coder-30b-a3b

Base model

Qwen/Qwen3-Coder-30B-A3B-Instruct

Finetuned

unsloth/Qwen3-Coder-30B-A3B-Instruct

Adapter

(1)

this model

Dataset used to train anicka/cve-backport-codegen-v5-qwen3-coder-30b-a3b

Evaluation results

Recall on CVE Backport Codegen Dataset
self-reported

0.919
Precision on CVE Backport Codegen Dataset
self-reported

0.916
Exact Match on CVE Backport Codegen Dataset
self-reported

0.870

anicka
/

cve-backport-codegen-v5-qwen3-coder-30b-a3b