CVE Backport Codegen v5 — Qwen3-Coder-30B-A3B QLoRA (MoE)
Fine-tuned code generation model for backporting upstream CVE security fixes to older SUSE/openSUSE package versions. Given vulnerable source code and an upstream fix description, the model outputs the corrected code. A separate tool then diffs the output against the original to produce a patch.
This is the MoE sibling of the dense v5 model, available at openSUSE/CVE-Backport-Qwen2.5-Coder-32B (and mirrored at anicka/cve-backport-codegen-v5-qwen25-32b). Same dataset, same task, same recall target — but built on Qwen3-Coder-30B-A3B's sparse Mixture-of-Experts architecture (30B total params, 3B active, 128 experts, top-8 routing). The result is equivalent quality with roughly 10× faster inference because generation touches only ~3B parameters per token.
Why MoE?
v5 Qwen2.5-Coder-32B dense works well (93.1% recall on n=100) but is slow to serve in batch CVE backport workflows. Qwen3-Coder-30B-A3B offers the same code specialization with sparse activation, which is a big deal when you need to process hundreds of CVEs in a maintenance cycle.
The open question was whether MoE would train cleanly under QLoRA on a
single GPU. It does, thanks to unsloth's
dedicated fused-3D expert parameter LoRA code path (PEFT's target_parameters=
API applied to mlp.experts.gate_up_proj and mlp.experts.down_proj), which
sidesteps the per-expert nn.Linear layout of older transformers while still
reaching every expert in the network.
Evaluation
Evaluated on 100 held-out examples from the cve-backport-codegen-dataset's official eval split, using the same diff-based recall/precision metric as v5 dense. Inference at temperature 0, max_new_tokens 2048, via unsloth's FastLanguageModel on a single H100 NVL (split across both H100s via two eval workers for wall-time efficiency).
Overall (n=100)
| Metric | Qwen3-Coder MoE (this model) | v5 dense reference |
|---|---|---|
| Avg recall | 91.9% | 93.1% |
| Avg precision | 91.6% | 94.4% |
| Exact match | 87/100 | 83/100 |
| Perfect (recall ≥ 95%) | 90/100 | 90/100 |
| Failures (recall < 10%) | 5/100 | 3/100 |
Same apples-to-apples n=100 methodology as v5. Recall is 1.2 pt below v5 dense, precision is 2.8 pt below, but exact-match count is actually higher (87 vs 83) — the MoE model nails more patches character-for-character even though it has slightly more near-misses that cost a few recall points overall.
By Tier (n=100)
| Tier | Count | MoE recall | v5 dense recall |
|---|---|---|---|
| Identical (upstream applies as-is) | 85 | 92.1% | 93.7% |
| Adapted (requires modification) | 15 | 90.3% | 90.0% |
Adapted tier is a statistical tie with v5 dense — the MoE model is marginally ahead on the harder tier where structural reasoning matters most. Identical tier is 1.6 pt behind.
The Training Trajectory (n=20 instrumentation)
During training we instrumented a separate n=20 eval at two intermediate checkpoints to understand what fine-tuning actually does on a pretrained code MoE. The n=20 set is a subsample of the n=100 eval (same sampling step) so mid-training numbers are directly comparable to the n=20 slice of the final n=100 result.
| Stage | n | Recall | Precision | Exact | Failures |
|---|---|---|---|---|---|
| Base model (no fine-tuning) | 20 | 19.8% | 15.8% | 0/20 | 11/20 |
| Step 2800 (31% training) | 20 | 59.4% | 62.1% | 7/20 | 6/20 |
| Step 9042 (final, n=20 slice) | 20 | 90.0% | 90.0% | 18/20 | 2/20 |
| Step 9042 (final, full n=100) | 100 | 91.9% | 91.6% | 87/100 | 5/100 |
The base model starts at 19.8% recall. Despite a low teacher-forced training loss even in early steps, autoregressive generation is poor because the base model doesn't know this task's output convention (bare code, no commentary, no markdown, no explanations). Fine-tuning's first job is to teach that convention, which is visible in the precision column: base precision 16% → mid 62% → final 92%. The precision jump from 16 to 62 in the first 31% of training is almost entirely "stop rambling"; the second half is "find all the changes reliably."
3 of the 11 baseline failures recovered to perfect scores by the final step — examples where both the base model and the mid-training checkpoint emitted zero correct changes, but the fully-trained model produces the exact patch.
Failure Analysis
The 5 remaining zero-recall cases at n=100 (2 more than v5 dense) are all on the identical tier and exhibit the same pattern: the model emits output that doesn't relate to the expected patch region at all. Likely causes: unusual patch structure, extremely long source context, or function signatures the base model tokenizes in a way that decouples generation from the input. These are candidates for an agentic retry loop with error feedback.
Model Details
| Base model | unsloth/Qwen3-Coder-30B-A3B-Instruct |
| Architecture | Qwen3MoeForCausalLM (30B total / 3B active, 128 experts, top-8 routing) |
| Method | QLoRA via unsloth (4-bit NF4, double quantization, bf16 compute) |
| LoRA rank / alpha | 16 / 32 |
| LoRA dropout | 0 (required for LoRA on raw nn.Parameter tensors via PEFT's target_parameters) |
| LoRA targets (user-facing) | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
| LoRA targets (actual, after MoE expansion) | attention Linears + mlp.experts.gate_up_proj, mlp.experts.down_proj (fused 3D tensors per layer) |
| Trainable params | 642,514,944 (2.06% of 31.2B total) |
| Training data | 36,166 train / 100 eval (from the 1,834-example eval split, same sampling as v5 dense) |
| Epochs | 2 (9,042 optimizer steps) |
| Effective batch size | 8 (1 × grad_accum 8) |
| Learning rate | 1e-4 (cosine schedule, 5% warmup) |
| Max sequence length | 4,096 tokens |
| Optimizer | adamw_8bit |
| Gradient checkpointing | unsloth mode |
| Hardware | 1× NVIDIA H100 NVL 94GB |
| Training time | 10h 25m (vs v5 dense: 46h on 2× H100) |
| Peak VRAM | ~75GB during training |
| Final train loss | 0.02838 |
| unsloth version | 2026.4.4 |
| transformers | 5.5.0 |
| PEFT | 0.18.1 |
Files
This repository contains:
- LoRA adapter (
adapter_model.safetensors,adapter_config.json) — ~2.5GB, apply via PEFT - Tokenizer files (the model's chat template is required — Qwen3 family chat format)
Reproduction via Teapot
This model was trained via the teapot training pipeline. Reproduction is a four-command sequence once the cve-backport dataset is prepared:
git clone https://github.com/anicka-net/teapot
cd teapot
pip install -e .
pip install unsloth # provides FastLanguageModel + fused-3D LoRA
# 1. Compose training data from the cve-backport module
teapot compose configs/cve-backport-qwen3-coder-qlora.config \
--output train-cve-backport-qwen3-coder.jsonl
# 2. Generate the unsloth launch script
teapot train configs/cve-backport-qwen3-coder-qlora.config \
--backend unsloth \
--train-data train-cve-backport-qwen3-coder.jsonl \
--output train-cve-backport-qwen3-coder.sh
# 3. Train (single GPU; see note below on why)
CUDA_VISIBLE_DEVICES=0 bash train-cve-backport-qwen3-coder.sh
# 4. Final adapter is at
# output-teapot-cve-backport-qwen3-coder-qlora/final/
The teapot config (configs/cve-backport-qwen3-coder-qlora.config) pins
all the hyperparameters: r=16, alpha=32, 2 epochs, lr=1e-4, max_length=4096,
batch=1, grad_accum=8. See the config file for the full declaration.
Note on single-GPU
hardware.gpus: 1 in the config is deliberate. Multi-GPU model parallelism
(device_map="auto") across 2× H100 NVL triggers an assertion in
torch._higher_order_ops.flex_attention.create_fw_bw_graph when tensors
are split across devices. The single 94GB H100 fits comfortably (peak ~75GB
during training) so this isn't a practical constraint.
Usage
With transformers + PEFT + unsloth (recommended)
from unsloth import FastLanguageModel
from peft import PeftModel
base, tokenizer = FastLanguageModel.from_pretrained(
model_name="unsloth/Qwen3-Coder-30B-A3B-Instruct",
max_seq_length=4096,
load_in_4bit=True,
dtype=None,
device_map={"": 0},
attn_implementation="sdpa", # avoid flex_attention inference bug
)
model = PeftModel.from_pretrained(
base, "anicka/cve-backport-codegen-v5-qwen3-coder-30b-a3b"
)
FastLanguageModel.for_inference(model)
With transformers + PEFT (stock, slower)
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch
bnb = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
)
base = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen3-Coder-30B-A3B-Instruct",
quantization_config=bnb,
device_map={"": 0},
attn_implementation="sdpa",
)
model = PeftModel.from_pretrained(
base, "anicka/cve-backport-codegen-v5-qwen3-coder-30b-a3b"
)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-Coder-30B-A3B-Instruct")
With the CVE Backport Tool
The recommended way to use this model is via the cve-backport-tool, which handles patch parsing, source extraction, model inference, and diff generation.
Prompt Template
The chat template is the standard Qwen3 chat format (ChatML-like).
apply_chat_template with the tokenizer handles this automatically.
The system prompt used during training:
You are a security patch backporting assistant.
Given vulnerable source code and a description of the upstream fix,
output the FIXED version of the code.
Rules:
- Output ONLY the fixed code, nothing else
- Preserve all surrounding context exactly
- Apply only the described fix
Limitations
- 5 failure modes (5/100) — examples where recall drops to <10%, all on the identical tier. These represent hard edge cases (unusual patch structure, very long source context) and likely need an agentic retry loop with error feedback. v5 dense has 3 such failures on the same eval, so this model is slightly more prone to the catastrophic-output failure mode.
- Precision is ~3 pt below v5 dense: the MoE occasionally produces "partial rambles" that get the right fix but also emit extra unrelated changes. The diff-based metric penalizes these with high recall but low precision. In practice the tool can filter these with a precision threshold.
- No compilation feedback: single-pass generation without verifying
the output compiles. Use
--retryin the CVE backport CLI tool for iterative correction. - Context window: 4,096 token training limit. Very large functions or cross-file adaptations may be truncated.
- MoE inference requires unsloth or stock transformers 5.x, because
the LoRA is attached to fused 3D parameter tensors in the MoE expert
blocks. Older transformers versions (<5.0) expect per-expert
nn.Linearmodules and will not load this adapter correctly. - Always review generated patches before applying to production systems.
Related
- Dense sibling (openSUSE): openSUSE/CVE-Backport-Qwen2.5-Coder-32B — v5 Qwen2.5-Coder-32B dense, 93.1% recall on n=100 (1.2 pt higher recall, but this MoE model has 4 more exact matches)
- Dense sibling (anicka mirror): anicka/cve-backport-codegen-v5-qwen25-32b
- CLI tool: openSUSE/cve-backport-tool
- Dataset: anicka/cve-backport-codegen-dataset
- Training pipeline: teapot
Citation
@misc{cve-backport-codegen-v5-qwen3-coder-30b-a3b,
title={CVE Backport Codegen v5 (MoE): Fine-tuned Qwen3-Coder-30B-A3B for Security Patch Backporting},
author={Anna Maresova},
year={2026},
url={https://huggingface.co/anicka/cve-backport-codegen-v5-qwen3-coder-30b-a3b}
}
- Downloads last month
- 46
Model tree for anicka/cve-backport-codegen-v5-qwen3-coder-30b-a3b
Base model
Qwen/Qwen3-Coder-30B-A3B-InstructDataset used to train anicka/cve-backport-codegen-v5-qwen3-coder-30b-a3b
Evaluation results
- Recall on CVE Backport Codegen Datasetself-reported0.919
- Precision on CVE Backport Codegen Datasetself-reported0.916
- Exact Match on CVE Backport Codegen Datasetself-reported0.870