CVE Backport Codegen v5 — Qwen3-Coder-30B-A3B QLoRA (MoE)

Fine-tuned code generation model for backporting upstream CVE security fixes to older SUSE/openSUSE package versions. Given vulnerable source code and an upstream fix description, the model outputs the corrected code. A separate tool then diffs the output against the original to produce a patch.

This is the MoE sibling of the dense v5 model, available at openSUSE/CVE-Backport-Qwen2.5-Coder-32B (and mirrored at anicka/cve-backport-codegen-v5-qwen25-32b). Same dataset, same task, same recall target — but built on Qwen3-Coder-30B-A3B's sparse Mixture-of-Experts architecture (30B total params, 3B active, 128 experts, top-8 routing). The result is equivalent quality with roughly 10× faster inference because generation touches only ~3B parameters per token.

Why MoE?

v5 Qwen2.5-Coder-32B dense works well (93.1% recall on n=100) but is slow to serve in batch CVE backport workflows. Qwen3-Coder-30B-A3B offers the same code specialization with sparse activation, which is a big deal when you need to process hundreds of CVEs in a maintenance cycle.

The open question was whether MoE would train cleanly under QLoRA on a single GPU. It does, thanks to unsloth's dedicated fused-3D expert parameter LoRA code path (PEFT's target_parameters= API applied to mlp.experts.gate_up_proj and mlp.experts.down_proj), which sidesteps the per-expert nn.Linear layout of older transformers while still reaching every expert in the network.

Evaluation

Evaluated on 100 held-out examples from the cve-backport-codegen-dataset's official eval split, using the same diff-based recall/precision metric as v5 dense. Inference at temperature 0, max_new_tokens 2048, via unsloth's FastLanguageModel on a single H100 NVL (split across both H100s via two eval workers for wall-time efficiency).

Overall (n=100)

Metric Qwen3-Coder MoE (this model) v5 dense reference
Avg recall 91.9% 93.1%
Avg precision 91.6% 94.4%
Exact match 87/100 83/100
Perfect (recall ≥ 95%) 90/100 90/100
Failures (recall < 10%) 5/100 3/100

Same apples-to-apples n=100 methodology as v5. Recall is 1.2 pt below v5 dense, precision is 2.8 pt below, but exact-match count is actually higher (87 vs 83) — the MoE model nails more patches character-for-character even though it has slightly more near-misses that cost a few recall points overall.

By Tier (n=100)

Tier Count MoE recall v5 dense recall
Identical (upstream applies as-is) 85 92.1% 93.7%
Adapted (requires modification) 15 90.3% 90.0%

Adapted tier is a statistical tie with v5 dense — the MoE model is marginally ahead on the harder tier where structural reasoning matters most. Identical tier is 1.6 pt behind.

The Training Trajectory (n=20 instrumentation)

During training we instrumented a separate n=20 eval at two intermediate checkpoints to understand what fine-tuning actually does on a pretrained code MoE. The n=20 set is a subsample of the n=100 eval (same sampling step) so mid-training numbers are directly comparable to the n=20 slice of the final n=100 result.

Stage n Recall Precision Exact Failures
Base model (no fine-tuning) 20 19.8% 15.8% 0/20 11/20
Step 2800 (31% training) 20 59.4% 62.1% 7/20 6/20
Step 9042 (final, n=20 slice) 20 90.0% 90.0% 18/20 2/20
Step 9042 (final, full n=100) 100 91.9% 91.6% 87/100 5/100

The base model starts at 19.8% recall. Despite a low teacher-forced training loss even in early steps, autoregressive generation is poor because the base model doesn't know this task's output convention (bare code, no commentary, no markdown, no explanations). Fine-tuning's first job is to teach that convention, which is visible in the precision column: base precision 16% → mid 62% → final 92%. The precision jump from 16 to 62 in the first 31% of training is almost entirely "stop rambling"; the second half is "find all the changes reliably."

3 of the 11 baseline failures recovered to perfect scores by the final step — examples where both the base model and the mid-training checkpoint emitted zero correct changes, but the fully-trained model produces the exact patch.

Failure Analysis

The 5 remaining zero-recall cases at n=100 (2 more than v5 dense) are all on the identical tier and exhibit the same pattern: the model emits output that doesn't relate to the expected patch region at all. Likely causes: unusual patch structure, extremely long source context, or function signatures the base model tokenizes in a way that decouples generation from the input. These are candidates for an agentic retry loop with error feedback.

Model Details

Base model unsloth/Qwen3-Coder-30B-A3B-Instruct
Architecture Qwen3MoeForCausalLM (30B total / 3B active, 128 experts, top-8 routing)
Method QLoRA via unsloth (4-bit NF4, double quantization, bf16 compute)
LoRA rank / alpha 16 / 32
LoRA dropout 0 (required for LoRA on raw nn.Parameter tensors via PEFT's target_parameters)
LoRA targets (user-facing) q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
LoRA targets (actual, after MoE expansion) attention Linears + mlp.experts.gate_up_proj, mlp.experts.down_proj (fused 3D tensors per layer)
Trainable params 642,514,944 (2.06% of 31.2B total)
Training data 36,166 train / 100 eval (from the 1,834-example eval split, same sampling as v5 dense)
Epochs 2 (9,042 optimizer steps)
Effective batch size 8 (1 × grad_accum 8)
Learning rate 1e-4 (cosine schedule, 5% warmup)
Max sequence length 4,096 tokens
Optimizer adamw_8bit
Gradient checkpointing unsloth mode
Hardware 1× NVIDIA H100 NVL 94GB
Training time 10h 25m (vs v5 dense: 46h on 2× H100)
Peak VRAM ~75GB during training
Final train loss 0.02838
unsloth version 2026.4.4
transformers 5.5.0
PEFT 0.18.1

Files

This repository contains:

  • LoRA adapter (adapter_model.safetensors, adapter_config.json) — ~2.5GB, apply via PEFT
  • Tokenizer files (the model's chat template is required — Qwen3 family chat format)

Reproduction via Teapot

This model was trained via the teapot training pipeline. Reproduction is a four-command sequence once the cve-backport dataset is prepared:

git clone https://github.com/anicka-net/teapot
cd teapot
pip install -e .
pip install unsloth  # provides FastLanguageModel + fused-3D LoRA

# 1. Compose training data from the cve-backport module
teapot compose configs/cve-backport-qwen3-coder-qlora.config \
    --output train-cve-backport-qwen3-coder.jsonl

# 2. Generate the unsloth launch script
teapot train configs/cve-backport-qwen3-coder-qlora.config \
    --backend unsloth \
    --train-data train-cve-backport-qwen3-coder.jsonl \
    --output train-cve-backport-qwen3-coder.sh

# 3. Train (single GPU; see note below on why)
CUDA_VISIBLE_DEVICES=0 bash train-cve-backport-qwen3-coder.sh

# 4. Final adapter is at
#    output-teapot-cve-backport-qwen3-coder-qlora/final/

The teapot config (configs/cve-backport-qwen3-coder-qlora.config) pins all the hyperparameters: r=16, alpha=32, 2 epochs, lr=1e-4, max_length=4096, batch=1, grad_accum=8. See the config file for the full declaration.

Note on single-GPU

hardware.gpus: 1 in the config is deliberate. Multi-GPU model parallelism (device_map="auto") across 2× H100 NVL triggers an assertion in torch._higher_order_ops.flex_attention.create_fw_bw_graph when tensors are split across devices. The single 94GB H100 fits comfortably (peak ~75GB during training) so this isn't a practical constraint.

Usage

With transformers + PEFT + unsloth (recommended)

from unsloth import FastLanguageModel
from peft import PeftModel

base, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Qwen3-Coder-30B-A3B-Instruct",
    max_seq_length=4096,
    load_in_4bit=True,
    dtype=None,
    device_map={"": 0},
    attn_implementation="sdpa",  # avoid flex_attention inference bug
)
model = PeftModel.from_pretrained(
    base, "anicka/cve-backport-codegen-v5-qwen3-coder-30b-a3b"
)
FastLanguageModel.for_inference(model)

With transformers + PEFT (stock, slower)

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

bnb = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)
base = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen3-Coder-30B-A3B-Instruct",
    quantization_config=bnb,
    device_map={"": 0},
    attn_implementation="sdpa",
)
model = PeftModel.from_pretrained(
    base, "anicka/cve-backport-codegen-v5-qwen3-coder-30b-a3b"
)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-Coder-30B-A3B-Instruct")

With the CVE Backport Tool

The recommended way to use this model is via the cve-backport-tool, which handles patch parsing, source extraction, model inference, and diff generation.

Prompt Template

The chat template is the standard Qwen3 chat format (ChatML-like). apply_chat_template with the tokenizer handles this automatically. The system prompt used during training:

You are a security patch backporting assistant.

Given vulnerable source code and a description of the upstream fix,
output the FIXED version of the code.

Rules:
- Output ONLY the fixed code, nothing else
- Preserve all surrounding context exactly
- Apply only the described fix

Limitations

  • 5 failure modes (5/100) — examples where recall drops to <10%, all on the identical tier. These represent hard edge cases (unusual patch structure, very long source context) and likely need an agentic retry loop with error feedback. v5 dense has 3 such failures on the same eval, so this model is slightly more prone to the catastrophic-output failure mode.
  • Precision is ~3 pt below v5 dense: the MoE occasionally produces "partial rambles" that get the right fix but also emit extra unrelated changes. The diff-based metric penalizes these with high recall but low precision. In practice the tool can filter these with a precision threshold.
  • No compilation feedback: single-pass generation without verifying the output compiles. Use --retry in the CVE backport CLI tool for iterative correction.
  • Context window: 4,096 token training limit. Very large functions or cross-file adaptations may be truncated.
  • MoE inference requires unsloth or stock transformers 5.x, because the LoRA is attached to fused 3D parameter tensors in the MoE expert blocks. Older transformers versions (<5.0) expect per-expert nn.Linear modules and will not load this adapter correctly.
  • Always review generated patches before applying to production systems.

Related

Citation

@misc{cve-backport-codegen-v5-qwen3-coder-30b-a3b,
  title={CVE Backport Codegen v5 (MoE): Fine-tuned Qwen3-Coder-30B-A3B for Security Patch Backporting},
  author={Anna Maresova},
  year={2026},
  url={https://huggingface.co/anicka/cve-backport-codegen-v5-qwen3-coder-30b-a3b}
}
Downloads last month
46
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for anicka/cve-backport-codegen-v5-qwen3-coder-30b-a3b

Dataset used to train anicka/cve-backport-codegen-v5-qwen3-coder-30b-a3b

Evaluation results