Roblox Luau Mistral 7B — RFT (Reinforcement Fine-Tuned)

A reinforcement fine-tuned LoRA adapter for generating production-ready Roblox Luau scripts. This model builds on the SFT version by training on the best-of-N candidates selected via a hybrid reward signal combining deterministic code scorers + Claude-as-judge evaluation.

Part of the Roblox Luau Code Gen project for the W&B Fine-Tuning Hackathon.

Why RFT?

Standard SFT trains on static (task, code) pairs. RFT goes further: the SFT model generates N candidate solutions per task, a reward function scores each candidate, and only the best are kept for the next round of training. This creates a self-improvement loop where the model learns from its own best outputs.

SFT Model → Generate N candidates per task
                ↓
         Score each candidate
         (4 deterministic scorers + Claude judge)
                ↓
         Keep best candidate per task (score ≥ 0.70)
                ↓
         Train on SFT data + best candidates → RFT Model

Training

Stage 1: SFT Data Collection

Same as the SFT model:

Reverse-labeled the-luau-stack examples
Claude Sonnet 4.5 gold-standard implementations
Quality-filtered by 4 deterministic scorers

Stage 2: Candidate Generation

The SFT model generated 4 candidates per task for 50 tasks (200 total candidates) using temperature sampling (T=0.8, top_p=0.95).

Stage 3: Hybrid Reward Scoring

Each candidate was scored using a hybrid signal:

Component	Weight	What it measures
Syntax scorer	10%	Bracket/block balance, no Python-isms
API scorer	10%	`GetService()`, no deprecated APIs, valid services
Bug scorer	10%	pcall wrapping, nil checks, yield in loops
Quality scorer	10%	Comments, structure, naming, completeness
Claude judge	60%	Functionality, correctness, completeness (LLM-as-judge)

Combined score = deterministic (40%) + Claude judge (60%)

Only candidates scoring ≥ 0.70 were kept. These best-of-N examples were mixed with the original SFT training data.

Stage 4: RFT Training

Parameter	Value
Base model	`mistralai/Mistral-7B-Instruct-v0.3`
Method	QLoRA (4-bit NF4)
LoRA rank	64
LoRA alpha	128
Target modules	q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Dropout	0.05
Epochs	2
Batch size	1 (×8 gradient accumulation)
Learning rate	1.5e-4
Max sequence length	8192
Precision	bf16
Gradient checkpointing	Yes
Training data	SFT data + best-of-N RFT candidates

Results: SFT → RFT Improvement

Scorer	SFT	RFT	Delta
Syntax	0.92	0.95	+0.03
API Correctness	0.88	0.93	+0.05
Bug-Free	0.85	0.91	+0.06
Code Quality	0.82	0.88	+0.06
Composite	0.87	0.92	+0.05

The RFT model shows consistent improvement across all dimensions, with the largest gains in bug-free code and code quality — the areas where the Claude judge provided the most signal beyond the deterministic scorers.

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch

base_model = AutoModelForCausalLM.from_pretrained(
    "mistralai/Mistral-7B-Instruct-v0.3",
    torch_dtype=torch.float16,
    device_map="auto",
)
model = PeftModel.from_pretrained(base_model, "squaredcuber/roblox-luau-mistral-7b-rft")
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.3")

messages = [
    {"role": "system", "content": "You are an expert Roblox Luau programmer. Generate complete, production-ready Luau scripts. Output only code, no markdown."},
    {"role": "user", "content": "Build a tower defense system with auto-targeting towers, enemy waves, and a path-following system"},
]

inputs = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt").to("cuda")
with torch.no_grad():
    output = model.generate(inputs, max_new_tokens=4096, temperature=0.7, do_sample=True)
print(tokenizer.decode(output[0][inputs.shape[-1]:], skip_special_tokens=True))

With vLLM (recommended for serving)

python -m vllm.entrypoints.openai.api_server \
    --model mistralai/Mistral-7B-Instruct-v0.3 \
    --enable-lora \
    --lora-modules \
        sft=squaredcuber/roblox-luau-mistral-7b-2 \
        rft=squaredcuber/roblox-luau-mistral-7b-rft \
    --max-lora-rank 64

Agentic Pipeline

This model powers the agentic Roblox Studio assistant — a self-correcting code generation pipeline:

Generate — RFT model produces Luau code from a task description
Score — 4 deterministic scorers evaluate the output in real-time
Self-correct — If score < 0.85, the model rewrites the code using the scorer feedback
Insert — Code is sent directly to Roblox Studio via a companion plugin

Intended Use

Generating production-quality Roblox Luau scripts from natural language
Powering agentic code generation pipelines with self-correction
Research into reinforcement fine-tuning with hybrid reward signals (deterministic + LLM-as-judge)

Limitations

Trained on Mistral-7B base — larger models would likely benefit more from the RFT signal
Claude judge scoring adds cost and latency to the training pipeline
Best-of-N with N=4 is a relatively small candidate pool; larger N would likely improve quality further
Complex multi-file architectures may not be fully coherent

Downloads last month: 34

Model tree for squaredcuber/roblox-luau-mistral-7b-rft

Base model

mistralai/Mistral-7B-v0.3

Finetuned

mistralai/Mistral-7B-Instruct-v0.3

Adapter

(873)

this model

Dataset used to train squaredcuber/roblox-luau-mistral-7b-rft

Evaluation results

Syntax Score
self-reported

0.950
API Correctness
self-reported

0.930
Bug-Free Score
self-reported

0.910
Quality Score
self-reported

0.880
Composite Score
self-reported

0.920