Roblox Luau Mistral 7B β RFT (Reinforcement Fine-Tuned)
A reinforcement fine-tuned LoRA adapter for generating production-ready Roblox Luau scripts. This model builds on the SFT version by training on the best-of-N candidates selected via a hybrid reward signal combining deterministic code scorers + Claude-as-judge evaluation.
Part of the Roblox Luau Code Gen project for the W&B Fine-Tuning Hackathon.
Why RFT?
Standard SFT trains on static (task, code) pairs. RFT goes further: the SFT model generates N candidate solutions per task, a reward function scores each candidate, and only the best are kept for the next round of training. This creates a self-improvement loop where the model learns from its own best outputs.
SFT Model β Generate N candidates per task
β
Score each candidate
(4 deterministic scorers + Claude judge)
β
Keep best candidate per task (score β₯ 0.70)
β
Train on SFT data + best candidates β RFT Model
Training
Stage 1: SFT Data Collection
Same as the SFT model:
- Reverse-labeled the-luau-stack examples
- Claude Sonnet 4.5 gold-standard implementations
- Quality-filtered by 4 deterministic scorers
Stage 2: Candidate Generation
The SFT model generated 4 candidates per task for 50 tasks (200 total candidates) using temperature sampling (T=0.8, top_p=0.95).
Stage 3: Hybrid Reward Scoring
Each candidate was scored using a hybrid signal:
| Component | Weight | What it measures |
|---|---|---|
| Syntax scorer | 10% | Bracket/block balance, no Python-isms |
| API scorer | 10% | GetService(), no deprecated APIs, valid services |
| Bug scorer | 10% | pcall wrapping, nil checks, yield in loops |
| Quality scorer | 10% | Comments, structure, naming, completeness |
| Claude judge | 60% | Functionality, correctness, completeness (LLM-as-judge) |
Combined score = deterministic (40%) + Claude judge (60%)
Only candidates scoring β₯ 0.70 were kept. These best-of-N examples were mixed with the original SFT training data.
Stage 4: RFT Training
| Parameter | Value |
|---|---|
| Base model | mistralai/Mistral-7B-Instruct-v0.3 |
| Method | QLoRA (4-bit NF4) |
| LoRA rank | 64 |
| LoRA alpha | 128 |
| Target modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
| Dropout | 0.05 |
| Epochs | 2 |
| Batch size | 1 (Γ8 gradient accumulation) |
| Learning rate | 1.5e-4 |
| Max sequence length | 8192 |
| Precision | bf16 |
| Gradient checkpointing | Yes |
| Training data | SFT data + best-of-N RFT candidates |
Results: SFT β RFT Improvement
| Scorer | SFT | RFT | Delta |
|---|---|---|---|
| Syntax | 0.92 | 0.95 | +0.03 |
| API Correctness | 0.88 | 0.93 | +0.05 |
| Bug-Free | 0.85 | 0.91 | +0.06 |
| Code Quality | 0.82 | 0.88 | +0.06 |
| Composite | 0.87 | 0.92 | +0.05 |
The RFT model shows consistent improvement across all dimensions, with the largest gains in bug-free code and code quality β the areas where the Claude judge provided the most signal beyond the deterministic scorers.
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch
base_model = AutoModelForCausalLM.from_pretrained(
"mistralai/Mistral-7B-Instruct-v0.3",
torch_dtype=torch.float16,
device_map="auto",
)
model = PeftModel.from_pretrained(base_model, "squaredcuber/roblox-luau-mistral-7b-rft")
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.3")
messages = [
{"role": "system", "content": "You are an expert Roblox Luau programmer. Generate complete, production-ready Luau scripts. Output only code, no markdown."},
{"role": "user", "content": "Build a tower defense system with auto-targeting towers, enemy waves, and a path-following system"},
]
inputs = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt").to("cuda")
with torch.no_grad():
output = model.generate(inputs, max_new_tokens=4096, temperature=0.7, do_sample=True)
print(tokenizer.decode(output[0][inputs.shape[-1]:], skip_special_tokens=True))
With vLLM (recommended for serving)
python -m vllm.entrypoints.openai.api_server \
--model mistralai/Mistral-7B-Instruct-v0.3 \
--enable-lora \
--lora-modules \
sft=squaredcuber/roblox-luau-mistral-7b-2 \
rft=squaredcuber/roblox-luau-mistral-7b-rft \
--max-lora-rank 64
Agentic Pipeline
This model powers the agentic Roblox Studio assistant β a self-correcting code generation pipeline:
- Generate β RFT model produces Luau code from a task description
- Score β 4 deterministic scorers evaluate the output in real-time
- Self-correct β If score < 0.85, the model rewrites the code using the scorer feedback
- Insert β Code is sent directly to Roblox Studio via a companion plugin
Intended Use
- Generating production-quality Roblox Luau scripts from natural language
- Powering agentic code generation pipelines with self-correction
- Research into reinforcement fine-tuning with hybrid reward signals (deterministic + LLM-as-judge)
Limitations
- Trained on Mistral-7B base β larger models would likely benefit more from the RFT signal
- Claude judge scoring adds cost and latency to the training pipeline
- Best-of-N with N=4 is a relatively small candidate pool; larger N would likely improve quality further
- Complex multi-file architectures may not be fully coherent
- Downloads last month
- 34
Model tree for squaredcuber/roblox-luau-mistral-7b-rft
Base model
mistralai/Mistral-7B-v0.3Dataset used to train squaredcuber/roblox-luau-mistral-7b-rft
Evaluation results
- Syntax Scoreself-reported0.950
- API Correctnessself-reported0.930
- Bug-Free Scoreself-reported0.910
- Quality Scoreself-reported0.880
- Composite Scoreself-reported0.920