Roblox Luau Mistral 7B β€” RFT (Reinforcement Fine-Tuned)

A reinforcement fine-tuned LoRA adapter for generating production-ready Roblox Luau scripts. This model builds on the SFT version by training on the best-of-N candidates selected via a hybrid reward signal combining deterministic code scorers + Claude-as-judge evaluation.

Part of the Roblox Luau Code Gen project for the W&B Fine-Tuning Hackathon.

Why RFT?

Standard SFT trains on static (task, code) pairs. RFT goes further: the SFT model generates N candidate solutions per task, a reward function scores each candidate, and only the best are kept for the next round of training. This creates a self-improvement loop where the model learns from its own best outputs.

SFT Model β†’ Generate N candidates per task
                ↓
         Score each candidate
         (4 deterministic scorers + Claude judge)
                ↓
         Keep best candidate per task (score β‰₯ 0.70)
                ↓
         Train on SFT data + best candidates β†’ RFT Model

Training

Stage 1: SFT Data Collection

Same as the SFT model:

  • Reverse-labeled the-luau-stack examples
  • Claude Sonnet 4.5 gold-standard implementations
  • Quality-filtered by 4 deterministic scorers

Stage 2: Candidate Generation

The SFT model generated 4 candidates per task for 50 tasks (200 total candidates) using temperature sampling (T=0.8, top_p=0.95).

Stage 3: Hybrid Reward Scoring

Each candidate was scored using a hybrid signal:

Component Weight What it measures
Syntax scorer 10% Bracket/block balance, no Python-isms
API scorer 10% GetService(), no deprecated APIs, valid services
Bug scorer 10% pcall wrapping, nil checks, yield in loops
Quality scorer 10% Comments, structure, naming, completeness
Claude judge 60% Functionality, correctness, completeness (LLM-as-judge)

Combined score = deterministic (40%) + Claude judge (60%)

Only candidates scoring β‰₯ 0.70 were kept. These best-of-N examples were mixed with the original SFT training data.

Stage 4: RFT Training

Parameter Value
Base model mistralai/Mistral-7B-Instruct-v0.3
Method QLoRA (4-bit NF4)
LoRA rank 64
LoRA alpha 128
Target modules q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Dropout 0.05
Epochs 2
Batch size 1 (Γ—8 gradient accumulation)
Learning rate 1.5e-4
Max sequence length 8192
Precision bf16
Gradient checkpointing Yes
Training data SFT data + best-of-N RFT candidates

Results: SFT β†’ RFT Improvement

Scorer SFT RFT Delta
Syntax 0.92 0.95 +0.03
API Correctness 0.88 0.93 +0.05
Bug-Free 0.85 0.91 +0.06
Code Quality 0.82 0.88 +0.06
Composite 0.87 0.92 +0.05

The RFT model shows consistent improvement across all dimensions, with the largest gains in bug-free code and code quality β€” the areas where the Claude judge provided the most signal beyond the deterministic scorers.

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch

base_model = AutoModelForCausalLM.from_pretrained(
    "mistralai/Mistral-7B-Instruct-v0.3",
    torch_dtype=torch.float16,
    device_map="auto",
)
model = PeftModel.from_pretrained(base_model, "squaredcuber/roblox-luau-mistral-7b-rft")
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.3")

messages = [
    {"role": "system", "content": "You are an expert Roblox Luau programmer. Generate complete, production-ready Luau scripts. Output only code, no markdown."},
    {"role": "user", "content": "Build a tower defense system with auto-targeting towers, enemy waves, and a path-following system"},
]

inputs = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt").to("cuda")
with torch.no_grad():
    output = model.generate(inputs, max_new_tokens=4096, temperature=0.7, do_sample=True)
print(tokenizer.decode(output[0][inputs.shape[-1]:], skip_special_tokens=True))

With vLLM (recommended for serving)

python -m vllm.entrypoints.openai.api_server \
    --model mistralai/Mistral-7B-Instruct-v0.3 \
    --enable-lora \
    --lora-modules \
        sft=squaredcuber/roblox-luau-mistral-7b-2 \
        rft=squaredcuber/roblox-luau-mistral-7b-rft \
    --max-lora-rank 64

Agentic Pipeline

This model powers the agentic Roblox Studio assistant β€” a self-correcting code generation pipeline:

  1. Generate β€” RFT model produces Luau code from a task description
  2. Score β€” 4 deterministic scorers evaluate the output in real-time
  3. Self-correct β€” If score < 0.85, the model rewrites the code using the scorer feedback
  4. Insert β€” Code is sent directly to Roblox Studio via a companion plugin

Intended Use

  • Generating production-quality Roblox Luau scripts from natural language
  • Powering agentic code generation pipelines with self-correction
  • Research into reinforcement fine-tuning with hybrid reward signals (deterministic + LLM-as-judge)

Limitations

  • Trained on Mistral-7B base β€” larger models would likely benefit more from the RFT signal
  • Claude judge scoring adds cost and latency to the training pipeline
  • Best-of-N with N=4 is a relatively small candidate pool; larger N would likely improve quality further
  • Complex multi-file architectures may not be fully coherent
Downloads last month
34
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for squaredcuber/roblox-luau-mistral-7b-rft

Adapter
(873)
this model

Dataset used to train squaredcuber/roblox-luau-mistral-7b-rft

Evaluation results