Spaces:

sanjuhs
/

cadforge-cadquery-openenv

Running

App Files Files Community

cadforge-cadquery-openenv / docs /final-postmortem-round1.md

sanjuhs

Upload CADForge judge evidence docs

ca16fdf verified 12 days ago

preview code

raw

history blame contribute delete

5.88 kB

DocEdit Qwen2.5-3B SFT + GRPO Post-Mortem

Date:

April 17, 2026

Hardware:

1x H200 SXM

Base model:

Qwen/Qwen2.5-3B-Instruct

Training recipe:

LoRA SFT
LoRA GRPO

Primary Hub repo:

sanjuhs/docedit-qwen25-3b-checkpoints

1. Goal

The goal of this run was to answer a narrow but important question:

Can a small open model be adapted and reinforcement-tuned to repair corrupted structured documents?

This was not yet the final tool-policy architecture.

Instead, this run intentionally produced a rewrite-policy baseline that we can later compare against:

frontier-model tool use
tool-trajectory training
planner -> applicator architectures

2. What We Ran

SFT stage

We trained a LoRA adapter on paired:

corrupted document
repaired target document

This teaches:

markup discipline
structured output behavior
basic repair mapping

GRPO stage

We then continued from the SFT adapter using verifier-based RL.

Reward ingredients:

structural correctness
edit accuracy
collateral damage penalty
output format penalty

3. Final Training Outcome

SFT

runtime: about 109.38s
final train loss: about 0.06346
final mean token accuracy: about 0.98954

GRPO

runtime: about 5562.75s
total steps: 100
final train loss: about 0.03506
final logged step-100 reward mean: about 0.79567

GRPO checkpoints written:

checkpoint-25
checkpoint-50
checkpoint-75
checkpoint-100

4. SFT Loss Curve

xychart-beta
    title "SFT Loss"
    x-axis ["Step 5", "Step 10", "Step 15", "Final"]
    y-axis "Loss" 0 --> 0.10
    line [0.0811, 0.0352, 0.0910, 0.0635]

5. GRPO Reward Curve Snapshot

xychart-beta
    title "GRPO Reward Snapshot"
    x-axis ["Step 5", "Step 10", "Step 15", "Step 100"]
    y-axis "Reward" 0.55 --> 1.30
    line [0.8422, 0.7638, 0.9102, 0.7957]

6. GRPO Step Time Snapshot

xychart-beta
    title "GRPO Step Time"
    x-axis ["Step 5", "Step 10", "Step 15", "Step 100"]
    y-axis "Seconds" 40 --> 70
    line [66.42, 58.12, 55.95, 61.71]

7. Quick Directional Eval

After training, we ran a very small local eval on 3 validation cases for:

base model
SFT adapter
final GRPO adapter

This is not a full benchmark.

It is only a quick directional comparison to tell us whether the trained adapters are plausibly improving over baseline.

3-case quick eval results

Model	Cases	Exact match rate	Mean similarity	Mean composite score	Mean edit accuracy	Mean collateral damage
Base `Qwen2.5-3B-Instruct`	3	0.0000	0.9358	0.7790	0.4444	0.2000
`Qwen2.5-3B + SFT LoRA`	3	0.3333	0.9964	0.9109	0.6667	0.0159
`Qwen2.5-3B + GRPO LoRA`	3	0.3333	0.9964	0.9149	0.6667	0.0000

Visual comparison

xychart-beta
    title "Quick Eval Mean Composite Score"
    x-axis ["Base", "SFT", "GRPO"]
    y-axis "Composite Score" 0.70 --> 0.95
    bar [0.7790, 0.9109, 0.9149]

xychart-beta
    title "Quick Eval Mean Collateral Damage"
    x-axis ["Base", "SFT", "GRPO"]
    y-axis "Collateral Damage" 0.00 --> 0.25
    bar [0.2000, 0.0159, 0.0000]

What this means

On this very small check:

SFT clearly improved over the base model
GRPO slightly improved over SFT on composite score
GRPO also reduced collateral damage to zero on this 3-case slice

This is encouraging, but it is not enough to claim robust superiority yet.

8. What Went Well

The H200 setup worked well for this scale.
SFT completed quickly and produced a clean LoRA adapter.
GRPO completed fully and wrote multiple checkpoints.
The final GRPO adapter loads and generates correctly.
The quick directional eval suggests the trained adapters beat the untuned base model.

9. What Did Not Go Perfectly

The current policy is still a rewrite policy, not the final tool-call architecture.
We had to patch run_grpo.py during the run to match the installed TRL version.
We also had to fix a repo-root import issue in the GRPO entrypoint.
The currently published eval is still small and should be treated as a sanity check, not a full research result.

10. Biggest Strategic Takeaway

This run successfully answers:

Can we fine-tune and RL-tune a small model for DocEdit on one H200?

Answer:

yes

But it does not yet settle the bigger architecture question:

Is rewrite-policy the right final product design?

The answer there is still:

probably not

The next likely better direction is:

frontier model plans edits
smaller executor/applicator handles structured edit application
or frontier model directly uses a compact patch language

This run is therefore best understood as:

a successful baseline
a checkpoint artifact
a comparison anchor for future tool-policy work

11. Recommended Next Steps

Run GPT-5.4 directly with a compact edit language or tool schema.
Compare that against this rewrite-policy baseline.
Decide whether to:
- keep frontier-only tool use
- or distill those edit traces into a smaller applicator model
Move future training toward:
- structured edit plans
- tool trajectories
- planner -> executor separation

12. Final Judgment

Was the H200 run worth doing?

Yes.

Why?

it produced complete SFT and GRPO artifacts
it gave us a usable small-model baseline
it generated a real comparison point for future design decisions

Would I immediately continue training more rewrite-policy models after this?

No.

I would pause here, keep these artifacts, and move the next cycle toward the cleaner frontier-planner / structured-edit direction.