DocEdit Qwen2.5-3B SFT + GRPO Post-Mortem
Date:
- April 17, 2026
Hardware:
1x H200 SXM
Base model:
Qwen/Qwen2.5-3B-Instruct
Training recipe:
LoRA SFTLoRA GRPO
Primary Hub repo:
1. Goal
The goal of this run was to answer a narrow but important question:
Can a small open model be adapted and reinforcement-tuned to repair corrupted structured documents?
This was not yet the final tool-policy architecture.
Instead, this run intentionally produced a rewrite-policy baseline that we can later compare against:
- frontier-model tool use
- tool-trajectory training
- planner -> applicator architectures
2. What We Ran
SFT stage
We trained a LoRA adapter on paired:
- corrupted document
- repaired target document
This teaches:
- markup discipline
- structured output behavior
- basic repair mapping
GRPO stage
We then continued from the SFT adapter using verifier-based RL.
Reward ingredients:
- structural correctness
- edit accuracy
- collateral damage penalty
- output format penalty
3. Final Training Outcome
SFT
- runtime: about
109.38s - final train loss: about
0.06346 - final mean token accuracy: about
0.98954
GRPO
- runtime: about
5562.75s - total steps:
100 - final train loss: about
0.03506 - final logged step-100 reward mean: about
0.79567
GRPO checkpoints written:
checkpoint-25checkpoint-50checkpoint-75checkpoint-100
4. SFT Loss Curve
xychart-beta
title "SFT Loss"
x-axis ["Step 5", "Step 10", "Step 15", "Final"]
y-axis "Loss" 0 --> 0.10
line [0.0811, 0.0352, 0.0910, 0.0635]
5. GRPO Reward Curve Snapshot
xychart-beta
title "GRPO Reward Snapshot"
x-axis ["Step 5", "Step 10", "Step 15", "Step 100"]
y-axis "Reward" 0.55 --> 1.30
line [0.8422, 0.7638, 0.9102, 0.7957]
6. GRPO Step Time Snapshot
xychart-beta
title "GRPO Step Time"
x-axis ["Step 5", "Step 10", "Step 15", "Step 100"]
y-axis "Seconds" 40 --> 70
line [66.42, 58.12, 55.95, 61.71]
7. Quick Directional Eval
After training, we ran a very small local eval on 3 validation cases for:
- base model
- SFT adapter
- final GRPO adapter
This is not a full benchmark.
It is only a quick directional comparison to tell us whether the trained adapters are plausibly improving over baseline.
3-case quick eval results
| Model | Cases | Exact match rate | Mean similarity | Mean composite score | Mean edit accuracy | Mean collateral damage |
|---|---|---|---|---|---|---|
Base Qwen2.5-3B-Instruct |
3 | 0.0000 | 0.9358 | 0.7790 | 0.4444 | 0.2000 |
Qwen2.5-3B + SFT LoRA |
3 | 0.3333 | 0.9964 | 0.9109 | 0.6667 | 0.0159 |
Qwen2.5-3B + GRPO LoRA |
3 | 0.3333 | 0.9964 | 0.9149 | 0.6667 | 0.0000 |
Visual comparison
xychart-beta
title "Quick Eval Mean Composite Score"
x-axis ["Base", "SFT", "GRPO"]
y-axis "Composite Score" 0.70 --> 0.95
bar [0.7790, 0.9109, 0.9149]
xychart-beta
title "Quick Eval Mean Collateral Damage"
x-axis ["Base", "SFT", "GRPO"]
y-axis "Collateral Damage" 0.00 --> 0.25
bar [0.2000, 0.0159, 0.0000]
What this means
On this very small check:
- SFT clearly improved over the base model
- GRPO slightly improved over SFT on composite score
- GRPO also reduced collateral damage to zero on this 3-case slice
This is encouraging, but it is not enough to claim robust superiority yet.
8. What Went Well
- The H200 setup worked well for this scale.
- SFT completed quickly and produced a clean LoRA adapter.
- GRPO completed fully and wrote multiple checkpoints.
- The final GRPO adapter loads and generates correctly.
- The quick directional eval suggests the trained adapters beat the untuned base model.
9. What Did Not Go Perfectly
- The current policy is still a rewrite policy, not the final tool-call architecture.
- We had to patch
run_grpo.pyduring the run to match the installed TRL version. - We also had to fix a repo-root import issue in the GRPO entrypoint.
- The currently published eval is still small and should be treated as a sanity check, not a full research result.
10. Biggest Strategic Takeaway
This run successfully answers:
Can we fine-tune and RL-tune a small model for DocEdit on one H200?
Answer:
- yes
But it does not yet settle the bigger architecture question:
Is rewrite-policy the right final product design?
The answer there is still:
- probably not
The next likely better direction is:
- frontier model plans edits
- smaller executor/applicator handles structured edit application
- or frontier model directly uses a compact patch language
This run is therefore best understood as:
- a successful baseline
- a checkpoint artifact
- a comparison anchor for future tool-policy work
11. Recommended Next Steps
- Run
GPT-5.4directly with a compact edit language or tool schema. - Compare that against this rewrite-policy baseline.
- Decide whether to:
- keep frontier-only tool use
- or distill those edit traces into a smaller applicator model
- Move future training toward:
- structured edit plans
- tool trajectories
- planner -> executor separation
12. Final Judgment
Was the H200 run worth doing?
- Yes.
Why?
- it produced complete SFT and GRPO artifacts
- it gave us a usable small-model baseline
- it generated a real comparison point for future design decisions
Would I immediately continue training more rewrite-policy models after this?
- No.
I would pause here, keep these artifacts, and move the next cycle toward the cleaner frontier-planner / structured-edit direction.