File size: 5,879 Bytes
ca16fdf
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
# DocEdit Qwen2.5-3B SFT + GRPO Post-Mortem

Date:
- April 17, 2026

Hardware:
- `1x H200 SXM`

Base model:
- `Qwen/Qwen2.5-3B-Instruct`

Training recipe:
- `LoRA SFT`
- `LoRA GRPO`

Primary Hub repo:
- [sanjuhs/docedit-qwen25-3b-checkpoints](https://huggingface.co/sanjuhs/docedit-qwen25-3b-checkpoints)

---

## 1. Goal

The goal of this run was to answer a narrow but important question:

> Can a small open model be adapted and reinforcement-tuned to repair corrupted structured documents?

This was not yet the final tool-policy architecture.

Instead, this run intentionally produced a **rewrite-policy baseline** that we can later compare against:
- frontier-model tool use
- tool-trajectory training
- planner -> applicator architectures

---

## 2. What We Ran

### SFT stage

We trained a LoRA adapter on paired:
- corrupted document
- repaired target document

This teaches:
- markup discipline
- structured output behavior
- basic repair mapping

### GRPO stage

We then continued from the SFT adapter using verifier-based RL.

Reward ingredients:
- structural correctness
- edit accuracy
- collateral damage penalty
- output format penalty

---

## 3. Final Training Outcome

### SFT

- runtime: about `109.38s`
- final train loss: about `0.06346`
- final mean token accuracy: about `0.98954`

### GRPO

- runtime: about `5562.75s`
- total steps: `100`
- final train loss: about `0.03506`
- final logged step-100 reward mean: about `0.79567`

GRPO checkpoints written:
- `checkpoint-25`
- `checkpoint-50`
- `checkpoint-75`
- `checkpoint-100`

---

## 4. SFT Loss Curve

```mermaid
xychart-beta
    title "SFT Loss"
    x-axis ["Step 5", "Step 10", "Step 15", "Final"]
    y-axis "Loss" 0 --> 0.10
    line [0.0811, 0.0352, 0.0910, 0.0635]
```

## 5. GRPO Reward Curve Snapshot

```mermaid
xychart-beta
    title "GRPO Reward Snapshot"
    x-axis ["Step 5", "Step 10", "Step 15", "Step 100"]
    y-axis "Reward" 0.55 --> 1.30
    line [0.8422, 0.7638, 0.9102, 0.7957]
```

## 6. GRPO Step Time Snapshot

```mermaid
xychart-beta
    title "GRPO Step Time"
    x-axis ["Step 5", "Step 10", "Step 15", "Step 100"]
    y-axis "Seconds" 40 --> 70
    line [66.42, 58.12, 55.95, 61.71]
```

---

## 7. Quick Directional Eval

After training, we ran a **very small** local eval on `3` validation cases for:
- base model
- SFT adapter
- final GRPO adapter

This is not a full benchmark.

It is only a quick directional comparison to tell us whether the trained adapters are plausibly improving over baseline.

### 3-case quick eval results

| Model | Cases | Exact match rate | Mean similarity | Mean composite score | Mean edit accuracy | Mean collateral damage |
|---|---:|---:|---:|---:|---:|---:|
| Base `Qwen2.5-3B-Instruct` | 3 | 0.0000 | 0.9358 | 0.7790 | 0.4444 | 0.2000 |
| `Qwen2.5-3B + SFT LoRA` | 3 | 0.3333 | 0.9964 | 0.9109 | 0.6667 | 0.0159 |
| `Qwen2.5-3B + GRPO LoRA` | 3 | 0.3333 | 0.9964 | 0.9149 | 0.6667 | 0.0000 |

### Visual comparison

```mermaid
xychart-beta
    title "Quick Eval Mean Composite Score"
    x-axis ["Base", "SFT", "GRPO"]
    y-axis "Composite Score" 0.70 --> 0.95
    bar [0.7790, 0.9109, 0.9149]
```

```mermaid
xychart-beta
    title "Quick Eval Mean Collateral Damage"
    x-axis ["Base", "SFT", "GRPO"]
    y-axis "Collateral Damage" 0.00 --> 0.25
    bar [0.2000, 0.0159, 0.0000]
```

### What this means

On this very small check:
- SFT clearly improved over the base model
- GRPO slightly improved over SFT on composite score
- GRPO also reduced collateral damage to zero on this 3-case slice

This is encouraging, but it is **not enough** to claim robust superiority yet.

---

## 8. What Went Well

1. The H200 setup worked well for this scale.
2. SFT completed quickly and produced a clean LoRA adapter.
3. GRPO completed fully and wrote multiple checkpoints.
4. The final GRPO adapter loads and generates correctly.
5. The quick directional eval suggests the trained adapters beat the untuned base model.

---

## 9. What Did Not Go Perfectly

1. The current policy is still a **rewrite policy**, not the final tool-call architecture.
2. We had to patch `run_grpo.py` during the run to match the installed TRL version.
3. We also had to fix a repo-root import issue in the GRPO entrypoint.
4. The currently published eval is still small and should be treated as a sanity check, not a full research result.

---

## 10. Biggest Strategic Takeaway

This run successfully answers:

> Can we fine-tune and RL-tune a small model for DocEdit on one H200?

Answer:
- **yes**

But it does **not** yet settle the bigger architecture question:

> Is rewrite-policy the right final product design?

The answer there is still:
- **probably not**

The next likely better direction is:
- frontier model plans edits
- smaller executor/applicator handles structured edit application
- or frontier model directly uses a compact patch language

This run is therefore best understood as:
- a successful baseline
- a checkpoint artifact
- a comparison anchor for future tool-policy work

---

## 11. Recommended Next Steps

1. Run `GPT-5.4` directly with a compact edit language or tool schema.
2. Compare that against this rewrite-policy baseline.
3. Decide whether to:
   - keep frontier-only tool use
   - or distill those edit traces into a smaller applicator model
4. Move future training toward:
   - structured edit plans
   - tool trajectories
   - planner -> executor separation

---

## 12. Final Judgment

Was the H200 run worth doing?

- **Yes.**

Why?
- it produced complete SFT and GRPO artifacts
- it gave us a usable small-model baseline
- it generated a real comparison point for future design decisions

Would I immediately continue training more rewrite-policy models after this?

- **No.**

I would pause here, keep these artifacts, and move the next cycle toward the cleaner frontier-planner / structured-edit direction.