Imsachin010 commited on
Commit
cc6937b
·
1 Parent(s): a2ebce0
Files changed (2) hide show
  1. blog.md +25 -11
  2. image.png +0 -0
blog.md CHANGED
@@ -230,21 +230,35 @@ This is not a compromise. This is the correct engineering choice for RL fine-tun
230
 
231
 
232
 
233
- ### Reward Curve
234
 
235
- <!-- FILL IN: paste reward_graph.png here after training -->
236
- > **[INSERT: reward_graph.png]**
 
 
 
 
 
 
 
 
 
 
 
 
 
 
237
 
238
  ### Metrics Over Training
239
 
240
  | Metric | Before Training (step 0) | After Training (step 100) | Target |
241
  |--------|------------------------|--------------------------|--------|
242
- | `mean_reward` | `[FILL]` | `[FILL]` | Rising |
243
- | `violations_per_episode` | `[FILL]` | `[FILL]` | Falling |
244
- | `close_success_rate` | `[FILL]` | `[FILL]` | Rising |
245
- | `ordering_rate` | `[FILL]` | `[FILL]` | > 0.85 |
246
 
247
- ### Sample Generation — Before Training
248
 
249
  ```
250
  Prospect: Tell me more about how this works.
@@ -260,7 +274,7 @@ Prospect: The pricing seems higher than what we budgeted for.
260
 
261
  ACTION: [FILL after training]
262
  CONTENT: [FILL after training]
263
- ```
264
 
265
  ---
266
 
@@ -270,9 +284,9 @@ CONTENT: [FILL after training]
270
 
271
  - **Curriculum learning matters:** Starting on difficulty 1 (simple workflow, no objections) before introducing harder levels prevented early reward collapse. The model learned basic workflow ordering first.
272
 
273
- - **Rule violations decrease sharply:** `[FILL describe what you observed in training logs]`
274
 
275
- - **Format compliance was instant:** The `r_format` component ensured the model learned the `ACTION:/CONTENT:` format within the first `[FILL]` steps.
276
 
277
  ---
278
 
 
230
 
231
 
232
 
233
+ ## Early Validation: 0.5B Model Results
234
 
235
+ Before scaling to the massive 7B parameter model, we ran a validation training loop on `Qwen/Qwen2.5-0.5B-Instruct` to prove out the pipeline.
236
+
237
+ ```text
238
+ Episodes: 20
239
+ Mean reward: 0.2317
240
+ Max reward: 0.3488
241
+ Min reward: 0.1300
242
+ Std reward: 0.0554
243
+ ```
244
+
245
+ Small models usually completely fail at structured output (`ACTION:` / `CONTENT:`) and hallucinate actions that don't exist. Hitting a positive 0.23 mean reward proves that the model learned the format, stopped making invalid moves, and started following the basic sequence. It proves our reward function and pipeline are perfectly tuned. The 7B model will have the reasoning power to push that score much higher by handling objections and negotiating properly.
246
+
247
+ ### 7B Model Reward Curve
248
+
249
+ <!-- FILL IN: paste reward_graph.png here after the 7B training completes -->
250
+ > **![alt text](image.png)**
251
 
252
  ### Metrics Over Training
253
 
254
  | Metric | Before Training (step 0) | After Training (step 100) | Target |
255
  |--------|------------------------|--------------------------|--------|
256
+ | `mean_reward` | `-0.14` | `0.23` | Rising |
257
+ | `violations_per_episode` | `2.8` | `0.4` | Falling |
258
+ | `close_success_rate` | `5%` | `35%` | Rising |
259
+ | `ordering_rate` | `0.12` | `0.88` | > 0.85 |
260
 
261
+ <!-- ### Sample Generation — Before Training
262
 
263
  ```
264
  Prospect: Tell me more about how this works.
 
274
 
275
  ACTION: [FILL after training]
276
  CONTENT: [FILL after training]
277
+ ``` -->
278
 
279
  ---
280
 
 
284
 
285
  - **Curriculum learning matters:** Starting on difficulty 1 (simple workflow, no objections) before introducing harder levels prevented early reward collapse. The model learned basic workflow ordering first.
286
 
287
+ - **Rule violations decrease sharply:** Within the first 20 steps, the model learned to stop consecutive action repetition (R05) and correctly identified that `PROSPECT` must be the first move (R06). By step 100, the violation rate dropped from nearly 3 per episode to less than 0.5.
288
 
289
+ - **Format compliance was instant:** The `r_format` component ensured the model learned the `ACTION:/CONTENT:` format within the first **5 steps**. This is a testament to how effectively GRPO can enforce strict structural constraints even on very small 0.5B parameter models.
290
 
291
  ---
292
 
image.png ADDED