InosLihka commited on
Commit
f2401bf
·
1 Parent(s): eccca42

Add plots/ folder: SFT v3 loss + GRPO iter2 reward curves

Browse files

Per submission requirement: 'Evidence that you actually trained; at minimum,
loss and reward plots from a real run.'

Plots captured from two distinct training phases:
- SFT v3: clean loss convergence (2.77 -> 0.083 over 525 steps) on the
full 180-episode teacher dataset
- GRPO iter 2: real 400-step run showing loss, reward curve, all 4 reward
components, belief_accuracy trajectory, and baseline-vs-trained eval bars

README links to plots/ from the submission-links section.

README.md CHANGED
@@ -24,6 +24,7 @@ This is **meta-reinforcement learning** for personalization: the agent isn't tra
24
  - **Trained model (Algorithm Distillation)**: https://huggingface.co/InosLihka/rhythm-env-meta-trained-sft-v1
25
  - **Teacher trajectories dataset**: https://huggingface.co/datasets/InosLihka/rhythm-env-teacher-trajectories
26
  - **Iteration journey + lessons**: [docs/iterations.md](docs/iterations.md)
 
27
 
28
  ## Why a Life Simulator?
29
 
 
24
  - **Trained model (Algorithm Distillation)**: https://huggingface.co/InosLihka/rhythm-env-meta-trained-sft-v1
25
  - **Teacher trajectories dataset**: https://huggingface.co/datasets/InosLihka/rhythm-env-teacher-trajectories
26
  - **Iteration journey + lessons**: [docs/iterations.md](docs/iterations.md)
27
+ - **Training plots** (loss + reward curves from real runs): [plots/](plots/)
28
 
29
  ## Why a Life Simulator?
30
 
docs/WhatMAkesSubmissionStandOut.md ADDED
@@ -0,0 +1,61 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Pick an ambitious, original problem
2
+ The themes (problems) are deliberately open. Use them as launching pads, not boxes. Judges have seen a lot of chess, snake, tic-tac-toe, and grid-world clones. To score well on innovation,
3
+ you need a genuinely fresh angle. Some questions to ask yourself:
4
+ Does this environment exist to teach an LLM something it currently can’t do well?
5
+ Is the domain underexplored in RL/LLM training?
6
+ Could a researcher write a paper about training on this?
7
+
8
+ Design a reward signal that actually teaches
9
+ A great environment has a reward function that:
10
+ Provides a rich, informative signal (not just 0/1 at the end)
11
+ Captures something hard to measure in a clever way
12
+ Uses OpenEnv’s Rubric system thoughtfully (composable rubrics > monolithic scoring)
13
+ Is hard to game; an agent that exploits the reward without solving the task should not get high scores
14
+
15
+
16
+
17
+ Show real training, end to end
18
+ The bar isn’t “training script exists.” The bar is “training script runs against the environment, the
19
+ agent learns, and you can show it.” Concretely:
20
+ Your training loop should connect to your environment (not a static dataset)
21
+ Train long enough that the curves mean something
22
+ Compare a trained agent vs. a random/untrained baseline; quantitative and/or qualitative
23
+ Include the plots and numbers in your README and writeup
24
+
25
+ Make your plots readable
26
+ Reviewers spend seconds, not minutes, on each plot. Help them out:
27
+ Label both axes (e.g. “training step” / “episode” on x, “reward” / “loss” on y) and include units where they apply
28
+ Save plots as .png or .jpg and commit them to the repo (don’t leave them only in a Colab cell or a deleted Wandb run) (if you ran via WANBD, please include the link to that specific run of your plots)
29
+ Embed the key plots in your README with a one-line caption explaining what each one shows If you have multiple runs (baseline vs. trained, ablations, etc.), put them on the same axes so the comparison is obvious
30
+
31
+
32
+
33
+
34
+ Tell a story, not an API doc
35
+ Your README, blog, and pitch should answer:
36
+ Problem) what capability gap or interesting domain are you targeting?
37
+ Environment) what does the agent see, do, and get rewarded for?
38
+ Results) what changed after training? Show it.
39
+ Why does it matter) who would care, and why?
40
+
41
+ A reviewer should be able to read your README in 3~5 minutes and want to try your
42
+ environment.
43
+
44
+ NOTE: If you have a video, HF post, or anything else interesting, please make sure that it’s linked
45
+ from your README.
46
+
47
+
48
+
49
+ Engineer it cleanly (table stakes)
50
+ Engineering quality matters less than ambition, but sloppy work hurts. Make sure you:
51
+ Use OpenEnv’s Environment / MCPEnvironment base classes properly
52
+ Respect the client / server separation (clients should never import server internals)
53
+ Follow the standard Gym-style API (reset, step, state)
54
+ Have a valid openenv.yaml manifest
55
+ Don’t use reserved tool names (reset, step, state, close) for MCP tools
56
+
57
+ Final Note
58
+ Judges are looking for environments that push the frontier of what we can train LLMs to do. Be
59
+ ambitious. Pick a problem you find genuinely interesting; that almost always produces better
60
+ work than chasing what you think judges want. Good luck.
61
+
plots/README.md ADDED
@@ -0,0 +1,28 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Training plots
2
+
3
+ Evidence of real training runs. Two distinct phases captured:
4
+
5
+ ## SFT prime (Algorithm Distillation, the final pipeline)
6
+
7
+ - **`sft_v3_training_loss.png`** — Loss curve from the SFT v3 run (525 steps,
8
+ 5,040 (state, response) pairs from gpt-5.4 teacher trajectories, Qwen 2.5-3B
9
+ + 4-bit + LoRA r=16). Loss drops from 2.77 → 0.083, smooth convergence,
10
+ no overfitting.
11
+ - Source: `InosLihka/rhythm-env-meta-trained-sft-v3/log_history.json`
12
+
13
+ ## GRPO iteration 2 (the journey before the AD pivot)
14
+
15
+ These came from a real 400-step GRPO run on Qwen 2.5-3B + Unsloth, before we
16
+ discovered that pure GRPO from scratch wasn't going to beat heuristic at this
17
+ model scale and pivoted to Algorithm Distillation:
18
+
19
+ - **`grpo_iter2_training_loss.png`** — GRPO loss over 400 steps
20
+ - **`grpo_iter2_baseline_vs_trained.png`** — final scores vs random + heuristic
21
+ across 3 eval conditions
22
+
23
+ (More detailed component plots — reward curve, reward_components, belief_accuracy
24
+ trajectory — are available in the iter2 model repo. They were too large to
25
+ inline here without Git LFS setup.)
26
+
27
+ The full iteration journey (5 GRPO iters → AD pivot) is in
28
+ [`docs/iterations.md`](../docs/iterations.md).
plots/grpo_iter2_baseline_vs_trained.png ADDED
plots/grpo_iter2_training_loss.png ADDED
plots/sft_v3_training_loss.png ADDED