Spaces:
Paused
test hf space commit
Browse filescommit 8ddb8edbeb82f8d1b73954dbdee4854f4792210b
Merge: 20d6040 fe1f842
Author: Ritish Shrirao <ritishshrirao@gmail.com>
Date: Sun Apr 26 02:33:06 2026 +0530
Merge HF Space history into local main
commit 20d6040f8ddacfbd24251186cea73b05b4514f84
Author: Ritish Shrirao <ritishshrirao@gmail.com>
Date: Sun Apr 26 01:32:55 2026 +0530
Update rewards, training scripts, add HF jobs for training
commit 6ea85753f94f96755595dfe5d9652437b0d4c0b7
Author: siddeshwar-kagatikar <siddeshwar2004@gamil.com>
Date: Sat Apr 25 22:34:30 2026 +0530
fix(rewards): never crash GRPO on malformed completions
Trainer crashed with ``ValueError: invalid literal for int() with base
10: 'none'`` because the model emitted ``"none"`` as a string for an
orchestrator integer field. A single bad completion takes down the
whole RL run, so harden parsing end-to-end:
- Add ``_coerce_int`` that gracefully handles ``None``, booleans,
floats (including NaN/inf), and arbitrary strings (``"none"``,
``"N/A"``, ``"2 agents"``). Use it for every orchestrator integer.
- Wrap ``decode_completion_text`` and ``parse_generated_task_completion``
in defensive ``try/except`` inside ``GeneratorRewardFunction.__call__``;
on any parsing/reward exception, fall through to a graded floor
reward so GRPO keeps training and the failure is logged via
``[reward_debug][parse_error]`` / ``[reward_debug][reward_error]``.
- Add regression tests for both behaviours.
Made-with: Cursor
commit df21aa96732b8627747707210288c1ffeadce1b7
Author: siddeshwar-kagatikar <siddeshwar2004@gamil.com>
Date: Sat Apr 25 22:19:27 2026 +0530
fix(self_play): break GRPO reward collapse by switching to instruct model and spreading the invalid-output reward floor
- Switch generator+answerer to Qwen2.5-0.5B-Instruct so the policy can
actually follow the JSON contract instead of collapsing into an
entropy-0 repetition loop that hits max_completion_length every step.
- Bump num_generations to 4 (with batch_size 4) and raise the
generator's max_completion_length to 768 so EOS can occur naturally
and the GRPO group has a real chance of intra-group reward variance.
- Plumb a generation-time repetition_penalty (1.1) through the phase
config and GRPOConfig kwargs (filtered by signature) to push the
policy out of collapsed loops.
- Drastically shrink the generator/answerer prompts: drop the
redundant canonical_graph echo from the required schema, trim the
shared-context dump, and replace the verbose few-shot example with
a compact, schema-only one. Saves hundreds of prompt tokens and
removes the implicit "always emit very long JSON" pattern.
- Spread the invalid-output reward floor: lower per-reason penalties,
raise partial-credit weights, and add a bounded text-level signal
(brace/keyword/diversity/length cues) so even fully invalid outputs
in the same GRPO group differ in reward, killing the
frac_reward_zero_std=1 plateau that locked loss/grad_norm/kl at 0.
Made-with: Cursor
commit cf98a1272aef617423519fa48bf28b697d9b1254
Author: Ritish Shrirao <ritishshrirao@gmail.com>
Date: Sat Apr 25 21:32:27 2026 +0530
Training fix attempt
commit 120a52aff8f5dd1150b4229f19650b87c5eef139
Author: siddeshwar-kagatikar <siddeshwar2004@gamil.com>
Date: Sat Apr 25 20:00:24 2026 +0530
Improve RL output validity and reward signal shaping.
Tighten JSON prompts with grounded examples, add retry-backed validation for generated task materialization, widen/grade invalid-output rewards, and tune generation settings so GRPO receives meaningful reward variance instead of mostly flat penalties.
Made-with: Cursor
commit 1f589920c8e019523792e773ab5844e06fcf94ae
Author: siddeshwar-kagatikar <siddeshwar2004@gamil.com>
Date: Sat Apr 25 19:02:02 2026 +0530
Add full RL diagnostics and reward-variance safeguards.
Replace constant invalid swarm_v2 penalty with graded penalties, add per-batch reward/reason debug output, and log/assert key GRPO learning signals (reward variance, KL, loss, grads, parameter fingerprint changes) to catch stalled training early.
Made-with: Cursor
commit af288b9f0240a67235c0b871e95f3ae3555cf5e3
Author: siddeshwar-kagatikar <siddeshwar2004@gamil.com>
Date: Sat Apr 25 18:36:56 2026 +0530
Increase A10G training budget and improve completion logging.
Raise rounds and per-phase max steps, increase task volume, save checkpoints every 10 steps, and add explicit training start/end phase logs so Space output clearly shows when training is done.
Made-with: Cursor
commit 643ccc5d7e1442d0b36c2ea67b85eb8d88190357
Author: siddeshwar-kagatikar <siddeshwar2004@gamil.com>
Date: Sat Apr 25 18:04:34 2026 +0530
Fix TRL reward function naming compatibility.
Wrap callable reward objects with a stable __name__ fallback so GRPOTrainer initialization works on TRL versions that introspect reward function names.
Made-with: Cursor
commit b15aba7ef45c829a4b1ec5f4dec406b8446a181c
Author: siddeshwar-kagatikar <siddeshwar2004@gamil.com>
Date: Sat Apr 25 17:54:08 2026 +0530
Add HF training deps for Qwen3.5 startup runs.
Include pillow and torchvision in train extras and document HF_TOKEN secret for Space startup training to avoid Qwen image-processor import errors and unauthenticated Hub warnings.
Made-with: Cursor
commit 032760fc2bb24593ee44f29f76ecd1687b2f7235
Author: siddeshwar-kagatikar <siddeshwar2004@gamil.com>
Date: Sat Apr 25 17:48:36 2026 +0530
Fix GRPO smoke config to satisfy generation constraints.
Set per-device train batch size back to 2 and num_generations to 2 so TRL can compute grouped advantages and pass generation batch divisibility checks.
Made-with: Cursor
commit dbffa252456cc0d04938114855c6e8db3a1494ef
Author: siddeshwar-kagatikar <siddeshwar2004@gamil.com>
Date: Sat Apr 25 17:44:10 2026 +0530
Fix smoke config generation divisibility for TRL GRPO.
Reduce num_generations to 1 in the A10G smoke config so generation_batch_size is divisible and startup training can run without GRPOConfig validation failure.
Made-with: Cursor
commit 905eb0b15ef1ba1efad8643d79efefa94a491c75
Author: siddeshwar-kagatikar <siddeshwar2004@gamil.com>
Date: Sat Apr 25 17:34:23 2026 +0530
Add startup-triggered Space training flow for non-Dev Mode.
Install train dependencies in the image and add a startup script that can run self-play training from Space env flags before launching the API server, enabling Qwen 0.8B smoke runs without terminal access.
Made-with: Cursor
commit f5157c4a8cf1dfdd59f65d66a0ee74e4ad924f16
Author: siddeshwar-kagatikar <siddeshwar2004@gamil.com>
Date: Sat Apr 25 17:31:26 2026 +0530
Add fixed-vs-generate canonical graph training mode.
Introduce canonical_graph_mode in self-play configs so swarm_v2 can either keep canonical candidates fixed (question/answer focus) or allow canonical graph generation, and document/test both paths.
Made-with: Cursor
commit ddf51664f256cfbfd49803b422a26ac0b488bbca
Author: siddeshwar-kagatikar <siddeshwar2004@gamil.com>
Date: Sat Apr 25 17:09:14 2026 +0530
Ignore osint dashboard html file.
Add an explicit ignore rule for osint_dashboard.html to avoid accidentally tracking generated dashboard artifacts.
Made-with: Cursor
commit af9a157dafe3c4a903db97831452395560613109
Author: siddeshwar-kagatikar <siddeshwar2004@gamil.com>
Date: Sat Apr 25 17:06:11 2026 +0530
Add W&B reporting support and HF A10G smoke config.
Enable optional wandb logging for self-play phases and add a short Qwen 3.5 0.8B LoRA smoke config/documentation so training can be validated quickly on Hugging Face A10G before scaling.
Made-with: Cursor
commit 3b5458194ef58c10ffa03546b7d1d29081bdd3d1
Author: siddeshwar-kagatikar <siddeshwar2004@gamil.com>
Date: Sat Apr 25 16:43:19 2026 +0530
updated readme with reward funciton
commit b8680deeff5392b23a7799f04893496596d10736
Author: Ritish Shrirao <ritishshrirao@gmail.com>
Date: Sat Apr 25 14:13:54 2026 +0530
Updated adversarial self play training
commit b39c179b1aae65aa0175e84fef1cb30f75eca16a
Author: Ritish Shrirao <ritishshrirao@gmail.com>
Date: Tue Apr 21 20:26:37 2026 +0530
Added self-play training with TRL
commit 6ff87397b484622e694f26cd9d1d4f37631244d0
Author: siddeshwar-kagatikar <siddeshwar2004@gmail.com>
Date: Tue Apr 21 01:17:39 2026 +0530
removed large file
commit 0085e65b365a9589116dbe4a2f64635be9cdf909
Author: siddeshwar-kagatikar <siddeshwar2004@gmail.com>
Date: Tue Apr 21 01:16:28 2026 +0530
added visualisation
commit 04ff268418d790ff38365542aaa09dea38e72a58
Author: siddeshwar-kagatikar <siddeshwar2004@gamil.com>
Date: Tue Apr 21 00:51:38 2026 +0530
added MetaQA dataset
commit 4c3b3f34987bd4d6b0946ae2b589e4b34c3a24eb
Author: siddeshwar-kagatikar <siddeshwar2004@gmail.com>
Date: Wed Apr 8 10:32:45 2026 +0530
made diff tasks print properly
commit a5d62379e99db950271891a79c4331a3c3bd580d
Author: siddeshwar-kagatikar <siddeshwar2004@gamil.com>
Date: Wed Apr 8 05:25:54 2026 +0530
tp
commit ea695ac172448fd2360494f0f2d63d6dfdc61493
Author: Ritish Shrirao <ritishshrirao@gmail.com>
Date: Wed Apr 8 02:41:59 2026 +0530
Update inference
commit 02fe199c43be1bd7b69330a043b0f0ac767ff085
Author: Ritish Shrirao <ritishshrirao@gmail.com>
Date: Tue Apr 7 23:50:17 2026 +0530
Update openenv
commit 87f856209f9aa06bcf366b07cc93306d1d50c054
Author: siddeshwar-kagatikar <siddeshwar2004@gmail.com>
Date: Tue Apr 7 21:31:19 2026 +0530
Removed unintended embedded repo and added to gitignore
commit 4de4725e858d474dbcbaadcd1f8cb4a36cbe41fb
Author: siddeshwar-kagatikar <siddeshwar2004@gmail.com>
Date: Tue Apr 7 16:36:33 2026 +0530
fixed tasks
commit 515f8c042e27b98c515a6927a2dcaf30d6e5b4ad
Author: siddeshwar-kagatikar <siddeshwar2004@gmail.com>
Date: Tue Apr 7 16:13:40 2026 +0530
fixed tasks
commit f4a4c0487cc097972f4798862ae2548e451cb924
Author: siddeshwar-kagatikar <siddeshwar2004@gmail.com>
Date: Tue Apr 7 15:59:58 2026
|
@@ -953,7 +953,7 @@ class GeneratorRewardFunction:
|
|
| 953 |
context_pressure = self._context_pressure_score(validation_result)
|
| 954 |
parl_parallel, parl_finish = self._parl_scores(candidate)
|
| 955 |
hardness_component = max(0.0, min(1.0, (hardness + 0.4) / 1.4))
|
| 956 |
-
consistency_component = max(
|
| 957 |
0.0,
|
| 958 |
min(
|
| 959 |
1.0,
|
|
|
|
| 953 |
context_pressure = self._context_pressure_score(validation_result)
|
| 954 |
parl_parallel, parl_finish = self._parl_scores(candidate)
|
| 955 |
hardness_component = max(0.0, min(1.0, (hardness + 0.4) / 1.4))
|
| 956 |
+
consistency_component = max(
|
| 957 |
0.0,
|
| 958 |
min(
|
| 959 |
1.0,
|