ritishshrirao commited on
Commit
d822755
·
1 Parent(s): fe1f842

test hf space commit

Browse files

commit 8ddb8edbeb82f8d1b73954dbdee4854f4792210b
Merge: 20d6040 fe1f842
Author: Ritish Shrirao <ritishshrirao@gmail.com>
Date: Sun Apr 26 02:33:06 2026 +0530

Merge HF Space history into local main

commit 20d6040f8ddacfbd24251186cea73b05b4514f84
Author: Ritish Shrirao <ritishshrirao@gmail.com>
Date: Sun Apr 26 01:32:55 2026 +0530

Update rewards, training scripts, add HF jobs for training

commit 6ea85753f94f96755595dfe5d9652437b0d4c0b7
Author: siddeshwar-kagatikar <siddeshwar2004@gamil.com>
Date: Sat Apr 25 22:34:30 2026 +0530

fix(rewards): never crash GRPO on malformed completions

Trainer crashed with ``ValueError: invalid literal for int() with base
10: 'none'`` because the model emitted ``"none"`` as a string for an
orchestrator integer field. A single bad completion takes down the
whole RL run, so harden parsing end-to-end:

- Add ``_coerce_int`` that gracefully handles ``None``, booleans,
floats (including NaN/inf), and arbitrary strings (``"none"``,
``"N/A"``, ``"2 agents"``). Use it for every orchestrator integer.
- Wrap ``decode_completion_text`` and ``parse_generated_task_completion``
in defensive ``try/except`` inside ``GeneratorRewardFunction.__call__``;
on any parsing/reward exception, fall through to a graded floor
reward so GRPO keeps training and the failure is logged via
``[reward_debug][parse_error]`` / ``[reward_debug][reward_error]``.
- Add regression tests for both behaviours.

Made-with: Cursor

commit df21aa96732b8627747707210288c1ffeadce1b7
Author: siddeshwar-kagatikar <siddeshwar2004@gamil.com>
Date: Sat Apr 25 22:19:27 2026 +0530

fix(self_play): break GRPO reward collapse by switching to instruct model and spreading the invalid-output reward floor

- Switch generator+answerer to Qwen2.5-0.5B-Instruct so the policy can
actually follow the JSON contract instead of collapsing into an
entropy-0 repetition loop that hits max_completion_length every step.
- Bump num_generations to 4 (with batch_size 4) and raise the
generator's max_completion_length to 768 so EOS can occur naturally
and the GRPO group has a real chance of intra-group reward variance.
- Plumb a generation-time repetition_penalty (1.1) through the phase
config and GRPOConfig kwargs (filtered by signature) to push the
policy out of collapsed loops.
- Drastically shrink the generator/answerer prompts: drop the
redundant canonical_graph echo from the required schema, trim the
shared-context dump, and replace the verbose few-shot example with
a compact, schema-only one. Saves hundreds of prompt tokens and
removes the implicit "always emit very long JSON" pattern.
- Spread the invalid-output reward floor: lower per-reason penalties,
raise partial-credit weights, and add a bounded text-level signal
(brace/keyword/diversity/length cues) so even fully invalid outputs
in the same GRPO group differ in reward, killing the
frac_reward_zero_std=1 plateau that locked loss/grad_norm/kl at 0.

Made-with: Cursor

commit cf98a1272aef617423519fa48bf28b697d9b1254
Author: Ritish Shrirao <ritishshrirao@gmail.com>
Date: Sat Apr 25 21:32:27 2026 +0530

Training fix attempt

commit 120a52aff8f5dd1150b4229f19650b87c5eef139
Author: siddeshwar-kagatikar <siddeshwar2004@gamil.com>
Date: Sat Apr 25 20:00:24 2026 +0530

Improve RL output validity and reward signal shaping.

Tighten JSON prompts with grounded examples, add retry-backed validation for generated task materialization, widen/grade invalid-output rewards, and tune generation settings so GRPO receives meaningful reward variance instead of mostly flat penalties.

Made-with: Cursor

commit 1f589920c8e019523792e773ab5844e06fcf94ae
Author: siddeshwar-kagatikar <siddeshwar2004@gamil.com>
Date: Sat Apr 25 19:02:02 2026 +0530

Add full RL diagnostics and reward-variance safeguards.

Replace constant invalid swarm_v2 penalty with graded penalties, add per-batch reward/reason debug output, and log/assert key GRPO learning signals (reward variance, KL, loss, grads, parameter fingerprint changes) to catch stalled training early.

Made-with: Cursor

commit af288b9f0240a67235c0b871e95f3ae3555cf5e3
Author: siddeshwar-kagatikar <siddeshwar2004@gamil.com>
Date: Sat Apr 25 18:36:56 2026 +0530

Increase A10G training budget and improve completion logging.

Raise rounds and per-phase max steps, increase task volume, save checkpoints every 10 steps, and add explicit training start/end phase logs so Space output clearly shows when training is done.

Made-with: Cursor

commit 643ccc5d7e1442d0b36c2ea67b85eb8d88190357
Author: siddeshwar-kagatikar <siddeshwar2004@gamil.com>
Date: Sat Apr 25 18:04:34 2026 +0530

Fix TRL reward function naming compatibility.

Wrap callable reward objects with a stable __name__ fallback so GRPOTrainer initialization works on TRL versions that introspect reward function names.

Made-with: Cursor

commit b15aba7ef45c829a4b1ec5f4dec406b8446a181c
Author: siddeshwar-kagatikar <siddeshwar2004@gamil.com>
Date: Sat Apr 25 17:54:08 2026 +0530

Add HF training deps for Qwen3.5 startup runs.

Include pillow and torchvision in train extras and document HF_TOKEN secret for Space startup training to avoid Qwen image-processor import errors and unauthenticated Hub warnings.

Made-with: Cursor

commit 032760fc2bb24593ee44f29f76ecd1687b2f7235
Author: siddeshwar-kagatikar <siddeshwar2004@gamil.com>
Date: Sat Apr 25 17:48:36 2026 +0530

Fix GRPO smoke config to satisfy generation constraints.

Set per-device train batch size back to 2 and num_generations to 2 so TRL can compute grouped advantages and pass generation batch divisibility checks.

Made-with: Cursor

commit dbffa252456cc0d04938114855c6e8db3a1494ef
Author: siddeshwar-kagatikar <siddeshwar2004@gamil.com>
Date: Sat Apr 25 17:44:10 2026 +0530

Fix smoke config generation divisibility for TRL GRPO.

Reduce num_generations to 1 in the A10G smoke config so generation_batch_size is divisible and startup training can run without GRPOConfig validation failure.

Made-with: Cursor

commit 905eb0b15ef1ba1efad8643d79efefa94a491c75
Author: siddeshwar-kagatikar <siddeshwar2004@gamil.com>
Date: Sat Apr 25 17:34:23 2026 +0530

Add startup-triggered Space training flow for non-Dev Mode.

Install train dependencies in the image and add a startup script that can run self-play training from Space env flags before launching the API server, enabling Qwen 0.8B smoke runs without terminal access.

Made-with: Cursor

commit f5157c4a8cf1dfdd59f65d66a0ee74e4ad924f16
Author: siddeshwar-kagatikar <siddeshwar2004@gamil.com>
Date: Sat Apr 25 17:31:26 2026 +0530

Add fixed-vs-generate canonical graph training mode.

Introduce canonical_graph_mode in self-play configs so swarm_v2 can either keep canonical candidates fixed (question/answer focus) or allow canonical graph generation, and document/test both paths.

Made-with: Cursor

commit ddf51664f256cfbfd49803b422a26ac0b488bbca
Author: siddeshwar-kagatikar <siddeshwar2004@gamil.com>
Date: Sat Apr 25 17:09:14 2026 +0530

Ignore osint dashboard html file.

Add an explicit ignore rule for osint_dashboard.html to avoid accidentally tracking generated dashboard artifacts.

Made-with: Cursor

commit af9a157dafe3c4a903db97831452395560613109
Author: siddeshwar-kagatikar <siddeshwar2004@gamil.com>
Date: Sat Apr 25 17:06:11 2026 +0530

Add W&B reporting support and HF A10G smoke config.

Enable optional wandb logging for self-play phases and add a short Qwen 3.5 0.8B LoRA smoke config/documentation so training can be validated quickly on Hugging Face A10G before scaling.

Made-with: Cursor

commit 3b5458194ef58c10ffa03546b7d1d29081bdd3d1
Author: siddeshwar-kagatikar <siddeshwar2004@gamil.com>
Date: Sat Apr 25 16:43:19 2026 +0530

updated readme with reward funciton

commit b8680deeff5392b23a7799f04893496596d10736
Author: Ritish Shrirao <ritishshrirao@gmail.com>
Date: Sat Apr 25 14:13:54 2026 +0530

Updated adversarial self play training

commit b39c179b1aae65aa0175e84fef1cb30f75eca16a
Author: Ritish Shrirao <ritishshrirao@gmail.com>
Date: Tue Apr 21 20:26:37 2026 +0530

Added self-play training with TRL

commit 6ff87397b484622e694f26cd9d1d4f37631244d0
Author: siddeshwar-kagatikar <siddeshwar2004@gmail.com>
Date: Tue Apr 21 01:17:39 2026 +0530

removed large file

commit 0085e65b365a9589116dbe4a2f64635be9cdf909
Author: siddeshwar-kagatikar <siddeshwar2004@gmail.com>
Date: Tue Apr 21 01:16:28 2026 +0530

added visualisation

commit 04ff268418d790ff38365542aaa09dea38e72a58
Author: siddeshwar-kagatikar <siddeshwar2004@gamil.com>
Date: Tue Apr 21 00:51:38 2026 +0530

added MetaQA dataset

commit 4c3b3f34987bd4d6b0946ae2b589e4b34c3a24eb
Author: siddeshwar-kagatikar <siddeshwar2004@gmail.com>
Date: Wed Apr 8 10:32:45 2026 +0530

made diff tasks print properly

commit a5d62379e99db950271891a79c4331a3c3bd580d
Author: siddeshwar-kagatikar <siddeshwar2004@gamil.com>
Date: Wed Apr 8 05:25:54 2026 +0530

tp

commit ea695ac172448fd2360494f0f2d63d6dfdc61493
Author: Ritish Shrirao <ritishshrirao@gmail.com>
Date: Wed Apr 8 02:41:59 2026 +0530

Update inference

commit 02fe199c43be1bd7b69330a043b0f0ac767ff085
Author: Ritish Shrirao <ritishshrirao@gmail.com>
Date: Tue Apr 7 23:50:17 2026 +0530

Update openenv

commit 87f856209f9aa06bcf366b07cc93306d1d50c054
Author: siddeshwar-kagatikar <siddeshwar2004@gmail.com>
Date: Tue Apr 7 21:31:19 2026 +0530

Removed unintended embedded repo and added to gitignore

commit 4de4725e858d474dbcbaadcd1f8cb4a36cbe41fb
Author: siddeshwar-kagatikar <siddeshwar2004@gmail.com>
Date: Tue Apr 7 16:36:33 2026 +0530

fixed tasks

commit 515f8c042e27b98c515a6927a2dcaf30d6e5b4ad
Author: siddeshwar-kagatikar <siddeshwar2004@gmail.com>
Date: Tue Apr 7 16:13:40 2026 +0530

fixed tasks

commit f4a4c0487cc097972f4798862ae2548e451cb924
Author: siddeshwar-kagatikar <siddeshwar2004@gmail.com>
Date: Tue Apr 7 15:59:58 2026

Files changed (1) hide show
  1. src/osint_env/training/rewards.py +1 -1
src/osint_env/training/rewards.py CHANGED
@@ -953,7 +953,7 @@ class GeneratorRewardFunction:
953
  context_pressure = self._context_pressure_score(validation_result)
954
  parl_parallel, parl_finish = self._parl_scores(candidate)
955
  hardness_component = max(0.0, min(1.0, (hardness + 0.4) / 1.4))
956
- consistency_component = max(
957
  0.0,
958
  min(
959
  1.0,
 
953
  context_pressure = self._context_pressure_score(validation_result)
954
  parl_parallel, parl_finish = self._parl_scores(candidate)
955
  hardness_component = max(0.0, min(1.0, (hardness + 0.4) / 1.4))
956
+ consistency_component = max(
957
  0.0,
958
  min(
959
  1.0,