Spaces:

InosLihka
/

rhythm_env

Sleeping

App Files Files Community

rhythm_env / training

Commit History

Clarify documentation: anomaly signal explainer, GRPO scope notes

361aed7

InosLihka commited on 7 days ago

Move blog to root as BLOG.md (per Meta mentor guidance)

eccca42

InosLihka commited on 12 days ago

Fix prompt truncation in inference_eval.py: max_seq_length 768 -> 2048

1217c1d

InosLihka commited on 12 days ago

Fix max_new_tokens for CoT format + add eval-only HF Jobs script

b9c9b8f

InosLihka commited on 12 days ago

Algorithm Distillation: grader v2 with belief_accuracy + SFT pipeline

ece0bbe

InosLihka commited on 12 days ago

client: surface ALL observation fields (was dropping deltas, anomalies, last_action, step_history)

105973d

InosLihka Claude Opus 4.7 (1M context) commited on 12 days ago

iter4: fix the 'constant belief = free reward' bug + 6 other deep issues

bb2a9c7

InosLihka Claude Opus 4.7 (1M context) commited on 12 days ago

iter3: align reward with grader + belief-first format + exploration shaping

64d24b3

InosLihka Claude Opus 4.7 (1M context) commited on 12 days ago

iter2: fix mode collapse + 3 deeper bugs from code review

e21a960

InosLihka Claude Opus 4.7 (1M context) commited on 13 days ago

tune: GRPO hyperparameter fixes from ML reviewer

dc0186f

InosLihka Claude Opus 4.7 (1M context) commited on 13 days ago

feat: HF Jobs training script + plot generator

73c7ea0

InosLihka commited on 13 days ago

fix: notebook plot cell syntax error (newline in string literal)

7340206

InosLihka commited on 13 days ago

notebook: add belief-accuracy + reward-components plots

b5ac530

InosLihka Claude Opus 4.7 (1M context) commited on 13 days ago

env: meta-RL refactor (continuous profiles, action+belief, adaptation grader)

ecbe0d8

InosLihka Claude Opus 4.7 (1M context) commited on 13 days ago

fix: correct GRPO training hyperparameters to prevent KL explosion

fb112e4

InosLihka Claude Sonnet 4.6 commited on 13 days ago

fix: rename kl_coef to beta (correct param name in TRL GRPOConfig)

2c6ee11

InosLihka Claude Sonnet 4.6 commited on 13 days ago

fix: reduce kl_coef to prevent training instability

0bdfeaa

InosLihka Claude Sonnet 4.6 commited on 13 days ago

Rebuild as Life Simulator: 5 meters, 3 hidden profiles, GRPO training pipeline

cc6473a

InosLihka Claude Sonnet 4.6 commited on 14 days ago

Commit History

Clarify documentation: anomaly signal explainer, GRPO scope notes 361aed7

Move blog to root as BLOG.md (per Meta mentor guidance) eccca42

Fix prompt truncation in inference_eval.py: max_seq_length 768 -> 2048 1217c1d

Fix max_new_tokens for CoT format + add eval-only HF Jobs script b9c9b8f

Algorithm Distillation: grader v2 with belief_accuracy + SFT pipeline ece0bbe

client: surface ALL observation fields (was dropping deltas, anomalies, last_action, step_history) 105973d

iter4: fix the 'constant belief = free reward' bug + 6 other deep issues bb2a9c7

iter3: align reward with grader + belief-first format + exploration shaping 64d24b3

iter2: fix mode collapse + 3 deeper bugs from code review e21a960

tune: GRPO hyperparameter fixes from ML reviewer dc0186f

feat: HF Jobs training script + plot generator 73c7ea0

fix: notebook plot cell syntax error (newline in string literal) 7340206

notebook: add belief-accuracy + reward-components plots b5ac530

env: meta-RL refactor (continuous profiles, action+belief, adaptation grader) ecbe0d8

fix: correct GRPO training hyperparameters to prevent KL explosion fb112e4

fix: rename kl_coef to beta (correct param name in TRL GRPOConfig) 2c6ee11

fix: reduce kl_coef to prevent training instability 0bdfeaa

Rebuild as Life Simulator: 5 meters, 3 hidden profiles, GRPO training pipeline cc6473a

Clarify documentation: anomaly signal explainer, GRPO scope notes

361aed7

Move blog to root as BLOG.md (per Meta mentor guidance)

eccca42

Fix prompt truncation in inference_eval.py: max_seq_length 768 -> 2048

1217c1d

Fix max_new_tokens for CoT format + add eval-only HF Jobs script

b9c9b8f

Algorithm Distillation: grader v2 with belief_accuracy + SFT pipeline

ece0bbe

client: surface ALL observation fields (was dropping deltas, anomalies, last_action, step_history)

105973d

iter4: fix the 'constant belief = free reward' bug + 6 other deep issues

bb2a9c7

iter3: align reward with grader + belief-first format + exploration shaping

64d24b3

iter2: fix mode collapse + 3 deeper bugs from code review

e21a960

tune: GRPO hyperparameter fixes from ML reviewer

dc0186f

feat: HF Jobs training script + plot generator

73c7ea0

fix: notebook plot cell syntax error (newline in string literal)

7340206

notebook: add belief-accuracy + reward-components plots

b5ac530

env: meta-RL refactor (continuous profiles, action+belief, adaptation grader)

ecbe0d8

fix: correct GRPO training hyperparameters to prevent KL explosion

fb112e4

fix: rename kl_coef to beta (correct param name in TRL GRPOConfig)

2c6ee11

fix: reduce kl_coef to prevent training instability

0bdfeaa

Rebuild as Life Simulator: 5 meters, 3 hidden profiles, GRPO training pipeline

cc6473a