tucano2-commerce / notebooks /v4_2_instruct_grpo.ipynb

Commit History

feat(rewards): add sentiment mismatch penalty to prevent extraction reward hacking
b1be31c

rtferraz Claude Haiku 4.5 commited on

fix(probe): use TRL 0.24.0 log keys β€” rewards/commerce_reward_fn/mean, grad_norm (not train/ prefix)
080fd9a
verified

rtferraz commited on

fix(classifier): reorder _classify_task_type β€” insights before push to prevent reengajamento misclassification
63b1c86
verified

rtferraz commited on

fix(rewards): 3 bugs from Cell 8 audit β€” push length/formal, SQL domain, extraction int check
41eb15f
verified

rtferraz commited on

Fix V4.2 audit: show INPUT REVIEW alongside MODEL OUTPUT for proper human scoring
71422f3
verified

rtferraz commited on

Fix V4.2: task weights 40/40/10/10, full audit completions, interactive input() scoring
0fc9042
verified

rtferraz commited on

Fix V4.2: GDPO + IWU now active in training reward path (not just monitoring)
c95e44c
verified

rtferraz commited on

Add V4.2 GRPO training notebook (Gold Standard, 0.5B)
c5f1d2d
verified

rtferraz commited on