Spaces:
Running
Running
Commit History
Fix notebook: remove hardcoded 384, use defaults (640, lr=5e-5, beta=0.04) cb53788
Fix stagnation: LR 5e-5, beta 0.04, default 640 001f0ed
max_completion=640 temp=0.9 fix truncation 537d8f4
num_gen=4 for T4 + bump steps to 340 1fa41e5
3-level 2c93c00
Make unsloth import lazy c1f1ab3
3-level curriculum + 7B + reward fixes 9868dfb
Add training evidence: curriculum results, plots (LFS), parse_job_log helper 805b735
changed model d745c55
balanced 5365e54
fixed mismatches d1f8afa
Shabista Sehar commited on
hf job 409ca4d
improved 3f2e418
python notebook da5d6b0
Shabista Sehar commited on
protection against reward hacking improved 9adca2d
training script 1272145
Shabista Sehar commited on
feat: implement GRPO training script with environment health checks and structured reward functions for bail assessment 46d6990
---- aa1acaa
Shabista Sehar commited on
feat: implement GRPO training pipeline for bail assessment model and update README credits 472a28c
feat: implement dataset loader, environment, and GRPO training pipeline for undertrial bail prediction bf8f1ff
modified a085ad1
Shabista Sehar commited on
implemented d8f8a45
Shabista Sehar commited on
Fix A3 (OOM eval), B9 (NDPS eligibility), B3 (direction-gated computation bonus), A8-pt2 (episode_id case lookup) 4855450
Fix 8 compliance gaps: repeat-action dedup+cache, min-steps hard block, criminal history tool (12th action), efficiency removed from training formula, circular import cleaned, yaml formula synced 898bc18
Reward overhaul: add compute_reasoning_quality (anchoring+arithmetic+specificity+consistency), parity-grounds penalty, reduce outcome 40%->30%, add 10% reasoning quality signal ca62faa
Fix 5 bugs: inference mode reset, step_counts in curriculum, adapter-only save (x3), DEMO001 false defence claim, episode_id in /reset 37edd09
import fixed c1adced
Shabista Sehar commited on