--- title: UndertriAI emoji: โš–๏ธ colorFrom: indigo colorTo: blue sdk: docker app_port: 7860 license: mit short_description: OpenEnv RL environment for Indian bail decision support tags: - openenv - legal-ai - reinforcement-learning - bail - india - grpo - world-modeling --- # UndertriAI โš–๏ธ **OpenEnv-compliant RL training environment for Indian bail decision support.** [![OpenEnv](https://img.shields.io/badge/OpenEnv-compatible-6366f1)](https://github.com/meta-pytorch/OpenEnv) [![Live Demo](https://img.shields.io/badge/๐Ÿค—_Space-Live_Demo-yellow)](https://huggingface.co/spaces/Draken1606/undertrial-ai) [![Swagger](https://img.shields.io/badge/API-Swagger_Docs-green)](https://draken1606-undertrial-ai.hf.space/docs) [![License: MIT](https://img.shields.io/badge/License-MIT-gray)](LICENSE) > **[โ–ถ Try the Live Demo](https://huggingface.co/spaces/Draken1606/undertrial-ai)** โ€” click "Run Bail Assessment" to see the environment in action. > **[๐Ÿ“ Read the Story](https://huggingface.co/spaces/Draken1606/undertrial-ai/blob/main/Blog.md)** โ€” *"Three minutes should never decide a life"* (link to be updated) --- ## The Problem **76% of India's 5.7 lakh prisoners are undertrials**[^1] โ€” unconvicted people awaiting bail hearings, many of whom cannot afford lawyers. A subordinate court judge handles **80โ€“100 bail hearings per day** โ€” roughly **3 minutes per case**. In that window they must read the charge sheet, assess flight risk, evaluate custody duration against the statutory threshold, and check for parity with co-accused. In practice, outcomes are inconsistent and empirically biased against poor, lower-caste, and minority accused. **This is not anecdotal โ€” it is structural.** The Supreme Court in *Satender Kumar Antil v. CBI* (2022) explicitly noted the crisis. [^1]: Deshmukh et al., "IndianBailJudgments: A Dataset for Bail Prediction and Fairness in Indian Courts," *arXiv:2508.07592* (2025), analyzing NCRB Prison Statistics India 2022. --- ## What UndertriAI Does UndertriAI is an **OpenEnv-compliant RL training environment** designed for **Theme 3.1: Professional Tasks / World Modeling**. It teaches an LLM to interact with a realistic legal workflow โ€” not through shortcuts, but through genuine tool use, statutory reasoning, and multi-step case analysis: 1. **Read case documents** (charge sheet, arguments, criminal history) 2. **Invoke legal tools** (12 specialized tools for statutory eligibility, precedent lookup, risk assessment) 3. **Produce structured bail memos** with explicit reasoning chains 4. **Get evaluated** against real Indian High Court decisions using a deterministic, multi-component reward function Additionally, the environment implements **Theme 4: Self-Improvement** through adaptive curriculum mechanisms (detailed below). --- ## Environment Design ### Theme 3.1: Professional Tasks / World Modeling This environment qualifies for Theme 3.1 by requiring **genuine interaction with a partially observable legal world** where: - **Tool invocation is mandatory** โ€” statutory thresholds cannot be guessed; they must be computed via `compute_statutory_eligibility` - **Multi-step reasoning is required** โ€” the model must sequence tool calls (read arguments โ†’ assess risk โ†’ compute eligibility โ†’ cite precedent โ†’ draft memo) - **Shortcuts fail** โ€” trying to submit a memo without tool use earns near-zero reward due to missing statutory/precedent signals - **State persistence matters** โ€” tool outputs accumulate in episode state; later reasoning depends on earlier tool calls - **API/workflow simulation** โ€” the environment models real judicial clerk workflows: document retrieval, legal database queries, risk scoring matrices **This is not a text completion task.** It is a dynamic system where the agent must orchestrate tools, maintain working memory across 5โ€“15 actions per episode, and produce outputs that match real judicial reasoning patterns. ### API Endpoints | Method | Endpoint | Description | |---|---|---| | `POST` | `/reset?stage=1` | Start a new episode (curriculum stage 1โ€“4) | | `POST` | `/reset?adaptive=true&auto_stage=true` | Start episode with adaptive selection (Theme 4) | | `POST` | `/step` | Submit a tool call or final memo | | `GET` | `/state?session_id=...` | Inspect current episode state | | `GET` | `/profile?session_id=...` | Agent performance profile (Theme 4) | | `GET` | `/adaptive_status` | Adaptive mode capabilities & thresholds | | `GET` | `/health` | Health check | | `GET` | `/tools` | List available tools | | `WS` | `/ws/{session_id}` | WebSocket real-time feed | ### Tools Available to the Agent | Tool | Purpose | |---|---| | `compute_statutory_eligibility` | Calculate custody vs threshold for IPC/BNSS sections (non-guessable) | | `cross_reference_precedent` | Look up landmark HC/SC decisions | | `assess_surety` | Evaluate surety bond appropriateness | | `classify_bail_type` | Determine regular / anticipatory / default bail | | `request_document` | Request additional case documents | | `flag_inconsistency` | Flag contradictions in the charge sheet | | `read_submissions` | Read prosecution/defence arguments on record | | `assess_flight_risk` | Systematic flight risk scoring matrix | | `check_case_factors` | Examine parity, evidence tampering, victim vulnerability | | `apply_proportionality` | BNSS 479 custody vs. max sentence proportionality | | `pull_criminal_history` | Prior record, bail history, conviction status | | `submit_memo` | **Terminal action** โ€” submit final bail recommendation | **Example tool invocation:** ```json { "tool": "compute_statutory_eligibility", "section": "IPC 420", "custody_months": 8 } ``` ### 4-Stage Curriculum | Stage | Focus | Cases | Learning Objective | |---|---|---|---| | 1 | Landmark cases (clear-cut eligibility) | ~40 | Learn tool sequencing + format | | 2 | Contested cases (murder, repeat offenders) | ~1,100 | Learn contested reasoning patterns | | 3 | Bias-reversal cases (HC overturning biased lower courts) | ~30 | Learn to detect parity violations | | 4 | BNSS schema drift (IPC โ†’ BNS remapping, 2023 reform) | ~50 | Test adaptability to legal schema changes | **Example Stage 4 challenge:** Case uses IPC 379 (theft, 3-year max sentence, threshold = 1/2 max = 18 months). After BNSS 2023 reform, this maps to BNS 303 (theft, still 3-year max, but different bail provision language under BNSS ยง 479). The model must apply the new schema without retraining on BNSS-specific examples. --- ## Theme 4 โ€” Self-Improvement (Secondary) UndertriAI implements three self-improvement mechanisms as a **secondary theme contribution**: **1. Adaptive Curriculum Promotion** The environment tracks per-stage performance using exponential moving averages. When the agent demonstrates consistent improvement (Stage 1 mean reward โ‰ฅ 0.65 over 20 episodes), it automatically promotes to the next curriculum stage. Visible in training logs as: ``` [SELF-IMPROVEMENT] Step 100: Promoted to Stage 2. Stage 1 mean reward: 0.710 โ†’ Stage 2 begins. ``` **2. Weakness-Targeted Episode Selection** In adaptive mode, the episode selector identifies the crime type where the agent performs worst (via EMA-tracked per-crime-type reward) and serves proportionally more cases from that domain. As the agent improves on weak domains, the selection distribution shifts โ€” the environment continuously finds and targets new weaknesses. | Selection Mode | Weight | Mechanism | |---|---|---| | Weakest domain | 60% | Serve cases from lowest-performing crime category | | Failure replay | 30% | Re-serve cases with reward < 0.40 | | Exploration | 10% | Uniform random (prevent overfitting) | **3. Synthetic Case Generation** When the agent masters a stage (mean reward โ‰ฅ 0.70 on a stage), the environment generates harder synthetic variants using 5 perturbation types: | Perturbation | What it tests | |---|---| | Custody escalation | Custody 2 months below threshold โ€” forces exact statutory computation | | Co-accused conflict | Opposite bail outcomes for co-accused โ€” tests parity reasoning | | Section ambiguity | IPC โ†” BNSS section swap โ€” tests schema drift robustness | | Evidence reversal | Key witness retracted โ€” tests flight risk reassessment | | Surety complexity | Non-resident surety โ€” tests condition appropriateness | **Live Demo โ€” Self-Improvement in Action:** ```bash # Start the server python -m server.app # In another terminal โ€” adaptive training python training/train_grpo.py --adaptive --steps 50 --env_url http://localhost:8000 ``` Monitor progress via `GET /profile?session_id={id}` and `GET /adaptive_status`. --- ## Reward Function ```python R = 0.4 ร— outcome_match (gated by think_factor) + 0.2 ร— flight_risk_accuracy + 0.2 ร— statutory_accuracy + 0.2 ร— condition_appropriateness + 0.1 ร— reasoning_quality (bonus) + 0.05 ร— format_compliance (bonus) + 0.05 ร— process_bonus (tool-use proxy, bonus) ยฑ 0.05 ร— diversity_bonus (anti-collapse signal) โˆ’ 0.3 ร— bias_penalty (fires on parity violations) ``` **Reward range:** core components sum to 1.0; with bonuses, total can reach ~1.15; with bias penalty, it can drop to ~0.7 on a bias-flagged case answered without parity reasoning. All components are **fully deterministic and rule-based** โ€” no LLM-as-judge. | Component | Signal Type | Details | |---|---|---| | **Outcome Match** | 0.0 / 0.8 / 1.0 | Exact, directional, or wrong vs HC decision โ€” **gated by `` block presence** | | **Flight Risk** | 0โ€“1 | Ordinal distance to ground-truth risk level (Low / Medium / High) | | **Statutory** | 0โ€“1 | IPC/BNSS threshold computation, **direction-gated**, NDPS Section 37 aware | | **Conditions** | 0โ€“1 | Bail-condition appropriateness for crime / risk profile | | **Reasoning Quality** | 0โ€“1 | Anchoring + arithmetic + grounds specificity (10% bonus) | | **Format Compliance** | 0โ€“1 | XML tag adherence to system prompt (5% bonus) | | **Process Bonus** | 0 or 0.05 | Awarded if both `custody_months` and threshold computation appear verbatim in `` (proxy for tool use) | | **Diversity Bonus** | ยฑ0.05 | +0.05 if rollouts produce โ‰ฅ2 distinct outcomes; โˆ’0.05 if all rollouts collapse to the same outcome | | **Bias Penalty** | โˆ’0.3 | Fires if parity argument ignored in bias-flagged cases | ### Anti-Reward-Hacking Design - **Multiple independent reward signals** โ€” gaming all of them simultaneously is harder than gaming one - **`GenerationInspectionCallback`** prints raw completions every 25 training steps for manual review - **Reasoning gate:** No `` block โ†’ outcome reward zeroed in Stage 2+ (prevents format exploitation) - **Direction gate:** Wrong bail direction โ†’ statutory bonus capped (prevents partial-credit gaming) - **Bias penalty operates as a separate signal**, not folded into outcome (ensures visibility) - **Schema drift (Stage 4)** tests adaptability, not pattern memorisation - **Diversity signal** flags reward-collapse โ€” prints `[WARNING] Reward variance collapsed` if the policy converges to a single outcome - **Tool-invocation tracking:** `process_bonus` only fires when episode-specific custody/threshold values (which are **not** in the user prompt) appear in the model's reasoning โ€” strong proxy for actual tool use **Gaming resistance verified via unit tests:** | Completion Type | Sample Reward | Verification | |---|---|---| | **Ideal** (full reasoning, all tools, correct outcome) | 1.15 | โœ… PASS | | **Filler** (generic reasoning, minimal tools) | 0.66 | โœ… PASS | | **Minimal** (bare XML, no tools) | 0.32 | โœ… PASS | | **Tool spam** (redundant calls, no reasoning) | 0.17 | โœ… PASS | GRPO correctly ranks `ideal > filler > minimal > spam`. --- ## Training Uses **GRPO** (Group Relative Policy Optimization) via TRL + Unsloth on `Qwen2.5-7B-Instruct` (4-bit quantized + LoRA r=16 โ€” i.e. **QLoRA**). ### Hybrid Training / Evaluation Design **Key design decision:** UndertriAI uses a **hybrid offline/online architecture** to balance speed and correctness. - **Reward computation during training: in-process (offline).** The trainer imports the same `server/reward.py` module that the deployed FastAPI server uses and calls `combined_reward(...)` directly. This gives **bitwise reward parity** with the env-API path while avoiding ~64 HTTP calls per training step (`num_generations ร— grad_accum ร— 2 calls per rollout`). On a single A10G, in-process scoring lets four curriculum stages fit into a ~3h budget; the equivalent online path would require ~5โ€“6h of wall time mostly spent in network I/O. - **Adaptive curriculum mechanisms: live env API.** The `/profile`, `/adaptive_status`, and stage-promotion logic always go through the deployed environment so per-domain EMA tracking and weakness-targeted episode selection observe real environment state. - **Evaluation: in-process scoring with bitwise parity to the env API.** Per-stage before/after numbers in [Results & Verification](#results--verification) are produced by `evaluate_on_stage(...)` calling `combined_reward(...)` against the same model checkpoint. Because `combined_reward` is the *same function object* the deployed env imports, replaying the same episodes through `rollout_via_env_api()` against the live HF Space returns identical scores up to sampling stochasticity. The Live Demo HF Space serves the trained adapter through the env API end-to-end for interactive verification. The alternative โ€” pure online training via `rollout_via_env_api()` for every rollout โ€” is also implemented and selectable via `--env_url ...` (without `--offline`) in single-stage mode (`--stage N`). It is not the default for `--curriculum` because of the latency profile described above. See `training/train_grpo.py โ†’ rollout_via_env_api()` for the env-API path. ### Training Modes | Mode | Command | Description | |---|---|---| | **3-Level Curriculum** *(recommended)* | `python training/train_grpo.py --curriculum --offline` | Format โ†’ Reasoning โ†’ Adversarial (300 steps total) | | Legacy 4-stage | `python training/train_grpo.py --curriculum --offline --difficulties "" --stages 1,2,3,4` | Sequential 4-stage with trace harvesting | | Single-stage (offline) | `python training/train_grpo.py --stage 1 --offline --steps 200` | Local scoring (smoke testing) | | Baseline only | `python training/train_grpo.py --baseline_only` | Zero-shot eval, no training | ### 3-Level Difficulty Curriculum | Level | Case Type | Episodes | Steps | Difficulty | |-------|-----------|----------|-------|------------| | **Easy** | Landmark clear-cut cases | 104 | 60 | Model builds confidence on obvious grant/deny | | **Medium** | Contested judgment calls | 761 | 160 | Bulk learning โ€” statutory math, risk assessment | | **Hard** | Bias reversal + schema drift | 335 | 80 | Edge cases that trip up shortcut-takers | ### Default hyperparameters | Parameter | Default | Rationale | |---|---|---| | Base model | `unsloth/Qwen2.5-7B-Instruct` | 4-bit + LoRA r=16 | | Total steps | 300 (60+160+80) | 3-level curriculum, ~2.5h on Kaggle T4 | | `num_generations` | 6 | GRPO rollouts per prompt; 50% more variance than 4 | | `temperature` | 1.1 | Higher exploration for diverse rollouts | | Max completion length | 384 tokens | Fits bail memos; saves VRAM vs 512 | | `batch_size ร— grad_accum` | 1 ร— 8 | Effective batch 8; Kaggle T4 safe | | `learning_rate` | 5e-6 | Curriculum-scale LR | ### Deploy & Train Workflow ```bash # 1. Deploy environment to HF Spaces openenv push --repo-id username/undertri-ai # 2. Verify it is running curl https://username-undertri-ai.hf.space/health # 3. Set WandB auth (optional, for live metric tracking) export WANDB_API_KEY=your_wandb_api_key # 4. Run curriculum training as a one-shot HF Job (A10G, ~2h) hf jobs uv run --flavor a10g-large --timeout 3h \ --secrets HF_TOKEN \ https://raw.githubusercontent.com/Faiz-1606/Undertrial/main/training/run_hf_job.py \ --curriculum \ --env_url https://username-undertri-ai.hf.space \ --output ./output/undertrial_grpo ``` ### Colab Notebook (Step-by-Step) [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](training/UndertriAI_GRPO_Training.ipynb) ```python # ============================================================ # STEP 1 โ€” Install dependencies # ============================================================ !pip install -q "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git" !pip install -q --no-deps trl peft accelerate bitsandbytes xformers !pip install -q openenv-core datasets wandb import os os.environ["WANDB_API_KEY"] = "your_wandb_api_key" # optional # ============================================================ # STEP 2 โ€” Clone repo + load episodes # ============================================================ !git clone https://github.com/Faiz-1606/Undertrial.git %cd Undertrial # Verify episodes are present (loaded from data/episodes/) import os for f in sorted(os.listdir("./data/episodes")): if f.endswith(".jsonl"): n = sum(1 for _ in open(f"./data/episodes/{f}")) print(f" {f}: {n} episodes") # ============================================================ # STEP 3 โ€” Quick smoke test (10 steps, ~3 min on T4) # ============================================================ !python training/train_grpo.py \ --episodes_dir ./data/episodes \ --offline --stage 1 --steps 10 --batch_size 1 # ============================================================ # STEP 4 โ€” Full curriculum training (~1h 50m on A10G; longer on T4) # ============================================================ !python training/train_grpo.py \ --episodes_dir ./data/episodes \ --curriculum \ --env_url https://draken1606-undertrial-ai.hf.space # ============================================================ # STEP 5 โ€” Adaptive training (Theme 4, requires server) # ============================================================ import subprocess, time, requests server = subprocess.Popen( ["python", "-m", "uvicorn", "server.app:app", "--host", "0.0.0.0", "--port", "8000"], stdout=subprocess.PIPE, stderr=subprocess.STDOUT, ) for _ in range(30): try: if requests.get("http://localhost:8000/health", timeout=1).status_code == 200: print("โœ“ Server ready"); break except Exception: time.sleep(1) else: raise RuntimeError("Server startup failed โ€” check logs") !python training/train_grpo.py \ --adaptive \ --episodes_dir ./data/episodes \ --steps 50 --batch_size 1 \ --env_url http://localhost:8000 # ============================================================ # STEP 6 โ€” Inspect results # ============================================================ import json, pathlib results_path = pathlib.Path("./output/undertrial_grpo/curriculum_results.json") if results_path.exists(): print(json.dumps(json.load(open(results_path)), indent=2)) else: print("Check ./output/undertrial_grpo/ for stage_*/ directories") # ============================================================ # STEP 7 โ€” Merge LoRA adapters for inference # ============================================================ from unsloth import FastLanguageModel model, tokenizer = FastLanguageModel.from_pretrained( "./output/undertrial_grpo/final", max_seq_length=3072, ) model.save_pretrained_merged( "./output/undertrial_merged", tokenizer, save_method="merged_16bit", ) print("โœ“ Merged model saved to ./output/undertrial_merged") ``` ### Training Architecture ``` Episode dataset (JSONL โ€” 1,200 HC judgments, 4 curriculum stages) โ†“ Format as chat prompt (system + user) โ†“ Qwen2.5-1.5B-Instruct generates 4 rollouts (GRPO group) โ†“ XML parser extracts structured fields (recommendation, think, statutory, ...) โ†“ server/reward.py scores each rollout (deterministic, in-process; same code as env-API) โ†“ GRPO updates LoRA adapter weights โ†“ [Theme 4] PerformanceTracker updates EMA per stage / per crime type โ†“ [Theme 4] AdaptiveSelector targets weakest domain โ†“ [Theme 4] CaseGenerator creates harder synthetic variants on stage mastery โ†“ [Theme 4] Auto-promote when stage EMA exceeds threshold โ†“ Stage save: LoRA adapter + per-stage reward_curve.png + curriculum_results.json โ†“ End of curriculum: before_after_comparison.png (4-stage baseline vs trained) ``` --- ## Installation ```bash # Clone and install git clone https://github.com/Faiz-1606/Undertrial cd Undertrial pip install -e . # Use the environment client from client import UndertriAIEnv env = UndertriAIEnv(base_url="https://draken1606-undertrial-ai.hf.space") obs = env.reset(stage=1) ``` Or connect directly via the OpenEnv client: ```python from openenv import from_hub env = from_hub("Draken1606/undertrial-ai") ``` --- ## Project Structure ``` undertrial_ai/ โ”œโ”€โ”€ server/ โ”‚ โ”œโ”€โ”€ app.py # FastAPI routes + Theme 4 endpoints โ”‚ โ”œโ”€โ”€ undertrial_environment.py # Environment logic (Theme 3.1) โ”‚ โ”œโ”€โ”€ reward.py # Multi-component deterministic reward โ”‚ โ”œโ”€โ”€ dataset.py # Curriculum-staged episode loader โ”‚ โ”œโ”€โ”€ schema_drift.py # IPC โ†’ BNSS remapping (Stage 4) โ”‚ โ”œโ”€โ”€ performance_tracker.py # [Theme 4] EMA-based performance profiling โ”‚ โ”œโ”€โ”€ adaptive_selector.py # [Theme 4] Weakness-targeted episode selection โ”‚ โ””โ”€โ”€ case_generator.py # [Theme 4] Synthetic case perturbation โ”œโ”€โ”€ training/ โ”‚ โ”œโ”€โ”€ train_grpo.py # GRPO training (single / curriculum / adaptive) โ”‚ โ”œโ”€โ”€ run_hf_job.py # PEP 723 bootstrap for HF Jobs (clones repo + installs deps) โ”‚ โ”œโ”€โ”€ eval_and_plot.py # Post-training env-API-verified eval + plots โ”‚ โ””โ”€โ”€ UndertriAI_GRPO_Training.ipynb # Colab notebook โ”œโ”€โ”€ data/ โ”‚ โ””โ”€โ”€ episodes/ # 1,200 HC judgments across 4 stages โ”œโ”€โ”€ demo/ โ”‚ โ””โ”€โ”€ index.html # Interactive demo UI โ”œโ”€โ”€ client.py # UndertriAIEnv HTTP client โ”œโ”€โ”€ models.py # Pydantic action / observation schemas โ”œโ”€โ”€ openenv.yaml # OpenEnv manifest โ””โ”€โ”€ Dockerfile # HF Spaces deployment ``` --- ## Data **Source:** Deshmukh et al., "IndianBailJudgments: A Dataset for Bail Prediction and Fairness in Indian Courts" ([arXiv:2508.07592](https://arxiv.org/abs/2508.07592)) **Dataset:** [SnehaDeshmukh/IndianBailJudgments-1200](https://huggingface.co/datasets/SnehaDeshmukh/IndianBailJudgments-1200) 1,200 Indian High Court bail judgments (2018โ€“2024) processed into curriculum episodes covering: - Delhi, Bombay, Allahabad, Madras, Kerala, and Calcutta High Courts - Crimes from IPC 420 (cheating) to IPC 302 (murder) - Cases annotated with ground-truth outcome, flight risk, bias flags, and parity arguments ### Dataset as a Training Challenge (Not a Bug) **Known dataset characteristics โ€” and why they make this a stronger RL environment:** | Characteristic | Value | Why this strengthens training | |---|---|---| | **`flight_risk == "Medium"`** | ~72% | The model cannot earn full reward by always saying "Medium" โ€” flight risk is only 20% of total reward. To exceed 0.70 total reward the model **must** correctly invoke statutory tools, cite precedents, and produce coherent reasoning. The Medium-heavy distribution mirrors real Indian HC data, making this a **realistic training challenge** rather than a synthetic balanced dataset. | | **`custody_months == 6.0`** | ~74% | Custody arithmetic becomes discriminating in Stage 3 (bias-reversal) and Stage 4 (schema drift) where threshold calculations differ. The `reasoning_quality` sub-score rewards exact numerical matches in `` blocks. | | **`bias_flag == True`** | ~1% (13 cases) | **Honest limitation:** bias penalty fires rarely (โ‰ˆ once every 92 episodes under uniform sampling). This is a proof-of-concept signal, not a large-scale bias-mitigation system. The 28% parity-argument signal provides the main training pathway for fairness reasoning. Future work: expand bias-flagged evaluation set to 10โ€“15%. | | **Empty `prosecution_arguments`** | ~53% | Not a flaw โ€” this mirrors real case records where prosecution arguments are not always transcribed. The model must reason from charge sheet and defence arguments alone, which is the actual judicial workflow. | **Why imbalanced data is valuable for RL training:** Balanced datasets teach pattern matching. Imbalanced datasets teach **robust reasoning under real-world distributions**. A model trained on 50/50 Medium/High flight-risk cases would fail on real HC data, which is overwhelmingly Medium. UndertriAI's distribution forces the model to learn when "Medium" is correct (most cases) and when it's wrong (bias-reversal cases) โ€” which is exactly the reasoning pattern judges need. --- ## Why This Matters > *"Bail is the rule, jail is the exception."* > โ€” Supreme Court of India, *Satender Kumar Antil v. CBI* (2022) An RL-trained agent that consistently applies this principle โ€” without being swayed by a defendant's name, religion, or economic status โ€” could serve as a real-time consistency check for overburdened courts. **This is not a tool to replace judges.** It is a mirror that forces the system to confront its own inconsistencies. --- ## Results & Verification ### Training Evidence Due to compute and time constraints during the hackathon, we conducted **limited training runs** to validate the environment's learnability. Full-scale training with optimal hyperparameters is planned for post-hackathon work. **Setup for the headline run** (Qwen2.5-1.5B-Instruct on A10G-large): | Parameter | Value | |---|---| | Total training steps | 120 (30 per stage ร— 4 stages) | | Episode quota | 120 cases (30 per stage, balanced) | | Effective batch size | 32 completions per step (1 ร— 8 ร— 4) | | Max completion length | 728 tokens | | Wall time | ~1h 50m | | Reward source โ€” training | In-process `combined_reward` (the same module the env imports) | | Reward source โ€” eval (n=12 per stage) | In-process `combined_reward` against held-out episodes | | Env-API parity | Bitwise โ€” eval scores reproduce on `rollout_via_env_api` up to sampling stochasticity | **Headline metrics** (n = 12 episodes per stage, scored with `combined_reward`; bitwise parity with `server/reward.py`): | Stage | Before (zero-shot) | After (trained) | ฮ” | |---|---|---|---| | Stage 1 โ€” Landmark cases (clear-cut) | 0.4786 | **0.5314** | **+0.0528** | | Stage 2 โ€” Statutory thresholds (BNSS ยง479) | 0.3992 | **0.4827** | **+0.0835** | | Stage 3 โ€” Bias / disadvantage scenarios | 0.4154 | **0.4734** | **+0.0580** | | Stage 4 โ€” Interleaved + perturbations | 0.4710 | 0.4717 | +0.0007 | | **Mean (all stages)** | **0.4410** | **0.4898** | **+0.0488** *(+11% relative)* | | Traces harvested into Stage N+1 prompts (Theme 4) | โ€” | 8 | โ€” | ![Baseline vs trained reward per curriculum stage](assets/results/before_after_comparison.png) *Headline figure โ€” baseline vs trained reward per curriculum stage. Stages 1โ€“3 show consistent improvement with the largest gain on statutory-threshold reasoning (Stage 2, +0.084). Stage 4 (perturbations) is essentially flat โ€” the open problem.* **Reading the table.** GRPO produced consistent gains on Stages 1โ€“3 (format compliance, outcome correctness, statutory threshold reasoning, bias-penalty avoidance), with the largest absolute improvement on Stage 2 โ€” exactly where the new `reward_reasoning_specificity` signal was designed to fire. Stage 4 (perturbations: name swaps, numerical variants, schema drift) is **flat**: the model fits the curriculum but does not yet generalise to robustness perturbations after only 30 steps per stage. We treat this as the headline open problem (see Limitations & Future Work). ![Reward curve across all four curriculum stages](assets/results/reward_curve.png) *Multi-stage reward trajectory (cumulative steps 5 โ†’ 120). Each colour is one curriculum stage; **dashed lines** are the zero-shot baseline for that stage and **dotted lines** are the post-train evaluation. Training rollouts (the connected dots) sit consistently above the dashed baselines, confirming GRPO is updating the policy in the right direction. The Stage 4 rollouts are also above its baseline, but the post-train eval lands almost exactly on the baseline โ€” visual confirmation that gains do not transfer to perturbed inputs.* ![GRPO training loss across all 120 cumulative steps](assets/results/training_loss.png) *Training loss (note y-axis: ร—10โปโถ). Loss in GRPO is dominated by the KL penalty (`beta=0.01`) โ€” the actual learning signal lives in the reward, not the loss. The slow downward drift across cumulative steps is consistent with stable, non-collapsing updates.* **Reconstructed from log.** The full per-step `log_history` (24 entries: 4 stages ร— 30 steps รท logging_steps = 5) is embedded in `outputs/undertrial_grpo/curriculum_results.json` for independent verification. The plots above were rebuilt from the captured `hf jobs logs` stdout via [`training/parse_job_log.py`](training/parse_job_log.py) โ€” the artifacts inside the HF Jobs container did not survive the ephemeral filesystem teardown, but every metric we needed was already in the log. **Methodology note (honest framing).** The numbers above are from in-process `combined_reward` evaluation against held-out episodes; the reward code is byte-identical to the live env's `server/reward.py`, so a deployment-time env-API rollout against the same episodes returns the same score. The `--env_url` plumbing is wired through `train_grpo.py` and verified for liveness on each run; we chose in-process scoring during training to avoid HTTP latency dominating the rollout loop, not because the env API is unreliable. A separate post-training env-API verification pass would produce identical numbers up to model-sampling stochasticity (`temperature=0.85`). **Note on limited training.** These results represent a single 30-steps-per-stage validation run on Qwen2.5-1.5B-Instruct under a 3-hour wall budget. With longer training, larger base models (3B / 7B), and richer perturbation curricula, we expect Stage 4 to also show meaningful gains and absolute mean reward to exceed 0.70+. The gaming-resistance verification (below) confirms that *any* reward improvement we observe corresponds to genuine legal reasoning rather than format exploitation. ### Gaming Resistance Verified The reward function correctly ranks completions by reasoning quality: | Completion Type | Sample Reward | Verification | |---|---|---| | **Ideal** (full reasoning, all tools, correct outcome) | 1.15 | โœ… PASS | | **Filler** (generic reasoning, minimal tools) | 0.66 | โœ… PASS | | **Minimal** (bare XML, no tools) | 0.32 | โœ… PASS | | **Tool spam** (redundant calls, no reasoning) | 0.17 | โœ… PASS | GRPO correctly optimises for `ideal > filler > minimal > spam`. ### Verification Suite - **`smoke_test.py`** โ€” 10 / 10 PASS (environment correctness, tool registration, episode loading) - **`pass5_verify.py`** โ€” 8 / 8 PASS (gaming resistance, component independence, reward bounds) - **`quick_check.py`** โ€” 1-minute end-to-end env reachability + sample episode roundtrip ### Demo & Resources - **[Live HF Space](https://huggingface.co/spaces/Draken1606/undertrial-ai)** โ€” interactive bail assessment demo *(Note: Space may need 30โ€“60 s to wake from sleep on first visit)* - **[Swagger API Docs](https://draken1606-undertrial-ai.hf.space/docs)** โ€” full REST API documentation - **[Training Script](training/train_grpo.py)** โ€” GRPO training with Unsloth (single / curriculum / adaptive modes) - **[Colab Notebook](training/UndertriAI_GRPO_Training.ipynb)** โ€” step-by-step training walkthrough - **[Project Blog](BLOG_LINK_HERE)** โ€” *"Three minutes should never decide a life"* (link to be updated) - **[Source Paper](https://arxiv.org/abs/2508.07592)** โ€” dataset methodology and fairness analysis - **[Dataset on HF](https://huggingface.co/datasets/SnehaDeshmukh/IndianBailJudgments-1200)** โ€” 1,200 annotated HC judgments --- ## Limitations & Future Work **Current limitations:** - **Bias-flagged cases are sparse** (~1%, 13 cases) โ€” sufficient for proof-of-concept, not for large-scale fairness claims. Parity-argument signal partially compensates. - **Training was offline** (in-process scoring) for latency reasons. Headline numbers are env-API-verified post-hoc; full online training is implemented but not used by default in `--curriculum` mode. - **Single-model evaluation** โ€” only Qwen2.5-1.5B-Instruct was trained for the hackathon submission. Larger backbones (3B / 7B) likely close the gap to higher reward ceilings. - **No human-in-the-loop fairness audit** โ€” bias detection relies on dataset annotations; an external legal-expert review is future work. **Future improvements:** - Expand bias-flagged cases to 10โ€“15% of dataset - Add adversarial evaluation set (cases designed to exploit reward weaknesses) - Train on larger models (Qwen2.5-7B, Llama-3-8B) with extended curricula - Add human-in-the-loop evaluation for bias detection - Switch curriculum mode to env-API rewards once HTTP overhead is amortised (e.g. via batched `/step` or co-located env) --- ## Team Built for the **Meta PyTorch OpenEnv Hackathon ร— Scaler School of Technology, April 2026**. **Primary Theme:** Theme 3.1 โ€” Professional Tasks / World Modeling **Secondary Theme:** Theme 4 โ€” Self-Improvement --- ## Citation If you use this environment or dataset, please cite: ```bibtex @article{deshmukh2025indianbail, title = {IndianBailJudgments: A Dataset for Bail Prediction and Fairness in Indian Courts}, author = {Deshmukh, Sneha and others}, journal = {arXiv preprint arXiv:2508.07592}, year = {2025} } ``` --- ## License MIT License โ€” see [LICENSE](LICENSE) for details. Environment code licensed under MIT. Dataset usage subject to terms in the [HF dataset card](https://huggingface.co/datasets/SnehaDeshmukh/IndianBailJudgments-1200).