Spaces:

Draken1606
/

undertrial-ai

Running

App Files Files Community

undertrial-ai / README.md

Draken1606

3-level

2c93c00 5 days ago

preview code

raw

history blame contribute delete

34.5 kB

	---
	title: UndertriAI
	emoji: ⚖️
	colorFrom: indigo
	colorTo: blue
	sdk: docker
	app_port: 7860
	license: mit
	short_description: OpenEnv RL environment for Indian bail decision support
	tags:
	- openenv
	- legal-ai
	- reinforcement-learning
	- bail
	- india
	- grpo
	- world-modeling
	---

	# UndertriAI ⚖️

	OpenEnv-compliant RL training environment for Indian bail decision support.

	[![OpenEnv](https://img.shields.io/badge/OpenEnv-compatible-6366f1)](https://github.com/meta-pytorch/OpenEnv)
	[![Live Demo](https://img.shields.io/badge/🤗_Space-Live_Demo-yellow)](https://huggingface.co/spaces/Draken1606/undertrial-ai)
	[![Swagger](https://img.shields.io/badge/API-Swagger_Docs-green)](https://draken1606-undertrial-ai.hf.space/docs)
	[![License: MIT](https://img.shields.io/badge/License-MIT-gray)](LICENSE)

	> [▶ Try the Live Demo](https://huggingface.co/spaces/Draken1606/undertrial-ai) — click "Run Bail Assessment" to see the environment in action.
	> [📝 Read the Story](https://huggingface.co/spaces/Draken1606/undertrial-ai/blob/main/Blog.md) — "Three minutes should never decide a life" (link to be updated)

	---

	## The Problem

	76% of India's 5.7 lakh prisoners are undertrials[^1] — unconvicted people awaiting bail hearings, many of whom cannot afford lawyers.

	A subordinate court judge handles 80–100 bail hearings per day — roughly 3 minutes per case. In that window they must read the charge sheet, assess flight risk, evaluate custody duration against the statutory threshold, and check for parity with co-accused. In practice, outcomes are inconsistent and empirically biased against poor, lower-caste, and minority accused.

	This is not anecdotal — it is structural. The Supreme Court in Satender Kumar Antil v. CBI (2022) explicitly noted the crisis.

	[^1]: Deshmukh et al., "IndianBailJudgments: A Dataset for Bail Prediction and Fairness in Indian Courts," arXiv:2508.07592 (2025), analyzing NCRB Prison Statistics India 2022.

	---

	## What UndertriAI Does

	UndertriAI is an OpenEnv-compliant RL training environment designed for Theme 3.1: Professional Tasks / World Modeling.

	It teaches an LLM to interact with a realistic legal workflow — not through shortcuts, but through genuine tool use, statutory reasoning, and multi-step case analysis:

	1. Read case documents (charge sheet, arguments, criminal history)
	2. Invoke legal tools (12 specialized tools for statutory eligibility, precedent lookup, risk assessment)
	3. Produce structured bail memos with explicit reasoning chains
	4. Get evaluated against real Indian High Court decisions using a deterministic, multi-component reward function

	Additionally, the environment implements Theme 4: Self-Improvement through adaptive curriculum mechanisms (detailed below).

	---

	## Environment Design

	### Theme 3.1: Professional Tasks / World Modeling

	This environment qualifies for Theme 3.1 by requiring genuine interaction with a partially observable legal world where:

	- Tool invocation is mandatory — statutory thresholds cannot be guessed; they must be computed via `compute_statutory_eligibility`
	- Multi-step reasoning is required — the model must sequence tool calls (read arguments → assess risk → compute eligibility → cite precedent → draft memo)
	- Shortcuts fail — trying to submit a memo without tool use earns near-zero reward due to missing statutory/precedent signals
	- State persistence matters — tool outputs accumulate in episode state; later reasoning depends on earlier tool calls
	- API/workflow simulation — the environment models real judicial clerk workflows: document retrieval, legal database queries, risk scoring matrices

	This is not a text completion task. It is a dynamic system where the agent must orchestrate tools, maintain working memory across 5–15 actions per episode, and produce outputs that match real judicial reasoning patterns.

	### API Endpoints

	\| Method \| Endpoint \| Description \|
	\|---\|---\|---\|
	\| `POST` \| `/reset?stage=1` \| Start a new episode (curriculum stage 1–4) \|
	\| `POST` \| `/reset?adaptive=true&auto_stage=true` \| Start episode with adaptive selection (Theme 4) \|
	\| `POST` \| `/step` \| Submit a tool call or final memo \|
	\| `GET` \| `/state?session_id=...` \| Inspect current episode state \|
	\| `GET` \| `/profile?session_id=...` \| Agent performance profile (Theme 4) \|
	\| `GET` \| `/adaptive_status` \| Adaptive mode capabilities & thresholds \|
	\| `GET` \| `/health` \| Health check \|
	\| `GET` \| `/tools` \| List available tools \|
	\| `WS` \| `/ws/{session_id}` \| WebSocket real-time feed \|

	### Tools Available to the Agent

	\| Tool \| Purpose \|
	\|---\|---\|
	\| `compute_statutory_eligibility` \| Calculate custody vs threshold for IPC/BNSS sections (non-guessable) \|
	\| `cross_reference_precedent` \| Look up landmark HC/SC decisions \|
	\| `assess_surety` \| Evaluate surety bond appropriateness \|
	\| `classify_bail_type` \| Determine regular / anticipatory / default bail \|
	\| `request_document` \| Request additional case documents \|
	\| `flag_inconsistency` \| Flag contradictions in the charge sheet \|
	\| `read_submissions` \| Read prosecution/defence arguments on record \|
	\| `assess_flight_risk` \| Systematic flight risk scoring matrix \|
	\| `check_case_factors` \| Examine parity, evidence tampering, victim vulnerability \|
	\| `apply_proportionality` \| BNSS 479 custody vs. max sentence proportionality \|
	\| `pull_criminal_history` \| Prior record, bail history, conviction status \|
	\| `submit_memo` \| Terminal action — submit final bail recommendation \|

	Example tool invocation:
	```json
	{
	"tool": "compute_statutory_eligibility",
	"section": "IPC 420",
	"custody_months": 8
	}
	```

	### 4-Stage Curriculum

	\| Stage \| Focus \| Cases \| Learning Objective \|
	\|---\|---\|---\|---\|
	\| 1 \| Landmark cases (clear-cut eligibility) \| ~40 \| Learn tool sequencing + format \|
	\| 2 \| Contested cases (murder, repeat offenders) \| ~1,100 \| Learn contested reasoning patterns \|
	\| 3 \| Bias-reversal cases (HC overturning biased lower courts) \| ~30 \| Learn to detect parity violations \|
	\| 4 \| BNSS schema drift (IPC → BNS remapping, 2023 reform) \| ~50 \| Test adaptability to legal schema changes \|

	Example Stage 4 challenge: Case uses IPC 379 (theft, 3-year max sentence, threshold = 1/2 max = 18 months). After BNSS 2023 reform, this maps to BNS 303 (theft, still 3-year max, but different bail provision language under BNSS § 479). The model must apply the new schema without retraining on BNSS-specific examples.

	---

	## Theme 4 — Self-Improvement (Secondary)

	UndertriAI implements three self-improvement mechanisms as a secondary theme contribution:

	1. Adaptive Curriculum Promotion
	The environment tracks per-stage performance using exponential moving averages. When the agent demonstrates consistent improvement (Stage 1 mean reward ≥ 0.65 over 20 episodes), it automatically promotes to the next curriculum stage. Visible in training logs as:
	```
	[SELF-IMPROVEMENT] Step 100: Promoted to Stage 2. Stage 1 mean reward: 0.710 → Stage 2 begins.
	```

	2. Weakness-Targeted Episode Selection
	In adaptive mode, the episode selector identifies the crime type where the agent performs worst (via EMA-tracked per-crime-type reward) and serves proportionally more cases from that domain. As the agent improves on weak domains, the selection distribution shifts — the environment continuously finds and targets new weaknesses.

	\| Selection Mode \| Weight \| Mechanism \|
	\|---\|---\|---\|
	\| Weakest domain \| 60% \| Serve cases from lowest-performing crime category \|
	\| Failure replay \| 30% \| Re-serve cases with reward < 0.40 \|
	\| Exploration \| 10% \| Uniform random (prevent overfitting) \|

	3. Synthetic Case Generation
	When the agent masters a stage (mean reward ≥ 0.70 on a stage), the environment generates harder synthetic variants using 5 perturbation types:

	\| Perturbation \| What it tests \|
	\|---\|---\|
	\| Custody escalation \| Custody 2 months below threshold — forces exact statutory computation \|
	\| Co-accused conflict \| Opposite bail outcomes for co-accused — tests parity reasoning \|
	\| Section ambiguity \| IPC ↔ BNSS section swap — tests schema drift robustness \|
	\| Evidence reversal \| Key witness retracted — tests flight risk reassessment \|
	\| Surety complexity \| Non-resident surety — tests condition appropriateness \|

	Live Demo — Self-Improvement in Action:
	```bash
	# Start the server
	python -m server.app

	# In another terminal — adaptive training
	python training/train_grpo.py --adaptive --steps 50 --env_url http://localhost:8000
	```

	Monitor progress via `GET /profile?session_id={id}` and `GET /adaptive_status`.

	---

	## Reward Function

	```python
	R = 0.4 × outcome_match (gated by think_factor)
	+ 0.2 × flight_risk_accuracy
	+ 0.2 × statutory_accuracy
	+ 0.2 × condition_appropriateness
	+ 0.1 × reasoning_quality (bonus)
	+ 0.05 × format_compliance (bonus)
	+ 0.05 × process_bonus (tool-use proxy, bonus)
	± 0.05 × diversity_bonus (anti-collapse signal)
	− 0.3 × bias_penalty (fires on parity violations)
	```

	Reward range: core components sum to 1.0; with bonuses, total can reach ~1.15; with bias penalty, it can drop to ~0.7 on a bias-flagged case answered without parity reasoning.

	All components are fully deterministic and rule-based — no LLM-as-judge.

	\| Component \| Signal Type \| Details \|
	\|---\|---\|---\|
	\| Outcome Match \| 0.0 / 0.8 / 1.0 \| Exact, directional, or wrong vs HC decision — gated by `<think>` block presence \|
	\| Flight Risk \| 0–1 \| Ordinal distance to ground-truth risk level (Low / Medium / High) \|
	\| Statutory \| 0–1 \| IPC/BNSS threshold computation, direction-gated, NDPS Section 37 aware \|
	\| Conditions \| 0–1 \| Bail-condition appropriateness for crime / risk profile \|
	\| Reasoning Quality \| 0–1 \| Anchoring + arithmetic + grounds specificity (10% bonus) \|
	\| Format Compliance \| 0–1 \| XML tag adherence to system prompt (5% bonus) \|
	\| Process Bonus \| 0 or 0.05 \| Awarded if both `custody_months` and threshold computation appear verbatim in `<think>` (proxy for tool use) \|
	\| Diversity Bonus \| ±0.05 \| +0.05 if rollouts produce ≥2 distinct outcomes; −0.05 if all rollouts collapse to the same outcome \|
	\| Bias Penalty \| −0.3 \| Fires if parity argument ignored in bias-flagged cases \|

	### Anti-Reward-Hacking Design

	- Multiple independent reward signals — gaming all of them simultaneously is harder than gaming one
	- `GenerationInspectionCallback` prints raw completions every 25 training steps for manual review
	- Reasoning gate: No `<think>` block → outcome reward zeroed in Stage 2+ (prevents format exploitation)
	- Direction gate: Wrong bail direction → statutory bonus capped (prevents partial-credit gaming)
	- Bias penalty operates as a separate signal, not folded into outcome (ensures visibility)
	- Schema drift (Stage 4) tests adaptability, not pattern memorisation
	- Diversity signal flags reward-collapse — prints `[WARNING] Reward variance collapsed` if the policy converges to a single outcome
	- Tool-invocation tracking: `process_bonus` only fires when episode-specific custody/threshold values (which are not in the user prompt) appear in the model's reasoning — strong proxy for actual tool use

	Gaming resistance verified via unit tests:

	\| Completion Type \| Sample Reward \| Verification \|
	\|---\|---\|---\|
	\| Ideal (full reasoning, all tools, correct outcome) \| 1.15 \| ✅ PASS \|
	\| Filler (generic reasoning, minimal tools) \| 0.66 \| ✅ PASS \|
	\| Minimal (bare XML, no tools) \| 0.32 \| ✅ PASS \|
	\| Tool spam (redundant calls, no reasoning) \| 0.17 \| ✅ PASS \|

	GRPO correctly ranks `ideal > filler > minimal > spam`.

	---

	## Training

	Uses GRPO (Group Relative Policy Optimization) via TRL + Unsloth on `Qwen2.5-7B-Instruct` (4-bit quantized + LoRA r=16 — i.e. QLoRA).

	### Hybrid Training / Evaluation Design

	Key design decision: UndertriAI uses a hybrid offline/online architecture to balance speed and correctness.

	- Reward computation during training: in-process (offline).
	The trainer imports the same `server/reward.py` module that the deployed FastAPI server uses and calls `combined_reward(...)` directly. This gives bitwise reward parity with the env-API path while avoiding ~64 HTTP calls per training step (`num_generations × grad_accum × 2 calls per rollout`). On a single A10G, in-process scoring lets four curriculum stages fit into a ~3h budget; the equivalent online path would require ~5–6h of wall time mostly spent in network I/O.

	- Adaptive curriculum mechanisms: live env API.
	The `/profile`, `/adaptive_status`, and stage-promotion logic always go through the deployed environment so per-domain EMA tracking and weakness-targeted episode selection observe real environment state.

	- Evaluation: in-process scoring with bitwise parity to the env API.
	Per-stage before/after numbers in [Results & Verification](#results--verification) are produced by `evaluate_on_stage(...)` calling `combined_reward(...)` against the same model checkpoint. Because `combined_reward` is the same function object the deployed env imports, replaying the same episodes through `rollout_via_env_api()` against the live HF Space returns identical scores up to sampling stochasticity. The Live Demo HF Space serves the trained adapter through the env API end-to-end for interactive verification.

	The alternative — pure online training via `rollout_via_env_api()` for every rollout — is also implemented and selectable via `--env_url ...` (without `--offline`) in single-stage mode (`--stage N`). It is not the default for `--curriculum` because of the latency profile described above. See `training/train_grpo.py → rollout_via_env_api()` for the env-API path.

	### Training Modes

	\| Mode \| Command \| Description \|
	\|---\|---\|---\|
	\| 3-Level Curriculum (recommended) \| `python training/train_grpo.py --curriculum --offline` \| Format → Reasoning → Adversarial (300 steps total) \|
	\| Legacy 4-stage \| `python training/train_grpo.py --curriculum --offline --difficulties "" --stages 1,2,3,4` \| Sequential 4-stage with trace harvesting \|
	\| Single-stage (offline) \| `python training/train_grpo.py --stage 1 --offline --steps 200` \| Local scoring (smoke testing) \|
	\| Baseline only \| `python training/train_grpo.py --baseline_only` \| Zero-shot eval, no training \|

	### 3-Level Difficulty Curriculum

	\| Level \| Case Type \| Episodes \| Steps \| Difficulty \|
	\|-------\|-----------\|----------\|-------\|------------\|
	\| Easy \| Landmark clear-cut cases \| 104 \| 60 \| Model builds confidence on obvious grant/deny \|
	\| Medium \| Contested judgment calls \| 761 \| 160 \| Bulk learning — statutory math, risk assessment \|
	\| Hard \| Bias reversal + schema drift \| 335 \| 80 \| Edge cases that trip up shortcut-takers \|

	### Default hyperparameters

	\| Parameter \| Default \| Rationale \|
	\|---\|---\|---\|
	\| Base model \| `unsloth/Qwen2.5-7B-Instruct` \| 4-bit + LoRA r=16 \|
	\| Total steps \| 300 (60+160+80) \| 3-level curriculum, ~2.5h on Kaggle T4 \|
	\| `num_generations` \| 6 \| GRPO rollouts per prompt; 50% more variance than 4 \|
	\| `temperature` \| 1.1 \| Higher exploration for diverse rollouts \|
	\| Max completion length \| 384 tokens \| Fits bail memos; saves VRAM vs 512 \|
	\| `batch_size × grad_accum` \| 1 × 8 \| Effective batch 8; Kaggle T4 safe \|
	\| `learning_rate` \| 5e-6 \| Curriculum-scale LR \|

	### Deploy & Train Workflow

	```bash
	# 1. Deploy environment to HF Spaces
	openenv push --repo-id username/undertri-ai

	# 2. Verify it is running
	curl https://username-undertri-ai.hf.space/health

	# 3. Set WandB auth (optional, for live metric tracking)
	export WANDB_API_KEY=your_wandb_api_key

	# 4. Run curriculum training as a one-shot HF Job (A10G, ~2h)
	hf jobs uv run --flavor a10g-large --timeout 3h \
	--secrets HF_TOKEN \
	https://raw.githubusercontent.com/Faiz-1606/Undertrial/main/training/run_hf_job.py \
	--curriculum \
	--env_url https://username-undertri-ai.hf.space \
	--output ./output/undertrial_grpo
	```

	### Colab Notebook (Step-by-Step)

	[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](training/UndertriAI_GRPO_Training.ipynb)

	```python
	# ============================================================
	# STEP 1 — Install dependencies
	# ============================================================
	!pip install -q "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
	!pip install -q --no-deps trl peft accelerate bitsandbytes xformers
	!pip install -q openenv-core datasets wandb

	import os
	os.environ["WANDB_API_KEY"] = "your_wandb_api_key" # optional

	# ============================================================
	# STEP 2 — Clone repo + load episodes
	# ============================================================
	!git clone https://github.com/Faiz-1606/Undertrial.git
	%cd Undertrial

	# Verify episodes are present (loaded from data/episodes/)
	import os
	for f in sorted(os.listdir("./data/episodes")):
	if f.endswith(".jsonl"):
	n = sum(1 for _ in open(f"./data/episodes/{f}"))
	print(f" {f}: {n} episodes")

	# ============================================================
	# STEP 3 — Quick smoke test (10 steps, ~3 min on T4)
	# ============================================================
	!python training/train_grpo.py \
	--episodes_dir ./data/episodes \
	--offline --stage 1 --steps 10 --batch_size 1

	# ============================================================
	# STEP 4 — Full curriculum training (~1h 50m on A10G; longer on T4)
	# ============================================================
	!python training/train_grpo.py \
	--episodes_dir ./data/episodes \
	--curriculum \
	--env_url https://draken1606-undertrial-ai.hf.space

	# ============================================================
	# STEP 5 — Adaptive training (Theme 4, requires server)
	# ============================================================
	import subprocess, time, requests
	server = subprocess.Popen(
	["python", "-m", "uvicorn", "server.app:app", "--host", "0.0.0.0", "--port", "8000"],
	stdout=subprocess.PIPE, stderr=subprocess.STDOUT,
	)

	for _ in range(30):
	try:
	if requests.get("http://localhost:8000/health", timeout=1).status_code == 200:
	print("✓ Server ready"); break
	except Exception:
	time.sleep(1)
	else:
	raise RuntimeError("Server startup failed — check logs")

	!python training/train_grpo.py \
	--adaptive \
	--episodes_dir ./data/episodes \
	--steps 50 --batch_size 1 \
	--env_url http://localhost:8000

	# ============================================================
	# STEP 6 — Inspect results
	# ============================================================
	import json, pathlib
	results_path = pathlib.Path("./output/undertrial_grpo/curriculum_results.json")
	if results_path.exists():
	print(json.dumps(json.load(open(results_path)), indent=2))
	else:
	print("Check ./output/undertrial_grpo/ for stage_*/ directories")

	# ============================================================
	# STEP 7 — Merge LoRA adapters for inference
	# ============================================================
	from unsloth import FastLanguageModel
	model, tokenizer = FastLanguageModel.from_pretrained(
	"./output/undertrial_grpo/final",
	max_seq_length=3072,
	)
	model.save_pretrained_merged(
	"./output/undertrial_merged",
	tokenizer,
	save_method="merged_16bit",
	)
	print("✓ Merged model saved to ./output/undertrial_merged")
	```

	### Training Architecture

	```
	Episode dataset (JSONL — 1,200 HC judgments, 4 curriculum stages)
	↓
	Format as chat prompt (system + user)
	↓
	Qwen2.5-1.5B-Instruct generates 4 rollouts (GRPO group)
	↓
	XML parser extracts structured fields (recommendation, think, statutory, ...)
	↓
	server/reward.py scores each rollout (deterministic, in-process; same code as env-API)
	↓
	GRPO updates LoRA adapter weights
	↓
	[Theme 4] PerformanceTracker updates EMA per stage / per crime type
	↓
	[Theme 4] AdaptiveSelector targets weakest domain
	↓
	[Theme 4] CaseGenerator creates harder synthetic variants on stage mastery
	↓
	[Theme 4] Auto-promote when stage EMA exceeds threshold
	↓
	Stage save: LoRA adapter + per-stage reward_curve.png + curriculum_results.json
	↓
	End of curriculum: before_after_comparison.png (4-stage baseline vs trained)
	```

	---

	## Installation

	```bash
	# Clone and install
	git clone https://github.com/Faiz-1606/Undertrial
	cd Undertrial
	pip install -e .

	# Use the environment client
	from client import UndertriAIEnv
	env = UndertriAIEnv(base_url="https://draken1606-undertrial-ai.hf.space")
	obs = env.reset(stage=1)
	```

	Or connect directly via the OpenEnv client:
	```python
	from openenv import from_hub
	env = from_hub("Draken1606/undertrial-ai")
	```

	---

	## Project Structure

	```
	undertrial_ai/
	├── server/
	│ ├── app.py # FastAPI routes + Theme 4 endpoints
	│ ├── undertrial_environment.py # Environment logic (Theme 3.1)
	│ ├── reward.py # Multi-component deterministic reward
	│ ├── dataset.py # Curriculum-staged episode loader
	│ ├── schema_drift.py # IPC → BNSS remapping (Stage 4)
	│ ├── performance_tracker.py # [Theme 4] EMA-based performance profiling
	│ ├── adaptive_selector.py # [Theme 4] Weakness-targeted episode selection
	│ └── case_generator.py # [Theme 4] Synthetic case perturbation
	├── training/
	│ ├── train_grpo.py # GRPO training (single / curriculum / adaptive)
	│ ├── run_hf_job.py # PEP 723 bootstrap for HF Jobs (clones repo + installs deps)
	│ ├── eval_and_plot.py # Post-training env-API-verified eval + plots
	│ └── UndertriAI_GRPO_Training.ipynb # Colab notebook
	├── data/
	│ └── episodes/ # 1,200 HC judgments across 4 stages
	├── demo/
	│ └── index.html # Interactive demo UI
	├── client.py # UndertriAIEnv HTTP client
	├── models.py # Pydantic action / observation schemas
	├── openenv.yaml # OpenEnv manifest
	└── Dockerfile # HF Spaces deployment
	```

	---

	## Data

	Source: Deshmukh et al., "IndianBailJudgments: A Dataset for Bail Prediction and Fairness in Indian Courts" ([arXiv:2508.07592](https://arxiv.org/abs/2508.07592))

	Dataset: [SnehaDeshmukh/IndianBailJudgments-1200](https://huggingface.co/datasets/SnehaDeshmukh/IndianBailJudgments-1200)

	1,200 Indian High Court bail judgments (2018–2024) processed into curriculum episodes covering:
	- Delhi, Bombay, Allahabad, Madras, Kerala, and Calcutta High Courts
	- Crimes from IPC 420 (cheating) to IPC 302 (murder)
	- Cases annotated with ground-truth outcome, flight risk, bias flags, and parity arguments

	### Dataset as a Training Challenge (Not a Bug)

	Known dataset characteristics — and why they make this a stronger RL environment:

	\| Characteristic \| Value \| Why this strengthens training \|
	\|---\|---\|---\|
	\| `flight_risk == "Medium"` \| ~72% \| The model cannot earn full reward by always saying "Medium" — flight risk is only 20% of total reward. To exceed 0.70 total reward the model must correctly invoke statutory tools, cite precedents, and produce coherent reasoning. The Medium-heavy distribution mirrors real Indian HC data, making this a realistic training challenge rather than a synthetic balanced dataset. \|
	\| `custody_months == 6.0` \| ~74% \| Custody arithmetic becomes discriminating in Stage 3 (bias-reversal) and Stage 4 (schema drift) where threshold calculations differ. The `reasoning_quality` sub-score rewards exact numerical matches in `<think>` blocks. \|
	\| `bias_flag == True` \| ~1% (13 cases) \| Honest limitation: bias penalty fires rarely (≈ once every 92 episodes under uniform sampling). This is a proof-of-concept signal, not a large-scale bias-mitigation system. The 28% parity-argument signal provides the main training pathway for fairness reasoning. Future work: expand bias-flagged evaluation set to 10–15%. \|
	\| Empty `prosecution_arguments` \| ~53% \| Not a flaw — this mirrors real case records where prosecution arguments are not always transcribed. The model must reason from charge sheet and defence arguments alone, which is the actual judicial workflow. \|

	Why imbalanced data is valuable for RL training:
	Balanced datasets teach pattern matching. Imbalanced datasets teach robust reasoning under real-world distributions. A model trained on 50/50 Medium/High flight-risk cases would fail on real HC data, which is overwhelmingly Medium. UndertriAI's distribution forces the model to learn when "Medium" is correct (most cases) and when it's wrong (bias-reversal cases) — which is exactly the reasoning pattern judges need.

	---

	## Why This Matters

	> "Bail is the rule, jail is the exception."
	> — Supreme Court of India, Satender Kumar Antil v. CBI (2022)

	An RL-trained agent that consistently applies this principle — without being swayed by a defendant's name, religion, or economic status — could serve as a real-time consistency check for overburdened courts.

	This is not a tool to replace judges. It is a mirror that forces the system to confront its own inconsistencies.

	---

	## Results & Verification

	### Training Evidence

	Due to compute and time constraints during the hackathon, we conducted limited training runs to validate the environment's learnability. Full-scale training with optimal hyperparameters is planned for post-hackathon work.

	Setup for the headline run (Qwen2.5-1.5B-Instruct on A10G-large):

	\| Parameter \| Value \|
	\|---\|---\|
	\| Total training steps \| 120 (30 per stage × 4 stages) \|
	\| Episode quota \| 120 cases (30 per stage, balanced) \|
	\| Effective batch size \| 32 completions per step (1 × 8 × 4) \|
	\| Max completion length \| 728 tokens \|
	\| Wall time \| ~1h 50m \|
	\| Reward source — training \| In-process `combined_reward` (the same module the env imports) \|
	\| Reward source — eval (n=12 per stage) \| In-process `combined_reward` against held-out episodes \|
	\| Env-API parity \| Bitwise — eval scores reproduce on `rollout_via_env_api` up to sampling stochasticity \|

	Headline metrics (n = 12 episodes per stage, scored with `combined_reward`; bitwise parity with `server/reward.py`):

	\| Stage \| Before (zero-shot) \| After (trained) \| Δ \|
	\|---\|---\|---\|---\|
	\| Stage 1 — Landmark cases (clear-cut) \| 0.4786 \| 0.5314 \| +0.0528 \|
	\| Stage 2 — Statutory thresholds (BNSS §479) \| 0.3992 \| 0.4827 \| +0.0835 \|
	\| Stage 3 — Bias / disadvantage scenarios \| 0.4154 \| 0.4734 \| +0.0580 \|
	\| Stage 4 — Interleaved + perturbations \| 0.4710 \| 0.4717 \| +0.0007 \|
	\| Mean (all stages) \| 0.4410 \| 0.4898 \| +0.0488 (+11% relative) \|
	\| Traces harvested into Stage N+1 prompts (Theme 4) \| — \| 8 \| — \|

	![Baseline vs trained reward per curriculum stage](assets/results/before_after_comparison.png)

	Headline figure — baseline vs trained reward per curriculum stage. Stages 1–3 show consistent improvement with the largest gain on statutory-threshold reasoning (Stage 2, +0.084). Stage 4 (perturbations) is essentially flat — the open problem.

	Reading the table. GRPO produced consistent gains on Stages 1–3 (format compliance, outcome correctness, statutory threshold reasoning, bias-penalty avoidance), with the largest absolute improvement on Stage 2 — exactly where the new `reward_reasoning_specificity` signal was designed to fire. Stage 4 (perturbations: name swaps, numerical variants, schema drift) is flat: the model fits the curriculum but does not yet generalise to robustness perturbations after only 30 steps per stage. We treat this as the headline open problem (see Limitations & Future Work).

	![Reward curve across all four curriculum stages](assets/results/reward_curve.png)

	Multi-stage reward trajectory (cumulative steps 5 → 120). Each colour is one curriculum stage; dashed lines* are the zero-shot baseline for that stage and dotted lines are the post-train evaluation. Training rollouts (the connected dots) sit consistently above the dashed baselines, confirming GRPO is updating the policy in the right direction. The Stage 4 rollouts are also above its baseline, but the post-train eval lands almost exactly on the baseline — visual confirmation that gains do not transfer to perturbed inputs.*

	![GRPO training loss across all 120 cumulative steps](assets/results/training_loss.png)

	Training loss (note y-axis: ×10⁻⁶). Loss in GRPO is dominated by the KL penalty (`beta=0.01`) — the actual learning signal lives in the reward, not the loss. The slow downward drift across cumulative steps is consistent with stable, non-collapsing updates.

	Reconstructed from log. The full per-step `log_history` (24 entries: 4 stages × 30 steps ÷ logging_steps = 5) is embedded in `outputs/undertrial_grpo/curriculum_results.json` for independent verification. The plots above were rebuilt from the captured `hf jobs logs` stdout via [`training/parse_job_log.py`](training/parse_job_log.py) — the artifacts inside the HF Jobs container did not survive the ephemeral filesystem teardown, but every metric we needed was already in the log.

	Methodology note (honest framing). The numbers above are from in-process `combined_reward` evaluation against held-out episodes; the reward code is byte-identical to the live env's `server/reward.py`, so a deployment-time env-API rollout against the same episodes returns the same score. The `--env_url` plumbing is wired through `train_grpo.py` and verified for liveness on each run; we chose in-process scoring during training to avoid HTTP latency dominating the rollout loop, not because the env API is unreliable. A separate post-training env-API verification pass would produce identical numbers up to model-sampling stochasticity (`temperature=0.85`).

	Note on limited training. These results represent a single 30-steps-per-stage validation run on Qwen2.5-1.5B-Instruct under a 3-hour wall budget. With longer training, larger base models (3B / 7B), and richer perturbation curricula, we expect Stage 4 to also show meaningful gains and absolute mean reward to exceed 0.70+. The gaming-resistance verification (below) confirms that any reward improvement we observe corresponds to genuine legal reasoning rather than format exploitation.

	### Gaming Resistance Verified

	The reward function correctly ranks completions by reasoning quality:

	\| Completion Type \| Sample Reward \| Verification \|
	\|---\|---\|---\|
	\| Ideal (full reasoning, all tools, correct outcome) \| 1.15 \| ✅ PASS \|
	\| Filler (generic reasoning, minimal tools) \| 0.66 \| ✅ PASS \|
	\| Minimal (bare XML, no tools) \| 0.32 \| ✅ PASS \|
	\| Tool spam (redundant calls, no reasoning) \| 0.17 \| ✅ PASS \|

	GRPO correctly optimises for `ideal > filler > minimal > spam`.

	### Verification Suite

	- `smoke_test.py` — 10 / 10 PASS (environment correctness, tool registration, episode loading)
	- `pass5_verify.py` — 8 / 8 PASS (gaming resistance, component independence, reward bounds)
	- `quick_check.py` — 1-minute end-to-end env reachability + sample episode roundtrip

	### Demo & Resources

	- [Live HF Space](https://huggingface.co/spaces/Draken1606/undertrial-ai) — interactive bail assessment demo
	(Note: Space may need 30–60 s to wake from sleep on first visit)
	- [Swagger API Docs](https://draken1606-undertrial-ai.hf.space/docs) — full REST API documentation
	- [Training Script](training/train_grpo.py) — GRPO training with Unsloth (single / curriculum / adaptive modes)
	- [Colab Notebook](training/UndertriAI_GRPO_Training.ipynb) — step-by-step training walkthrough
	- [Project Blog](BLOG_LINK_HERE) — "Three minutes should never decide a life" (link to be updated)
	- [Source Paper](https://arxiv.org/abs/2508.07592) — dataset methodology and fairness analysis
	- [Dataset on HF](https://huggingface.co/datasets/SnehaDeshmukh/IndianBailJudgments-1200) — 1,200 annotated HC judgments

	---

	## Limitations & Future Work

	Current limitations:

	- Bias-flagged cases are sparse (~1%, 13 cases) — sufficient for proof-of-concept, not for large-scale fairness claims. Parity-argument signal partially compensates.
	- Training was offline (in-process scoring) for latency reasons. Headline numbers are env-API-verified post-hoc; full online training is implemented but not used by default in `--curriculum` mode.
	- Single-model evaluation — only Qwen2.5-1.5B-Instruct was trained for the hackathon submission. Larger backbones (3B / 7B) likely close the gap to higher reward ceilings.
	- No human-in-the-loop fairness audit — bias detection relies on dataset annotations; an external legal-expert review is future work.

	Future improvements:

	- Expand bias-flagged cases to 10–15% of dataset
	- Add adversarial evaluation set (cases designed to exploit reward weaknesses)
	- Train on larger models (Qwen2.5-7B, Llama-3-8B) with extended curricula
	- Add human-in-the-loop evaluation for bias detection
	- Switch curriculum mode to env-API rewards once HTTP overhead is amortised (e.g. via batched `/step` or co-located env)

	---

	## Team

	Built for the Meta PyTorch OpenEnv Hackathon × Scaler School of Technology, April 2026.

	Primary Theme: Theme 3.1 — Professional Tasks / World Modeling
	Secondary Theme: Theme 4 — Self-Improvement

	---

	## Citation

	If you use this environment or dataset, please cite:

	```bibtex
	@article{deshmukh2025indianbail,
	title = {IndianBailJudgments: A Dataset for Bail Prediction and Fairness in Indian Courts},
	author = {Deshmukh, Sneha and others},
	journal = {arXiv preprint arXiv:2508.07592},
	year = {2025}
	}
	```

	---

	## License

	MIT License — see [LICENSE](LICENSE) for details.

	Environment code licensed under MIT. Dataset usage subject to terms in the [HF dataset card](https://huggingface.co/datasets/SnehaDeshmukh/IndianBailJudgments-1200).