Spaces:
Running
Running
| title: UndertriAI | |
| emoji: ⚖️ | |
| colorFrom: indigo | |
| colorTo: blue | |
| sdk: docker | |
| app_port: 7860 | |
| license: mit | |
| short_description: OpenEnv RL environment for Indian bail decision support | |
| tags: | |
| - openenv | |
| - legal-ai | |
| - reinforcement-learning | |
| - bail | |
| - india | |
| - grpo | |
| - world-modeling | |
| # UndertriAI ⚖️ | |
| **OpenEnv-compliant RL training environment for Indian bail decision support.** | |
| [](https://github.com/meta-pytorch/OpenEnv) | |
| [](https://huggingface.co/spaces/Draken1606/undertrial-ai) | |
| [](https://draken1606-undertrial-ai.hf.space/docs) | |
| [](LICENSE) | |
| > **[▶ Try the Live Demo](https://huggingface.co/spaces/Draken1606/undertrial-ai)** — click "Run Bail Assessment" to see the environment in action. | |
| > **[📝 Read the Story](https://huggingface.co/spaces/Draken1606/undertrial-ai/blob/main/Blog.md)** — *"Three minutes should never decide a life"* (link to be updated) | |
| --- | |
| ## The Problem | |
| **76% of India's 5.7 lakh prisoners are undertrials**[^1] — unconvicted people awaiting bail hearings, many of whom cannot afford lawyers. | |
| A subordinate court judge handles **80–100 bail hearings per day** — roughly **3 minutes per case**. In that window they must read the charge sheet, assess flight risk, evaluate custody duration against the statutory threshold, and check for parity with co-accused. In practice, outcomes are inconsistent and empirically biased against poor, lower-caste, and minority accused. | |
| **This is not anecdotal — it is structural.** The Supreme Court in *Satender Kumar Antil v. CBI* (2022) explicitly noted the crisis. | |
| [^1]: Deshmukh et al., "IndianBailJudgments: A Dataset for Bail Prediction and Fairness in Indian Courts," *arXiv:2508.07592* (2025), analyzing NCRB Prison Statistics India 2022. | |
| --- | |
| ## What UndertriAI Does | |
| UndertriAI is an **OpenEnv-compliant RL training environment** designed for **Theme 3.1: Professional Tasks / World Modeling**. | |
| It teaches an LLM to interact with a realistic legal workflow — not through shortcuts, but through genuine tool use, statutory reasoning, and multi-step case analysis: | |
| 1. **Read case documents** (charge sheet, arguments, criminal history) | |
| 2. **Invoke legal tools** (12 specialized tools for statutory eligibility, precedent lookup, risk assessment) | |
| 3. **Produce structured bail memos** with explicit reasoning chains | |
| 4. **Get evaluated** against real Indian High Court decisions using a deterministic, multi-component reward function | |
| Additionally, the environment implements **Theme 4: Self-Improvement** through adaptive curriculum mechanisms (detailed below). | |
| --- | |
| ## Environment Design | |
| ### Theme 3.1: Professional Tasks / World Modeling | |
| This environment qualifies for Theme 3.1 by requiring **genuine interaction with a partially observable legal world** where: | |
| - **Tool invocation is mandatory** — statutory thresholds cannot be guessed; they must be computed via `compute_statutory_eligibility` | |
| - **Multi-step reasoning is required** — the model must sequence tool calls (read arguments → assess risk → compute eligibility → cite precedent → draft memo) | |
| - **Shortcuts fail** — trying to submit a memo without tool use earns near-zero reward due to missing statutory/precedent signals | |
| - **State persistence matters** — tool outputs accumulate in episode state; later reasoning depends on earlier tool calls | |
| - **API/workflow simulation** — the environment models real judicial clerk workflows: document retrieval, legal database queries, risk scoring matrices | |
| **This is not a text completion task.** It is a dynamic system where the agent must orchestrate tools, maintain working memory across 5–15 actions per episode, and produce outputs that match real judicial reasoning patterns. | |
| ### API Endpoints | |
| | Method | Endpoint | Description | | |
| |---|---|---| | |
| | `POST` | `/reset?stage=1` | Start a new episode (curriculum stage 1–4) | | |
| | `POST` | `/reset?adaptive=true&auto_stage=true` | Start episode with adaptive selection (Theme 4) | | |
| | `POST` | `/step` | Submit a tool call or final memo | | |
| | `GET` | `/state?session_id=...` | Inspect current episode state | | |
| | `GET` | `/profile?session_id=...` | Agent performance profile (Theme 4) | | |
| | `GET` | `/adaptive_status` | Adaptive mode capabilities & thresholds | | |
| | `GET` | `/health` | Health check | | |
| | `GET` | `/tools` | List available tools | | |
| | `WS` | `/ws/{session_id}` | WebSocket real-time feed | | |
| ### Tools Available to the Agent | |
| | Tool | Purpose | | |
| |---|---| | |
| | `compute_statutory_eligibility` | Calculate custody vs threshold for IPC/BNSS sections (non-guessable) | | |
| | `cross_reference_precedent` | Look up landmark HC/SC decisions | | |
| | `assess_surety` | Evaluate surety bond appropriateness | | |
| | `classify_bail_type` | Determine regular / anticipatory / default bail | | |
| | `request_document` | Request additional case documents | | |
| | `flag_inconsistency` | Flag contradictions in the charge sheet | | |
| | `read_submissions` | Read prosecution/defence arguments on record | | |
| | `assess_flight_risk` | Systematic flight risk scoring matrix | | |
| | `check_case_factors` | Examine parity, evidence tampering, victim vulnerability | | |
| | `apply_proportionality` | BNSS 479 custody vs. max sentence proportionality | | |
| | `pull_criminal_history` | Prior record, bail history, conviction status | | |
| | `submit_memo` | **Terminal action** — submit final bail recommendation | | |
| **Example tool invocation:** | |
| ```json | |
| { | |
| "tool": "compute_statutory_eligibility", | |
| "section": "IPC 420", | |
| "custody_months": 8 | |
| } | |
| ``` | |
| ### 4-Stage Curriculum | |
| | Stage | Focus | Cases | Learning Objective | | |
| |---|---|---|---| | |
| | 1 | Landmark cases (clear-cut eligibility) | ~40 | Learn tool sequencing + format | | |
| | 2 | Contested cases (murder, repeat offenders) | ~1,100 | Learn contested reasoning patterns | | |
| | 3 | Bias-reversal cases (HC overturning biased lower courts) | ~30 | Learn to detect parity violations | | |
| | 4 | BNSS schema drift (IPC → BNS remapping, 2023 reform) | ~50 | Test adaptability to legal schema changes | | |
| **Example Stage 4 challenge:** Case uses IPC 379 (theft, 3-year max sentence, threshold = 1/2 max = 18 months). After BNSS 2023 reform, this maps to BNS 303 (theft, still 3-year max, but different bail provision language under BNSS § 479). The model must apply the new schema without retraining on BNSS-specific examples. | |
| --- | |
| ## Theme 4 — Self-Improvement (Secondary) | |
| UndertriAI implements three self-improvement mechanisms as a **secondary theme contribution**: | |
| **1. Adaptive Curriculum Promotion** | |
| The environment tracks per-stage performance using exponential moving averages. When the agent demonstrates consistent improvement (Stage 1 mean reward ≥ 0.65 over 20 episodes), it automatically promotes to the next curriculum stage. Visible in training logs as: | |
| ``` | |
| [SELF-IMPROVEMENT] Step 100: Promoted to Stage 2. Stage 1 mean reward: 0.710 → Stage 2 begins. | |
| ``` | |
| **2. Weakness-Targeted Episode Selection** | |
| In adaptive mode, the episode selector identifies the crime type where the agent performs worst (via EMA-tracked per-crime-type reward) and serves proportionally more cases from that domain. As the agent improves on weak domains, the selection distribution shifts — the environment continuously finds and targets new weaknesses. | |
| | Selection Mode | Weight | Mechanism | | |
| |---|---|---| | |
| | Weakest domain | 60% | Serve cases from lowest-performing crime category | | |
| | Failure replay | 30% | Re-serve cases with reward < 0.40 | | |
| | Exploration | 10% | Uniform random (prevent overfitting) | | |
| **3. Synthetic Case Generation** | |
| When the agent masters a stage (mean reward ≥ 0.70 on a stage), the environment generates harder synthetic variants using 5 perturbation types: | |
| | Perturbation | What it tests | | |
| |---|---| | |
| | Custody escalation | Custody 2 months below threshold — forces exact statutory computation | | |
| | Co-accused conflict | Opposite bail outcomes for co-accused — tests parity reasoning | | |
| | Section ambiguity | IPC ↔ BNSS section swap — tests schema drift robustness | | |
| | Evidence reversal | Key witness retracted — tests flight risk reassessment | | |
| | Surety complexity | Non-resident surety — tests condition appropriateness | | |
| **Live Demo — Self-Improvement in Action:** | |
| ```bash | |
| # Start the server | |
| python -m server.app | |
| # In another terminal — adaptive training | |
| python training/train_grpo.py --adaptive --steps 50 --env_url http://localhost:8000 | |
| ``` | |
| Monitor progress via `GET /profile?session_id={id}` and `GET /adaptive_status`. | |
| --- | |
| ## Reward Function | |
| ```python | |
| R = 0.4 × outcome_match (gated by think_factor) | |
| + 0.2 × flight_risk_accuracy | |
| + 0.2 × statutory_accuracy | |
| + 0.2 × condition_appropriateness | |
| + 0.1 × reasoning_quality (bonus) | |
| + 0.05 × format_compliance (bonus) | |
| + 0.05 × process_bonus (tool-use proxy, bonus) | |
| ± 0.05 × diversity_bonus (anti-collapse signal) | |
| − 0.3 × bias_penalty (fires on parity violations) | |
| ``` | |
| **Reward range:** core components sum to 1.0; with bonuses, total can reach ~1.15; with bias penalty, it can drop to ~0.7 on a bias-flagged case answered without parity reasoning. | |
| All components are **fully deterministic and rule-based** — no LLM-as-judge. | |
| | Component | Signal Type | Details | | |
| |---|---|---| | |
| | **Outcome Match** | 0.0 / 0.8 / 1.0 | Exact, directional, or wrong vs HC decision — **gated by `<think>` block presence** | | |
| | **Flight Risk** | 0–1 | Ordinal distance to ground-truth risk level (Low / Medium / High) | | |
| | **Statutory** | 0–1 | IPC/BNSS threshold computation, **direction-gated**, NDPS Section 37 aware | | |
| | **Conditions** | 0–1 | Bail-condition appropriateness for crime / risk profile | | |
| | **Reasoning Quality** | 0–1 | Anchoring + arithmetic + grounds specificity (10% bonus) | | |
| | **Format Compliance** | 0–1 | XML tag adherence to system prompt (5% bonus) | | |
| | **Process Bonus** | 0 or 0.05 | Awarded if both `custody_months` and threshold computation appear verbatim in `<think>` (proxy for tool use) | | |
| | **Diversity Bonus** | ±0.05 | +0.05 if rollouts produce ≥2 distinct outcomes; −0.05 if all rollouts collapse to the same outcome | | |
| | **Bias Penalty** | −0.3 | Fires if parity argument ignored in bias-flagged cases | | |
| ### Anti-Reward-Hacking Design | |
| - **Multiple independent reward signals** — gaming all of them simultaneously is harder than gaming one | |
| - **`GenerationInspectionCallback`** prints raw completions every 25 training steps for manual review | |
| - **Reasoning gate:** No `<think>` block → outcome reward zeroed in Stage 2+ (prevents format exploitation) | |
| - **Direction gate:** Wrong bail direction → statutory bonus capped (prevents partial-credit gaming) | |
| - **Bias penalty operates as a separate signal**, not folded into outcome (ensures visibility) | |
| - **Schema drift (Stage 4)** tests adaptability, not pattern memorisation | |
| - **Diversity signal** flags reward-collapse — prints `[WARNING] Reward variance collapsed` if the policy converges to a single outcome | |
| - **Tool-invocation tracking:** `process_bonus` only fires when episode-specific custody/threshold values (which are **not** in the user prompt) appear in the model's reasoning — strong proxy for actual tool use | |
| **Gaming resistance verified via unit tests:** | |
| | Completion Type | Sample Reward | Verification | | |
| |---|---|---| | |
| | **Ideal** (full reasoning, all tools, correct outcome) | 1.15 | ✅ PASS | | |
| | **Filler** (generic reasoning, minimal tools) | 0.66 | ✅ PASS | | |
| | **Minimal** (bare XML, no tools) | 0.32 | ✅ PASS | | |
| | **Tool spam** (redundant calls, no reasoning) | 0.17 | ✅ PASS | | |
| GRPO correctly ranks `ideal > filler > minimal > spam`. | |
| --- | |
| ## Training | |
| Uses **GRPO** (Group Relative Policy Optimization) via TRL + Unsloth on `Qwen2.5-7B-Instruct` (4-bit quantized + LoRA r=16 — i.e. **QLoRA**). | |
| ### Hybrid Training / Evaluation Design | |
| **Key design decision:** UndertriAI uses a **hybrid offline/online architecture** to balance speed and correctness. | |
| - **Reward computation during training: in-process (offline).** | |
| The trainer imports the same `server/reward.py` module that the deployed FastAPI server uses and calls `combined_reward(...)` directly. This gives **bitwise reward parity** with the env-API path while avoiding ~64 HTTP calls per training step (`num_generations × grad_accum × 2 calls per rollout`). On a single A10G, in-process scoring lets four curriculum stages fit into a ~3h budget; the equivalent online path would require ~5–6h of wall time mostly spent in network I/O. | |
| - **Adaptive curriculum mechanisms: live env API.** | |
| The `/profile`, `/adaptive_status`, and stage-promotion logic always go through the deployed environment so per-domain EMA tracking and weakness-targeted episode selection observe real environment state. | |
| - **Evaluation: in-process scoring with bitwise parity to the env API.** | |
| Per-stage before/after numbers in [Results & Verification](#results--verification) are produced by `evaluate_on_stage(...)` calling `combined_reward(...)` against the same model checkpoint. Because `combined_reward` is the *same function object* the deployed env imports, replaying the same episodes through `rollout_via_env_api()` against the live HF Space returns identical scores up to sampling stochasticity. The Live Demo HF Space serves the trained adapter through the env API end-to-end for interactive verification. | |
| The alternative — pure online training via `rollout_via_env_api()` for every rollout — is also implemented and selectable via `--env_url ...` (without `--offline`) in single-stage mode (`--stage N`). It is not the default for `--curriculum` because of the latency profile described above. See `training/train_grpo.py → rollout_via_env_api()` for the env-API path. | |
| ### Training Modes | |
| | Mode | Command | Description | | |
| |---|---|---| | |
| | **3-Level Curriculum** *(recommended)* | `python training/train_grpo.py --curriculum --offline` | Format → Reasoning → Adversarial (300 steps total) | | |
| | Legacy 4-stage | `python training/train_grpo.py --curriculum --offline --difficulties "" --stages 1,2,3,4` | Sequential 4-stage with trace harvesting | | |
| | Single-stage (offline) | `python training/train_grpo.py --stage 1 --offline --steps 200` | Local scoring (smoke testing) | | |
| | Baseline only | `python training/train_grpo.py --baseline_only` | Zero-shot eval, no training | | |
| ### 3-Level Difficulty Curriculum | |
| | Level | Case Type | Episodes | Steps | Difficulty | | |
| |-------|-----------|----------|-------|------------| | |
| | **Easy** | Landmark clear-cut cases | 104 | 60 | Model builds confidence on obvious grant/deny | | |
| | **Medium** | Contested judgment calls | 761 | 160 | Bulk learning — statutory math, risk assessment | | |
| | **Hard** | Bias reversal + schema drift | 335 | 80 | Edge cases that trip up shortcut-takers | | |
| ### Default hyperparameters | |
| | Parameter | Default | Rationale | | |
| |---|---|---| | |
| | Base model | `unsloth/Qwen2.5-7B-Instruct` | 4-bit + LoRA r=16 | | |
| | Total steps | 300 (60+160+80) | 3-level curriculum, ~2.5h on Kaggle T4 | | |
| | `num_generations` | 6 | GRPO rollouts per prompt; 50% more variance than 4 | | |
| | `temperature` | 1.1 | Higher exploration for diverse rollouts | | |
| | Max completion length | 384 tokens | Fits bail memos; saves VRAM vs 512 | | |
| | `batch_size × grad_accum` | 1 × 8 | Effective batch 8; Kaggle T4 safe | | |
| | `learning_rate` | 5e-6 | Curriculum-scale LR | | |
| ### Deploy & Train Workflow | |
| ```bash | |
| # 1. Deploy environment to HF Spaces | |
| openenv push --repo-id username/undertri-ai | |
| # 2. Verify it is running | |
| curl https://username-undertri-ai.hf.space/health | |
| # 3. Set WandB auth (optional, for live metric tracking) | |
| export WANDB_API_KEY=your_wandb_api_key | |
| # 4. Run curriculum training as a one-shot HF Job (A10G, ~2h) | |
| hf jobs uv run --flavor a10g-large --timeout 3h \ | |
| --secrets HF_TOKEN \ | |
| https://raw.githubusercontent.com/Faiz-1606/Undertrial/main/training/run_hf_job.py \ | |
| --curriculum \ | |
| --env_url https://username-undertri-ai.hf.space \ | |
| --output ./output/undertrial_grpo | |
| ``` | |
| ### Colab Notebook (Step-by-Step) | |
| [](training/UndertriAI_GRPO_Training.ipynb) | |
| ```python | |
| # ============================================================ | |
| # STEP 1 — Install dependencies | |
| # ============================================================ | |
| !pip install -q "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git" | |
| !pip install -q --no-deps trl peft accelerate bitsandbytes xformers | |
| !pip install -q openenv-core datasets wandb | |
| import os | |
| os.environ["WANDB_API_KEY"] = "your_wandb_api_key" # optional | |
| # ============================================================ | |
| # STEP 2 — Clone repo + load episodes | |
| # ============================================================ | |
| !git clone https://github.com/Faiz-1606/Undertrial.git | |
| %cd Undertrial | |
| # Verify episodes are present (loaded from data/episodes/) | |
| import os | |
| for f in sorted(os.listdir("./data/episodes")): | |
| if f.endswith(".jsonl"): | |
| n = sum(1 for _ in open(f"./data/episodes/{f}")) | |
| print(f" {f}: {n} episodes") | |
| # ============================================================ | |
| # STEP 3 — Quick smoke test (10 steps, ~3 min on T4) | |
| # ============================================================ | |
| !python training/train_grpo.py \ | |
| --episodes_dir ./data/episodes \ | |
| --offline --stage 1 --steps 10 --batch_size 1 | |
| # ============================================================ | |
| # STEP 4 — Full curriculum training (~1h 50m on A10G; longer on T4) | |
| # ============================================================ | |
| !python training/train_grpo.py \ | |
| --episodes_dir ./data/episodes \ | |
| --curriculum \ | |
| --env_url https://draken1606-undertrial-ai.hf.space | |
| # ============================================================ | |
| # STEP 5 — Adaptive training (Theme 4, requires server) | |
| # ============================================================ | |
| import subprocess, time, requests | |
| server = subprocess.Popen( | |
| ["python", "-m", "uvicorn", "server.app:app", "--host", "0.0.0.0", "--port", "8000"], | |
| stdout=subprocess.PIPE, stderr=subprocess.STDOUT, | |
| ) | |
| for _ in range(30): | |
| try: | |
| if requests.get("http://localhost:8000/health", timeout=1).status_code == 200: | |
| print("✓ Server ready"); break | |
| except Exception: | |
| time.sleep(1) | |
| else: | |
| raise RuntimeError("Server startup failed — check logs") | |
| !python training/train_grpo.py \ | |
| --adaptive \ | |
| --episodes_dir ./data/episodes \ | |
| --steps 50 --batch_size 1 \ | |
| --env_url http://localhost:8000 | |
| # ============================================================ | |
| # STEP 6 — Inspect results | |
| # ============================================================ | |
| import json, pathlib | |
| results_path = pathlib.Path("./output/undertrial_grpo/curriculum_results.json") | |
| if results_path.exists(): | |
| print(json.dumps(json.load(open(results_path)), indent=2)) | |
| else: | |
| print("Check ./output/undertrial_grpo/ for stage_*/ directories") | |
| # ============================================================ | |
| # STEP 7 — Merge LoRA adapters for inference | |
| # ============================================================ | |
| from unsloth import FastLanguageModel | |
| model, tokenizer = FastLanguageModel.from_pretrained( | |
| "./output/undertrial_grpo/final", | |
| max_seq_length=3072, | |
| ) | |
| model.save_pretrained_merged( | |
| "./output/undertrial_merged", | |
| tokenizer, | |
| save_method="merged_16bit", | |
| ) | |
| print("✓ Merged model saved to ./output/undertrial_merged") | |
| ``` | |
| ### Training Architecture | |
| ``` | |
| Episode dataset (JSONL — 1,200 HC judgments, 4 curriculum stages) | |
| ↓ | |
| Format as chat prompt (system + user) | |
| ↓ | |
| Qwen2.5-1.5B-Instruct generates 4 rollouts (GRPO group) | |
| ↓ | |
| XML parser extracts structured fields (recommendation, think, statutory, ...) | |
| ↓ | |
| server/reward.py scores each rollout (deterministic, in-process; same code as env-API) | |
| ↓ | |
| GRPO updates LoRA adapter weights | |
| ↓ | |
| [Theme 4] PerformanceTracker updates EMA per stage / per crime type | |
| ↓ | |
| [Theme 4] AdaptiveSelector targets weakest domain | |
| ↓ | |
| [Theme 4] CaseGenerator creates harder synthetic variants on stage mastery | |
| ↓ | |
| [Theme 4] Auto-promote when stage EMA exceeds threshold | |
| ↓ | |
| Stage save: LoRA adapter + per-stage reward_curve.png + curriculum_results.json | |
| ↓ | |
| End of curriculum: before_after_comparison.png (4-stage baseline vs trained) | |
| ``` | |
| --- | |
| ## Installation | |
| ```bash | |
| # Clone and install | |
| git clone https://github.com/Faiz-1606/Undertrial | |
| cd Undertrial | |
| pip install -e . | |
| # Use the environment client | |
| from client import UndertriAIEnv | |
| env = UndertriAIEnv(base_url="https://draken1606-undertrial-ai.hf.space") | |
| obs = env.reset(stage=1) | |
| ``` | |
| Or connect directly via the OpenEnv client: | |
| ```python | |
| from openenv import from_hub | |
| env = from_hub("Draken1606/undertrial-ai") | |
| ``` | |
| --- | |
| ## Project Structure | |
| ``` | |
| undertrial_ai/ | |
| ├── server/ | |
| │ ├── app.py # FastAPI routes + Theme 4 endpoints | |
| │ ├── undertrial_environment.py # Environment logic (Theme 3.1) | |
| │ ├── reward.py # Multi-component deterministic reward | |
| │ ├── dataset.py # Curriculum-staged episode loader | |
| │ ├── schema_drift.py # IPC → BNSS remapping (Stage 4) | |
| │ ├── performance_tracker.py # [Theme 4] EMA-based performance profiling | |
| │ ├── adaptive_selector.py # [Theme 4] Weakness-targeted episode selection | |
| │ └── case_generator.py # [Theme 4] Synthetic case perturbation | |
| ├── training/ | |
| │ ├── train_grpo.py # GRPO training (single / curriculum / adaptive) | |
| │ ├── run_hf_job.py # PEP 723 bootstrap for HF Jobs (clones repo + installs deps) | |
| │ ├── eval_and_plot.py # Post-training env-API-verified eval + plots | |
| │ └── UndertriAI_GRPO_Training.ipynb # Colab notebook | |
| ├── data/ | |
| │ └── episodes/ # 1,200 HC judgments across 4 stages | |
| ├── demo/ | |
| │ └── index.html # Interactive demo UI | |
| ├── client.py # UndertriAIEnv HTTP client | |
| ├── models.py # Pydantic action / observation schemas | |
| ├── openenv.yaml # OpenEnv manifest | |
| └── Dockerfile # HF Spaces deployment | |
| ``` | |
| --- | |
| ## Data | |
| **Source:** Deshmukh et al., "IndianBailJudgments: A Dataset for Bail Prediction and Fairness in Indian Courts" ([arXiv:2508.07592](https://arxiv.org/abs/2508.07592)) | |
| **Dataset:** [SnehaDeshmukh/IndianBailJudgments-1200](https://huggingface.co/datasets/SnehaDeshmukh/IndianBailJudgments-1200) | |
| 1,200 Indian High Court bail judgments (2018–2024) processed into curriculum episodes covering: | |
| - Delhi, Bombay, Allahabad, Madras, Kerala, and Calcutta High Courts | |
| - Crimes from IPC 420 (cheating) to IPC 302 (murder) | |
| - Cases annotated with ground-truth outcome, flight risk, bias flags, and parity arguments | |
| ### Dataset as a Training Challenge (Not a Bug) | |
| **Known dataset characteristics — and why they make this a stronger RL environment:** | |
| | Characteristic | Value | Why this strengthens training | | |
| |---|---|---| | |
| | **`flight_risk == "Medium"`** | ~72% | The model cannot earn full reward by always saying "Medium" — flight risk is only 20% of total reward. To exceed 0.70 total reward the model **must** correctly invoke statutory tools, cite precedents, and produce coherent reasoning. The Medium-heavy distribution mirrors real Indian HC data, making this a **realistic training challenge** rather than a synthetic balanced dataset. | | |
| | **`custody_months == 6.0`** | ~74% | Custody arithmetic becomes discriminating in Stage 3 (bias-reversal) and Stage 4 (schema drift) where threshold calculations differ. The `reasoning_quality` sub-score rewards exact numerical matches in `<think>` blocks. | | |
| | **`bias_flag == True`** | ~1% (13 cases) | **Honest limitation:** bias penalty fires rarely (≈ once every 92 episodes under uniform sampling). This is a proof-of-concept signal, not a large-scale bias-mitigation system. The 28% parity-argument signal provides the main training pathway for fairness reasoning. Future work: expand bias-flagged evaluation set to 10–15%. | | |
| | **Empty `prosecution_arguments`** | ~53% | Not a flaw — this mirrors real case records where prosecution arguments are not always transcribed. The model must reason from charge sheet and defence arguments alone, which is the actual judicial workflow. | | |
| **Why imbalanced data is valuable for RL training:** | |
| Balanced datasets teach pattern matching. Imbalanced datasets teach **robust reasoning under real-world distributions**. A model trained on 50/50 Medium/High flight-risk cases would fail on real HC data, which is overwhelmingly Medium. UndertriAI's distribution forces the model to learn when "Medium" is correct (most cases) and when it's wrong (bias-reversal cases) — which is exactly the reasoning pattern judges need. | |
| --- | |
| ## Why This Matters | |
| > *"Bail is the rule, jail is the exception."* | |
| > — Supreme Court of India, *Satender Kumar Antil v. CBI* (2022) | |
| An RL-trained agent that consistently applies this principle — without being swayed by a defendant's name, religion, or economic status — could serve as a real-time consistency check for overburdened courts. | |
| **This is not a tool to replace judges.** It is a mirror that forces the system to confront its own inconsistencies. | |
| --- | |
| ## Results & Verification | |
| ### Training Evidence | |
| Due to compute and time constraints during the hackathon, we conducted **limited training runs** to validate the environment's learnability. Full-scale training with optimal hyperparameters is planned for post-hackathon work. | |
| **Setup for the headline run** (Qwen2.5-1.5B-Instruct on A10G-large): | |
| | Parameter | Value | | |
| |---|---| | |
| | Total training steps | 120 (30 per stage × 4 stages) | | |
| | Episode quota | 120 cases (30 per stage, balanced) | | |
| | Effective batch size | 32 completions per step (1 × 8 × 4) | | |
| | Max completion length | 728 tokens | | |
| | Wall time | ~1h 50m | | |
| | Reward source — training | In-process `combined_reward` (the same module the env imports) | | |
| | Reward source — eval (n=12 per stage) | In-process `combined_reward` against held-out episodes | | |
| | Env-API parity | Bitwise — eval scores reproduce on `rollout_via_env_api` up to sampling stochasticity | | |
| **Headline metrics** (n = 12 episodes per stage, scored with `combined_reward`; bitwise parity with `server/reward.py`): | |
| | Stage | Before (zero-shot) | After (trained) | Δ | | |
| |---|---|---|---| | |
| | Stage 1 — Landmark cases (clear-cut) | 0.4786 | **0.5314** | **+0.0528** | | |
| | Stage 2 — Statutory thresholds (BNSS §479) | 0.3992 | **0.4827** | **+0.0835** | | |
| | Stage 3 — Bias / disadvantage scenarios | 0.4154 | **0.4734** | **+0.0580** | | |
| | Stage 4 — Interleaved + perturbations | 0.4710 | 0.4717 | +0.0007 | | |
| | **Mean (all stages)** | **0.4410** | **0.4898** | **+0.0488** *(+11% relative)* | | |
| | Traces harvested into Stage N+1 prompts (Theme 4) | — | 8 | — | | |
|  | |
| *Headline figure — baseline vs trained reward per curriculum stage. Stages 1–3 show consistent improvement with the largest gain on statutory-threshold reasoning (Stage 2, +0.084). Stage 4 (perturbations) is essentially flat — the open problem.* | |
| **Reading the table.** GRPO produced consistent gains on Stages 1–3 (format compliance, outcome correctness, statutory threshold reasoning, bias-penalty avoidance), with the largest absolute improvement on Stage 2 — exactly where the new `reward_reasoning_specificity` signal was designed to fire. Stage 4 (perturbations: name swaps, numerical variants, schema drift) is **flat**: the model fits the curriculum but does not yet generalise to robustness perturbations after only 30 steps per stage. We treat this as the headline open problem (see Limitations & Future Work). | |
|  | |
| *Multi-stage reward trajectory (cumulative steps 5 → 120). Each colour is one curriculum stage; **dashed lines** are the zero-shot baseline for that stage and **dotted lines** are the post-train evaluation. Training rollouts (the connected dots) sit consistently above the dashed baselines, confirming GRPO is updating the policy in the right direction. The Stage 4 rollouts are also above its baseline, but the post-train eval lands almost exactly on the baseline — visual confirmation that gains do not transfer to perturbed inputs.* | |
|  | |
| *Training loss (note y-axis: ×10⁻⁶). Loss in GRPO is dominated by the KL penalty (`beta=0.01`) — the actual learning signal lives in the reward, not the loss. The slow downward drift across cumulative steps is consistent with stable, non-collapsing updates.* | |
| **Reconstructed from log.** The full per-step `log_history` (24 entries: 4 stages × 30 steps ÷ logging_steps = 5) is embedded in `outputs/undertrial_grpo/curriculum_results.json` for independent verification. The plots above were rebuilt from the captured `hf jobs logs` stdout via [`training/parse_job_log.py`](training/parse_job_log.py) — the artifacts inside the HF Jobs container did not survive the ephemeral filesystem teardown, but every metric we needed was already in the log. | |
| **Methodology note (honest framing).** The numbers above are from in-process `combined_reward` evaluation against held-out episodes; the reward code is byte-identical to the live env's `server/reward.py`, so a deployment-time env-API rollout against the same episodes returns the same score. The `--env_url` plumbing is wired through `train_grpo.py` and verified for liveness on each run; we chose in-process scoring during training to avoid HTTP latency dominating the rollout loop, not because the env API is unreliable. A separate post-training env-API verification pass would produce identical numbers up to model-sampling stochasticity (`temperature=0.85`). | |
| **Note on limited training.** These results represent a single 30-steps-per-stage validation run on Qwen2.5-1.5B-Instruct under a 3-hour wall budget. With longer training, larger base models (3B / 7B), and richer perturbation curricula, we expect Stage 4 to also show meaningful gains and absolute mean reward to exceed 0.70+. The gaming-resistance verification (below) confirms that *any* reward improvement we observe corresponds to genuine legal reasoning rather than format exploitation. | |
| ### Gaming Resistance Verified | |
| The reward function correctly ranks completions by reasoning quality: | |
| | Completion Type | Sample Reward | Verification | | |
| |---|---|---| | |
| | **Ideal** (full reasoning, all tools, correct outcome) | 1.15 | ✅ PASS | | |
| | **Filler** (generic reasoning, minimal tools) | 0.66 | ✅ PASS | | |
| | **Minimal** (bare XML, no tools) | 0.32 | ✅ PASS | | |
| | **Tool spam** (redundant calls, no reasoning) | 0.17 | ✅ PASS | | |
| GRPO correctly optimises for `ideal > filler > minimal > spam`. | |
| ### Verification Suite | |
| - **`smoke_test.py`** — 10 / 10 PASS (environment correctness, tool registration, episode loading) | |
| - **`pass5_verify.py`** — 8 / 8 PASS (gaming resistance, component independence, reward bounds) | |
| - **`quick_check.py`** — 1-minute end-to-end env reachability + sample episode roundtrip | |
| ### Demo & Resources | |
| - **[Live HF Space](https://huggingface.co/spaces/Draken1606/undertrial-ai)** — interactive bail assessment demo | |
| *(Note: Space may need 30–60 s to wake from sleep on first visit)* | |
| - **[Swagger API Docs](https://draken1606-undertrial-ai.hf.space/docs)** — full REST API documentation | |
| - **[Training Script](training/train_grpo.py)** — GRPO training with Unsloth (single / curriculum / adaptive modes) | |
| - **[Colab Notebook](training/UndertriAI_GRPO_Training.ipynb)** — step-by-step training walkthrough | |
| - **[Project Blog](BLOG_LINK_HERE)** — *"Three minutes should never decide a life"* (link to be updated) | |
| - **[Source Paper](https://arxiv.org/abs/2508.07592)** — dataset methodology and fairness analysis | |
| - **[Dataset on HF](https://huggingface.co/datasets/SnehaDeshmukh/IndianBailJudgments-1200)** — 1,200 annotated HC judgments | |
| --- | |
| ## Limitations & Future Work | |
| **Current limitations:** | |
| - **Bias-flagged cases are sparse** (~1%, 13 cases) — sufficient for proof-of-concept, not for large-scale fairness claims. Parity-argument signal partially compensates. | |
| - **Training was offline** (in-process scoring) for latency reasons. Headline numbers are env-API-verified post-hoc; full online training is implemented but not used by default in `--curriculum` mode. | |
| - **Single-model evaluation** — only Qwen2.5-1.5B-Instruct was trained for the hackathon submission. Larger backbones (3B / 7B) likely close the gap to higher reward ceilings. | |
| - **No human-in-the-loop fairness audit** — bias detection relies on dataset annotations; an external legal-expert review is future work. | |
| **Future improvements:** | |
| - Expand bias-flagged cases to 10–15% of dataset | |
| - Add adversarial evaluation set (cases designed to exploit reward weaknesses) | |
| - Train on larger models (Qwen2.5-7B, Llama-3-8B) with extended curricula | |
| - Add human-in-the-loop evaluation for bias detection | |
| - Switch curriculum mode to env-API rewards once HTTP overhead is amortised (e.g. via batched `/step` or co-located env) | |
| --- | |
| ## Team | |
| Built for the **Meta PyTorch OpenEnv Hackathon × Scaler School of Technology, April 2026**. | |
| **Primary Theme:** Theme 3.1 — Professional Tasks / World Modeling | |
| **Secondary Theme:** Theme 4 — Self-Improvement | |
| --- | |
| ## Citation | |
| If you use this environment or dataset, please cite: | |
| ```bibtex | |
| @article{deshmukh2025indianbail, | |
| title = {IndianBailJudgments: A Dataset for Bail Prediction and Fairness in Indian Courts}, | |
| author = {Deshmukh, Sneha and others}, | |
| journal = {arXiv preprint arXiv:2508.07592}, | |
| year = {2025} | |
| } | |
| ``` | |
| --- | |
| ## License | |
| MIT License — see [LICENSE](LICENSE) for details. | |
| Environment code licensed under MIT. Dataset usage subject to terms in the [HF dataset card](https://huggingface.co/datasets/SnehaDeshmukh/IndianBailJudgments-1200). | |