Spaces:

Draken1606
/

undertrial-ai

Sleeping

App Files Files Community

Shabista Sehar commited on 29 days ago

Commit

d8f8a45

1 Parent(s): 4855450

implemented

Browse files

Files changed (9) hide show

README.md +198 -26
openenv.yaml +72 -9
server/adaptive_selector.py +140 -0
server/app.py +145 -13
server/case_generator.py +312 -0
server/performance_tracker.py +251 -0
server/reward.py +139 -7
server/undertrial_environment.py +16 -1
training/train_grpo.py +407 -37

README.md CHANGED Viewed

@@ -58,8 +58,11 @@ UndertriAI is an **OpenEnv-compliant RL training environment** that teaches an L
 | Method | Endpoint | Description |
 |---|---|---|
 | `POST` | `/reset?stage=1` | Start a new episode (curriculum stage 1–4) |
 | `POST` | `/step` | Submit a tool call or final memo |
 | `GET` | `/state?session_id=...` | Inspect current episode state |
 | `GET` | `/health` | Health check |
 | `GET` | `/tools` | List available tools |
 | `WS` | `/ws/{session_id}` | WebSocket real-time feed |
@@ -74,6 +77,11 @@ UndertriAI is an **OpenEnv-compliant RL training environment** that teaches an L
 | `classify_bail_type` | Determine regular / anticipatory / default bail |
 | `request_document` | Request additional case documents |
 | `flag_inconsistency` | Flag contradictions in the charge sheet |
 | `submit_memo` | **Terminal action** — submit final bail recommendation |
 ### 4-Stage Curriculum
@@ -87,13 +95,71 @@ UndertriAI is an **OpenEnv-compliant RL training environment** that teaches an L
 ---
 ## Reward Function
 ```
-R = 0.4 × outcome_match
   + 0.2 × flight_risk_accuracy
   + 0.2 × statutory_accuracy
   + 0.2 × condition_appropriateness
   − 0.3 × bias_penalty
 ```
@@ -101,16 +167,20 @@ All components are **fully deterministic and rule-based** — no LLM-as-judge.
 | Component | Signal | Details |
 |---|---|---|
-| **Outcome Match** | 0.0 / 0.8 / 1.0 | Exact, directional, or wrong vs HC decision |
 | **Flight Risk** | 0–1 | Ordinal distance to ground-truth risk level |
-| **Statutory** | 0–1 | IPC/BNSS section, sentence threshold, custody duration |
 | **Conditions** | 0–1 | Appropriate bail conditions for crime/risk profile |
 | **Bias Penalty** | −0.3 | Fired if parity argument ignored in bias-flagged cases |
 ### Anti-Reward-Hacking Design
-- 5 independent reward signals (harder to simultaneously game all)
 - `GenerationInspectionCallback` prints raw completions every 25 training steps
 - Bias penalty operates as a separate signal, not folded into outcome
 - Schema drift (Stage 4) tests adaptability, not pattern memorisation
@@ -118,18 +188,110 @@ All components are **fully deterministic and rule-based** — no LLM-as-judge.
 ## Training
-Uses **GRPO** (Group Relative Policy Optimization) via TRL + Unsloth on `Qwen2.5-3B-Instruct`.
-```bash
-# Run with before/after eval and results.json
-python training/train_grpo.py \
-  --episodes_dir ./data/episodes \
-  --stage 1 \
-  --steps 200 \
-  --eval_after
-```
-Or use the Colab notebook: [`training/UndertriAI_GRPO_Training.ipynb`](training/UndertriAI_GRPO_Training.ipynb)
 ### Training Architecture
@@ -146,7 +308,13 @@ Episode Dataset (JSONL)
         ↓
   GRPO updates model weights
         ↓
-  GenerationInspectionCallback logs samples every 25 steps
 ```
 ---
@@ -178,21 +346,25 @@ env = from_hub("Draken1606/undertrial-ai")
 ```
 undertrial_ai/
 ├── server/
-│   ├── app.py                  # FastAPI routes
-│   ├── undertrial_environment.py  # Environment logic
-│   ├── reward.py               # 5-component deterministic reward
-│   ├── dataset.py              # Curriculum-staged episode loader
-│   └── schema_drift.py         # IPC → BNSS remapping (Stage 4)
 ├── training/
-│   ├── train_grpo.py           # GRPO training script
 │   └── UndertriAI_GRPO_Training.ipynb  # Colab notebook
 ├── data/
-│   └── episodes/               # 1,200 HC judgments across 4 stages
 ├── demo/
-│   └── index.html              # Interactive demo UI
-├── client.py                   # UndertriAIEnv HTTP client
-├── models.py                   # Pydantic action/observation schemas
-└── Dockerfile                  # HF Spaces deployment
 ```
 ---

 | Method | Endpoint | Description |
 |---|---|---|
 | `POST` | `/reset?stage=1` | Start a new episode (curriculum stage 1–4) |
+| `POST` | `/reset?adaptive=true&auto_stage=true` | Start episode with adaptive selection (Theme 4) |
 | `POST` | `/step` | Submit a tool call or final memo |
 | `GET` | `/state?session_id=...` | Inspect current episode state |
+| `GET` | `/profile?session_id=...` | Agent performance profile (Theme 4) |
+| `GET` | `/adaptive_status` | Adaptive mode capabilities & thresholds |
 | `GET` | `/health` | Health check |
 | `GET` | `/tools` | List available tools |
 | `WS` | `/ws/{session_id}` | WebSocket real-time feed |
 | `classify_bail_type` | Determine regular / anticipatory / default bail |
 | `request_document` | Request additional case documents |
 | `flag_inconsistency` | Flag contradictions in the charge sheet |
+| `read_submissions` | Read prosecution/defence arguments on record |
+| `assess_flight_risk` | Systematic flight risk scoring matrix |
+| `check_case_factors` | Examine parity, evidence tampering, victim vulnerability |
+| `apply_proportionality` | BNSS 479 custody vs. max sentence proportionality |
+| `pull_criminal_history` | Prior record, bail history, conviction status |
 | `submit_memo` | **Terminal action** — submit final bail recommendation |
 ### 4-Stage Curriculum
 ---
+## Theme 4 — Self-Improvement
+UndertriAI qualifies for Theme 4 through three mechanisms:
+**1. Adaptive Curriculum Promotion**
+The environment tracks per-domain and per-stage performance using exponential
+moving averages. When the agent demonstrates consistent improvement
+(Stage 1 mean reward ≥ 0.65 over 20 episodes), it automatically promotes
+to the next curriculum stage. This is visible in training logs as:
+```
+[SELF-IMPROVEMENT] Step 100: Promoted to Stage 2. Stage 1 mean reward: 0.710 → Stage 2 begins.
+```
+**2. Weakness-Targeted Episode Selection**
+In adaptive mode, the episode selector identifies the crime type where the
+agent performs worst and serves proportionally more cases from that domain.
+As the agent improves on weak domains, the selection distribution shifts —
+the environment continuously finds and targets new weaknesses.
+| Selection | Weight | Mechanism |
+|---|---|---|
+| Weakest domain | 60% | EMA-tracked per-crime-type reward |
+| Failure replay | 30% | Re-serve cases with reward < 0.40 |
+| Exploration | 10% | Uniform random (prevent overfitting) |
+**3. Synthetic Case Generation**
+When the agent masters a domain (mean reward > 0.70), the environment
+generates harder synthetic variants using 5 perturbation types:
+| Perturbation | What it tests |
+|---|---|
+| Custody escalation | Custody 2 months below threshold — forces careful statutory computation |
+| Co-accused conflict | Opposite bail outcome for co-accused — tests parity reasoning |
+| Section ambiguity | IPC ↔ BNSS section swap — tests schema drift adaptability |
+| Evidence reversal | Key witness retracted — tests flight risk reassessment |
+| Surety complexity | Non-resident surety — tests condition appropriateness |
+**Live Demo — Self-Improvement in Action**
+```bash
+# Start the server
+python -m server.app
+# In another terminal — start adaptive training
+python training/train_grpo.py --adaptive --steps 50 --env_url http://localhost:8000
+```
+Monitor progress via:
+```
+GET /profile?session_id={id}
+GET /adaptive_status
+```
+Watch stage promotions in the training log.
+---
 ## Reward Function
 ```
+R = 0.4 × outcome_match (gated by reasoning quality)
   + 0.2 × flight_risk_accuracy
   + 0.2 × statutory_accuracy
   + 0.2 × condition_appropriateness
+  + 0.1 × reasoning_quality (bonus)
+  + 0.05 × format_compliance (bonus)
   − 0.3 × bias_penalty
 ```
 | Component | Signal | Details |
 |---|---|---|
+| **Outcome Match** | 0.0 / 0.8 / 1.0 | Exact, directional, or wrong vs HC decision — gated by `<think>` block |
 | **Flight Risk** | 0–1 | Ordinal distance to ground-truth risk level |
+| **Statutory** | 0–1 | IPC/BNSS threshold computation, direction-gated, NDPS Section 37 aware |
 | **Conditions** | 0–1 | Appropriate bail conditions for crime/risk profile |
+| **Reasoning Quality** | 0–1 | Anchoring + arithmetic + grounds specificity (10% bonus) |
+| **Format Compliance** | 0–1 | XML tag adherence to system prompt (5% bonus) |
 | **Bias Penalty** | −0.3 | Fired if parity argument ignored in bias-flagged cases |
 ### Anti-Reward-Hacking Design
+- 7 independent reward signals (harder to simultaneously game all)
 - `GenerationInspectionCallback` prints raw completions every 25 training steps
+- Reasoning gate: no `<think>` block → outcome reward zeroed in Stage 2+
+- Direction gate: wrong bail direction → statutory bonus capped
 - Bias penalty operates as a separate signal, not folded into outcome
 - Schema drift (Stage 4) tests adaptability, not pattern memorisation
 ## Training
+Uses **GRPO** (Group Relative Policy Optimization) via TRL + Unsloth on `Qwen2.5-7B-Instruct`.
+### Training Modes
+| Mode | Command | Description |
+|---|---|---|
+| Single stage | `python training/train_grpo.py --stage 1 --steps 200` | Train on one stage |
+| Curriculum | `python training/train_grpo.py --curriculum --steps 150` | Sequential 4-stage with trace harvesting |
+| **Adaptive** | `python training/train_grpo.py --adaptive --steps 50` | **Theme 4** — self-directed with auto-promotion |
+### Google Colab Training Walkthrough
+```python
+# ============================================================
+# STEP 1 — Install dependencies (run in first cell)
+# ============================================================
+!pip install -q "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
+!pip install -q --no-deps trl peft accelerate bitsandbytes xformers
+!pip install -q openenv-core datasets
+# ============================================================
+# STEP 2 — Clone the repository
+# ============================================================
+!git clone https://github.com/Faiz-1606/Undertrial.git
+%cd Undertrial
+# ============================================================
+# STEP 3 — Verify episodes are available
+# ============================================================
+import os
+episodes_dir = "./data/episodes"
+if not os.path.exists(episodes_dir):
+    print("No episodes directory — will use built-in demo episodes")
+else:
+    for f in os.listdir(episodes_dir):
+        if f.endswith('.jsonl'):
+            count = sum(1 for _ in open(f"{episodes_dir}/{f}"))
+            print(f"  {f}: {count} episodes")
+# ============================================================
+# STEP 4 — Option A: Single-stage training (quick, ~20 min on T4)
+# ============================================================
+!python training/train_grpo.py \
+    --episodes_dir ./data/episodes \
+    --stage 1 \
+    --steps 200 \
+    --batch_size 4 \
+    --eval_after
+# ============================================================
+# STEP 4 — Option B: Curriculum training (full, ~90 min on T4)
+# ============================================================
+!python training/train_grpo.py \
+    --episodes_dir ./data/episodes \
+    --curriculum \
+    --steps 150 \
+    --batch_size 4
+# ============================================================
+# STEP 4 — Option C: Adaptive training (Theme 4, ~60 min on T4)
+# (Requires server running — start in a background cell first)
+# ============================================================
+# Background cell: start the server
+import subprocess
+server = subprocess.Popen(
+    ["python", "-m", "uvicorn", "server.app:app", "--host", "0.0.0.0", "--port", "8000"],
+    stdout=subprocess.PIPE, stderr=subprocess.PIPE
+)
+import time; time.sleep(5)  # Wait for server startup
+# Then run adaptive training
+!python training/train_grpo.py \
+    --adaptive \
+    --episodes_dir ./data/episodes \
+    --steps 50 \
+    --batch_size 4 \
+    --env_url http://localhost:8000
+# ============================================================
+# STEP 5 — View results
+# ============================================================
+import json
+# For single/curriculum:
+results = json.load(open("./output/undertrial_grpo/results.json"))
+print(json.dumps(results, indent=2))
+# For adaptive:
+# results = json.load(open("./output/undertrial_grpo/results_adaptive.json"))
+# ============================================================
+# STEP 6 — (Optional) Merge LoRA adapters for inference
+# ============================================================
+from unsloth import FastLanguageModel
+model, tokenizer = FastLanguageModel.from_pretrained(
+    "./output/undertrial_grpo/final",
+    max_seq_length=3072,
+)
+model.save_pretrained_merged(
+    "./output/undertrial_merged",
+    tokenizer,
+    save_method="merged_16bit",
+)
+print("Merged model saved to ./output/undertrial_merged")
+```
 ### Training Architecture
         ↓
   GRPO updates model weights
         ↓
+  [Theme 4] PerformanceTracker updates EMA per domain/stage
+        ↓
+  [Theme 4] AdaptiveSelector targets weakest domain
+        ↓
+  [Theme 4] CaseGenerator creates harder synthetic variants
+        ↓
+  [Theme 4] Auto-promote when stage EMA exceeds threshold
 ```
 ---
 ```
 undertrial_ai/
 ├── server/
+│   ├── app.py                    # FastAPI routes + Theme 4 endpoints
+│   ├── undertrial_environment.py # Environment logic
+│   ├── reward.py                 # 7-component deterministic reward
+│   ├── dataset.py                # Curriculum-staged episode loader
+│   ├── schema_drift.py           # IPC → BNSS remapping (Stage 4)
+│   ├── performance_tracker.py    # [Theme 4] EMA-based performance profiling
+│   ├── adaptive_selector.py      # [Theme 4] Weakness-targeted episode selection
+│   └── case_generator.py         # [Theme 4] Synthetic case perturbation
 ├── training/
+│   ├── train_grpo.py             # GRPO training (single/curriculum/adaptive)
 │   └── UndertriAI_GRPO_Training.ipynb  # Colab notebook
 ├── data/
+│   └── episodes/                 # 1,200 HC judgments across 4 stages
 ├── demo/
+│   └── index.html                # Interactive demo UI
+├── client.py                     # UndertriAIEnv HTTP client
+├── models.py                     # Pydantic action/observation schemas
+├── openenv.yaml                  # OpenEnv manifest
+└── Dockerfile                    # HF Spaces deployment
 ```
 ---

openenv.yaml CHANGED Viewed

@@ -1,10 +1,12 @@
 name: undertrial-ai
-version: "1.0.0"
 description: >
-  OpenEnv-compliant RL training environment for Indian bail decision support.
-  An LLM agent reads High Court bail cases, invokes legal tools, and submits
-  structured bail recommendations. Reward computed deterministically against
-  real HC judgments with an explicit bias penalty (lambda=0.3).
 author: Draken1606
 license: MIT
@@ -19,6 +21,8 @@ tags:
   - world-modeling
   - bias-mitigation
   - bnss-2023
 environment:
   class: undertrial_ai.server.undertrial_environment.UndertriAIEnvironment
@@ -52,17 +56,18 @@ actions:
     description: "TERMINAL — Submit structured bail assessment memo"
 reward:
-  formula: "0.3*outcome + 0.2*flight_risk + 0.2*statutory + 0.2*conditions + 0.1*reasoning_quality + 0.1*efficiency + 0.05*process_bonus - 0.3*bias"
   range: [-0.7, 1.15]
   terminal_action: submit_memo
   deterministic: true
   llm_as_judge: false
   components:
-    - outcome_match: "Agreement with real High Court decision (30%)"
     - flight_risk_accuracy: "Flight risk classification accuracy (20%)"
-    - statutory_accuracy: "IPC/BNSS threshold computation (20%)"
     - condition_appropriateness: "Bail condition quality (20%)"
-    - reasoning_quality: "Justification anchoring + arithmetic verification + grounds specificity (10%)"
     - bias_penalty: "Penalty for ignoring parity in bias cases (-30%)"
 curriculum:
@@ -72,12 +77,70 @@ curriculum:
   stage_3: "Bias reversal / parity cases"
   stage_4: "Schema drift (IPC→BNSS, regional FIR formats)"
 training:
   method: GRPO
   framework: TRL + Unsloth
   model: unsloth/Qwen2.5-7B-Instruct
   notebook: training/UndertriAI_GRPO_Training.ipynb
   script: training/train_grpo.py
 deployment:
   platform: huggingface-spaces

 name: undertrial-ai
+version: "1.1.0"
 description: >
+  OpenEnv-compliant RL training environment for Indian bail decision support
+  with adaptive self-improvement (Theme 4). An LLM agent reads High Court bail
+  cases, invokes legal tools, and submits structured bail recommendations.
+  Reward computed deterministically against real HC judgments with an explicit
+  bias penalty (lambda=0.3). Features performance-aware episode selection,
+  stage-gated curriculum promotion, and synthetic case generation.
 author: Draken1606
 license: MIT
   - world-modeling
   - bias-mitigation
   - bnss-2023
+  - self-improvement
+  - adaptive-curriculum
 environment:
   class: undertrial_ai.server.undertrial_environment.UndertriAIEnvironment
     description: "TERMINAL — Submit structured bail assessment memo"
 reward:
+  formula: "0.4*outcome_gated + 0.2*flight_risk + 0.2*statutory + 0.2*conditions + 0.1*reasoning_quality + 0.05*format - 0.3*bias"
   range: [-0.7, 1.15]
   terminal_action: submit_memo
   deterministic: true
   llm_as_judge: false
   components:
+    - outcome_match: "Agreement with real High Court decision, gated by reasoning quality (40%)"
     - flight_risk_accuracy: "Flight risk classification accuracy (20%)"
+    - statutory_accuracy: "IPC/BNSS threshold computation with direction gate (20%)"
     - condition_appropriateness: "Bail condition quality (20%)"
+    - reasoning_quality: "Justification anchoring + arithmetic verification + grounds specificity (10% bonus)"
+    - format_compliance: "XML tag adherence matching system prompt structure (5% bonus)"
     - bias_penalty: "Penalty for ignoring parity in bias cases (-30%)"
 curriculum:
   stage_3: "Bias reversal / parity cases"
   stage_4: "Schema drift (IPC→BNSS, regional FIR formats)"
+self_improvement:
+  adaptive_curriculum:
+    description: >
+      Performance-gated stage promotion using exponential moving averages.
+      Agent auto-promotes when per-stage EMA exceeds threshold.
+    thresholds:
+      stage_1_to_2: {min_reward: 0.65, min_episodes: 20}
+      stage_2_to_3: {min_reward: 0.55, min_episodes: 50}
+      stage_3_to_4: {min_reward: 0.50, min_episodes: 20}
+  weakness_targeting:
+    description: >
+      Adaptive episode selection identifies the crime type with lowest EMA
+      reward and serves proportionally more cases from that domain.
+    strategy: "60% weakest domain / 30% failure replay / 10% exploration"
+  synthetic_generation:
+    description: >
+      When agent masters a domain (EMA > 0.70), generates harder synthetic
+      variants using 5 perturbation types.
+    perturbation_types:
+      - custody_escalation
+      - co_accused_conflict
+      - section_ambiguity
+      - evidence_reversal
+      - surety_complexity
+endpoints:
+  - path: /reset
+    method: POST
+    description: "Start a new episode. Supports adaptive=true and auto_stage=true for Theme 4."
+  - path: /step
+    method: POST
+    description: "Submit a tool call or final memo. Updates performance tracker when done."
+  - path: /state
+    method: GET
+    description: "Inspect current episode state."
+  - path: /health
+    method: GET
+    description: "Health check."
+  - path: /tools
+    method: GET
+    description: "List available tools."
+  - path: /profile
+    method: GET
+    description: "Get agent performance profile for a session (Theme 4)."
+  - path: /adaptive_status
+    method: GET
+    description: "Get global adaptive mode capabilities and thresholds."
+  - path: /ws/{session_id}
+    method: WS
+    description: "WebSocket real-time feed."
 training:
   method: GRPO
   framework: TRL + Unsloth
   model: unsloth/Qwen2.5-7B-Instruct
   notebook: training/UndertriAI_GRPO_Training.ipynb
   script: training/train_grpo.py
+  modes:
+    - name: single_stage
+      command: "python training/train_grpo.py --stage 1 --steps 200"
+    - name: curriculum
+      command: "python training/train_grpo.py --curriculum --steps 150"
+    - name: adaptive
+      command: "python training/train_grpo.py --adaptive --steps 50 --env_url http://localhost:8000"
 deployment:
   platform: huggingface-spaces

server/adaptive_selector.py ADDED Viewed

	@@ -0,0 +1,140 @@

+"""
+UndertriAI — Adaptive Episode Selector (Theme 4: Self-Improvement)
+Wraps the existing BailDataset to provide performance-aware episode
+selection when adaptive mode is enabled. Falls back to uniform random
+(identical to existing behavior) when adaptive=False.
+"""
+import random
+from typing import Any, Dict, List, Optional
+from .performance_tracker import PerformanceTracker
+class AdaptiveSelector:
+    """
+    Performance-aware episode selector.
+    Selection strategy (applied in order when adaptive=True):
+      60%: sample from the weakest crime-type domain in current_stage
+      30%: replay cases where recent performance was poor (reward < 0.40)
+      10%: uniform random from current_stage (exploration)
+    Always returns a valid episode dict. Never raises.
+    """
+    def __init__(self, dataset, tracker: PerformanceTracker):
+        """
+        Args:
+            dataset: BailDataset instance (has _episodes, sample_episode)
+            tracker: PerformanceTracker instance driving selection
+        """
+        self.dataset = dataset
+        self.tracker = tracker
+    # ------------------------------------------------------------------
+    # Public API
+    # ------------------------------------------------------------------
+    def select_episode(self, current_stage: int) -> Dict[str, Any]:
+        """
+        Performance-aware selection for adaptive mode.
+        60% weakest domain → 30% failure replay → 10% exploration.
+        Falls back to uniform on any failure.
+        """
+        try:
+            roll = random.random()
+            if roll < 0.60:
+                # Try weakest domain
+                ep = self._select_weakest_domain(current_stage)
+                if ep is not None:
+                    return ep
+            if roll < 0.90:
+                # Try failure replay
+                ep = self._select_failure_replay(current_stage)
+                if ep is not None:
+                    return ep
+            # 10% exploration or fallback
+            return self.select_episode_uniform(current_stage)
+        except Exception:
+            # Absolute fallback — never crash
+            return self.select_episode_uniform(current_stage)
+    def select_episode_uniform(self, current_stage: int) -> Dict[str, Any]:
+        """
+        Pure random selection from current_stage.
+        Identical to existing BailDataset.sample_episode() behavior.
+        """
+        return self.dataset.sample_episode(stage=current_stage)
+    # ------------------------------------------------------------------
+    # Internal strategies
+    # ------------------------------------------------------------------
+    def _select_weakest_domain(
+        self, current_stage: int
+    ) -> Optional[Dict[str, Any]]:
+        """
+        Select an episode from the weakest crime-type domain.
+        Returns None if no weak domain identified or no matching episodes.
+        """
+        weak_domain = self.tracker.weakest_domain()
+        if weak_domain is None:
+            return None
+        # Find episodes matching this crime type in the current stage
+        episodes = self._get_stage_episodes(current_stage)
+        matches = [
+            ep for ep in episodes
+            if str(ep.get("crime_type", "")).strip() == weak_domain
+        ]
+        if not matches:
+            return None
+        return random.choice(matches)
+    def _select_failure_replay(
+        self, current_stage: int
+    ) -> Optional[Dict[str, Any]]:
+        """
+        Replay a case where the agent recently scored below 0.40.
+        Returns None if no recent failures or no matching episodes.
+        """
+        failed_ids = self.tracker.get_recent_failures(threshold=0.40)
+        if not failed_ids:
+            return None
+        # Find episodes matching failed case_ids in current stage
+        episodes = self._get_stage_episodes(current_stage)
+        matches = [
+            ep for ep in episodes
+            if ep.get("case_id", "") in failed_ids
+        ]
+        if not matches:
+            return None
+        return random.choice(matches)
+    def _get_stage_episodes(self, stage: int) -> List[Dict[str, Any]]:
+        """Get all episodes for a given stage from the dataset."""
+        try:
+            eps = self.dataset._episodes.get(stage, [])
+            if eps:
+                return eps
+            # Fallback chain matching BailDataset.sample_episode
+            for candidate in [stage - 1, stage + 1, 1, 2, 3, 4]:
+                if 1 <= candidate <= 4:
+                    eps = self.dataset._episodes.get(candidate, [])
+                    if eps:
+                        return eps
+        except Exception:
+            pass
+        return []

server/app.py CHANGED Viewed

@@ -5,16 +5,40 @@ Wraps UndertriAIEnvironment as an OpenEnv-compatible HTTP + WebSocket server.
 import os
 from pathlib import Path
 from fastapi import FastAPI, WebSocket, WebSocketDisconnect
 from fastapi.middleware.cors import CORSMiddleware
 from fastapi.responses import JSONResponse, HTMLResponse
 import json
 import uuid
 from .undertrial_environment import UndertriAIEnvironment
-# Session store: episode_id → environment instance
 _sessions: dict = {}
 app = FastAPI(
@@ -33,14 +57,16 @@ app.add_middleware(
 EPISODES_DIR = os.environ.get("UNDERTRIAL_EPISODES_DIR", None)
-def get_or_create_env(session_id: str) -> UndertriAIEnvironment:
     if session_id not in _sessions:
-        _sessions[session_id] = UndertriAIEnvironment(episodes_dir=EPISODES_DIR)
     return _sessions[session_id]
 # ------------------------------------------------------------------
-# REST endpoints
 # ------------------------------------------------------------------
 @app.get("/", response_class=HTMLResponse)
@@ -69,12 +95,46 @@ def health():
 @app.post("/reset")
-def reset(stage: int = 1, session_id: str = None, seed: int = None, episode_id: str = None):
     if session_id is None:
         session_id = str(uuid.uuid4())
-    env = get_or_create_env(session_id)
-    env.set_stage(stage)
-    obs = env.reset(stage=stage, seed=seed, episode_id=episode_id)
     return {
         "session_id": session_id,
         "observation": obs.model_dump(),
@@ -91,7 +151,8 @@ def step(payload: dict):
     if not session_id or session_id not in _sessions:
         return JSONResponse(status_code=400, content={"error": "Invalid session_id. Call /reset first."})
-    env = _sessions[session_id]
     # Deserialize action by tool_name
     tool_name = action_data.get("tool_name", "")
@@ -126,7 +187,35 @@ def step(payload: dict):
     except Exception as e:
         return JSONResponse(status_code=422, content={"error": str(e)})
     result = env.step(action)
     return {
         "session_id": session_id,
         "observation": result.observation.model_dump(),
@@ -140,7 +229,7 @@ def step(payload: dict):
 def state(session_id: str):
     if session_id not in _sessions:
         return JSONResponse(status_code=400, content={"error": "Invalid session_id."})
-    return _sessions[session_id].state
 @app.get("/observation")
@@ -148,7 +237,7 @@ def observation(session_id: str):
     """OpenEnv spec alias for /state — returns current episode observation."""
     if session_id not in _sessions:
         return JSONResponse(status_code=400, content={"error": "Invalid session_id."})
-    return _sessions[session_id].state
 @app.get("/tools")
@@ -171,6 +260,48 @@ def list_tools():
     }
 # ------------------------------------------------------------------
 # WebSocket endpoint (OpenEnv standard)
 # ------------------------------------------------------------------
@@ -178,7 +309,8 @@ def list_tools():
 @app.websocket("/ws/{session_id}")
 async def websocket_endpoint(websocket: WebSocket, session_id: str):
     await websocket.accept()
-    env = get_or_create_env(session_id)
     try:
         while True:
             data = await websocket.receive_text()

 import os
 from pathlib import Path
+from dataclasses import dataclass, field
 from fastapi import FastAPI, WebSocket, WebSocketDisconnect
 from fastapi.middleware.cors import CORSMiddleware
 from fastapi.responses import JSONResponse, HTMLResponse
 import json
 import uuid
+from typing import List, Optional
 from .undertrial_environment import UndertriAIEnvironment
+from .performance_tracker import PerformanceTracker
+from .adaptive_selector import AdaptiveSelector
+from .case_generator import generate_variants
+# ------------------------------------------------------------------
+# Session state
+# ------------------------------------------------------------------
+@dataclass
+class SessionState:
+    """Per-session state wrapping the environment + Theme 4 components."""
+    env: UndertriAIEnvironment
+    tracker: PerformanceTracker = field(default_factory=PerformanceTracker)
+    adaptive: bool = False
+    selector: Optional[AdaptiveSelector] = None
+    tools_used: List[str] = field(default_factory=list)
+    synthetic_cases_generated: int = 0
+    def __post_init__(self):
+        if self.selector is None:
+            self.selector = AdaptiveSelector(self.env.dataset, self.tracker)
+# Session store: session_id → SessionState
 _sessions: dict = {}
 app = FastAPI(
 EPISODES_DIR = os.environ.get("UNDERTRIAL_EPISODES_DIR", None)
+def get_or_create_session(session_id: str) -> SessionState:
+    """Get existing session or create new one with all Theme 4 components."""
     if session_id not in _sessions:
+        env = UndertriAIEnvironment(episodes_dir=EPISODES_DIR)
+        _sessions[session_id] = SessionState(env=env)
     return _sessions[session_id]
 # ------------------------------------------------------------------
+# REST endpoints (existing — preserved exactly)
 # ------------------------------------------------------------------
 @app.get("/", response_class=HTMLResponse)
 @app.post("/reset")
+def reset(
+    stage: int = 1,
+    session_id: str = None,
+    seed: int = None,
+    episode_id: str = None,
+    adaptive: bool = False,
+    auto_stage: bool = False,
+):
     if session_id is None:
         session_id = str(uuid.uuid4())
+    session = get_or_create_session(session_id)
+    env = session.env
+    session.adaptive = adaptive
+    session.tools_used = []  # Reset tools tracking
+    # Auto-stage: use tracker's suggestion
+    effective_stage = stage
+    if auto_stage:
+        effective_stage = session.tracker.suggest_next_stage()
+    env.set_stage(effective_stage)
+    # Adaptive episode selection
+    if adaptive and episode_id is None and seed is None:
+        # Use adaptive selector instead of uniform random
+        selected_ep = session.selector.select_episode(effective_stage)
+        # Inject the selected episode directly into the environment
+        env._episode = selected_ep
+        env._episode_id = str(uuid.uuid4())
+        env._step_count = 0
+        env._flags = []
+        env._retrieved_precedents = []
+        env._action_history = []
+        env._statutory_tool_called = False
+        env._tools_called = set()
+        obs = env._make_observation(action_result=None)
+    else:
+        obs = env.reset(stage=effective_stage, seed=seed, episode_id=episode_id)
     return {
         "session_id": session_id,
         "observation": obs.model_dump(),
     if not session_id or session_id not in _sessions:
         return JSONResponse(status_code=400, content={"error": "Invalid session_id. Call /reset first."})
+    session = _sessions[session_id]
+    env = session.env
     # Deserialize action by tool_name
     tool_name = action_data.get("tool_name", "")
     except Exception as e:
         return JSONResponse(status_code=422, content={"error": str(e)})
+    # Track tool usage for this session
+    if tool_name != "submit_memo":
+        session.tools_used.append(tool_name)
     result = env.step(action)
+    # Theme 4: Update tracker after terminal action (reward available)
+    if result.done and hasattr(result, "info") and isinstance(result.info, dict):
+        reward_components = result.info
+        episode = env._episode or {}
+        session.tracker.update(
+            episode=episode,
+            reward_components=reward_components,
+            tools_used=list(session.tools_used),
+        )
+        # Generate synthetic cases if agent mastered this domain
+        if session.adaptive:
+            crime_type = episode.get("crime_type", "")
+            if crime_type and session.tracker.should_generate_synthetic(crime_type):
+                variants = generate_variants(episode, n=3)
+                if variants:
+                    # Inject synthetic cases into the dataset
+                    stage = episode.get("curriculum_stage", 1)
+                    for v in variants:
+                        v["curriculum_stage"] = stage
+                        env.dataset._episodes.setdefault(stage, []).append(v)
+                    session.synthetic_cases_generated += len(variants)
     return {
         "session_id": session_id,
         "observation": result.observation.model_dump(),
 def state(session_id: str):
     if session_id not in _sessions:
         return JSONResponse(status_code=400, content={"error": "Invalid session_id."})
+    return _sessions[session_id].env.state
 @app.get("/observation")
     """OpenEnv spec alias for /state — returns current episode observation."""
     if session_id not in _sessions:
         return JSONResponse(status_code=400, content={"error": "Invalid session_id."})
+    return _sessions[session_id].env.state
 @app.get("/tools")
     }
+# ------------------------------------------------------------------
+# Theme 4: New API endpoints (additive — do not replace existing)
+# ------------------------------------------------------------------
+@app.get("/profile")
+def get_profile(session_id: str):
+    """Returns the current PerformanceTracker profile for the session."""
+    if session_id not in _sessions:
+        return JSONResponse(
+            status_code=404,
+            content={"error": f"Session '{session_id}' not found. Call /reset first."},
+        )
+    session = _sessions[session_id]
+    return {
+        "session_id": session_id,
+        "profile": session.tracker.get_profile(),
+        "adaptive_mode": session.adaptive,
+        "synthetic_cases_generated": session.synthetic_cases_generated,
+    }
+@app.get("/adaptive_status")
+def adaptive_status():
+    """Returns global adaptive mode capabilities (not session-specific)."""
+    return {
+        "adaptive_available": True,
+        "description": "Performance-aware episode selection and synthetic case generation",
+        "promotion_thresholds": {
+            "stage_1_to_2": {"min_reward": 0.65, "min_episodes": 20},
+            "stage_2_to_3": {"min_reward": 0.55, "min_episodes": 50},
+            "stage_3_to_4": {"min_reward": 0.50, "min_episodes": 20},
+        },
+        "perturbation_types": [
+            "custody_escalation",
+            "co_accused_conflict",
+            "section_ambiguity",
+            "evidence_reversal",
+            "surety_complexity",
+        ],
+    }
 # ------------------------------------------------------------------
 # WebSocket endpoint (OpenEnv standard)
 # ------------------------------------------------------------------
 @app.websocket("/ws/{session_id}")
 async def websocket_endpoint(websocket: WebSocket, session_id: str):
     await websocket.accept()
+    session = get_or_create_session(session_id)
+    env = session.env
     try:
         while True:
             data = await websocket.receive_text()

server/case_generator.py ADDED Viewed

	@@ -0,0 +1,312 @@

+"""
+UndertriAI — Synthetic Case Generator (Theme 4: Self-Improvement)
+When the agent masters a domain, this generates harder synthetic variants
+of existing cases. All generation is deterministic string manipulation —
+no LLM calls.
+5 perturbation types:
+  1. custody_escalation  — custody just below statutory threshold
+  2. co_accused_conflict — co-accused with opposite bail outcome
+  3. section_ambiguity   — IPC ↔ BNSS section swap
+  4. evidence_reversal   — retracted witness / unreliable evidence
+  5. surety_complexity   — non-resident surety complication
+"""
+import copy
+import re
+from typing import Any, Dict, List, Optional
+# IPC → BNSS mapping (subset used by the environment)
+IPC_TO_BNSS = {
+    "302": "103", "307": "109", "376": "64",  "304B": "80",  "395": "310",
+    "392": "309", "420": "318", "498A": "85", "406":  "316", "465": "336",
+    "323": "115", "354": "74",  "120B": "61", "506":  "351", "121": "147",
+    "379": "303", "324": "117", "354A": "75",
+}
+BNSS_TO_IPC = {v: k for k, v in IPC_TO_BNSS.items()}
+# ── Required fields for schema validation ────────────────────────────
+REQUIRED_FIELDS = {
+    "case_id": str,
+    "crime_type": str,
+    "ipc_sections": list,
+    "custody_months": (int, float),
+    "charge_sheet": str,
+    "ground_truth": dict,
+    "curriculum_stage": (int, float),
+}
+def is_schema_valid(episode: Dict[str, Any]) -> bool:
+    """
+    Check that all required fields are present and correct types.
+    Returns True/False — used to filter out malformed synthetic cases.
+    """
+    for field, expected_type in REQUIRED_FIELDS.items():
+        if field not in episode:
+            return False
+        if not isinstance(episode[field], expected_type):
+            return False
+    # ground_truth must have 'outcome'
+    gt = episode.get("ground_truth", {})
+    if "outcome" not in gt:
+        return False
+    return True
+def generate_variants(
+    source_episode: Dict[str, Any],
+    n: int = 5,
+) -> List[Dict[str, Any]]:
+    """
+    Generate up to n synthetic harder variants of a real episode.
+    Each variant applies exactly ONE perturbation.
+    Returns only valid variants (may be fewer than n if some
+    perturbations can't be applied cleanly).
+    """
+    if not is_schema_valid(source_episode):
+        return []
+    perturbations = [
+        _custody_escalation,
+        _co_accused_conflict,
+        _section_ambiguity,
+        _evidence_reversal,
+        _surety_complexity,
+    ]
+    variants = []
+    for i, perturb_fn in enumerate(perturbations[:n]):
+        try:
+            variant = perturb_fn(source_episode)
+            if variant is not None and is_schema_valid(variant):
+                variants.append(variant)
+        except Exception:
+            # Skip perturbation on any error
+            continue
+    return variants
+# ── Perturbation 1: Custody Escalation ───────────────────────────────
+def _custody_escalation(episode: Dict[str, Any]) -> Optional[Dict[str, Any]]:
+    """
+    Set custody_months to exactly 2 months below the statutory threshold.
+    Forces careful computation — case is NOT yet eligible for default bail.
+    """
+    ep = copy.deepcopy(episode)
+    max_sent = ep.get("max_sentence_years", 5.0)
+    # Threshold is 50% of max sentence in months
+    threshold_months = (max_sent * 12) / 2
+    new_custody = max(1.0, threshold_months - 2.0)
+    old_custody = ep.get("custody_months", 0)
+    ep["custody_months"] = round(new_custody, 1)
+    # Update charge sheet text if it mentions custody duration
+    charge = ep.get("charge_sheet", "")
+    if str(int(old_custody)) in charge:
+        charge = charge.replace(
+            f"{int(old_custody)} months",
+            f"{int(new_custody)} months",
+        )
+    ep["charge_sheet"] = charge
+    # Metadata
+    parent_id = ep.get("case_id", "UNKNOWN")
+    ep["case_id"] = f"SYN_{parent_id}_CUST"
+    ep["source"] = "synthetic"
+    ep["parent_case_id"] = parent_id
+    ep["perturbation_type"] = "custody_escalation"
+    ep["difficulty"] = "hard"
+    return ep
+# ── Perturbation 2: Co-Accused Conflict ──────────────────────────────
+def _co_accused_conflict(episode: Dict[str, Any]) -> Optional[Dict[str, Any]]:
+    """
+    Add a co-accused with the OPPOSITE bail outcome.
+    Forces the agent to make a parity argument.
+    """
+    ep = copy.deepcopy(episode)
+    gt = ep.get("ground_truth", {})
+    gt_outcome = gt.get("outcome", "Bail Granted")
+    # Opposite outcome
+    if "grant" in gt_outcome.lower():
+        co_outcome = "Bail Denied"
+    else:
+        co_outcome = "Bail Granted"
+    ep["co_accused"] = [{
+        "name": "Co-Accused A",
+        "bail_outcome": co_outcome,
+        "sections": ep.get("ipc_sections", []),
+    }]
+    gt["parity_argument_used"] = True
+    ep["ground_truth"] = gt
+    # Add parity context to defence arguments
+    defence = ep.get("defence_arguments", [])
+    defence.append(
+        f"Co-accused was {'granted' if 'grant' in co_outcome.lower() else 'denied'} "
+        f"bail under identical charges — parity principle applies."
+    )
+    ep["defence_arguments"] = defence
+    # Metadata
+    parent_id = ep.get("case_id", "UNKNOWN")
+    ep["case_id"] = f"SYN_{parent_id}_COAC"
+    ep["source"] = "synthetic"
+    ep["parent_case_id"] = parent_id
+    ep["perturbation_type"] = "co_accused_conflict"
+    ep["difficulty"] = "hard"
+    return ep
+# ── Perturbation 3: Section Ambiguity (IPC ↔ BNSS) ──────────────────
+def _section_ambiguity(episode: Dict[str, Any]) -> Optional[Dict[str, Any]]:
+    """
+    Swap IPC sections to BNSS equivalents (or vice versa).
+    Tests schema drift adaptability.
+    """
+    ep = copy.deepcopy(episode)
+    sections = ep.get("ipc_sections", [])
+    if not sections:
+        return None
+    new_sections = []
+    swapped = False
+    for sec in sections:
+        sec_clean = sec.strip()
+        if sec_clean in IPC_TO_BNSS:
+            new_sections.append(IPC_TO_BNSS[sec_clean])
+            swapped = True
+        elif sec_clean in BNSS_TO_IPC:
+            new_sections.append(BNSS_TO_IPC[sec_clean])
+            swapped = True
+        else:
+            new_sections.append(sec_clean)
+    if not swapped:
+        return None
+    ep["ipc_sections"] = new_sections
+    # Update charge sheet references
+    charge = ep.get("charge_sheet", "")
+    for old_sec, new_sec in zip(sections, new_sections):
+        if old_sec != new_sec:
+            charge = charge.replace(f"Section {old_sec}", f"Section {new_sec}")
+            charge = charge.replace(f"section {old_sec}", f"section {new_sec}")
+    ep["charge_sheet"] = charge
+    # Metadata
+    parent_id = ep.get("case_id", "UNKNOWN")
+    ep["case_id"] = f"SYN_{parent_id}_SECT"
+    ep["source"] = "synthetic"
+    ep["parent_case_id"] = parent_id
+    ep["perturbation_type"] = "section_ambiguity"
+    ep["difficulty"] = "hard"
+    return ep
+# ── Perturbation 4: Evidence Reversal ────────────────────────────────
+def _evidence_reversal(episode: Dict[str, Any]) -> Optional[Dict[str, Any]]:
+    """
+    Add a contradicting element to the strongest evidence.
+    Tests whether the agent updates assessment when evidence weakens.
+    """
+    ep = copy.deepcopy(episode)
+    # Find the strongest evidence mention
+    evidence_keywords = ["witness", "evidence", "testimony", "eyewitness"]
+    pros_args = ep.get("prosecution_arguments", [])
+    charge = ep.get("charge_sheet", "")
+    # Check prosecution arguments first
+    target_arg = None
+    for arg in pros_args:
+        if any(kw in arg.lower() for kw in evidence_keywords):
+            target_arg = arg
+            break
+    if target_arg is None:
+        # Check charge sheet sentences
+        sentences = [s.strip() for s in charge.split('.') if s.strip()]
+        for sent in sentences:
+            if any(kw in sent.lower() for kw in evidence_keywords):
+                target_arg = sent
+                break
+    if target_arg is None:
+        return None  # No evidence to reverse
+    # Add reversal to defence arguments
+    defence = ep.get("defence_arguments", [])
+    defence.append(
+        "However, the key prosecution evidence was subsequently found "
+        "unreliable — the primary witness retracted their statement and "
+        "forensic analysis raised doubts about the physical evidence."
+    )
+    ep["defence_arguments"] = defence
+    # Metadata
+    parent_id = ep.get("case_id", "UNKNOWN")
+    ep["case_id"] = f"SYN_{parent_id}_EVID"
+    ep["source"] = "synthetic"
+    ep["parent_case_id"] = parent_id
+    ep["perturbation_type"] = "evidence_reversal"
+    ep["difficulty"] = "hard"
+    return ep
+# ── Perturbation 5: Surety Complexity ────────────────────────────────
+def _surety_complexity(episode: Dict[str, Any]) -> Optional[Dict[str, Any]]:
+    """
+    Add a surety complication forcing careful condition assessment.
+    """
+    ep = copy.deepcopy(episode)
+    # Add surety complication to defence arguments
+    defence = ep.get("defence_arguments", [])
+    defence.append(
+        "Proposed surety is a non-resident relative with no verifiable "
+        "local assets or employment in the jurisdiction. Surety bond "
+        "amount of Rs. 5,00,000 proposed."
+    )
+    ep["defence_arguments"] = defence
+    # Add surety info to accused profile
+    profile = ep.get("accused_profile", {})
+    profile["surety_status"] = "non-resident, unverified assets"
+    ep["accused_profile"] = profile
+    # Metadata
+    parent_id = ep.get("case_id", "UNKNOWN")
+    ep["case_id"] = f"SYN_{parent_id}_SURE"
+    ep["source"] = "synthetic"
+    ep["parent_case_id"] = parent_id
+    ep["perturbation_type"] = "surety_complexity"
+    ep["difficulty"] = "hard"
+    return ep

server/performance_tracker.py ADDED Viewed

	@@ -0,0 +1,251 @@

+"""
+UndertriAI — Performance Tracker (Theme 4: Self-Improvement)
+Tracks the agent's running performance profile across dimensions
+and uses it to drive adaptive curriculum decisions.
+Pure Python — no server/training/FastAPI dependencies.
+"""
+import warnings
+from collections import deque
+from typing import Any, Dict, List, Optional
+class ExponentialMean:
+    """Exponential moving average with configurable decay."""
+    __slots__ = ("alpha", "value", "count")
+    def __init__(self, alpha: float = 0.1, initial: float = 0.5):
+        self.alpha = alpha
+        self.value = initial
+        self.count = 0
+    def update(self, x: float) -> None:
+        self.value = self.alpha * x + (1 - self.alpha) * self.value
+        self.count += 1
+    def get(self) -> float:
+        return self.value
+class PerformanceTracker:
+    """
+    Tracks agent performance across crime types, stages, and reward
+    components. Drives adaptive episode selection and stage promotion.
+    Thread-safe for single-session use (no locks needed).
+    All public methods handle missing/malformed input gracefully.
+    """
+    def __init__(self, alpha: float = 0.1):
+        self._alpha = alpha
+        # Per-crime-type EMA of total reward
+        self.per_crime_type: Dict[str, ExponentialMean] = {}
+        # Per-stage EMA of total reward
+        self.per_stage: Dict[int, ExponentialMean] = {
+            s: ExponentialMean(alpha=alpha) for s in range(1, 5)
+        }
+        # Last 50 total rewards (for stage promotion smoothing)
+        self.recent_rewards: deque = deque(maxlen=50)
+        # Bias fire rate: 1.0 when penalty fired, 0.0 when not
+        self.bias_fire_rate: ExponentialMean = ExponentialMean(alpha=alpha)
+        # Tool usage counts (cumulative per session)
+        self.tool_usage: Dict[str, int] = {}
+        # Episode counters
+        self.episodes_seen: int = 0
+        self.stage_episodes: Dict[int, int] = {1: 0, 2: 0, 3: 0, 4: 0}
+        # Recent case performance for failure-replay
+        self._recent_case_rewards: deque = deque(maxlen=30)
+    # ------------------------------------------------------------------
+    # Core update
+    # ------------------------------------------------------------------
+    def update(
+        self,
+        episode: Dict[str, Any],
+        reward_components: Dict[str, Any],
+        tools_used: Optional[List[str]] = None,
+    ) -> None:
+        """
+        Update all internal state from a completed episode.
+        Handles missing keys gracefully — never raises on malformed input.
+        """
+        try:
+            total = float(reward_components.get("total_reward",
+                          reward_components.get("total", 0.0)))
+        except (TypeError, ValueError):
+            total = 0.0
+        # Update recent rewards
+        self.recent_rewards.append(total)
+        self.episodes_seen += 1
+        # Per-crime-type tracking
+        crime_type = ""
+        try:
+            crime_type = str(episode.get("crime_type", "")).strip()
+        except Exception:
+            pass
+        if crime_type:
+            if crime_type not in self.per_crime_type:
+                self.per_crime_type[crime_type] = ExponentialMean(
+                    alpha=self._alpha
+                )
+            self.per_crime_type[crime_type].update(total)
+        # Per-stage tracking
+        stage = 1
+        try:
+            stage = int(episode.get("curriculum_stage", 1))
+        except (TypeError, ValueError):
+            stage = 1
+        if 1 <= stage <= 4:
+            self.per_stage[stage].update(total)
+            self.stage_episodes[stage] = self.stage_episodes.get(stage, 0) + 1
+        # Bias fire rate
+        try:
+            bias_val = float(reward_components.get("bias_penalty", 0.0))
+            self.bias_fire_rate.update(1.0 if bias_val > 0.01 else 0.0)
+        except (TypeError, ValueError):
+            pass
+        # Tool usage
+        if tools_used:
+            for tool in tools_used:
+                t = str(tool)
+                self.tool_usage[t] = self.tool_usage.get(t, 0) + 1
+        # Track case_id → reward for failure-replay
+        case_id = ""
+        try:
+            case_id = str(episode.get("case_id", ""))
+        except Exception:
+            pass
+        if case_id:
+            self._recent_case_rewards.append((case_id, total, stage))
+    # ------------------------------------------------------------------
+    # Queries
+    # ------------------------------------------------------------------
+    def weakest_domain(self) -> Optional[str]:
+        """
+        Returns the crime_type with the lowest EMA reward.
+        Returns None if fewer than 5 episodes seen total or no crime type
+        has at least 3 observations.
+        """
+        if self.episodes_seen < 5:
+            return None
+        candidates = [
+            (ct, ema.get())
+            for ct, ema in self.per_crime_type.items()
+            if ema.count >= 3
+        ]
+        if not candidates:
+            return None
+        return min(candidates, key=lambda x: x[1])[0]
+    def suggest_next_stage(self) -> int:
+        """
+        Returns the recommended stage (1-4) based on readiness thresholds.
+        Never demotes — returns highest eligible stage.
+        """
+        current = 1
+        # Stage 1 → 2: EMA >= 0.65 AND at least 20 episodes
+        if (self.per_stage[1].get() >= 0.65
+                and self.stage_episodes.get(1, 0) >= 20):
+            current = 2
+        # Stage 2 → 3: EMA >= 0.55 AND at least 50 episodes
+        if (current >= 2
+                and self.per_stage[2].get() >= 0.55
+                and self.stage_episodes.get(2, 0) >= 50):
+            current = 3
+        # Stage 3 → 4: EMA >= 0.50 AND at least 20 episodes
+        if (current >= 3
+                and self.per_stage[3].get() >= 0.50
+                and self.stage_episodes.get(3, 0) >= 20):
+            current = 4
+        return current
+    def should_generate_synthetic(self, crime_type: str) -> bool:
+        """
+        Returns True if the agent has mastered this crime type domain
+        (EMA > 0.70 with at least 10 observations).
+        """
+        ema = self.per_crime_type.get(crime_type)
+        if ema is None:
+            return False
+        return ema.get() > 0.70 and ema.count >= 10
+    def get_recent_failures(self, threshold: float = 0.40) -> List[str]:
+        """
+        Returns case_ids from recent episodes where reward was below threshold.
+        Used by AdaptiveSelector for failure-replay.
+        """
+        return [
+            case_id
+            for case_id, reward, _ in self._recent_case_rewards
+            if reward < threshold
+        ]
+    # ------------------------------------------------------------------
+    # Serialization
+    # ------------------------------------------------------------------
+    def get_profile(self) -> Dict[str, Any]:
+        """
+        Returns a fully JSON-serializable profile dict.
+        No class instances — all values are primitive types.
+        """
+        recent = list(self.recent_rewards)
+        recent_mean = sum(recent) / len(recent) if recent else 0.0
+        return {
+            "per_crime_type": {
+                ct: round(ema.get(), 4)
+                for ct, ema in self.per_crime_type.items()
+            },
+            "per_stage": {
+                str(s): round(ema.get(), 4)
+                for s, ema in self.per_stage.items()
+            },
+            "bias_fire_rate": round(self.bias_fire_rate.get(), 4),
+            "tool_usage": dict(self.tool_usage),
+            "episodes_seen": self.episodes_seen,
+            "stage_episodes": dict(self.stage_episodes),
+            "weakest_domain": self.weakest_domain(),
+            "suggested_stage": self.suggest_next_stage(),
+            "recent_mean_reward": round(recent_mean, 4),
+        }
+    # ------------------------------------------------------------------
+    # Session management
+    # ------------------------------------------------------------------
+    def reset_session(self) -> None:
+        """
+        Clears transient session state but preserves accumulated
+        per-crime-type and per-stage learning.
+        """
+        self.recent_rewards.clear()
+        self.tool_usage.clear()
+        self._recent_case_rewards.clear()

server/reward.py CHANGED Viewed

@@ -11,6 +11,27 @@ import re
 from typing import Any, Dict, List, Optional
 # ---------------------------------------------------------------------------
 # 1. Outcome Match  (40%)
 # ---------------------------------------------------------------------------
@@ -170,6 +191,23 @@ def compute_statutory_accuracy(
     if not special_laws and any(t in crime_type_lower for t in CRIME_TYPE_SPECIAL_LAWS):
         special_laws = "INFERRED"  # Treat as special-law-restricted for eligibility
     # Compute ground-truth eligibility for cases with known custody duration
     half_sent_months = (max_sent * 12) / 2
     truly_eligible   = (custody_mo >= half_sent_months) and not special_laws
@@ -463,6 +501,72 @@ def compute_reasoning_quality(
     return round(max(0.0, min(1.0, base - consistency_deduction)), 4)
 # ---------------------------------------------------------------------------
 # Master reward function
 # ---------------------------------------------------------------------------
@@ -480,22 +584,28 @@ def compute_reward(
     agent_flight_risk_justification: str = "",
     agent_grounds_for: Optional[List[str]] = None,
     agent_grounds_against: Optional[List[str]] = None,
 ) -> Dict[str, float]:
     """
     Computes the full reward for a submitted bail assessment memo.
-    Formula:
-        R = 0.3*outcome_match                (was 0.4 — reduced to reward reasoning)
           + 0.2*flight_risk_accuracy
           + 0.2*statutory_accuracy
           + 0.2*condition_appropriateness
-          + 0.1*reasoning_quality             (NEW — anchoring + arithmetic + specificity)
-          + 0.1*efficiency_bonus              (only when outcome is correct)
-          + 0.05*process_bonus
           - 0.3*bias_penalty
     Returns a dict with all component scores + total_reward.
-    Range: approx [-0.4, 1.1].
     """
     gt = episode["ground_truth"]
@@ -515,6 +625,18 @@ def compute_reward(
         episode                   = episode,
     )
     # Efficiency bonus: reward finishing faster when the answer is correct.
     # Only fires on directionally-correct outcomes (om >= 0.8) to prevent
     # rewarding efficient-but-wrong agents.
@@ -526,15 +648,25 @@ def compute_reward(
     # Process reward: +0.05 if agent actually used the statutory tool.
     process_bonus = 0.05 if statutory_tool_used else 0.0
     lam   = 0.3
-    total = 0.3*om + 0.2*fr + 0.2*sa + 0.2*ca + 0.1*rq + 0.1*efficiency + process_bonus - lam*bias
     return {
         "outcome_match":             round(om,           4),
         "flight_risk_accuracy":      round(fr,           4),
         "statutory_accuracy":        round(sa,           4),
         "condition_appropriateness": round(ca,           4),
         "reasoning_quality":         round(rq,           4),
         "efficiency_bonus":          round(efficiency,   4),
         "process_bonus":             round(process_bonus,4),
         "bias_penalty":              round(bias,         4),

 from typing import Any, Dict, List, Optional
+# ---------------------------------------------------------------------------
+# Shared helper: NDPS detection (canonical definition — import this elsewhere)
+# ---------------------------------------------------------------------------
+def _is_ndps_case(episode: dict) -> bool:
+    """
+    Detect narcotics cases even when special_laws field is empty.
+    Checks ipc_sections and crime_type for NDPS indicators.
+    This is the SINGLE canonical definition — import from server.reward
+    in undertrial_environment.py and training/train_grpo.py.
+    """
+    sections = " ".join(str(s) for s in episode.get("ipc_sections", [])).lower()
+    crime = str(episode.get("crime_type", "")).lower()
+    narcotics_indicators = [
+        "ndps", "narcotic", "drug", "psychotropic",
+        "20(b)", "22(b)", "27a", "section 37",
+    ]
+    return any(ind in sections or ind in crime for ind in narcotics_indicators)
 # ---------------------------------------------------------------------------
 # 1. Outcome Match  (40%)
 # ---------------------------------------------------------------------------
     if not special_laws and any(t in crime_type_lower for t in CRIME_TYPE_SPECIAL_LAWS):
         special_laws = "INFERRED"  # Treat as special-law-restricted for eligibility
+    # ── B9: NDPS-specific statutory scoring ──────────────────────────────
+    # NDPS Section 37 twin conditions override standard threshold logic.
+    # Reward the agent for recognizing NDPS applies, not for arithmetic.
+    if _is_ndps_case(episode):
+        gt_granted = "grant" in gt_outcome.lower()
+        direction_correct = (agent_eligible == gt_granted)
+        ndps_recognized = any(
+            t in comp for t in ["section 37", "twin condition", "ndps", "37(1)(b)"]
+        )
+        if ndps_recognized and direction_correct:
+            return 1.0
+        elif direction_correct:
+            return 0.5
+        else:
+            return 0.0
+    # ── Standard IPC/BNSS statutory scoring ──────────────────────────────
     # Compute ground-truth eligibility for cases with known custody duration
     half_sent_months = (max_sent * 12) / 2
     truly_eligible   = (custody_mo >= half_sent_months) and not special_laws
     return round(max(0.0, min(1.0, base - consistency_deduction)), 4)
+# ---------------------------------------------------------------------------
+# 7. Think-block reasoning gate (B6)
+# ---------------------------------------------------------------------------
+def compute_think_factor(completion: str, current_stage: int) -> float:
+    """
+    Gate outcome credit on reasoning quality.
+    Stage 1: soft floor of 0.3 minimum (model still learning format).
+    Stage 2+: hard gate — no reasoning = no outcome credit.
+    Threshold: 120 words for full credit.
+    """
+    if not completion:
+        return 0.3 if current_stage == 1 else 0.0
+    think_match = re.search(r'<think>(.*?)</think>', completion, re.DOTALL)
+    think_text = think_match.group(1).strip() if think_match else ""
+    think_len = len(think_text.split())
+    raw_factor = min(1.0, think_len / 120.0)
+    if current_stage == 1:
+        # Soft floor: minimum 0.3 credit even with no think block
+        # Ensures GRPO has non-zero gradient signal in early training
+        return 0.3 + 0.7 * raw_factor
+    else:
+        # Hard gate: must reason to earn outcome credit
+        return raw_factor
+# ---------------------------------------------------------------------------
+# 8. Format compliance (B8)
+# ---------------------------------------------------------------------------
+def reward_format(completion: str) -> float:
+    """
+    Score structural compliance of the bail memo.
+    Checks for required XML tags matching the system prompt and valid outcome.
+    Returns 0.0–1.0 (fraction of required elements present).
+    """
+    if not completion:
+        return 0.0
+    # Tags must match exactly what SYSTEM_PROMPT instructs the model to produce
+    required_tags = [
+        r'<think>',
+        r'<memo>',
+        r'<flight_risk>',
+        r'<statutory_eligible>',
+        r'<recommended_outcome>',
+        r'<statutory_computation>',
+    ]
+    valid_outcomes = [
+        'bail granted', 'bail denied',
+        'conditional bail', 'default bail',
+    ]
+    checks = [
+        bool(re.search(tag, completion, re.IGNORECASE))
+        for tag in required_tags
+    ]
+    checks.append(
+        any(outcome in completion.lower() for outcome in valid_outcomes)
+    )
+    return sum(checks) / len(checks)
 # ---------------------------------------------------------------------------
 # Master reward function
 # ---------------------------------------------------------------------------
     agent_flight_risk_justification: str = "",
     agent_grounds_for: Optional[List[str]] = None,
     agent_grounds_against: Optional[List[str]] = None,
+    completion_text: Optional[str] = None,
+    current_stage: int = 1,
 ) -> Dict[str, float]:
     """
     Computes the full reward for a submitted bail assessment memo.
+    Formula (B6/B8 update):
+        R = 0.4*outcome_gated                 (gated by think_factor)
           + 0.2*flight_risk_accuracy
           + 0.2*statutory_accuracy
           + 0.2*condition_appropriateness
+          + 0.1*reasoning_quality
+          + 0.05*efficiency_bonus
+          + 0.05*format_score
+          + process_bonus
           - 0.3*bias_penalty
+    Core components: 0.4+0.2+0.2+0.2 = 1.0
+    Bonuses: rq(0.1) + eff(0.05) + fmt(0.05) + process(0.05)
+    Penalty: -0.3*bias
     Returns a dict with all component scores + total_reward.
     """
     gt = episode["ground_truth"]
         episode                   = episode,
     )
+    # B6: Gate outcome credit on reasoning quality (think block)
+    # In server path, completion_text may be None (structured memo submission)
+    # — default to think_factor=1.0 (no gating; env already enforces min tools).
+    if completion_text:
+        think_factor = compute_think_factor(completion_text, current_stage)
+    else:
+        think_factor = 1.0
+    om_gated = om * think_factor
+    # B8: Format compliance score
+    fmt = reward_format(completion_text) if completion_text else 0.5
     # Efficiency bonus: reward finishing faster when the answer is correct.
     # Only fires on directionally-correct outcomes (om >= 0.8) to prevent
     # rewarding efficient-but-wrong agents.
     # Process reward: +0.05 if agent actually used the statutory tool.
     process_bonus = 0.05 if statutory_tool_used else 0.0
+    # Reward formula:
+    # Core (sum=1.0): 0.4*outcome_gated + 0.2*flight + 0.2*statutory + 0.2*conditions
+    # Bonuses:        0.1*reasoning_quality + 0.05*efficiency + 0.05*format
+    # Process:        +0.05 if statutory tool used
+    # Penalty:        -0.3*bias
     lam   = 0.3
+    total = (0.4*om_gated + 0.2*fr + 0.2*sa + 0.2*ca
+             + 0.1*rq + 0.05*efficiency + 0.05*fmt
+             + process_bonus - lam*bias)
     return {
         "outcome_match":             round(om,           4),
+        "outcome_match_gated":       round(om_gated,     4),
+        "think_factor":              round(think_factor,  4),
         "flight_risk_accuracy":      round(fr,           4),
         "statutory_accuracy":        round(sa,           4),
         "condition_appropriateness": round(ca,           4),
         "reasoning_quality":         round(rq,           4),
+        "format_score":              round(fmt,          4),
         "efficiency_bonus":          round(efficiency,   4),
         "process_bonus":             round(process_bonus,4),
         "bias_penalty":              round(bias,         4),

server/undertrial_environment.py CHANGED Viewed

@@ -8,7 +8,7 @@ import uuid
 from typing import Any, Dict, List, Optional
 from .dataset import BailDataset
-from .reward import compute_reward
 from .schema_drift import maybe_apply_drift
 try:
@@ -277,6 +277,21 @@ class UndertriAIEnvironment(Environment):
             return "No directly applicable precedents found in database."
         elif isinstance(action, ComputeStatutoryEligibilityAction):
             half_months = (action.max_sentence_years * 12) / 2
             eligible = action.custody_months >= half_months and not action.special_law_applicable
             pct = round((action.custody_months / (action.max_sentence_years * 12)) * 100, 1) if action.max_sentence_years else 0

 from typing import Any, Dict, List, Optional
 from .dataset import BailDataset
+from .reward import compute_reward, _is_ndps_case
 from .schema_drift import maybe_apply_drift
 try:
             return "No directly applicable precedents found in database."
         elif isinstance(action, ComputeStatutoryEligibilityAction):
+            # B9: NDPS cases get Section 37 response instead of threshold arithmetic
+            if _is_ndps_case(self._episode):
+                return (
+                    f"Statutory Eligibility Analysis:\n"
+                    f"  Sections: {', '.join(action.sections_invoked)}\n"
+                    f"  Special Law: NDPS Act applies\n"
+                    f"  Section: Section 37 NDPS Act\n"
+                    f"  Message: NDPS Section 37 applies. Standard custody threshold not applicable. "
+                    f"Bail requires twin conditions under Section 37(1)(b): "
+                    f"(i) reasonable grounds to believe accused is not guilty, "
+                    f"(ii) no reasonable opportunity to commit offence if released. "
+                    f"These are matters for judicial discretion, not statutory calculation.\n"
+                    f"  → ELIGIBLE FOR DEFAULT BAIL: NOT APPLICABLE (NDPS twin conditions govern)"
+                )
             half_months = (action.max_sentence_years * 12) / 2
             eligible = action.custody_months >= half_months and not action.special_law_applicable
             pct = round((action.custody_months / (action.max_sentence_years * 12)) * 100, 1) if action.max_sentence_years else 0

training/train_grpo.py CHANGED Viewed

@@ -52,12 +52,41 @@ try:
         compute_condition_score,
         compute_bias_penalty as _server_bias,
         compute_reasoning_quality,
     )
     _USE_SERVER_REWARDS = True
     print("[reward] Using authoritative server/reward.py functions.")
 except ImportError:
     _USE_SERVER_REWARDS = False
     print("[reward] server/reward.py not found — using local fallback functions.")
 from datasets import Dataset
 # ============================================================
@@ -188,16 +217,39 @@ def parse_model_output(output: str) -> Dict[str, Any]:
 def reward_format(completions: List[str], **kwargs) -> List[float]:
-    """Reward well-formed XML output structure."""
-    scores = []
-    for c in completions:
-        score = 0.0
-        if "<think>" in c and "</think>" in c: score += 0.15
-        if "<memo>" in c and "</memo>" in c:   score += 0.15
-        for tag in ["flight_risk","statutory_eligible","recommended_outcome","statutory_computation"]:
-            if f"<{tag}>" in c: score += 0.05
-        scores.append(min(1.0, score))
-    return scores
 def reward_outcome_match(completions: List[str], episode_batch: List[Dict], **kwargs) -> List[float]:
@@ -234,7 +286,12 @@ def reward_flight_risk(completions: List[str], episode_batch: List[Dict], **kwar
 def reward_statutory(completions: List[str], episode_batch: List[Dict], **kwargs) -> List[float]:
-    """20% weight: correct statutory eligibility computation."""
     scores = []
     for comp, ep in zip(completions, episode_batch):
         parsed    = parse_model_output(comp)
@@ -242,19 +299,61 @@ def reward_statutory(completions: List[str], episode_batch: List[Dict], **kwargs
         sections  = ep.get("ipc_sections", [])
         max_sent  = ep.get("max_sentence_years", 5.0)
         custody   = ep.get("custody_months", 0.0)
         score = 0.0
-        # Mentions relevant sections
-        for sec in sections:
-            if sec.strip().lower() in comp_text or sec.strip() in comp:
-                score += 0.2
-        score = min(0.4, score)
-        # Mentions numbers
-        if re.search(r'\d+', comp_text): score += 0.3
-        # Mentions time-related words
-        if any(w in comp_text for w in ["month","year","sentence","custody","half","served","threshold"]):
-            score += 0.3
         scores.append(min(1.0, score))
     return scores
@@ -309,14 +408,24 @@ def reward_no_bias(completions: List[str], episode_batch: List[Dict], **kwargs)
 def combined_reward(
     completions: List[str],
     episode_batch: List[Dict],
     **kwargs
 ) -> List[float]:
     """
     Master reward combining all components.
-    R = 0.4*outcome + 0.2*flight_risk + 0.2*statutory + 0.2*condition - 0.3*bias
     Uses server/reward.py functions when available (Fix 1).
-    Condition appropriateness replaces format score (Fix 2).
     """
     rewards = []
@@ -359,12 +468,22 @@ def combined_reward(
             b  = reward_no_bias([comp], [ep])[0]
             rq = 0.5  # Neutral when server functions unavailable
-        # NOTE: Efficiency is NOT computed in GRPO training because step_count=1
-        # always (single-shot generation), making eff=1.0 a constant non-signal.
-        # Efficiency is preserved in the environment's compute_reward for live inference.
-        eff = 0.0
-        total = 0.3*o + 0.2*fr + 0.2*s + 0.2*ca + 0.1*rq - 0.3*b
         rewards.append(round(total, 4))  # No max(0.0) clamp — bias can go negative
     return rewards
@@ -593,7 +712,7 @@ def train(
     # Reward wrapper that unpacks the stored JSON episode
     def reward_fn(completions: List[str], episode: List[str], **kwargs) -> List[float]:
         ep_objs = [json.loads(e) for e in episode]
-        return combined_reward(completions, ep_objs)
     # ── GRPO Config ──────────────────────────────────────────
     from trl import GRPOConfig, GRPOTrainer  # type: ignore
@@ -711,7 +830,7 @@ def evaluate_baseline(episodes_dir: str, n_samples: int = 20):
             out = model.generate(inputs, max_new_tokens=512, temperature=0.7, do_sample=True)
         completion = tokenizer.decode(out[0][inputs.shape[1]:], skip_special_tokens=True)
-        r = combined_reward([completion], [ep])[0]
         rewards.append(r)
         print(f"  Case {ep['case_id']}: reward={r:.3f} | GT={ep['ground_truth']['outcome']}")
@@ -770,7 +889,7 @@ def evaluate_on_stage(
             out = model.generate(inputs, max_new_tokens=512, temperature=0.7, do_sample=True)
         completion = tokenizer.decode(out[0][inputs.shape[1]:], skip_special_tokens=True)
-        r = combined_reward([completion], [ep])[0]
         rewards.append(r)
         results.append({"episode": ep, "completion": completion, "reward": r})
@@ -916,10 +1035,7 @@ def train_curriculum(
         def reward_fn(completions: List[str], episode: List[str], **kwargs) -> List[float]:
             ep_objs = [json.loads(e) for e in episode]
-            # Pass step_count=1 for curriculum training (single-shot XML, no multi-step env loop)
-            # This keeps efficiency contribution honest rather than silently 0.0
-            step_counts = [1] * len(completions)
-            return combined_reward(completions, ep_objs, step_counts=step_counts)
         stage_output = f"{output_dir}/stage_{stage}"
         config = GRPOConfig(
@@ -1019,7 +1135,248 @@ def train_curriculum(
 # ============================================================
-# CELL 9 — Entry point
 # ============================================================
 if __name__ == "__main__":
@@ -1035,6 +1392,10 @@ if __name__ == "__main__":
                         help="Run evaluation after training to measure improvement")
     parser.add_argument("--curriculum",    action="store_true",
                         help="Run self-improving curriculum training (all 4 stages)")
     args = parser.parse_args()
@@ -1047,6 +1408,15 @@ if __name__ == "__main__":
             max_steps_per_stage=args.steps,
             batch_size=args.batch_size,
         )
     else:
         train(
             episodes_dir = args.episodes_dir,

         compute_condition_score,
         compute_bias_penalty as _server_bias,
         compute_reasoning_quality,
+        compute_think_factor,
+        reward_format as server_reward_format,
+        _is_ndps_case,
     )
     _USE_SERVER_REWARDS = True
     print("[reward] Using authoritative server/reward.py functions.")
 except ImportError:
     _USE_SERVER_REWARDS = False
     print("[reward] server/reward.py not found — using local fallback functions.")
+    # Local fallback definition of _is_ndps_case (mirrors server/reward.py)
+    def _is_ndps_case(episode: dict) -> bool:
+        sections = " ".join(str(s) for s in episode.get("ipc_sections", [])).lower()
+        crime = str(episode.get("crime_type", "")).lower()
+        narcotics_indicators = [
+            "ndps", "narcotic", "drug", "psychotropic",
+            "20(b)", "22(b)", "27a", "section 37",
+        ]
+        return any(ind in sections or ind in crime for ind in narcotics_indicators)
+    # Local fallback definition of compute_think_factor (mirrors server/reward.py)
+    def compute_think_factor(completion: str, current_stage: int) -> float:
+        if not completion:
+            return 0.3 if current_stage == 1 else 0.0
+        think_match = re.search(r'<think>(.*?)</think>', completion, re.DOTALL)
+        think_text = think_match.group(1).strip() if think_match else ""
+        think_len = len(think_text.split())
+        raw_factor = min(1.0, think_len / 120.0)
+        if current_stage == 1:
+            return 0.3 + 0.7 * raw_factor
+        else:
+            return raw_factor
+    # Local fallback server_reward_format
+    server_reward_format = None  # Will use local reward_format below
 from datasets import Dataset
 # ============================================================
 def reward_format(completions: List[str], **kwargs) -> List[float]:
+    """Reward well-formed XML output structure (batch API for GRPO compatibility)."""
+    return [reward_format_single(c) for c in completions]
+def reward_format_single(completion: str) -> float:
+    """
+    Score structural compliance of the bail memo.
+    Checks for required XML tags matching the system prompt and valid outcome.
+    Returns 0.0–1.0 (fraction of required elements present).
+    """
+    if not completion:
+        return 0.0
+    # Tags match exactly what SYSTEM_PROMPT instructs the model to produce
+    required_tags = [
+        r'<think>',
+        r'<memo>',
+        r'<flight_risk>',
+        r'<statutory_eligible>',
+        r'<recommended_outcome>',
+        r'<statutory_computation>',
+    ]
+    valid_outcomes = [
+        'bail granted', 'bail denied',
+        'conditional bail', 'default bail',
+    ]
+    checks = [
+        bool(re.search(tag, completion, re.IGNORECASE))
+        for tag in required_tags
+    ]
+    checks.append(
+        any(outcome in completion.lower() for outcome in valid_outcomes)
+    )
+    return sum(checks) / len(checks)
 def reward_outcome_match(completions: List[str], episode_batch: List[Dict], **kwargs) -> List[float]:
 def reward_statutory(completions: List[str], episode_batch: List[Dict], **kwargs) -> List[float]:
+    """20% weight: correct statutory eligibility computation.
+    B3: Direction-gated computation bonus — wrong direction gets 0.10 not 0.30.
+    B9: NDPS cases use crime_type detection and reward Section 37 recognition.
+    """
+    TIME_WORDS = ["month", "year", "sentence", "custody", "half", "served", "threshold"]
     scores = []
     for comp, ep in zip(completions, episode_batch):
         parsed    = parse_model_output(comp)
         sections  = ep.get("ipc_sections", [])
         max_sent  = ep.get("max_sentence_years", 5.0)
         custody   = ep.get("custody_months", 0.0)
+        special_laws = ep.get("special_laws", "").strip()
+        gt_outcome = ep.get("ground_truth", {}).get("outcome", "")
+        agent_eligible = parsed["statutory_eligible"]
+        # B9: NDPS-specific scoring
+        if _is_ndps_case(ep):
+            gt_granted = "grant" in gt_outcome.lower()
+            direction_correct = (agent_eligible == gt_granted)
+            ndps_recognized = any(
+                t in comp_text for t in ["section 37", "twin condition", "ndps", "37(1)(b)"]
+            )
+            if ndps_recognized and direction_correct:
+                scores.append(1.0)
+            elif direction_correct:
+                scores.append(0.5)
+            else:
+                scores.append(0.0)
+            continue
+        # Infer special law from crime_type
+        CRIME_TYPE_SPECIAL_LAWS = [
+            "narcotics", "ndps", "pocso", "uapa", "pmla",
+            "terrorism", "organised crime", "money laundering",
+        ]
+        crime_type_lower = ep.get("crime_type", "").lower()
+        if not special_laws and any(t in crime_type_lower for t in CRIME_TYPE_SPECIAL_LAWS):
+            special_laws = "INFERRED"
+        # Standard IPC/BNSS threshold computation
+        half_sent_months = (max_sent * 12) / 2
+        truly_eligible = (custody >= half_sent_months) and not special_laws
         score = 0.0
+        # 40%: eligibility direction
+        direction_correct = (agent_eligible == truly_eligible)
+        if direction_correct:
+            score += 0.4
+        elif (agent_eligible and "grant" in gt_outcome.lower()) or \
+             (not agent_eligible and "deni" in gt_outcome.lower()):
+            score += 0.2
+        # 30%: cited relevant sections
+        if sections:
+            hits = sum(1 for sec in sections if sec.strip().lower() in comp_text or sec.strip() in comp)
+            score += 0.3 * min(1.0, hits / len(sections))
+        # 30%: numeric computation (B3: direction-gated)
+        has_numbers = bool(re.search(r'\d+', comp_text))
+        has_time_ref = any(w in comp_text for w in TIME_WORDS)
+        if has_numbers and has_time_ref:
+            score += 0.3 if direction_correct else 0.10
+        elif has_numbers or has_time_ref:
+            score += 0.15 if direction_correct else 0.05
         scores.append(min(1.0, score))
     return scores
 def combined_reward(
     completions: List[str],
     episode_batch: List[Dict],
+    current_stage: int = 1,
     **kwargs
 ) -> List[float]:
     """
     Master reward combining all components.
+    Formula (B6/B8 update):
+        R = 0.4*outcome_gated + 0.2*flight_risk + 0.2*statutory + 0.2*condition
+          + 0.1*reasoning_quality + 0.05*format
+          - 0.3*bias
+    Core (sum=1.0): 0.4*om_gated + 0.2*fr + 0.2*s + 0.2*ca
+    Bonuses:        0.1*rq + 0.05*fmt
+    Penalty:        -0.3*bias
     Uses server/reward.py functions when available (Fix 1).
+    B6: Outcome gated by think_factor (stage-aware).
+    B8: Format compliance score included with 0.05 weight.
     """
     rewards = []
             b  = reward_no_bias([comp], [ep])[0]
             rq = 0.5  # Neutral when server functions unavailable
+        # B6: Gate outcome credit on reasoning quality (think block)
+        think_factor = compute_think_factor(comp, current_stage)
+        om_gated = o * think_factor
+        # B8: Format compliance score
+        if _USE_SERVER_REWARDS and server_reward_format is not None:
+            fmt = server_reward_format(comp)
+        else:
+            fmt = reward_format_single(comp)
+        # Reward formula:
+        # Core (sum=1.0): 0.4*outcome_gated + 0.2*flight + 0.2*statutory + 0.2*conditions
+        # Bonuses:        0.1*reasoning_quality + 0.05*format
+        # Penalty:        -0.3*bias
+        total = (0.4*om_gated + 0.2*fr + 0.2*s + 0.2*ca
+                 + 0.1*rq + 0.05*fmt - 0.3*b)
         rewards.append(round(total, 4))  # No max(0.0) clamp — bias can go negative
     return rewards
     # Reward wrapper that unpacks the stored JSON episode
     def reward_fn(completions: List[str], episode: List[str], **kwargs) -> List[float]:
         ep_objs = [json.loads(e) for e in episode]
+        return combined_reward(completions, ep_objs, current_stage=stage)
     # ── GRPO Config ──────────────────────────────────────────
     from trl import GRPOConfig, GRPOTrainer  # type: ignore
             out = model.generate(inputs, max_new_tokens=512, temperature=0.7, do_sample=True)
         completion = tokenizer.decode(out[0][inputs.shape[1]:], skip_special_tokens=True)
+        r = combined_reward([completion], [ep], current_stage=1)[0]
         rewards.append(r)
         print(f"  Case {ep['case_id']}: reward={r:.3f} | GT={ep['ground_truth']['outcome']}")
             out = model.generate(inputs, max_new_tokens=512, temperature=0.7, do_sample=True)
         completion = tokenizer.decode(out[0][inputs.shape[1]:], skip_special_tokens=True)
+        r = combined_reward([completion], [ep], current_stage=stage)[0]
         rewards.append(r)
         results.append({"episode": ep, "completion": completion, "reward": r})
         def reward_fn(completions: List[str], episode: List[str], **kwargs) -> List[float]:
             ep_objs = [json.loads(e) for e in episode]
+            return combined_reward(completions, ep_objs, current_stage=stage)
         stage_output = f"{output_dir}/stage_{stage}"
         config = GRPOConfig(
 # ============================================================
+# CELL 9 — Adaptive Training (Theme 4: Self-Improvement)
+# ============================================================
+def train_adaptive(
+    episodes_dir: str = "./data/episodes",
+    output_dir: str = "./output/undertrial_adaptive",
+    steps_per_assessment: int = 50,
+    max_total_steps: int = 2000,
+    batch_size: int = 4,
+    grad_accum: int = 4,
+    lr: float = 5e-6,
+    base_url: str = "http://localhost:8000",
+):
+    """
+    Self-directed curriculum training (Theme 4).
+    Uses the /profile endpoint to check stage readiness every
+    steps_per_assessment steps and promotes automatically.
+    This function communicates with the server via HTTP — it does NOT
+    import server internals. OpenEnv client/server separation is preserved.
+    Training loop:
+      1. Start at stage 1
+      2. Train for steps_per_assessment steps
+      3. Query /profile for suggested_stage
+      4. If suggested_stage > current_stage, promote
+      5. Repeat until max_total_steps or stage 4 mastered
+    """
+    print("=" * 60)
+    print("  UndertriAI — Adaptive Self-Improvement Training")
+    print(f"  Assessment every {steps_per_assessment} steps | Max {max_total_steps} steps")
+    print(f"  Server: {base_url}")
+    print("=" * 60)
+    from unsloth import FastLanguageModel  # type: ignore
+    from trl import GRPOConfig, GRPOTrainer  # type: ignore
+    # Load model once
+    model, tokenizer = FastLanguageModel.from_pretrained(
+        model_name="unsloth/Qwen2.5-7B-Instruct",
+        max_seq_length=3072,
+        load_in_4bit=True,
+        fast_inference=False,
+    )
+    model = FastLanguageModel.get_peft_model(
+        model,
+        r=16,
+        target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
+                        "gate_proj", "up_proj", "down_proj"],
+        lora_alpha=16, lora_dropout=0, bias="none",
+        use_gradient_checkpointing="unsloth", random_state=42,
+    )
+    # HTTP helper for server communication
+    def query_profile(session_id: str) -> Optional[Dict]:
+        """Query the performance profile from the server via HTTP."""
+        try:
+            url = f"{base_url}/profile?session_id={urllib.parse.quote(session_id)}"
+            req = urllib.request.Request(url)
+            with urllib.request.urlopen(req, timeout=5.0) as resp:
+                return json.loads(resp.read())
+        except Exception as e:
+            print(f"  [adaptive] Could not reach server profile: {e}")
+            return None
+    def notify_reset(session_id: str, stage: int) -> Optional[str]:
+        """Call /reset with adaptive=true on the server."""
+        try:
+            url = f"{base_url}/reset?session_id={urllib.parse.quote(session_id)}&stage={stage}&adaptive=true"
+            req = urllib.request.Request(url, method="POST")
+            with urllib.request.urlopen(req, timeout=5.0) as resp:
+                data = json.loads(resp.read())
+                return data.get("session_id", session_id)
+        except Exception:
+            return None
+    current_stage = 1
+    total_steps = 0
+    session_id = f"adaptive_{uuid.uuid4().hex[:8]}" if 'uuid' in dir() else "adaptive_training"
+    # Try to initialise session on server
+    import uuid as _uuid_mod
+    session_id = f"adaptive_{_uuid_mod.uuid4().hex[:8]}"
+    notify_reset(session_id, current_stage)
+    # Tracking
+    stage_promotion_steps = []
+    reward_curve = []
+    stage_rewards = {1: [], 2: [], 3: [], 4: []}
+    while total_steps < max_total_steps:
+        print(f"\n{'━' * 60}")
+        print(f"  ADAPTIVE BLOCK: Steps {total_steps}–{total_steps + steps_per_assessment}")
+        print(f"  Current Stage: {current_stage} — {STAGE_NAMES.get(current_stage, '?')}")
+        print(f"{'━' * 60}")
+        # Load episodes for current stage
+        try:
+            episodes = load_episodes(episodes_dir, stage=current_stage, split="train")
+        except FileNotFoundError:
+            print(f"  No episodes for stage {current_stage} — breaking")
+            break
+        if not episodes:
+            print(f"  Empty episode list for stage {current_stage} — breaking")
+            break
+        # Build dataset
+        dataset = build_hf_dataset(episodes, tokenizer)
+        stage_for_closure = current_stage  # Capture for closure
+        def reward_fn(completions: List[str], episode: List[str], **kwargs) -> List[float]:
+            ep_objs = [json.loads(e) for e in episode]
+            return combined_reward(completions, ep_objs, current_stage=stage_for_closure)
+        block_output = f"{output_dir}/block_{total_steps}"
+        config = GRPOConfig(
+            output_dir=block_output,
+            learning_rate=lr,
+            per_device_train_batch_size=batch_size,
+            gradient_accumulation_steps=grad_accum,
+            num_train_epochs=1,
+            max_steps=steps_per_assessment,
+            num_generations=6,
+            max_completion_length=1024,
+            temperature=0.7,
+            beta=0.01,
+            logging_steps=5,
+            save_steps=steps_per_assessment,
+            report_to="none",
+            remove_unused_columns=False,
+        )
+        FastLanguageModel.for_training(model)
+        trainer = GRPOTrainer(
+            model=model,
+            processing_class=tokenizer,
+            config=config,
+            train_dataset=dataset,
+            reward_funcs=[reward_fn],
+        )
+        trainer.train()
+        total_steps += steps_per_assessment
+        # Evaluate current performance
+        eval_reward, _ = evaluate_on_stage(
+            model, tokenizer, episodes_dir, stage=current_stage, n_samples=15
+        )
+        stage_rewards[current_stage].append(eval_reward)
+        reward_curve.append((total_steps, round(eval_reward, 4)))
+        print(f"  Stage {current_stage} eval reward: {eval_reward:.4f}")
+        # Query server for stage promotion suggestion
+        profile_data = query_profile(session_id)
+        suggested_stage = current_stage
+        if profile_data and "profile" in profile_data:
+            suggested_stage = profile_data["profile"].get(
+                "suggested_stage", current_stage
+            )
+        else:
+            # Fallback: use local heuristic
+            if eval_reward >= 0.65 and current_stage == 1:
+                suggested_stage = 2
+            elif eval_reward >= 0.55 and current_stage == 2:
+                suggested_stage = 3
+            elif eval_reward >= 0.50 and current_stage == 3:
+                suggested_stage = 4
+        if suggested_stage > current_stage:
+            old_stage = current_stage
+            old_reward = eval_reward
+            current_stage = suggested_stage
+            stage_promotion_steps.append(
+                (total_steps, old_stage, current_stage, round(old_reward, 4))
+            )
+            print(
+                f"[SELF-IMPROVEMENT] Step {total_steps}: "
+                f"Promoted to Stage {current_stage}. "
+                f"Stage {old_stage} mean reward: {old_reward:.3f} → "
+                f"Stage {current_stage} begins."
+            )
+            # Notify server of promotion
+            notify_reset(session_id, current_stage)
+        # Check completion
+        if current_stage == 4:
+            s4_rewards = stage_rewards.get(4, [])
+            if s4_rewards and s4_rewards[-1] >= 0.50:
+                print(
+                    f"\n[SELF-IMPROVEMENT] Stage 4 mastered at step {total_steps}! "
+                    f"Reward: {s4_rewards[-1]:.3f}"
+                )
+                break
+        # Save checkpoint
+        model.save_pretrained(block_output, save_adapters_only=True)
+        tokenizer.save_pretrained(block_output)
+    # ── Final summary ──
+    print(f"\n{'═' * 60}")
+    print("  ADAPTIVE TRAINING COMPLETE")
+    print(f"{'═' * 60}")
+    print(f"  Total steps: {total_steps}")
+    print(f"  Stage promotions: {len(stage_promotion_steps)}")
+    for step_n, from_s, to_s, reward in stage_promotion_steps:
+        print(f"    Step {step_n}: Stage {from_s} → {to_s} (reward {reward:.3f})")
+    print(f"  Final stage: {current_stage}")
+    # Compute final reward per stage
+    final_reward_per_stage = {}
+    for s, rewards_list in stage_rewards.items():
+        if rewards_list:
+            final_reward_per_stage[str(s)] = round(rewards_list[-1], 4)
+    # Save results
+    results = {
+        "stage_promotion_steps": [
+            {"step": s, "from_stage": f, "to_stage": t, "reward": r}
+            for s, f, t, r in stage_promotion_steps
+        ],
+        "final_reward_per_stage": final_reward_per_stage,
+        "total_steps_completed": total_steps,
+        "reward_curve": [{"step": s, "reward": r} for s, r in reward_curve],
+    }
+    results_path = Path(output_dir) / "results_adaptive.json"
+    results_path.parent.mkdir(parents=True, exist_ok=True)
+    results_path.write_text(json.dumps(results, indent=2))
+    print(f"\n  Results saved: {results_path}")
+    # Save final model
+    final_dir = f"{output_dir}/final"
+    model.save_pretrained(final_dir, save_adapters_only=True)
+    tokenizer.save_pretrained(final_dir)
+    print(f"  Final model saved: {final_dir}")
+    return results
+# ============================================================
+# CELL 10 — Entry point
 # ============================================================
 if __name__ == "__main__":
                         help="Run evaluation after training to measure improvement")
     parser.add_argument("--curriculum",    action="store_true",
                         help="Run self-improving curriculum training (all 4 stages)")
+    parser.add_argument("--adaptive",      action="store_true",
+                        help="Run adaptive self-improvement training (Theme 4)")
+    parser.add_argument("--env_url",       default="http://localhost:8000",
+                        help="Server URL for adaptive training")
     args = parser.parse_args()
             max_steps_per_stage=args.steps,
             batch_size=args.batch_size,
         )
+    elif args.adaptive:
+        train_adaptive(
+            episodes_dir=args.episodes_dir,
+            output_dir=args.output,
+            steps_per_assessment=args.steps,
+            max_total_steps=2000,
+            batch_size=args.batch_size,
+            base_url=args.env_url,
+        )
     else:
         train(
             episodes_dir = args.episodes_dir,