name: undertrial-ai version: "1.1.0" description: > OpenEnv-compliant RL training environment for Indian bail decision support with adaptive self-improvement (Theme 4). An LLM agent reads High Court bail cases, invokes legal tools, and submits structured bail recommendations. Reward computed deterministically against real HC judgments with an explicit bias penalty (lambda=0.3). Features performance-aware episode selection, stage-gated curriculum promotion, and synthetic case generation. author: Draken1606 license: MIT repository: https://github.com/Faiz-1606/Undertrial space: https://huggingface.co/spaces/Draken1606/undertrial-ai tags: - legal-ai - india - bail - grpo - world-modeling - bias-mitigation - bnss-2023 - self-improvement - adaptive-curriculum environment: class: undertrial_ai.server.undertrial_environment.UndertriAIEnvironment supports_concurrent_sessions: true max_steps_per_episode: 10 actions: - name: request_document description: Request a missing case document (FIR, charge sheet, prior judgment) - name: flag_inconsistency description: Flag a legal inconsistency in the charge or prosecution argument - name: cross_reference_precedent description: Retrieve a relevant landmark SC/HC precedent - name: compute_statutory_eligibility description: Check BNSS 479 default bail eligibility (custody vs. max sentence) - name: assess_surety description: Evaluate financial viability of proposed surety - name: classify_bail_type description: Determine bail type from grounds for/against - name: read_submissions description: Read and summarise prosecution or defence submissions on record - name: assess_flight_risk description: Systematic flight risk assessment using a structured scoring matrix - name: check_case_factors description: Examine specific case factors (parity, evidence tampering, victim vulnerability) - name: apply_proportionality description: Apply BNSS 479 proportionality — custody vs. max sentence vs. trial timeline - name: pull_criminal_history description: Pull the accused's prior criminal record, bail history, and conviction status - name: submit_memo description: "TERMINAL — Submit structured bail assessment memo" reward: formula: "0.4*outcome_gated + 0.2*flight_risk + 0.2*statutory + 0.2*conditions + 0.1*reasoning_quality + 0.05*efficiency + 0.05*format + 0.05*process_bonus - 0.3*bias" range: [-0.7, 1.15] terminal_action: submit_memo deterministic: true llm_as_judge: false components: - outcome_match: "Agreement with real High Court decision, gated by reasoning quality (40%)" - flight_risk_accuracy: "Flight risk classification accuracy (20%)" - statutory_accuracy: "IPC/BNSS threshold computation with direction gate (20%)" - condition_appropriateness: "Bail condition quality (20%)" - reasoning_quality: "Justification anchoring + arithmetic verification + grounds specificity (10% bonus)" - format_compliance: "XML tag adherence matching system prompt structure (5% bonus)" - bias_penalty: "Penalty for ignoring parity in bias cases (-30%)" curriculum: levels: 3 easy: "Easy — landmark clear-cut cases (104 episodes, 60 steps)" medium: "Medium — contested judgment calls (761 episodes, 160 steps)" hard: "Hard — bias reversal + schema drift (335 episodes, 80 steps)" self_improvement: adaptive_curriculum: description: > Performance-gated stage promotion using exponential moving averages. Agent auto-promotes when per-stage EMA exceeds threshold. thresholds: stage_1_to_2: {min_reward: 0.65, min_episodes: 20} stage_2_to_3: {min_reward: 0.55, min_episodes: 50} stage_3_to_4: {min_reward: 0.50, min_episodes: 20} weakness_targeting: description: > Adaptive episode selection identifies the crime type with lowest EMA reward and serves proportionally more cases from that domain. strategy: "60% weakest domain / 30% failure replay / 10% exploration" synthetic_generation: description: > When agent masters a domain (EMA > 0.70), generates harder synthetic variants using 5 perturbation types. perturbation_types: - custody_escalation - co_accused_conflict - section_ambiguity - evidence_reversal - surety_complexity endpoints: - path: /reset method: POST description: "Start a new episode. Supports adaptive=true and auto_stage=true for Theme 4." - path: /step method: POST description: "Submit a tool call or final memo. Updates performance tracker when done." - path: /state method: GET description: "Inspect current episode state." - path: /health method: GET description: "Health check." - path: /tools method: GET description: "List available tools." - path: /profile method: GET description: "Get agent performance profile for a session (Theme 4)." - path: /adaptive_status method: GET description: "Get global adaptive mode capabilities and thresholds." - path: /ws/{session_id} method: WS description: "WebSocket real-time feed." training: method: GRPO framework: TRL + Unsloth model: unsloth/Qwen2.5-7B-Instruct notebook: training/UndertriAI_GRPO_Training.ipynb script: training/train_grpo.py total_steps: 300 num_generations: 6 temperature: 1.1 modes: - name: curriculum_3level command: "python training/train_grpo.py --curriculum --offline" - name: single_stage command: "python training/train_grpo.py --stage 1 --offline --steps 200" deployment: platform: huggingface-spaces sdk: docker port: 7860 url: https://draken1606-undertrial-ai.hf.space