sync: today's source updates (XML-only prompt, reward unclip, neg-reward on loss, pinned versions, configs reorg)
2fc50a9 verified | # Product Requirements Document: CricketCaptain-LLM | |
| **Version:** 1.0 | |
| **Date:** April 2025 | |
| **Author:** CricketCaptain Team | |
| --- | |
| ## 1. Executive Summary | |
| CricketCaptain-LLM is a reinforcement learning benchmark built on the [OpenEnv](https://github.com/openenv-ai/openenv) framework for **adaptive strategic decision-making**. The environment uses cricket as its domain, where an LLM agent acts as a team captain managing a full match across batting, bowling, fielding, player selection, opponent modeling, and post-ball tactical adjustment. | |
| ### Problem Statement | |
| The [WDCT benchmark (2025)](https://arxiv.org/abs/2503.07003) demonstrates that LLMs score 0.49–0.76 on strategic coherence — they routinely say one thing and do another. But practical strategic agents face a harder problem: they must keep plans coherent while adapting to adversarial, changing, long-horizon conditions. | |
| In cricket, a captain must react to wickets, target pressure, batter/bowler matchups, field changes, phase transitions, and the opponent's visible strategy. A good LLM captain should not merely produce a plausible plan; it should revise that plan when the match state demands it. | |
| ### Solution | |
| A two-sided cricket captaincy simulator where an LLM agent must: | |
| 1. **Plan before the ball**: select batter/bowler, set field, choose delivery or shot plan | |
| 2. **Execute actions** that are scored against the declared tactical plan | |
| 3. **Observe outcomes** and reflect after each ball | |
| 4. **Model the opponent** through fixed heuristic, live LLM, or cached LLM opponent policies | |
| 5. **Adapt strategies** across powerplay, middle, death overs, wickets, target pressure, and role switches | |
| --- | |
| ## 2. Target Users | |
| | User | Use Case | | |
| |------|----------| | |
| | **ML Researchers** | Training LLMs for adaptive strategic decision-making via RL (GRPO/PPO) | | |
| | **RL Engineers** | Benchmarking agents on long-horizon, multi-phase decision tasks | | |
| | **Cricket Analytics** | Exploring AI-driven match strategy optimization | | |
| | **Hackathon Participants** | OpenEnv hackathon — demonstrating tool-use RL environments | | |
| --- | |
| ## 3. Product Goals | |
| | # | Goal | Success Metric | | |
| |---|------|----------------| | |
| | G1 | Train LLMs to align declared strategy with executed actions | Coherence score ≥ 0.75 | | |
| | G2 | Measure adaptation after outcomes and phase changes | Adaptation score ≥ 0.70 on held-out scenarios | | |
| | G3 | Measure opponent-aware captaincy | Opponent-awareness score ≥ 0.65 | | |
| | G4 | Full-match simulation covering captain decisions | Agent plays TOSS → BAT → BOWL → FINISH autonomously | | |
| | G5 | Compatible with OpenEnv for WebSocket-based training | Server + `inference.py` + `eval.py` run end-to-end | | |
| --- | |
| ## 4. Match Flow & Game States | |
| ```mermaid | |
| stateDiagram-v2 | |
| [*] --> TOSS | |
| TOSS --> BATTING: Agent wins toss & bats | |
| TOSS --> BOWLING: Agent wins toss & bowls | |
| TOSS --> BATTING: Opponent wins & bowls | |
| TOSS --> BOWLING: Opponent wins & bats | |
| BATTING --> BOWLING: 1st innings ends → role swap | |
| BOWLING --> BATTING: 1st innings ends → role swap | |
| BATTING --> FINISHED: 2nd innings ends | |
| BOWLING --> FINISHED: 2nd innings ends | |
| FINISHED --> [*] | |
| ``` | |
| ### Innings Termination Conditions | |
| - All overs bowled (20 in T20, 50 in ODI) | |
| - 10 wickets fallen | |
| - Target chased (2nd innings only) | |
| --- | |
| ## 5. Functional Requirements | |
| ### 5.1 Tools (Agent Actions) | |
| | ID | Tool | Phase | Parameters | Description | | |
| |----|------|-------|------------|-------------| | |
| | T1 | `call_toss` | TOSS | `call` (heads/tails), `decision` (bat/bowl) | Coin toss with 50/50 outcome | | |
| | T2 | `select_batter` | BATTING | `name`, `style`, `aggression`, `rationale` | Choose batter profile for the situation | | |
| | T3 | `set_strategy` | BATTING | `phase_intent`, `aggression` (0–1), `rationale` | Declare batting strategy | | |
| | T4 | `plan_shot` | BATTING | `shot_intent`, `target_area`, `risk`, `rationale` | Pre-ball batting plan | | |
| | T5 | `play_delivery` | BATTING | `shot_intent`, `explanation` | Execute a shot against the bowler | | |
| | T6 | `choose_bowler` | BOWLING | `name`, `bowler_type`, `style`, `rationale` | Choose bowler profile for the over/situation | | |
| | T7 | `set_bowling_strategy` | BOWLING | `bowler_type`, `line`, `length`, `delivery_type`, `rationale` | Declare bowling plan | | |
| | T8 | `plan_delivery` | BOWLING | `bowler_type`, `line`, `length`, `delivery_type`, `rationale` | Pre-ball delivery plan | | |
| | T9 | `set_field_setting` | BOWLING | `setting` (Aggressive/Balanced/Defensive) | Set field configuration | | |
| | T10 | `bowl_delivery` | BOWLING | — | Bowl the next ball; outcome simulated from both sides' plans | | |
| | T11 | `reflect_after_ball` | BAT/BOWL | `reflection` | Post-ball tactical adjustment | | |
| | T12 | `analyze_situation` | ANY | `query_type` | Query match intel (costs -0.02 reward) | | |
| | T13 | `set_match_plan` | ANY | — | Establish long-term strategic match plan | | |
| | T14 | `update_match_plan` | ANY | `reason` | Partially update match plan with justification | | |
| ### 5.2 Observation Space | |
| Each step returns a `CricketObservation` containing: | |
| | Field | Type | Description | | |
| |-------|------|-------------| | |
| | `game_state` | str | Current phase: toss / batting / bowling / finished | | |
| | `game_context` | dict | Over, ball, score, wickets, phase, run_rate, bowler_type | | |
| | `declared_strategy` | dict | Active batting strategy | | |
| | `bowling_strategy` | dict | Active bowling strategy | | |
| | `field_setting` | str | Current field configuration | | |
| | `strategic_phase` | str | pre_over / pre_ball / ball_resolution / post_ball | | |
| | `current_batter` | dict | Active batter profile | | |
| | `current_bowler` | dict | Active bowler profile | | |
| | `opponent_plan` | dict | Last visible opponent policy plan | | |
| | `eval_pack_id` | str | Frozen benchmark pack identifier | | |
| | `available_tools` | list | Valid tools for current phase | | |
| | `last_ball_result` | str | Narrative of previous delivery | | |
| | `prompt_text` | str | Rendered prompt for LLM consumption | | |
| | `target` | int/null | Chase target (2nd innings only) | | |
| | `innings_type` | str | "first" or "second" | | |
| ### 5.3 Reward System | |
| | Rubric | Weight | Timing | Computation | | |
| |--------|--------|--------|-------------| | |
| | `r_cric` | 25% | Episode end | Win/loss plus score vs DLS par (1st inn) or target (2nd inn) | | |
| | `r_dream11` | 10% | Innings end | Dense auxiliary cricket contribution proxy | | |
| | `r_strategy` | 30% | Per decision/delivery | Coherence + adaptation + opponent awareness + regret score | | |
| | `r_tools` | 20% | Per tool call | Information value of `analyze_situation` queries | | |
| | `r_format` | 15% | Per step | Valid JSON structure compliance | | |
| The top-level objective remains long-horizon match success over many simulated matches. Dream11-style reward is auxiliary shaping, not the primary benchmark target. | |
| **Tool budget (operational constraint during play and training):** per over, the environment allows **3 no-fine “overhead” tool calls** among `set_strategy`, `set_bowling_strategy`, `plan_delivery`, `reflect_after_ball`, and `analyze_situation`. Each additional overhead call in that over applies a **−0.04** step reward. `plan_shot`, `set_match_plan`, `update_match_plan`, and ball-advancing tools do **not** count against this limit. Training via `train.py` (TRL GRPO with `CricketEnvironment`) uses the same rule, so the policy learns to ration analysis and re-planning across a full innings without a separate ad-hoc budget in the trainer. | |
| ### 5.4 Curriculum Stages | |
| | Stage | Episodes | Active Rubrics | Objective | | |
| |-------|----------|----------------|-----------| | |
| | Stage 1 | 0–100 | `r_format` only | Learn valid JSON tool calls | | |
| | Stage 2 | 100+ | All rubrics | Learn strategic coherence and improve match win/loss outcomes | | |
| --- | |
| ## 6. Non-Functional Requirements | |
| | ID | Requirement | Specification | | |
| |----|-------------|---------------| | |
| | NF1 | **Latency** | Single step < 10ms (no external API calls in env) | | |
| | NF2 | **Concurrency** | Multiple WebSocket sessions simultaneously (`SUPPORTS_CONCURRENT_SESSIONS = True`) | | |
| | NF3 | **Determinism** | Reproducible with seed parameter in `reset()` | | |
| | NF4 | **Extensibility** | Mid-match starts via `reset(options={over, score, wickets, ...})` | | |
| | NF5 | **Framework** | OpenEnv >= 0.2.2, Pydantic v2 models, FastAPI/Uvicorn server | | |
| --- | |
| ## 7. Architecture | |
| ```mermaid | |
| graph TD | |
| A["LLM Agent<br/>(inference.py / train.py)"] -->|WebSocket| B["OpenEnv Server<br/>(app.py)"] | |
| B --> C["CricketEnvironment<br/>(cricket_environment.py)"] | |
| C --> D["MarkovCricketEngine<br/>(markov_engine.py)"] | |
| C --> E["CoherenceGrader<br/>(coherence_grader.py)"] | |
| C --> F["RewardCalculator<br/>(reward_calculator.py)"] | |
| D -->|Transition Tables| G["Cricsheet Data or<br/>Synthetic Probs"] | |
| H["Gradio UI<br/>(ui.py)"] -->|Direct Python| C | |
| ``` | |
| ### Key Components | |
| | Component | Responsibility | | |
| |-----------|---------------| | |
| | `CricketEnvironment` | State machine managing TOSS→BAT→BOWL→FINISHED transitions | | |
| | `MarkovCricketEngine` | Ball-by-ball outcome simulation from transition tables; AI batter simulation during bowling | | |
| | `CoherenceGrader` | Scores alignment between declared strategy and executed actions (batting + bowling) | | |
| | `RewardCalculator` | Computes composite 4-rubric rewards with DLS par and target-based scoring | | |
| | `CricketCaptainEnv` (client) | OpenEnv WebSocket client for remote agent communication | | |
| --- | |
| ## 8. Data Dependencies | |
| | Data | Source | Location | Required? | | |
| |------|--------|----------|-----------| | |
| | Synthetic transitions | Hand-calibrated | `data/transition_probs.json` | Yes (fallback) | | |
| | Cricsheet transitions | `scripts/curate_transitions.py` | `data/processed/cricket_transitions_v1.pkl` | No (enhances realism) | | |
| | DLS par scores | Pre-computed | `data/dls_par_scores.json` | Yes | | |
| --- | |
| ## 9. Success Criteria | |
| | Criterion | Measurement | Target | | |
| |-----------|-------------|--------| | |
| | Environment runs end-to-end | Random agent completes 5 full matches | ✅ Verified | | |
| | All phases functional | TOSS → BAT → BOWL → FINISHED | ✅ Verified | | |
| | OpenEnv compatible | Server + client communicate via WebSocket | ✅ Verified | | |
| | Parse error rate | Random agent across 5 episodes | 0.0% | | |
| | Baseline coherence | Random agent (untrained) | ~0.50 | | |
| | Trained coherence | After Stage 2 GRPO training | ≥ 0.75 | | |
| ### Current Baseline (Random Agent, 5 episodes) | |
| ``` | |
| total_score : mean=87.6 std=12.4 | |
| wickets_lost : mean=1.6 std=0.5 | |
| total_reward : mean=1.511 std=0.158 | |
| mean_coherence : mean=0.502 std=0.016 | |
| parse_error_rate : 0.0% | |
| ``` | |
| --- | |
| ## 10. Risks & Mitigations | |
| | Risk | Impact | Mitigation | | |
| |------|--------|------------| | |
| | Bowling reward too easy/hard | Agent ignores bowling phase | Tune defensive reward scaling; evaluate bat vs bowl reward balance | | |
| | Coherence score gaming | Agent finds shortcuts to high coherence without real strategy | Multi-component scoring prevents single-dimension exploitation | | |
| | Long episode length (~350+ steps) | Training instability with GRPO | Curriculum stages; mid-match starts for targeted phase training | | |
| | Toss randomness | 50% of matches start in suboptimal role | Agent learns to adapt to both bat-first and bowl-first scenarios | | |
| --- | |
| ## 11. Future Enhancements | |
| | Priority | Enhancement | Description | | |
| |----------|-------------|-------------| | |
| | P1 | **Multi-format support** | T20 (20 ov), ODI (50 ov), Test (unlimited) via `max_overs` | | |
| | P1 | **LLM evaluation suite** | Benchmark GPT-4o, Claude, Gemini on coherence | | |
| | P2 | **Player-specific simulation** | Individual batter/bowler skill profiles | | |
| | P2 | **Weather & pitch degradation** | Dynamic conditions affecting strategy | | |
| | P3 | **Multi-agent matches** | Two LLM captains competing head-to-head | | |
| | P3 | **Commentary generation** | Natural language match commentary from game state | | |