# Product Requirements Document: CricketCaptain-LLM **Version:** 1.0 **Date:** April 2025 **Author:** CricketCaptain Team --- ## 1. Executive Summary CricketCaptain-LLM is a reinforcement learning benchmark built on the [OpenEnv](https://github.com/openenv-ai/openenv) framework for **adaptive strategic decision-making**. The environment uses cricket as its domain, where an LLM agent acts as a team captain managing a full match across batting, bowling, fielding, player selection, opponent modeling, and post-ball tactical adjustment. ### Problem Statement The [WDCT benchmark (2025)](https://arxiv.org/abs/2503.07003) demonstrates that LLMs score 0.49–0.76 on strategic coherence — they routinely say one thing and do another. But practical strategic agents face a harder problem: they must keep plans coherent while adapting to adversarial, changing, long-horizon conditions. In cricket, a captain must react to wickets, target pressure, batter/bowler matchups, field changes, phase transitions, and the opponent's visible strategy. A good LLM captain should not merely produce a plausible plan; it should revise that plan when the match state demands it. ### Solution A two-sided cricket captaincy simulator where an LLM agent must: 1. **Plan before the ball**: select batter/bowler, set field, choose delivery or shot plan 2. **Execute actions** that are scored against the declared tactical plan 3. **Observe outcomes** and reflect after each ball 4. **Model the opponent** through fixed heuristic, live LLM, or cached LLM opponent policies 5. **Adapt strategies** across powerplay, middle, death overs, wickets, target pressure, and role switches --- ## 2. Target Users | User | Use Case | |------|----------| | **ML Researchers** | Training LLMs for adaptive strategic decision-making via RL (GRPO/PPO) | | **RL Engineers** | Benchmarking agents on long-horizon, multi-phase decision tasks | | **Cricket Analytics** | Exploring AI-driven match strategy optimization | | **Hackathon Participants** | OpenEnv hackathon — demonstrating tool-use RL environments | --- ## 3. Product Goals | # | Goal | Success Metric | |---|------|----------------| | G1 | Train LLMs to align declared strategy with executed actions | Coherence score ≥ 0.75 | | G2 | Measure adaptation after outcomes and phase changes | Adaptation score ≥ 0.70 on held-out scenarios | | G3 | Measure opponent-aware captaincy | Opponent-awareness score ≥ 0.65 | | G4 | Full-match simulation covering captain decisions | Agent plays TOSS → BAT → BOWL → FINISH autonomously | | G5 | Compatible with OpenEnv for WebSocket-based training | Server + `inference.py` + `eval.py` run end-to-end | --- ## 4. Match Flow & Game States ```mermaid stateDiagram-v2 [*] --> TOSS TOSS --> BATTING: Agent wins toss & bats TOSS --> BOWLING: Agent wins toss & bowls TOSS --> BATTING: Opponent wins & bowls TOSS --> BOWLING: Opponent wins & bats BATTING --> BOWLING: 1st innings ends → role swap BOWLING --> BATTING: 1st innings ends → role swap BATTING --> FINISHED: 2nd innings ends BOWLING --> FINISHED: 2nd innings ends FINISHED --> [*] ``` ### Innings Termination Conditions - All overs bowled (20 in T20, 50 in ODI) - 10 wickets fallen - Target chased (2nd innings only) --- ## 5. Functional Requirements ### 5.1 Tools (Agent Actions) | ID | Tool | Phase | Parameters | Description | |----|------|-------|------------|-------------| | T1 | `call_toss` | TOSS | `call` (heads/tails), `decision` (bat/bowl) | Coin toss with 50/50 outcome | | T2 | `select_batter` | BATTING | `name`, `style`, `aggression`, `rationale` | Choose batter profile for the situation | | T3 | `set_strategy` | BATTING | `phase_intent`, `aggression` (0–1), `rationale` | Declare batting strategy | | T4 | `plan_shot` | BATTING | `shot_intent`, `target_area`, `risk`, `rationale` | Pre-ball batting plan | | T5 | `play_delivery` | BATTING | `shot_intent`, `explanation` | Execute a shot against the bowler | | T6 | `choose_bowler` | BOWLING | `name`, `bowler_type`, `style`, `rationale` | Choose bowler profile for the over/situation | | T7 | `set_bowling_strategy` | BOWLING | `bowler_type`, `line`, `length`, `delivery_type`, `rationale` | Declare bowling plan | | T8 | `plan_delivery` | BOWLING | `bowler_type`, `line`, `length`, `delivery_type`, `rationale` | Pre-ball delivery plan | | T9 | `set_field_setting` | BOWLING | `setting` (Aggressive/Balanced/Defensive) | Set field configuration | | T10 | `bowl_delivery` | BOWLING | — | Bowl the next ball; outcome simulated from both sides' plans | | T11 | `reflect_after_ball` | BAT/BOWL | `reflection` | Post-ball tactical adjustment | | T12 | `analyze_situation` | ANY | `query_type` | Query match intel (costs -0.02 reward) | | T13 | `set_match_plan` | ANY | — | Establish long-term strategic match plan | | T14 | `update_match_plan` | ANY | `reason` | Partially update match plan with justification | ### 5.2 Observation Space Each step returns a `CricketObservation` containing: | Field | Type | Description | |-------|------|-------------| | `game_state` | str | Current phase: toss / batting / bowling / finished | | `game_context` | dict | Over, ball, score, wickets, phase, run_rate, bowler_type | | `declared_strategy` | dict | Active batting strategy | | `bowling_strategy` | dict | Active bowling strategy | | `field_setting` | str | Current field configuration | | `strategic_phase` | str | pre_over / pre_ball / ball_resolution / post_ball | | `current_batter` | dict | Active batter profile | | `current_bowler` | dict | Active bowler profile | | `opponent_plan` | dict | Last visible opponent policy plan | | `eval_pack_id` | str | Frozen benchmark pack identifier | | `available_tools` | list | Valid tools for current phase | | `last_ball_result` | str | Narrative of previous delivery | | `prompt_text` | str | Rendered prompt for LLM consumption | | `target` | int/null | Chase target (2nd innings only) | | `innings_type` | str | "first" or "second" | ### 5.3 Reward System | Rubric | Weight | Timing | Computation | |--------|--------|--------|-------------| | `r_cric` | 25% | Episode end | Win/loss plus score vs DLS par (1st inn) or target (2nd inn) | | `r_dream11` | 10% | Innings end | Dense auxiliary cricket contribution proxy | | `r_strategy` | 30% | Per decision/delivery | Coherence + adaptation + opponent awareness + regret score | | `r_tools` | 20% | Per tool call | Information value of `analyze_situation` queries | | `r_format` | 15% | Per step | Valid JSON structure compliance | The top-level objective remains long-horizon match success over many simulated matches. Dream11-style reward is auxiliary shaping, not the primary benchmark target. **Tool budget (operational constraint during play and training):** per over, the environment allows **3 no-fine “overhead” tool calls** among `set_strategy`, `set_bowling_strategy`, `plan_delivery`, `reflect_after_ball`, and `analyze_situation`. Each additional overhead call in that over applies a **−0.04** step reward. `plan_shot`, `set_match_plan`, `update_match_plan`, and ball-advancing tools do **not** count against this limit. Training via `train.py` (TRL GRPO with `CricketEnvironment`) uses the same rule, so the policy learns to ration analysis and re-planning across a full innings without a separate ad-hoc budget in the trainer. ### 5.4 Curriculum Stages | Stage | Episodes | Active Rubrics | Objective | |-------|----------|----------------|-----------| | Stage 1 | 0–100 | `r_format` only | Learn valid JSON tool calls | | Stage 2 | 100+ | All rubrics | Learn strategic coherence and improve match win/loss outcomes | --- ## 6. Non-Functional Requirements | ID | Requirement | Specification | |----|-------------|---------------| | NF1 | **Latency** | Single step < 10ms (no external API calls in env) | | NF2 | **Concurrency** | Multiple WebSocket sessions simultaneously (`SUPPORTS_CONCURRENT_SESSIONS = True`) | | NF3 | **Determinism** | Reproducible with seed parameter in `reset()` | | NF4 | **Extensibility** | Mid-match starts via `reset(options={over, score, wickets, ...})` | | NF5 | **Framework** | OpenEnv >= 0.2.2, Pydantic v2 models, FastAPI/Uvicorn server | --- ## 7. Architecture ```mermaid graph TD A["LLM Agent
(inference.py / train.py)"] -->|WebSocket| B["OpenEnv Server
(app.py)"] B --> C["CricketEnvironment
(cricket_environment.py)"] C --> D["MarkovCricketEngine
(markov_engine.py)"] C --> E["CoherenceGrader
(coherence_grader.py)"] C --> F["RewardCalculator
(reward_calculator.py)"] D -->|Transition Tables| G["Cricsheet Data or
Synthetic Probs"] H["Gradio UI
(ui.py)"] -->|Direct Python| C ``` ### Key Components | Component | Responsibility | |-----------|---------------| | `CricketEnvironment` | State machine managing TOSS→BAT→BOWL→FINISHED transitions | | `MarkovCricketEngine` | Ball-by-ball outcome simulation from transition tables; AI batter simulation during bowling | | `CoherenceGrader` | Scores alignment between declared strategy and executed actions (batting + bowling) | | `RewardCalculator` | Computes composite 4-rubric rewards with DLS par and target-based scoring | | `CricketCaptainEnv` (client) | OpenEnv WebSocket client for remote agent communication | --- ## 8. Data Dependencies | Data | Source | Location | Required? | |------|--------|----------|-----------| | Synthetic transitions | Hand-calibrated | `data/transition_probs.json` | Yes (fallback) | | Cricsheet transitions | `scripts/curate_transitions.py` | `data/processed/cricket_transitions_v1.pkl` | No (enhances realism) | | DLS par scores | Pre-computed | `data/dls_par_scores.json` | Yes | --- ## 9. Success Criteria | Criterion | Measurement | Target | |-----------|-------------|--------| | Environment runs end-to-end | Random agent completes 5 full matches | ✅ Verified | | All phases functional | TOSS → BAT → BOWL → FINISHED | ✅ Verified | | OpenEnv compatible | Server + client communicate via WebSocket | ✅ Verified | | Parse error rate | Random agent across 5 episodes | 0.0% | | Baseline coherence | Random agent (untrained) | ~0.50 | | Trained coherence | After Stage 2 GRPO training | ≥ 0.75 | ### Current Baseline (Random Agent, 5 episodes) ``` total_score : mean=87.6 std=12.4 wickets_lost : mean=1.6 std=0.5 total_reward : mean=1.511 std=0.158 mean_coherence : mean=0.502 std=0.016 parse_error_rate : 0.0% ``` --- ## 10. Risks & Mitigations | Risk | Impact | Mitigation | |------|--------|------------| | Bowling reward too easy/hard | Agent ignores bowling phase | Tune defensive reward scaling; evaluate bat vs bowl reward balance | | Coherence score gaming | Agent finds shortcuts to high coherence without real strategy | Multi-component scoring prevents single-dimension exploitation | | Long episode length (~350+ steps) | Training instability with GRPO | Curriculum stages; mid-match starts for targeted phase training | | Toss randomness | 50% of matches start in suboptimal role | Agent learns to adapt to both bat-first and bowl-first scenarios | --- ## 11. Future Enhancements | Priority | Enhancement | Description | |----------|-------------|-------------| | P1 | **Multi-format support** | T20 (20 ov), ODI (50 ov), Test (unlimited) via `max_overs` | | P1 | **LLM evaluation suite** | Benchmark GPT-4o, Claude, Gemini on coherence | | P2 | **Player-specific simulation** | Individual batter/bowler skill profiles | | P2 | **Weather & pitch degradation** | Dynamic conditions affecting strategy | | P3 | **Multi-agent matches** | Two LLM captains competing head-to-head | | P3 | **Commentary generation** | Natural language match commentary from game state |