pratinavseth's picture
sync: today's source updates (XML-only prompt, reward unclip, neg-reward on loss, pinned versions, configs reorg)
2fc50a9 verified

Product Requirements Document: CricketCaptain-LLM

Version: 1.0
Date: April 2025
Author: CricketCaptain Team


1. Executive Summary

CricketCaptain-LLM is a reinforcement learning benchmark built on the OpenEnv framework for adaptive strategic decision-making. The environment uses cricket as its domain, where an LLM agent acts as a team captain managing a full match across batting, bowling, fielding, player selection, opponent modeling, and post-ball tactical adjustment.

Problem Statement

The WDCT benchmark (2025) demonstrates that LLMs score 0.49–0.76 on strategic coherence — they routinely say one thing and do another. But practical strategic agents face a harder problem: they must keep plans coherent while adapting to adversarial, changing, long-horizon conditions.

In cricket, a captain must react to wickets, target pressure, batter/bowler matchups, field changes, phase transitions, and the opponent's visible strategy. A good LLM captain should not merely produce a plausible plan; it should revise that plan when the match state demands it.

Solution

A two-sided cricket captaincy simulator where an LLM agent must:

  1. Plan before the ball: select batter/bowler, set field, choose delivery or shot plan
  2. Execute actions that are scored against the declared tactical plan
  3. Observe outcomes and reflect after each ball
  4. Model the opponent through fixed heuristic, live LLM, or cached LLM opponent policies
  5. Adapt strategies across powerplay, middle, death overs, wickets, target pressure, and role switches

2. Target Users

User Use Case
ML Researchers Training LLMs for adaptive strategic decision-making via RL (GRPO/PPO)
RL Engineers Benchmarking agents on long-horizon, multi-phase decision tasks
Cricket Analytics Exploring AI-driven match strategy optimization
Hackathon Participants OpenEnv hackathon — demonstrating tool-use RL environments

3. Product Goals

# Goal Success Metric
G1 Train LLMs to align declared strategy with executed actions Coherence score ≥ 0.75
G2 Measure adaptation after outcomes and phase changes Adaptation score ≥ 0.70 on held-out scenarios
G3 Measure opponent-aware captaincy Opponent-awareness score ≥ 0.65
G4 Full-match simulation covering captain decisions Agent plays TOSS → BAT → BOWL → FINISH autonomously
G5 Compatible with OpenEnv for WebSocket-based training Server + inference.py + eval.py run end-to-end

4. Match Flow & Game States

stateDiagram-v2
    [*] --> TOSS
    TOSS --> BATTING: Agent wins toss & bats
    TOSS --> BOWLING: Agent wins toss & bowls
    TOSS --> BATTING: Opponent wins & bowls
    TOSS --> BOWLING: Opponent wins & bats
    BATTING --> BOWLING: 1st innings ends → role swap
    BOWLING --> BATTING: 1st innings ends → role swap
    BATTING --> FINISHED: 2nd innings ends
    BOWLING --> FINISHED: 2nd innings ends
    FINISHED --> [*]

Innings Termination Conditions

  • All overs bowled (20 in T20, 50 in ODI)
  • 10 wickets fallen
  • Target chased (2nd innings only)

5. Functional Requirements

5.1 Tools (Agent Actions)

ID Tool Phase Parameters Description
T1 call_toss TOSS call (heads/tails), decision (bat/bowl) Coin toss with 50/50 outcome
T2 select_batter BATTING name, style, aggression, rationale Choose batter profile for the situation
T3 set_strategy BATTING phase_intent, aggression (0–1), rationale Declare batting strategy
T4 plan_shot BATTING shot_intent, target_area, risk, rationale Pre-ball batting plan
T5 play_delivery BATTING shot_intent, explanation Execute a shot against the bowler
T6 choose_bowler BOWLING name, bowler_type, style, rationale Choose bowler profile for the over/situation
T7 set_bowling_strategy BOWLING bowler_type, line, length, delivery_type, rationale Declare bowling plan
T8 plan_delivery BOWLING bowler_type, line, length, delivery_type, rationale Pre-ball delivery plan
T9 set_field_setting BOWLING setting (Aggressive/Balanced/Defensive) Set field configuration
T10 bowl_delivery BOWLING Bowl the next ball; outcome simulated from both sides' plans
T11 reflect_after_ball BAT/BOWL reflection Post-ball tactical adjustment
T12 analyze_situation ANY query_type Query match intel (costs -0.02 reward)
T13 set_match_plan ANY Establish long-term strategic match plan
T14 update_match_plan ANY reason Partially update match plan with justification

5.2 Observation Space

Each step returns a CricketObservation containing:

Field Type Description
game_state str Current phase: toss / batting / bowling / finished
game_context dict Over, ball, score, wickets, phase, run_rate, bowler_type
declared_strategy dict Active batting strategy
bowling_strategy dict Active bowling strategy
field_setting str Current field configuration
strategic_phase str pre_over / pre_ball / ball_resolution / post_ball
current_batter dict Active batter profile
current_bowler dict Active bowler profile
opponent_plan dict Last visible opponent policy plan
eval_pack_id str Frozen benchmark pack identifier
available_tools list Valid tools for current phase
last_ball_result str Narrative of previous delivery
prompt_text str Rendered prompt for LLM consumption
target int/null Chase target (2nd innings only)
innings_type str "first" or "second"

5.3 Reward System

Rubric Weight Timing Computation
r_cric 25% Episode end Win/loss plus score vs DLS par (1st inn) or target (2nd inn)
r_dream11 10% Innings end Dense auxiliary cricket contribution proxy
r_strategy 30% Per decision/delivery Coherence + adaptation + opponent awareness + regret score
r_tools 20% Per tool call Information value of analyze_situation queries
r_format 15% Per step Valid JSON structure compliance

The top-level objective remains long-horizon match success over many simulated matches. Dream11-style reward is auxiliary shaping, not the primary benchmark target.

Tool budget (operational constraint during play and training): per over, the environment allows 3 no-fine “overhead” tool calls among set_strategy, set_bowling_strategy, plan_delivery, reflect_after_ball, and analyze_situation. Each additional overhead call in that over applies a −0.04 step reward. plan_shot, set_match_plan, update_match_plan, and ball-advancing tools do not count against this limit. Training via train.py (TRL GRPO with CricketEnvironment) uses the same rule, so the policy learns to ration analysis and re-planning across a full innings without a separate ad-hoc budget in the trainer.

5.4 Curriculum Stages

Stage Episodes Active Rubrics Objective
Stage 1 0–100 r_format only Learn valid JSON tool calls
Stage 2 100+ All rubrics Learn strategic coherence and improve match win/loss outcomes

6. Non-Functional Requirements

ID Requirement Specification
NF1 Latency Single step < 10ms (no external API calls in env)
NF2 Concurrency Multiple WebSocket sessions simultaneously (SUPPORTS_CONCURRENT_SESSIONS = True)
NF3 Determinism Reproducible with seed parameter in reset()
NF4 Extensibility Mid-match starts via reset(options={over, score, wickets, ...})
NF5 Framework OpenEnv >= 0.2.2, Pydantic v2 models, FastAPI/Uvicorn server

7. Architecture

graph TD
    A["LLM Agent<br/>(inference.py / train.py)"] -->|WebSocket| B["OpenEnv Server<br/>(app.py)"]
    B --> C["CricketEnvironment<br/>(cricket_environment.py)"]
    C --> D["MarkovCricketEngine<br/>(markov_engine.py)"]
    C --> E["CoherenceGrader<br/>(coherence_grader.py)"]
    C --> F["RewardCalculator<br/>(reward_calculator.py)"]
    D -->|Transition Tables| G["Cricsheet Data or<br/>Synthetic Probs"]
    H["Gradio UI<br/>(ui.py)"] -->|Direct Python| C

Key Components

Component Responsibility
CricketEnvironment State machine managing TOSS→BAT→BOWL→FINISHED transitions
MarkovCricketEngine Ball-by-ball outcome simulation from transition tables; AI batter simulation during bowling
CoherenceGrader Scores alignment between declared strategy and executed actions (batting + bowling)
RewardCalculator Computes composite 4-rubric rewards with DLS par and target-based scoring
CricketCaptainEnv (client) OpenEnv WebSocket client for remote agent communication

8. Data Dependencies

Data Source Location Required?
Synthetic transitions Hand-calibrated data/transition_probs.json Yes (fallback)
Cricsheet transitions scripts/curate_transitions.py data/processed/cricket_transitions_v1.pkl No (enhances realism)
DLS par scores Pre-computed data/dls_par_scores.json Yes

9. Success Criteria

Criterion Measurement Target
Environment runs end-to-end Random agent completes 5 full matches ✅ Verified
All phases functional TOSS → BAT → BOWL → FINISHED ✅ Verified
OpenEnv compatible Server + client communicate via WebSocket ✅ Verified
Parse error rate Random agent across 5 episodes 0.0%
Baseline coherence Random agent (untrained) ~0.50
Trained coherence After Stage 2 GRPO training ≥ 0.75

Current Baseline (Random Agent, 5 episodes)

total_score    : mean=87.6  std=12.4
wickets_lost   : mean=1.6   std=0.5
total_reward   : mean=1.511 std=0.158
mean_coherence : mean=0.502 std=0.016
parse_error_rate : 0.0%

10. Risks & Mitigations

Risk Impact Mitigation
Bowling reward too easy/hard Agent ignores bowling phase Tune defensive reward scaling; evaluate bat vs bowl reward balance
Coherence score gaming Agent finds shortcuts to high coherence without real strategy Multi-component scoring prevents single-dimension exploitation
Long episode length (~350+ steps) Training instability with GRPO Curriculum stages; mid-match starts for targeted phase training
Toss randomness 50% of matches start in suboptimal role Agent learns to adapt to both bat-first and bowl-first scenarios

11. Future Enhancements

Priority Enhancement Description
P1 Multi-format support T20 (20 ov), ODI (50 ov), Test (unlimited) via max_overs
P1 LLM evaluation suite Benchmark GPT-4o, Claude, Gemini on coherence
P2 Player-specific simulation Individual batter/bowler skill profiles
P2 Weather & pitch degradation Dynamic conditions affecting strategy
P3 Multi-agent matches Two LLM captains competing head-to-head
P3 Commentary generation Natural language match commentary from game state