Product Requirements Document: CricketCaptain-LLM
Version: 1.0
Date: April 2025
Author: CricketCaptain Team
1. Executive Summary
CricketCaptain-LLM is a reinforcement learning benchmark built on the OpenEnv framework for adaptive strategic decision-making. The environment uses cricket as its domain, where an LLM agent acts as a team captain managing a full match across batting, bowling, fielding, player selection, opponent modeling, and post-ball tactical adjustment.
Problem Statement
The WDCT benchmark (2025) demonstrates that LLMs score 0.49–0.76 on strategic coherence — they routinely say one thing and do another. But practical strategic agents face a harder problem: they must keep plans coherent while adapting to adversarial, changing, long-horizon conditions.
In cricket, a captain must react to wickets, target pressure, batter/bowler matchups, field changes, phase transitions, and the opponent's visible strategy. A good LLM captain should not merely produce a plausible plan; it should revise that plan when the match state demands it.
Solution
A two-sided cricket captaincy simulator where an LLM agent must:
- Plan before the ball: select batter/bowler, set field, choose delivery or shot plan
- Execute actions that are scored against the declared tactical plan
- Observe outcomes and reflect after each ball
- Model the opponent through fixed heuristic, live LLM, or cached LLM opponent policies
- Adapt strategies across powerplay, middle, death overs, wickets, target pressure, and role switches
2. Target Users
| User |
Use Case |
| ML Researchers |
Training LLMs for adaptive strategic decision-making via RL (GRPO/PPO) |
| RL Engineers |
Benchmarking agents on long-horizon, multi-phase decision tasks |
| Cricket Analytics |
Exploring AI-driven match strategy optimization |
| Hackathon Participants |
OpenEnv hackathon — demonstrating tool-use RL environments |
3. Product Goals
| # |
Goal |
Success Metric |
| G1 |
Train LLMs to align declared strategy with executed actions |
Coherence score ≥ 0.75 |
| G2 |
Measure adaptation after outcomes and phase changes |
Adaptation score ≥ 0.70 on held-out scenarios |
| G3 |
Measure opponent-aware captaincy |
Opponent-awareness score ≥ 0.65 |
| G4 |
Full-match simulation covering captain decisions |
Agent plays TOSS → BAT → BOWL → FINISH autonomously |
| G5 |
Compatible with OpenEnv for WebSocket-based training |
Server + inference.py + eval.py run end-to-end |
4. Match Flow & Game States
stateDiagram-v2
[*] --> TOSS
TOSS --> BATTING: Agent wins toss & bats
TOSS --> BOWLING: Agent wins toss & bowls
TOSS --> BATTING: Opponent wins & bowls
TOSS --> BOWLING: Opponent wins & bats
BATTING --> BOWLING: 1st innings ends → role swap
BOWLING --> BATTING: 1st innings ends → role swap
BATTING --> FINISHED: 2nd innings ends
BOWLING --> FINISHED: 2nd innings ends
FINISHED --> [*]
Innings Termination Conditions
- All overs bowled (20 in T20, 50 in ODI)
- 10 wickets fallen
- Target chased (2nd innings only)
5. Functional Requirements
5.1 Tools (Agent Actions)
| ID |
Tool |
Phase |
Parameters |
Description |
| T1 |
call_toss |
TOSS |
call (heads/tails), decision (bat/bowl) |
Coin toss with 50/50 outcome |
| T2 |
select_batter |
BATTING |
name, style, aggression, rationale |
Choose batter profile for the situation |
| T3 |
set_strategy |
BATTING |
phase_intent, aggression (0–1), rationale |
Declare batting strategy |
| T4 |
plan_shot |
BATTING |
shot_intent, target_area, risk, rationale |
Pre-ball batting plan |
| T5 |
play_delivery |
BATTING |
shot_intent, explanation |
Execute a shot against the bowler |
| T6 |
choose_bowler |
BOWLING |
name, bowler_type, style, rationale |
Choose bowler profile for the over/situation |
| T7 |
set_bowling_strategy |
BOWLING |
bowler_type, line, length, delivery_type, rationale |
Declare bowling plan |
| T8 |
plan_delivery |
BOWLING |
bowler_type, line, length, delivery_type, rationale |
Pre-ball delivery plan |
| T9 |
set_field_setting |
BOWLING |
setting (Aggressive/Balanced/Defensive) |
Set field configuration |
| T10 |
bowl_delivery |
BOWLING |
— |
Bowl the next ball; outcome simulated from both sides' plans |
| T11 |
reflect_after_ball |
BAT/BOWL |
reflection |
Post-ball tactical adjustment |
| T12 |
analyze_situation |
ANY |
query_type |
Query match intel (costs -0.02 reward) |
| T13 |
set_match_plan |
ANY |
— |
Establish long-term strategic match plan |
| T14 |
update_match_plan |
ANY |
reason |
Partially update match plan with justification |
5.2 Observation Space
Each step returns a CricketObservation containing:
| Field |
Type |
Description |
game_state |
str |
Current phase: toss / batting / bowling / finished |
game_context |
dict |
Over, ball, score, wickets, phase, run_rate, bowler_type |
declared_strategy |
dict |
Active batting strategy |
bowling_strategy |
dict |
Active bowling strategy |
field_setting |
str |
Current field configuration |
strategic_phase |
str |
pre_over / pre_ball / ball_resolution / post_ball |
current_batter |
dict |
Active batter profile |
current_bowler |
dict |
Active bowler profile |
opponent_plan |
dict |
Last visible opponent policy plan |
eval_pack_id |
str |
Frozen benchmark pack identifier |
available_tools |
list |
Valid tools for current phase |
last_ball_result |
str |
Narrative of previous delivery |
prompt_text |
str |
Rendered prompt for LLM consumption |
target |
int/null |
Chase target (2nd innings only) |
innings_type |
str |
"first" or "second" |
5.3 Reward System
| Rubric |
Weight |
Timing |
Computation |
r_cric |
25% |
Episode end |
Win/loss plus score vs DLS par (1st inn) or target (2nd inn) |
r_dream11 |
10% |
Innings end |
Dense auxiliary cricket contribution proxy |
r_strategy |
30% |
Per decision/delivery |
Coherence + adaptation + opponent awareness + regret score |
r_tools |
20% |
Per tool call |
Information value of analyze_situation queries |
r_format |
15% |
Per step |
Valid JSON structure compliance |
The top-level objective remains long-horizon match success over many simulated matches. Dream11-style reward is auxiliary shaping, not the primary benchmark target.
Tool budget (operational constraint during play and training): per over, the environment allows 3 no-fine “overhead” tool calls among set_strategy, set_bowling_strategy, plan_delivery, reflect_after_ball, and analyze_situation. Each additional overhead call in that over applies a −0.04 step reward. plan_shot, set_match_plan, update_match_plan, and ball-advancing tools do not count against this limit. Training via train.py (TRL GRPO with CricketEnvironment) uses the same rule, so the policy learns to ration analysis and re-planning across a full innings without a separate ad-hoc budget in the trainer.
5.4 Curriculum Stages
| Stage |
Episodes |
Active Rubrics |
Objective |
| Stage 1 |
0–100 |
r_format only |
Learn valid JSON tool calls |
| Stage 2 |
100+ |
All rubrics |
Learn strategic coherence and improve match win/loss outcomes |
6. Non-Functional Requirements
| ID |
Requirement |
Specification |
| NF1 |
Latency |
Single step < 10ms (no external API calls in env) |
| NF2 |
Concurrency |
Multiple WebSocket sessions simultaneously (SUPPORTS_CONCURRENT_SESSIONS = True) |
| NF3 |
Determinism |
Reproducible with seed parameter in reset() |
| NF4 |
Extensibility |
Mid-match starts via reset(options={over, score, wickets, ...}) |
| NF5 |
Framework |
OpenEnv >= 0.2.2, Pydantic v2 models, FastAPI/Uvicorn server |
7. Architecture
graph TD
A["LLM Agent<br/>(inference.py / train.py)"] -->|WebSocket| B["OpenEnv Server<br/>(app.py)"]
B --> C["CricketEnvironment<br/>(cricket_environment.py)"]
C --> D["MarkovCricketEngine<br/>(markov_engine.py)"]
C --> E["CoherenceGrader<br/>(coherence_grader.py)"]
C --> F["RewardCalculator<br/>(reward_calculator.py)"]
D -->|Transition Tables| G["Cricsheet Data or<br/>Synthetic Probs"]
H["Gradio UI<br/>(ui.py)"] -->|Direct Python| C
Key Components
| Component |
Responsibility |
CricketEnvironment |
State machine managing TOSS→BAT→BOWL→FINISHED transitions |
MarkovCricketEngine |
Ball-by-ball outcome simulation from transition tables; AI batter simulation during bowling |
CoherenceGrader |
Scores alignment between declared strategy and executed actions (batting + bowling) |
RewardCalculator |
Computes composite 4-rubric rewards with DLS par and target-based scoring |
CricketCaptainEnv (client) |
OpenEnv WebSocket client for remote agent communication |
8. Data Dependencies
| Data |
Source |
Location |
Required? |
| Synthetic transitions |
Hand-calibrated |
data/transition_probs.json |
Yes (fallback) |
| Cricsheet transitions |
scripts/curate_transitions.py |
data/processed/cricket_transitions_v1.pkl |
No (enhances realism) |
| DLS par scores |
Pre-computed |
data/dls_par_scores.json |
Yes |
9. Success Criteria
| Criterion |
Measurement |
Target |
| Environment runs end-to-end |
Random agent completes 5 full matches |
✅ Verified |
| All phases functional |
TOSS → BAT → BOWL → FINISHED |
✅ Verified |
| OpenEnv compatible |
Server + client communicate via WebSocket |
✅ Verified |
| Parse error rate |
Random agent across 5 episodes |
0.0% |
| Baseline coherence |
Random agent (untrained) |
~0.50 |
| Trained coherence |
After Stage 2 GRPO training |
≥ 0.75 |
Current Baseline (Random Agent, 5 episodes)
total_score : mean=87.6 std=12.4
wickets_lost : mean=1.6 std=0.5
total_reward : mean=1.511 std=0.158
mean_coherence : mean=0.502 std=0.016
parse_error_rate : 0.0%
10. Risks & Mitigations
| Risk |
Impact |
Mitigation |
| Bowling reward too easy/hard |
Agent ignores bowling phase |
Tune defensive reward scaling; evaluate bat vs bowl reward balance |
| Coherence score gaming |
Agent finds shortcuts to high coherence without real strategy |
Multi-component scoring prevents single-dimension exploitation |
| Long episode length (~350+ steps) |
Training instability with GRPO |
Curriculum stages; mid-match starts for targeted phase training |
| Toss randomness |
50% of matches start in suboptimal role |
Agent learns to adapt to both bat-first and bowl-first scenarios |
11. Future Enhancements
| Priority |
Enhancement |
Description |
| P1 |
Multi-format support |
T20 (20 ov), ODI (50 ov), Test (unlimited) via max_overs |
| P1 |
LLM evaluation suite |
Benchmark GPT-4o, Claude, Gemini on coherence |
| P2 |
Player-specific simulation |
Individual batter/bowler skill profiles |
| P2 |
Weather & pitch degradation |
Dynamic conditions affecting strategy |
| P3 |
Multi-agent matches |
Two LLM captains competing head-to-head |
| P3 |
Commentary generation |
Natural language match commentary from game state |