# Product Requirements Document: CricketCaptain-LLM

**Version:** 1.0  
**Date:** April 2025  
**Author:** CricketCaptain Team  

---

## 1. Executive Summary

CricketCaptain-LLM is a reinforcement learning benchmark built on the [OpenEnv](https://github.com/openenv-ai/openenv) framework for **adaptive strategic decision-making**. The environment uses cricket as its domain, where an LLM agent acts as a team captain managing a full match across batting, bowling, fielding, player selection, opponent modeling, and post-ball tactical adjustment.

### Problem Statement

The [WDCT benchmark (2025)](https://arxiv.org/abs/2503.07003) demonstrates that LLMs score 0.49–0.76 on strategic coherence — they routinely say one thing and do another. But practical strategic agents face a harder problem: they must keep plans coherent while adapting to adversarial, changing, long-horizon conditions.

In cricket, a captain must react to wickets, target pressure, batter/bowler matchups, field changes, phase transitions, and the opponent's visible strategy. A good LLM captain should not merely produce a plausible plan; it should revise that plan when the match state demands it.

### Solution

A two-sided cricket captaincy simulator where an LLM agent must:
1. **Plan before the ball**: select batter/bowler, set field, choose delivery or shot plan
2. **Execute actions** that are scored against the declared tactical plan
3. **Observe outcomes** and reflect after each ball
4. **Model the opponent** through fixed heuristic, live LLM, or cached LLM opponent policies
5. **Adapt strategies** across powerplay, middle, death overs, wickets, target pressure, and role switches

---

## 2. Target Users

| User | Use Case |
|------|----------|
| **ML Researchers** | Training LLMs for adaptive strategic decision-making via RL (GRPO/PPO) |
| **RL Engineers** | Benchmarking agents on long-horizon, multi-phase decision tasks |
| **Cricket Analytics** | Exploring AI-driven match strategy optimization |
| **Hackathon Participants** | OpenEnv hackathon — demonstrating tool-use RL environments |

---

## 3. Product Goals

| # | Goal | Success Metric |
|---|------|----------------|
| G1 | Train LLMs to align declared strategy with executed actions | Coherence score ≥ 0.75 |
| G2 | Measure adaptation after outcomes and phase changes | Adaptation score ≥ 0.70 on held-out scenarios |
| G3 | Measure opponent-aware captaincy | Opponent-awareness score ≥ 0.65 |
| G4 | Full-match simulation covering captain decisions | Agent plays TOSS → BAT → BOWL → FINISH autonomously |
| G5 | Compatible with OpenEnv for WebSocket-based training | Server + `inference.py` + `eval.py` run end-to-end |

---

## 4. Match Flow & Game States

```mermaid
stateDiagram-v2
    [*] --> TOSS
    TOSS --> BATTING: Agent wins toss & bats
    TOSS --> BOWLING: Agent wins toss & bowls
    TOSS --> BATTING: Opponent wins & bowls
    TOSS --> BOWLING: Opponent wins & bats
    BATTING --> BOWLING: 1st innings ends → role swap
    BOWLING --> BATTING: 1st innings ends → role swap
    BATTING --> FINISHED: 2nd innings ends
    BOWLING --> FINISHED: 2nd innings ends
    FINISHED --> [*]
```

### Innings Termination Conditions
- All overs bowled (20 in T20, 50 in ODI)
- 10 wickets fallen
- Target chased (2nd innings only)

---

## 5. Functional Requirements

### 5.1 Tools (Agent Actions)

| ID | Tool | Phase | Parameters | Description |
|----|------|-------|------------|-------------|
| T1 | `call_toss` | TOSS | `call` (heads/tails), `decision` (bat/bowl) | Coin toss with 50/50 outcome |
| T2 | `select_batter` | BATTING | `name`, `style`, `aggression`, `rationale` | Choose batter profile for the situation |
| T3 | `set_strategy` | BATTING | `phase_intent`, `aggression` (0–1), `rationale` | Declare batting strategy |
| T4 | `plan_shot` | BATTING | `shot_intent`, `target_area`, `risk`, `rationale` | Pre-ball batting plan |
| T5 | `play_delivery` | BATTING | `shot_intent`, `explanation` | Execute a shot against the bowler |
| T6 | `choose_bowler` | BOWLING | `name`, `bowler_type`, `style`, `rationale` | Choose bowler profile for the over/situation |
| T7 | `set_bowling_strategy` | BOWLING | `bowler_type`, `line`, `length`, `delivery_type`, `rationale` | Declare bowling plan |
| T8 | `plan_delivery` | BOWLING | `bowler_type`, `line`, `length`, `delivery_type`, `rationale` | Pre-ball delivery plan |
| T9 | `set_field_setting` | BOWLING | `setting` (Aggressive/Balanced/Defensive) | Set field configuration |
| T10 | `bowl_delivery` | BOWLING | — | Bowl the next ball; outcome simulated from both sides' plans |
| T11 | `reflect_after_ball` | BAT/BOWL | `reflection` | Post-ball tactical adjustment |
| T12 | `analyze_situation` | ANY | `query_type` | Query match intel (costs -0.02 reward) |
| T13 | `set_match_plan` | ANY | — | Establish long-term strategic match plan |
| T14 | `update_match_plan` | ANY | `reason` | Partially update match plan with justification |

### 5.2 Observation Space

Each step returns a `CricketObservation` containing:

| Field | Type | Description |
|-------|------|-------------|
| `game_state` | str | Current phase: toss / batting / bowling / finished |
| `game_context` | dict | Over, ball, score, wickets, phase, run_rate, bowler_type |
| `declared_strategy` | dict | Active batting strategy |
| `bowling_strategy` | dict | Active bowling strategy |
| `field_setting` | str | Current field configuration |
| `strategic_phase` | str | pre_over / pre_ball / ball_resolution / post_ball |
| `current_batter` | dict | Active batter profile |
| `current_bowler` | dict | Active bowler profile |
| `opponent_plan` | dict | Last visible opponent policy plan |
| `eval_pack_id` | str | Frozen benchmark pack identifier |
| `available_tools` | list | Valid tools for current phase |
| `last_ball_result` | str | Narrative of previous delivery |
| `prompt_text` | str | Rendered prompt for LLM consumption |
| `target` | int/null | Chase target (2nd innings only) |
| `innings_type` | str | "first" or "second" |

### 5.3 Reward System

| Rubric | Weight | Timing | Computation |
|--------|--------|--------|-------------|
| `r_cric` | 25% | Episode end | Win/loss plus score vs DLS par (1st inn) or target (2nd inn) |
| `r_dream11` | 10% | Innings end | Dense auxiliary cricket contribution proxy |
| `r_strategy` | 30% | Per decision/delivery | Coherence + adaptation + opponent awareness + regret score |
| `r_tools` | 20% | Per tool call | Information value of `analyze_situation` queries |
| `r_format` | 15% | Per step | Valid JSON structure compliance |

The top-level objective remains long-horizon match success over many simulated matches. Dream11-style reward is auxiliary shaping, not the primary benchmark target.

**Tool budget (operational constraint during play and training):** per over, the environment allows **3 no-fine “overhead” tool calls** among `set_strategy`, `set_bowling_strategy`, `plan_delivery`, `reflect_after_ball`, and `analyze_situation`. Each additional overhead call in that over applies a **−0.04** step reward. `plan_shot`, `set_match_plan`, `update_match_plan`, and ball-advancing tools do **not** count against this limit. Training via `train.py` (TRL GRPO with `CricketEnvironment`) uses the same rule, so the policy learns to ration analysis and re-planning across a full innings without a separate ad-hoc budget in the trainer.

### 5.4 Curriculum Stages

| Stage | Episodes | Active Rubrics | Objective |
|-------|----------|----------------|-----------|
| Stage 1 | 0–100 | `r_format` only | Learn valid JSON tool calls |
| Stage 2 | 100+ | All rubrics | Learn strategic coherence and improve match win/loss outcomes |

---

## 6. Non-Functional Requirements

| ID | Requirement | Specification |
|----|-------------|---------------|
| NF1 | **Latency** | Single step < 10ms (no external API calls in env) |
| NF2 | **Concurrency** | Multiple WebSocket sessions simultaneously (`SUPPORTS_CONCURRENT_SESSIONS = True`) |
| NF3 | **Determinism** | Reproducible with seed parameter in `reset()` |
| NF4 | **Extensibility** | Mid-match starts via `reset(options={over, score, wickets, ...})` |
| NF5 | **Framework** | OpenEnv >= 0.2.2, Pydantic v2 models, FastAPI/Uvicorn server |

---

## 7. Architecture

```mermaid
graph TD
    A["LLM Agent<br/>(inference.py / train.py)"] -->|WebSocket| B["OpenEnv Server<br/>(app.py)"]
    B --> C["CricketEnvironment<br/>(cricket_environment.py)"]
    C --> D["MarkovCricketEngine<br/>(markov_engine.py)"]
    C --> E["CoherenceGrader<br/>(coherence_grader.py)"]
    C --> F["RewardCalculator<br/>(reward_calculator.py)"]
    D -->|Transition Tables| G["Cricsheet Data or<br/>Synthetic Probs"]
    H["Gradio UI<br/>(ui.py)"] -->|Direct Python| C
```

### Key Components

| Component | Responsibility |
|-----------|---------------|
| `CricketEnvironment` | State machine managing TOSS→BAT→BOWL→FINISHED transitions |
| `MarkovCricketEngine` | Ball-by-ball outcome simulation from transition tables; AI batter simulation during bowling |
| `CoherenceGrader` | Scores alignment between declared strategy and executed actions (batting + bowling) |
| `RewardCalculator` | Computes composite 4-rubric rewards with DLS par and target-based scoring |
| `CricketCaptainEnv` (client) | OpenEnv WebSocket client for remote agent communication |

---

## 8. Data Dependencies

| Data | Source | Location | Required? |
|------|--------|----------|-----------|
| Synthetic transitions | Hand-calibrated | `data/transition_probs.json` | Yes (fallback) |
| Cricsheet transitions | `scripts/curate_transitions.py` | `data/processed/cricket_transitions_v1.pkl` | No (enhances realism) |
| DLS par scores | Pre-computed | `data/dls_par_scores.json` | Yes |

---

## 9. Success Criteria

| Criterion | Measurement | Target |
|-----------|-------------|--------|
| Environment runs end-to-end | Random agent completes 5 full matches | ✅ Verified |
| All phases functional | TOSS → BAT → BOWL → FINISHED | ✅ Verified |
| OpenEnv compatible | Server + client communicate via WebSocket | ✅ Verified |
| Parse error rate | Random agent across 5 episodes | 0.0% |
| Baseline coherence | Random agent (untrained) | ~0.50 |
| Trained coherence | After Stage 2 GRPO training | ≥ 0.75 |

### Current Baseline (Random Agent, 5 episodes)

```
total_score    : mean=87.6  std=12.4
wickets_lost   : mean=1.6   std=0.5
total_reward   : mean=1.511 std=0.158
mean_coherence : mean=0.502 std=0.016
parse_error_rate : 0.0%
```

---

## 10. Risks & Mitigations

| Risk | Impact | Mitigation |
|------|--------|------------|
| Bowling reward too easy/hard | Agent ignores bowling phase | Tune defensive reward scaling; evaluate bat vs bowl reward balance |
| Coherence score gaming | Agent finds shortcuts to high coherence without real strategy | Multi-component scoring prevents single-dimension exploitation |
| Long episode length (~350+ steps) | Training instability with GRPO | Curriculum stages; mid-match starts for targeted phase training |
| Toss randomness | 50% of matches start in suboptimal role | Agent learns to adapt to both bat-first and bowl-first scenarios |

---

## 11. Future Enhancements

| Priority | Enhancement | Description |
|----------|-------------|-------------|
| P1 | **Multi-format support** | T20 (20 ov), ODI (50 ov), Test (unlimited) via `max_overs` |
| P1 | **LLM evaluation suite** | Benchmark GPT-4o, Claude, Gemini on coherence |
| P2 | **Player-specific simulation** | Individual batter/bowler skill profiles |
| P2 | **Weather & pitch degradation** | Dynamic conditions affecting strategy |
| P3 | **Multi-agent matches** | Two LLM captains competing head-to-head |
| P3 | **Commentary generation** | Natural language match commentary from game state |