Spaces:

pratinavseth
/

cricket-captain-llm

Running

App Files Files Community

cricket-captain-llm / PRD.md

pratinavseth

sync: today's source updates (XML-only prompt, reward unclip, neg-reward on loss, pinned versions, configs reorg)

2fc50a9 verified 13 days ago

preview code

raw

history blame contribute delete

11.8 kB

	# Product Requirements Document: CricketCaptain-LLM

	Version: 1.0
	Date: April 2025
	Author: CricketCaptain Team

	---

	## 1. Executive Summary

	CricketCaptain-LLM is a reinforcement learning benchmark built on the [OpenEnv](https://github.com/openenv-ai/openenv) framework for adaptive strategic decision-making. The environment uses cricket as its domain, where an LLM agent acts as a team captain managing a full match across batting, bowling, fielding, player selection, opponent modeling, and post-ball tactical adjustment.

	### Problem Statement

	The [WDCT benchmark (2025)](https://arxiv.org/abs/2503.07003) demonstrates that LLMs score 0.49–0.76 on strategic coherence — they routinely say one thing and do another. But practical strategic agents face a harder problem: they must keep plans coherent while adapting to adversarial, changing, long-horizon conditions.

	In cricket, a captain must react to wickets, target pressure, batter/bowler matchups, field changes, phase transitions, and the opponent's visible strategy. A good LLM captain should not merely produce a plausible plan; it should revise that plan when the match state demands it.

	### Solution

	A two-sided cricket captaincy simulator where an LLM agent must:
	1. Plan before the ball: select batter/bowler, set field, choose delivery or shot plan
	2. Execute actions that are scored against the declared tactical plan
	3. Observe outcomes and reflect after each ball
	4. Model the opponent through fixed heuristic, live LLM, or cached LLM opponent policies
	5. Adapt strategies across powerplay, middle, death overs, wickets, target pressure, and role switches

	---

	## 2. Target Users

	\| User \| Use Case \|
	\|------\|----------\|
	\| ML Researchers \| Training LLMs for adaptive strategic decision-making via RL (GRPO/PPO) \|
	\| RL Engineers \| Benchmarking agents on long-horizon, multi-phase decision tasks \|
	\| Cricket Analytics \| Exploring AI-driven match strategy optimization \|
	\| Hackathon Participants \| OpenEnv hackathon — demonstrating tool-use RL environments \|

	---

	## 3. Product Goals

	\| # \| Goal \| Success Metric \|
	\|---\|------\|----------------\|
	\| G1 \| Train LLMs to align declared strategy with executed actions \| Coherence score ≥ 0.75 \|
	\| G2 \| Measure adaptation after outcomes and phase changes \| Adaptation score ≥ 0.70 on held-out scenarios \|
	\| G3 \| Measure opponent-aware captaincy \| Opponent-awareness score ≥ 0.65 \|
	\| G4 \| Full-match simulation covering captain decisions \| Agent plays TOSS → BAT → BOWL → FINISH autonomously \|
	\| G5 \| Compatible with OpenEnv for WebSocket-based training \| Server + `inference.py` + `eval.py` run end-to-end \|

	---

	## 4. Match Flow & Game States

	```mermaid
	stateDiagram-v2
	[*] --> TOSS
	TOSS --> BATTING: Agent wins toss & bats
	TOSS --> BOWLING: Agent wins toss & bowls
	TOSS --> BATTING: Opponent wins & bowls
	TOSS --> BOWLING: Opponent wins & bats
	BATTING --> BOWLING: 1st innings ends → role swap
	BOWLING --> BATTING: 1st innings ends → role swap
	BATTING --> FINISHED: 2nd innings ends
	BOWLING --> FINISHED: 2nd innings ends
	FINISHED --> [*]
	```

	### Innings Termination Conditions
	- All overs bowled (20 in T20, 50 in ODI)
	- 10 wickets fallen
	- Target chased (2nd innings only)

	---

	## 5. Functional Requirements

	### 5.1 Tools (Agent Actions)

	\| ID \| Tool \| Phase \| Parameters \| Description \|
	\|----\|------\|-------\|------------\|-------------\|
	\| T1 \| `call_toss` \| TOSS \| `call` (heads/tails), `decision` (bat/bowl) \| Coin toss with 50/50 outcome \|
	\| T2 \| `select_batter` \| BATTING \| `name`, `style`, `aggression`, `rationale` \| Choose batter profile for the situation \|
	\| T3 \| `set_strategy` \| BATTING \| `phase_intent`, `aggression` (0–1), `rationale` \| Declare batting strategy \|
	\| T4 \| `plan_shot` \| BATTING \| `shot_intent`, `target_area`, `risk`, `rationale` \| Pre-ball batting plan \|
	\| T5 \| `play_delivery` \| BATTING \| `shot_intent`, `explanation` \| Execute a shot against the bowler \|
	\| T6 \| `choose_bowler` \| BOWLING \| `name`, `bowler_type`, `style`, `rationale` \| Choose bowler profile for the over/situation \|
	\| T7 \| `set_bowling_strategy` \| BOWLING \| `bowler_type`, `line`, `length`, `delivery_type`, `rationale` \| Declare bowling plan \|
	\| T8 \| `plan_delivery` \| BOWLING \| `bowler_type`, `line`, `length`, `delivery_type`, `rationale` \| Pre-ball delivery plan \|
	\| T9 \| `set_field_setting` \| BOWLING \| `setting` (Aggressive/Balanced/Defensive) \| Set field configuration \|
	\| T10 \| `bowl_delivery` \| BOWLING \| — \| Bowl the next ball; outcome simulated from both sides' plans \|
	\| T11 \| `reflect_after_ball` \| BAT/BOWL \| `reflection` \| Post-ball tactical adjustment \|
	\| T12 \| `analyze_situation` \| ANY \| `query_type` \| Query match intel (costs -0.02 reward) \|
	\| T13 \| `set_match_plan` \| ANY \| — \| Establish long-term strategic match plan \|
	\| T14 \| `update_match_plan` \| ANY \| `reason` \| Partially update match plan with justification \|

	### 5.2 Observation Space

	Each step returns a `CricketObservation` containing:

	\| Field \| Type \| Description \|
	\|-------\|------\|-------------\|
	\| `game_state` \| str \| Current phase: toss / batting / bowling / finished \|
	\| `game_context` \| dict \| Over, ball, score, wickets, phase, run_rate, bowler_type \|
	\| `declared_strategy` \| dict \| Active batting strategy \|
	\| `bowling_strategy` \| dict \| Active bowling strategy \|
	\| `field_setting` \| str \| Current field configuration \|
	\| `strategic_phase` \| str \| pre_over / pre_ball / ball_resolution / post_ball \|
	\| `current_batter` \| dict \| Active batter profile \|
	\| `current_bowler` \| dict \| Active bowler profile \|
	\| `opponent_plan` \| dict \| Last visible opponent policy plan \|
	\| `eval_pack_id` \| str \| Frozen benchmark pack identifier \|
	\| `available_tools` \| list \| Valid tools for current phase \|
	\| `last_ball_result` \| str \| Narrative of previous delivery \|
	\| `prompt_text` \| str \| Rendered prompt for LLM consumption \|
	\| `target` \| int/null \| Chase target (2nd innings only) \|
	\| `innings_type` \| str \| "first" or "second" \|

	### 5.3 Reward System

	\| Rubric \| Weight \| Timing \| Computation \|
	\|--------\|--------\|--------\|-------------\|
	\| `r_cric` \| 25% \| Episode end \| Win/loss plus score vs DLS par (1st inn) or target (2nd inn) \|
	\| `r_dream11` \| 10% \| Innings end \| Dense auxiliary cricket contribution proxy \|
	\| `r_strategy` \| 30% \| Per decision/delivery \| Coherence + adaptation + opponent awareness + regret score \|
	\| `r_tools` \| 20% \| Per tool call \| Information value of `analyze_situation` queries \|
	\| `r_format` \| 15% \| Per step \| Valid JSON structure compliance \|

	The top-level objective remains long-horizon match success over many simulated matches. Dream11-style reward is auxiliary shaping, not the primary benchmark target.

	Tool budget (operational constraint during play and training): per over, the environment allows 3 no-fine “overhead” tool calls among `set_strategy`, `set_bowling_strategy`, `plan_delivery`, `reflect_after_ball`, and `analyze_situation`. Each additional overhead call in that over applies a −0.04 step reward. `plan_shot`, `set_match_plan`, `update_match_plan`, and ball-advancing tools do not count against this limit. Training via `train.py` (TRL GRPO with `CricketEnvironment`) uses the same rule, so the policy learns to ration analysis and re-planning across a full innings without a separate ad-hoc budget in the trainer.

	### 5.4 Curriculum Stages

	\| Stage \| Episodes \| Active Rubrics \| Objective \|
	\|-------\|----------\|----------------\|-----------\|
	\| Stage 1 \| 0–100 \| `r_format` only \| Learn valid JSON tool calls \|
	\| Stage 2 \| 100+ \| All rubrics \| Learn strategic coherence and improve match win/loss outcomes \|

	---

	## 6. Non-Functional Requirements

	\| ID \| Requirement \| Specification \|
	\|----\|-------------\|---------------\|
	\| NF1 \| Latency \| Single step < 10ms (no external API calls in env) \|
	\| NF2 \| Concurrency \| Multiple WebSocket sessions simultaneously (`SUPPORTS_CONCURRENT_SESSIONS = True`) \|
	\| NF3 \| Determinism \| Reproducible with seed parameter in `reset()` \|
	\| NF4 \| Extensibility \| Mid-match starts via `reset(options={over, score, wickets, ...})` \|
	\| NF5 \| Framework \| OpenEnv >= 0.2.2, Pydantic v2 models, FastAPI/Uvicorn server \|

	---

	## 7. Architecture

	```mermaid
	graph TD
	A["LLM Agent<br/>(inference.py / train.py)"] -->\|WebSocket\| B["OpenEnv Server<br/>(app.py)"]
	B --> C["CricketEnvironment<br/>(cricket_environment.py)"]
	C --> D["MarkovCricketEngine<br/>(markov_engine.py)"]
	C --> E["CoherenceGrader<br/>(coherence_grader.py)"]
	C --> F["RewardCalculator<br/>(reward_calculator.py)"]
	D -->\|Transition Tables\| G["Cricsheet Data or<br/>Synthetic Probs"]
	H["Gradio UI<br/>(ui.py)"] -->\|Direct Python\| C
	```

	### Key Components

	\| Component \| Responsibility \|
	\|-----------\|---------------\|
	\| `CricketEnvironment` \| State machine managing TOSS→BAT→BOWL→FINISHED transitions \|
	\| `MarkovCricketEngine` \| Ball-by-ball outcome simulation from transition tables; AI batter simulation during bowling \|
	\| `CoherenceGrader` \| Scores alignment between declared strategy and executed actions (batting + bowling) \|
	\| `RewardCalculator` \| Computes composite 4-rubric rewards with DLS par and target-based scoring \|
	\| `CricketCaptainEnv` (client) \| OpenEnv WebSocket client for remote agent communication \|

	---

	## 8. Data Dependencies

	\| Data \| Source \| Location \| Required? \|
	\|------\|--------\|----------\|-----------\|
	\| Synthetic transitions \| Hand-calibrated \| `data/transition_probs.json` \| Yes (fallback) \|
	\| Cricsheet transitions \| `scripts/curate_transitions.py` \| `data/processed/cricket_transitions_v1.pkl` \| No (enhances realism) \|
	\| DLS par scores \| Pre-computed \| `data/dls_par_scores.json` \| Yes \|

	---

	## 9. Success Criteria

	\| Criterion \| Measurement \| Target \|
	\|-----------\|-------------\|--------\|
	\| Environment runs end-to-end \| Random agent completes 5 full matches \| ✅ Verified \|
	\| All phases functional \| TOSS → BAT → BOWL → FINISHED \| ✅ Verified \|
	\| OpenEnv compatible \| Server + client communicate via WebSocket \| ✅ Verified \|
	\| Parse error rate \| Random agent across 5 episodes \| 0.0% \|
	\| Baseline coherence \| Random agent (untrained) \| ~0.50 \|
	\| Trained coherence \| After Stage 2 GRPO training \| ≥ 0.75 \|

	### Current Baseline (Random Agent, 5 episodes)

	```
	total_score : mean=87.6 std=12.4
	wickets_lost : mean=1.6 std=0.5
	total_reward : mean=1.511 std=0.158
	mean_coherence : mean=0.502 std=0.016
	parse_error_rate : 0.0%
	```

	---

	## 10. Risks & Mitigations

	\| Risk \| Impact \| Mitigation \|
	\|------\|--------\|------------\|
	\| Bowling reward too easy/hard \| Agent ignores bowling phase \| Tune defensive reward scaling; evaluate bat vs bowl reward balance \|
	\| Coherence score gaming \| Agent finds shortcuts to high coherence without real strategy \| Multi-component scoring prevents single-dimension exploitation \|
	\| Long episode length (~350+ steps) \| Training instability with GRPO \| Curriculum stages; mid-match starts for targeted phase training \|
	\| Toss randomness \| 50% of matches start in suboptimal role \| Agent learns to adapt to both bat-first and bowl-first scenarios \|

	---

	## 11. Future Enhancements

	\| Priority \| Enhancement \| Description \|
	\|----------\|-------------\|-------------\|
	\| P1 \| Multi-format support \| T20 (20 ov), ODI (50 ov), Test (unlimited) via `max_overs` \|
	\| P1 \| LLM evaluation suite \| Benchmark GPT-4o, Claude, Gemini on coherence \|
	\| P2 \| Player-specific simulation \| Individual batter/bowler skill profiles \|
	\| P2 \| Weather & pitch degradation \| Dynamic conditions affecting strategy \|
	\| P3 \| Multi-agent matches \| Two LLM captains competing head-to-head \|
	\| P3 \| Commentary generation \| Natural language match commentary from game state \|