Spaces:

pratinavseth
/

cricket-captain-llm

Running

App Files Files Community

cricket-captain-llm / PRD.md

pratinavseth

sync: today's source updates (XML-only prompt, reward unclip, neg-reward on loss, pinned versions, configs reorg)

2fc50a9 verified 12 days ago

preview code

raw

history blame contribute delete

11.8 kB

Product Requirements Document: CricketCaptain-LLM

Version: 1.0
Date: April 2025
Author: CricketCaptain Team

1. Executive Summary

CricketCaptain-LLM is a reinforcement learning benchmark built on the OpenEnv framework for adaptive strategic decision-making. The environment uses cricket as its domain, where an LLM agent acts as a team captain managing a full match across batting, bowling, fielding, player selection, opponent modeling, and post-ball tactical adjustment.

Problem Statement

The WDCT benchmark (2025) demonstrates that LLMs score 0.49–0.76 on strategic coherence — they routinely say one thing and do another. But practical strategic agents face a harder problem: they must keep plans coherent while adapting to adversarial, changing, long-horizon conditions.

In cricket, a captain must react to wickets, target pressure, batter/bowler matchups, field changes, phase transitions, and the opponent's visible strategy. A good LLM captain should not merely produce a plausible plan; it should revise that plan when the match state demands it.

Solution

A two-sided cricket captaincy simulator where an LLM agent must:

Plan before the ball: select batter/bowler, set field, choose delivery or shot plan
Execute actions that are scored against the declared tactical plan
Observe outcomes and reflect after each ball
Model the opponent through fixed heuristic, live LLM, or cached LLM opponent policies
Adapt strategies across powerplay, middle, death overs, wickets, target pressure, and role switches

2. Target Users

User	Use Case
ML Researchers	Training LLMs for adaptive strategic decision-making via RL (GRPO/PPO)
RL Engineers	Benchmarking agents on long-horizon, multi-phase decision tasks
Cricket Analytics	Exploring AI-driven match strategy optimization
Hackathon Participants	OpenEnv hackathon — demonstrating tool-use RL environments

3. Product Goals

#	Goal	Success Metric
G1	Train LLMs to align declared strategy with executed actions	Coherence score ≥ 0.75
G2	Measure adaptation after outcomes and phase changes	Adaptation score ≥ 0.70 on held-out scenarios
G3	Measure opponent-aware captaincy	Opponent-awareness score ≥ 0.65
G4	Full-match simulation covering captain decisions	Agent plays TOSS → BAT → BOWL → FINISH autonomously
G5	Compatible with OpenEnv for WebSocket-based training	Server + `inference.py` + `eval.py` run end-to-end

4. Match Flow & Game States

stateDiagram-v2
    [*] --> TOSS
    TOSS --> BATTING: Agent wins toss & bats
    TOSS --> BOWLING: Agent wins toss & bowls
    TOSS --> BATTING: Opponent wins & bowls
    TOSS --> BOWLING: Opponent wins & bats
    BATTING --> BOWLING: 1st innings ends → role swap
    BOWLING --> BATTING: 1st innings ends → role swap
    BATTING --> FINISHED: 2nd innings ends
    BOWLING --> FINISHED: 2nd innings ends
    FINISHED --> [*]

Innings Termination Conditions

All overs bowled (20 in T20, 50 in ODI)
10 wickets fallen
Target chased (2nd innings only)

5. Functional Requirements

5.1 Tools (Agent Actions)

ID	Tool	Phase	Parameters	Description
T1	`call_toss`	TOSS	`call` (heads/tails), `decision` (bat/bowl)	Coin toss with 50/50 outcome
T2	`select_batter`	BATTING	`name`, `style`, `aggression`, `rationale`	Choose batter profile for the situation
T3	`set_strategy`	BATTING	`phase_intent`, `aggression` (0–1), `rationale`	Declare batting strategy
T4	`plan_shot`	BATTING	`shot_intent`, `target_area`, `risk`, `rationale`	Pre-ball batting plan
T5	`play_delivery`	BATTING	`shot_intent`, `explanation`	Execute a shot against the bowler
T6	`choose_bowler`	BOWLING	`name`, `bowler_type`, `style`, `rationale`	Choose bowler profile for the over/situation
T7	`set_bowling_strategy`	BOWLING	`bowler_type`, `line`, `length`, `delivery_type`, `rationale`	Declare bowling plan
T8	`plan_delivery`	BOWLING	`bowler_type`, `line`, `length`, `delivery_type`, `rationale`	Pre-ball delivery plan
T9	`set_field_setting`	BOWLING	`setting` (Aggressive/Balanced/Defensive)	Set field configuration
T10	`bowl_delivery`	BOWLING	—	Bowl the next ball; outcome simulated from both sides' plans
T11	`reflect_after_ball`	BAT/BOWL	`reflection`	Post-ball tactical adjustment
T12	`analyze_situation`	ANY	`query_type`	Query match intel (costs -0.02 reward)
T13	`set_match_plan`	ANY	—	Establish long-term strategic match plan
T14	`update_match_plan`	ANY	`reason`	Partially update match plan with justification

5.2 Observation Space

Each step returns a CricketObservation containing:

Field	Type	Description
`game_state`	str	Current phase: toss / batting / bowling / finished
`game_context`	dict	Over, ball, score, wickets, phase, run_rate, bowler_type
`declared_strategy`	dict	Active batting strategy
`bowling_strategy`	dict	Active bowling strategy
`field_setting`	str	Current field configuration
`strategic_phase`	str	pre_over / pre_ball / ball_resolution / post_ball
`current_batter`	dict	Active batter profile
`current_bowler`	dict	Active bowler profile
`opponent_plan`	dict	Last visible opponent policy plan
`eval_pack_id`	str	Frozen benchmark pack identifier
`available_tools`	list	Valid tools for current phase
`last_ball_result`	str	Narrative of previous delivery
`prompt_text`	str	Rendered prompt for LLM consumption
`target`	int/null	Chase target (2nd innings only)
`innings_type`	str	"first" or "second"

5.3 Reward System

Rubric	Weight	Timing	Computation
`r_cric`	25%	Episode end	Win/loss plus score vs DLS par (1st inn) or target (2nd inn)
`r_dream11`	10%	Innings end	Dense auxiliary cricket contribution proxy
`r_strategy`	30%	Per decision/delivery	Coherence + adaptation + opponent awareness + regret score
`r_tools`	20%	Per tool call	Information value of `analyze_situation` queries
`r_format`	15%	Per step	Valid JSON structure compliance

The top-level objective remains long-horizon match success over many simulated matches. Dream11-style reward is auxiliary shaping, not the primary benchmark target.

Tool budget (operational constraint during play and training): per over, the environment allows 3 no-fine “overhead” tool calls among set_strategy, set_bowling_strategy, plan_delivery, reflect_after_ball, and analyze_situation. Each additional overhead call in that over applies a −0.04 step reward. plan_shot, set_match_plan, update_match_plan, and ball-advancing tools do not count against this limit. Training via train.py (TRL GRPO with CricketEnvironment) uses the same rule, so the policy learns to ration analysis and re-planning across a full innings without a separate ad-hoc budget in the trainer.

5.4 Curriculum Stages

Stage	Episodes	Active Rubrics	Objective
Stage 1	0–100	`r_format` only	Learn valid JSON tool calls
Stage 2	100+	All rubrics	Learn strategic coherence and improve match win/loss outcomes

6. Non-Functional Requirements

ID	Requirement	Specification
NF1	Latency	Single step < 10ms (no external API calls in env)
NF2	Concurrency	Multiple WebSocket sessions simultaneously (`SUPPORTS_CONCURRENT_SESSIONS = True`)
NF3	Determinism	Reproducible with seed parameter in `reset()`
NF4	Extensibility	Mid-match starts via `reset(options={over, score, wickets, ...})`
NF5	Framework	OpenEnv >= 0.2.2, Pydantic v2 models, FastAPI/Uvicorn server

7. Architecture

graph TD
    A["LLM Agent<br/>(inference.py / train.py)"] -->|WebSocket| B["OpenEnv Server<br/>(app.py)"]
    B --> C["CricketEnvironment<br/>(cricket_environment.py)"]
    C --> D["MarkovCricketEngine<br/>(markov_engine.py)"]
    C --> E["CoherenceGrader<br/>(coherence_grader.py)"]
    C --> F["RewardCalculator<br/>(reward_calculator.py)"]
    D -->|Transition Tables| G["Cricsheet Data or<br/>Synthetic Probs"]
    H["Gradio UI<br/>(ui.py)"] -->|Direct Python| C

Key Components

Component	Responsibility
`CricketEnvironment`	State machine managing TOSS→BAT→BOWL→FINISHED transitions
`MarkovCricketEngine`	Ball-by-ball outcome simulation from transition tables; AI batter simulation during bowling
`CoherenceGrader`	Scores alignment between declared strategy and executed actions (batting + bowling)
`RewardCalculator`	Computes composite 4-rubric rewards with DLS par and target-based scoring
`CricketCaptainEnv` (client)	OpenEnv WebSocket client for remote agent communication

8. Data Dependencies

Data	Source	Location	Required?
Synthetic transitions	Hand-calibrated	`data/transition_probs.json`	Yes (fallback)
Cricsheet transitions	`scripts/curate_transitions.py`	`data/processed/cricket_transitions_v1.pkl`	No (enhances realism)
DLS par scores	Pre-computed	`data/dls_par_scores.json`	Yes

9. Success Criteria

Criterion	Measurement	Target
Environment runs end-to-end	Random agent completes 5 full matches	✅ Verified
All phases functional	TOSS → BAT → BOWL → FINISHED	✅ Verified
OpenEnv compatible	Server + client communicate via WebSocket	✅ Verified
Parse error rate	Random agent across 5 episodes	0.0%
Baseline coherence	Random agent (untrained)	~0.50
Trained coherence	After Stage 2 GRPO training	≥ 0.75

Current Baseline (Random Agent, 5 episodes)

total_score    : mean=87.6  std=12.4
wickets_lost   : mean=1.6   std=0.5
total_reward   : mean=1.511 std=0.158
mean_coherence : mean=0.502 std=0.016
parse_error_rate : 0.0%

10. Risks & Mitigations

Risk	Impact	Mitigation
Bowling reward too easy/hard	Agent ignores bowling phase	Tune defensive reward scaling; evaluate bat vs bowl reward balance
Coherence score gaming	Agent finds shortcuts to high coherence without real strategy	Multi-component scoring prevents single-dimension exploitation
Long episode length (~350+ steps)	Training instability with GRPO	Curriculum stages; mid-match starts for targeted phase training
Toss randomness	50% of matches start in suboptimal role	Agent learns to adapt to both bat-first and bowl-first scenarios

11. Future Enhancements

Priority	Enhancement	Description
P1	Multi-format support	T20 (20 ov), ODI (50 ov), Test (unlimited) via `max_overs`
P1	LLM evaluation suite	Benchmark GPT-4o, Claude, Gemini on coherence
P2	Player-specific simulation	Individual batter/bowler skill profiles
P2	Weather & pitch degradation	Dynamic conditions affecting strategy
P3	Multi-agent matches	Two LLM captains competing head-to-head
P3	Commentary generation	Natural language match commentary from game state