OpenEnv · Extension of Melady TS Green Agent

TemporalBenchEnv

An OpenEnv-native multi-step MCQ environment for post-training LLMs on time-series reasoning — built on the four datasets from our Melady TS Green Agent submission.

Melady / AgentX OpenEnv Track | Muyan Weng (USC) · Defu Cao (USC) · Yashaswi Sharma (USC) · Yan Liu (USC)

Live Environment Space → GitHub Repo

Multi-step TS reasoning as a verifiable environment

Most time-series LLM benchmarks grade a single prompt at a time. TemporalBenchEnv grades an episode: nine multiple-choice questions drawn from four time-series datasets, answered one per step, with a terminal bonus that rewards both accuracy and cross-domain coverage.

Every reward signal here is ground-truth arithmetic, not a judge. Labels are produced by the TS-Benchmark construction pipeline (trend / volatility / seasonality / outlier thresholds; S1–S5 family rules), so the environment can score an answer with a normalized string match against the stored ground truth. There is no LLM judge in the loop.

The falsifiable hypothesis this environment is built to test: whether a GRPO-trained LLM, post-trained on sequential episodes sampled from our Green Agent’s own benchmark, outperforms strong zero-shot baselines on per-domain MCQ accuracy while hitting the cross-domain coverage bonus. Empirical adjudication is contingent on the training runs described under Architecture & training pipeline.

Extension of our Melady TS Green Agent submission

TemporalBenchEnv is a direct extension of our AgentBeats Melady TS Green Agent submission. The Green Agent is an A2A-protocol evaluator that grades purple agents on 764 TS-Benchmark tasks (see the Green Agent GitHub repository). TemporalBenchEnv takes the same datasets and task taxonomy and re-exposes them as a sequential OpenEnv environment consumable by TRL’s rollout_func — turning the benchmark into a training target.

Artifact	Melady TS Green Agent	TemporalBenchEnv (this submission)
Role	A2A evaluator of purple agents	OpenEnv RL environment for post-training LLMs
Datasets	PSML, freshretailnet, MIMIC, causal_chambers	Same four datasets
Tasks	T1/T3 (accuracy) + T2/T4 (regression + MCQ), 764 total	MCQ subset: T1U, T3, T2_MCQ — 2,775 `TSQuestion` records
Per-domain bank sizes	Packaged in the Docker image	PSML 750 · freshretailnet 616 · MIMIC 709 · causal_chambers 700
Protocol	A2A messaging, one-shot prompts	WebSocket OpenEnv contract, sequential 9-step MDP
Reward	MSE / MAE / RMSE / MASE / accuracy (eval metrics)	Per-step correctness + terminal episode bonus w/ coverage multiplier
Consumer	AgentBeats leaderboard	TRL 1.0 `rollout_func`, vLLM colocate / server, GRPO

The ETL from the Green Agent’s labeled JSONL into the per-domain TSQuestion banks the environment consumes lives in TS-benchmark/scripts/build_temporal_bench_openenv_banks.py; banks ship at openenv-ts/TemporalBenchEnv/data/banks/ and are loaded via the TEMPORALBENCH_QUESTION_BANK_DIR environment variable.

The Green Agent answered: which purple agent is best at TS reasoning right now? TemporalBenchEnv answers: can we post-train an LLM, on that exact benchmark, to be the next best purple agent?

Purple agent harnesses: evaluating mainstream TS-capable agent stacks

The Green Agent scores purple agents over the A2A protocol — any stack that speaks A2A is a valid participant. To make the benchmark diagnostic for modern agentic time-series practice, we target four of the most popular open-source agent harnesses as purple agents on AgentBeats: two are implemented and live (AgentScope, CAMEL) and two are planned (MetaGPT, TimeSeriesScientist). Each harness stays unchanged internally; a thin A2A adapter feeds the Green Agent’s TS-Benchmark MCQs into the harness’s own reasoning / tool-use loop and returns a final label. These four frameworks are the de-facto “agentic harnesses” TS practitioners actually wrap around LLMs today, so they anchor the eval to real downstream usage.

The feedback loop we are instrumenting. For every harness we expose a swappable backbone LLM (e.g. Qwen2.5-7B-Instruct or GPT-4o-mini). The plan is: (1) score the base backbone inside each harness via the Melady TS Green Agent, (2) post-train that same backbone with this OpenEnv — TemporalBenchEnv’s randomized-domain / randomized-task 9-step MCQ episodes with per-step + terminal verifiable rewards, (3) re-score every harness with the post-trained backbone in place of the base, and (4) attribute any delta specifically to the RL post-training rather than to harness architecture. The panel makes the research question concrete: does randomized-domain verifiable-reward post-training on TemporalBenchEnv actually transfer to agentic TS reasoning under mainstream orchestration frameworks?

Implemented — live purple agents on AgentBeats

Harness	What it is	Why it matters for TS reasoning	AgentBeats listing
AgentScope Gao et al., “AgentScope 1.0: A Developer-Centric Framework for Building Agentic Applications” (arXiv:2508.16279, 2025); Gao et al., “AgentScope: A Flexible yet Robust Multi-Agent Platform” (arXiv:2402.14034, 2024)	Production-ready ReAct agent framework with built-in tools, memory, planning, MCP / A2A interop, and an agentic-RL tuner (Trinity-RFT). Apache-2.0.	The most common Python-native ReAct / tool-use stack; A2A support makes the Green Agent wiring mechanical. Serves as our canonical “single-ReAct-loop” purple baseline over TS MCQs.	Melady TS Purple (AgentScope)
CAMEL Li et al., “CAMEL: Communicative Agents for ‘Mind’ Exploration of Large Language Model Society” (NeurIPS 2023, arXiv:2303.17760)	Role-playing multi-agent framework with stateful memory, structured messages, societies / workforce pipelines, and a strong focus on scaling laws of agents.	Role-play + critic loop tests whether a trend / seasonality / outlier answer survives multi-turn discussion without drifting; complements AgentScope’s single-ReAct shape with an inter-agent communication surface.	Melady TS Base Purple Agent

Planned — purple agents under development

Harness	What it is	Why it matters for TS reasoning	Status
MetaGPT Hong et al., “MetaGPT: Meta Programming for a Multi-Agent Collaborative Framework” (ICLR 2024, oral; arXiv:2308.00352)	SOP-driven multi-agent system — explicit role decomposition (product manager / architect / engineer / QA) orchestrated by a meta-programming layer. One of the most-starred multi-agent frameworks in the wild.	Gives us a decomposition-heavy purple: a “data analyst” + “forecaster” / “reviewer” role split answering the same MCQs, isolating whether explicit SOPs help TS reasoning vs. a flat ReAct loop.	Planned. A2A wrapper not yet published on AgentBeats.
TimeSeriesScientist (TSci) Zhao et al., “TimeSeriesScientist: A General-Purpose AI Agent for Time Series Analysis” (arXiv:2510.01538, 2025)	Domain-specific LangGraph agent purpose-built for TS: Curator → Planner → Forecaster → Reporter, with statistical / ML / DL model selection and ensembling (ARIMA, Prophet, LSTM, XGBoost, Transformer, …).	The strongest TS-specialized agent we can find; a natural “ceiling” for agentic TS reasoning and a direct yardstick for whether an RL-post-trained generalist LLM inside a simpler harness closes the gap to a purpose-built TS pipeline.	Planned. A2A adapter not yet published on AgentBeats.

Green Agent + TemporalBenchEnv + harness panel = a closed evaluation loop. The Green Agent scores every harness on a fixed TS-Benchmark MCQ set; TemporalBenchEnv uses the same datasets and task taxonomy to post-train a backbone with randomized-domain / randomized-task episodes under verifiable rewards; we then drop the post-trained backbone back into AgentScope, CAMEL, MetaGPT, and TSci and re-score. A delta from the RL post-training should show up as a uniform lift across all four harnesses — not just in one framework, and not just in the zero-shot MCQ row of the baselines table.

Why this benchmark matters

Time-series reasoning is one of the few areas where frontier LLMs still look visibly weak, but also one where the verifiable signal is clean: trend / volatility / seasonality / outlier labels are constructed from thresholded statistics of the series, so exact-match grading is unambiguous. That makes MCQ episodes a near-ideal RL target before touching noisier numeric forecasting rewards.

The design is transferable. Any benchmark that produces labeled MCQ records over a set of domains — medical diagnostics, power-grid anomaly tagging, retail demand regimes — fits the same 9-question cross-domain template. Datasets are the proxy; the capability is multi-step, multi-domain, verifiable TS reasoning.

Every reward component is ground-truth arithmetic. The environment samples questions from pre-built banks, scores answers via normalized string equality, and aggregates with a closed-form episode bonus. No LLM judge, no circular reward.

Prior work & novelty

Prior “LLMs + time series” work lands in one of three buckets. None occupies the cell we target:

Prior work bucket	What it does	What it does not
Static TS benchmarks TS-Benchmark (ours, OpenReview); FreshRetailNet (arXiv:2505.16319); PSML (Nat. Sci. Data 2022); MIMIC-IV (Nat. Sci. Data 2022); Causal Chambers (Nat. MI 2024)	Construct labeled MCQ / regression tasks over TS datasets, graded by fixed rules	No RL-native environment contract, no sequential episodes, no post-training loop
TS-LLM composite RL rewards TimeMaster (arXiv:2506.13705); COUNTS (arXiv:2510.01116); SenTSR-Bench (arXiv:2602.19455)	Composite reward shaping (MSE + DTW + direction + quantile + ArcTan) for TS multimodal LLMs	No OpenEnv contract, no sequential MCQ episodes, reward is forecasting-centric
A2A evaluators Melady TS Green Agent (AgentBeats); other AgentBeats green agents	Score deployed purple agents on the benchmark through the A2A protocol	Not a training environment; no per-step RL signal, no environment state
TemporalBenchEnv (ours)	Sequential 9-step MCQ MDP over four TS datasets, structured Pydantic actions, terminal reward from ground-truth labels + coverage multiplier, OpenEnv + GRPO contract	Does not (yet) train numeric forecasting — T2/T4 reward is stubbed for future work

To our knowledge, no prior work exposes TS-Benchmark’s multi-dataset MCQ suite as an OpenEnv-native sequential MDP with verifiable terminal rewards suitable for GRPO post-training. The action space, reward decomposition, and sibling-env architecture follow the LotteryElicitationEnv lineage; the domain and task taxonomy follow our own Melady TS Green Agent.

What TemporalBenchEnv is

An OpenEnv-native sequential MDP in which an LLM agent answers nine MCQ questions — six from a primary domain and one each from the other three — earning per-step correctness and a terminal bonus that rewards cross-domain coverage.

Each episode proceeds like this:

reset() samples nine TSQuestion records: 6 from the primary domain (default PSML), with T3 families round-robined for diversity, and 1 from each of the three non-primary domains. Final order is shuffled.
On turn t, the agent emits a TemporalBenchAction containing the MCQ label, plus optional confidence and reasoning fields.
The environment grades the answer via normalized string equality against question.answer (also accepting an option whose normalized text matches the ground truth), and returns the next question.
Per-step reward is alpha * correctness. Mid-episode bonuses are zero.
On the final step, the environment adds the terminal episode bonus: lambda_ep * (total_correct / N) * coverage_multiplier, where the multiplier is 1.0 if every one of the four domains contributed at least one correct answer, else 0.8.

The agent’s interface is deliberately minimal: a single answer string per step, no tool-call protocol. Optional confidence and reasoning fields exist on the action for future reward shaping.

Environment design

The core contract is three Pydantic types exchanged over the OpenEnv WebSocket (see openenv-ts/TemporalBenchEnv/env/models.py):

# Action (agent → env)
class TemporalBenchAction(Action):
    answer: str                         # MCQ label matching an option
    confidence: Optional[float]         # in [0, 1], unused in reward for now
    reasoning: Optional[str]            # optional CoT, unused in reward for now

# Observation (env → agent)
class TemporalBenchObservation(Observation):
    step_idx, steps_remaining, max_steps: int
    question: str                       # current MCQ prompt
    options: list[str]                  # 2+ answer choices
    task_type: str                      # "T1U" | "T3" | "T2_MCQ"
    dataset: str                        # "PSML" | "freshretailnet" | "MIMIC" | "causal_chambers"
    history: list[dict]                 # [{question, answer, correct, dataset, ...}, ...]
    accuracy_so_far: float
    done: bool; reward: Optional[float]; metadata: dict

# State (serializable snapshot)
class TemporalBenchState(State):
    episode_id: Optional[str]
    step_count, total_correct, total_questions: int
    current_accuracy: float
    primary_domain: str
    per_task_type_accuracy: dict[str, float]
    total_reward: float

Four datasets, three MCQ task types, and a three-stage curriculum shape the training distribution (see env/config.py and env/episode_sampler.py):

Stage	Allowed task types	Purpose
Stage 1	`T1U` only (non-contextual understanding MCQ)	Shorten credit assignment; learn trend / volatility / seasonality / outliers first
Stage 2	`T1U` + `T3` (contextual understanding, S1–S5 families)	Add context-conditioned reasoning; maintain verifiable labels
Stage 3	`T1U` + `T3` + `T2_MCQ` (prediction-as-classification)	Full MCQ track; adds direction-of-change / volatility-change / seasonality-alignment

Curriculum is honored both at EnvConfig(curriculum_stage=...) construction and at env.reset(curriculum_stage=...), so a single server can serve multiple stages to different sessions concurrently.

Why OpenEnv

OpenEnv gives us three things that matter for this submission: (1) a standard WebSocket environment contract consumable by TRL’s rollout_func, (2) per-session state with max_concurrent_envs=64 in our create_app factory — each WebSocket session gets a fresh TemporalBenchEnvironment via _env_factory so DDP ranks can hammer the same Space without cross-talk, and (3) a uniform deployment path. The same env code runs in-process for tests, as a Docker container for development (server/Dockerfile, with TEMPORALBENCH_QUESTION_BANK_DIR=/app/env/data/banks), and as a Hugging Face Space during training and evaluation.

No new abstractions were invented. Base types only: EnvClient, Environment, Pydantic Action / Observation / State. All extensions (history, per-domain coverage, per-task accuracy) ride on metadata or the serialized state. No new method signatures, no fork. The env ships with openenv.yaml and a Dockerfile, and passes uv run openenv validate.

Hygiene note: OpenEnv’s CLI validator does a naive substring check for main() in server/app.py. We match the reference pattern from LotteryElicitationEnv — an explicit main() call under if __name__ == "__main__" with CLI flags parsed via parse_known_args.

Scoring: per-step correctness + episode bonus

Reward is decomposed into a per-step term and a terminal bonus (see env/reward.py).

r_t \;=\; \alpha \cdot \mathbf{1}\!\left[\hat a_t = a_t^\ast\right]

Here \(\hat a_t\) is the agent’s submitted label and \(a_t^\ast\) is the stored ground truth, compared under normalized string equality (also accepting an option whose normalized text matches).

B \;=\; \lambda_{\mathrm{ep}} \,\cdot\, \frac{C}{N} \,\cdot\, m, \qquad m \;=\; \begin{cases} 1.0 & \text{all 4 domains have } \ge 1 \text{ correct} \\ 0.8 & \text{otherwise} \end{cases} \] \[ R \;=\; \sum_{t=1}^{N} r_t \;+\; B

\(C\) is the total correct count in the episode, \(N = 9\) is the episode length, \(\alpha, \lambda_{\mathrm{ep}}\) are alpha and lambda_ep in EnvConfig.

Defaults live in EnvConfig:

Component	Weight	What it rewards
Per-step correctness	α = 1.0	Normalized-string match against the MCQ ground truth
Episode bonus weight	λ_ep = 0.5	Scales the terminal accuracy×coverage term
Coverage multiplier	{0.8, 1.0}	1.0 iff every domain in `EnvConfig.all_domains` has ≥1 correct answer this episode
Forecasting reward	—	Stubbed (`compute_forecasting_reward` raises `NotImplementedError`); future work

Why a coverage multiplier: per-step accuracy alone lets the agent ace the six primary questions while guessing on the other three domains, collapsing to a single-domain policy. The 0.8 penalty forces the policy to treat the three cross-domain questions as first-class signal — the very thing that distinguishes a TS-generalist from a PSML-only memorizer.

Architecture & training pipeline

Following the LotteryElicitationEnv / ReasoningEconomicsEnv lineage, TemporalBenchEnv (this page) ships the OpenEnv side: reset, step, rewards, and question banks over WebSocket. A separate trainer process — same env/trainer separation as LotteryElicitationEnv and LotteryElicitationPT — would drive GRPO with TRL’s rollout_func and vLLM against that socket, without in-process imports of env-side types.

Training client status. The companion GRPO / TRL package that runs rollouts against this environment is under active internal testing. We will release it on GitHub once we secure additional compute so we can stress-test and validate models at a scale we consider reasonable. This blog documents the shipped TemporalBenchEnv only until that release.

Purple agents on AgentBeats. The same TS-Benchmark task surface is also exercised by purple agents scored through the A2A green agent: live listings are Melady TS Purple (AgentScope) and Melady TS Base Purple Agent. Planned additions are documented in § Purple agent harnesses.

flowchart LR
    subgraph TRN ["Trainer (TRL + GRPO, WebSocket client)"]
        GRPO["GRPOTrainer
TRL 1.0"]
        RF["rollout_func"]
        VLLM["vLLM
colocate/server"]
        PARSE["action_parser
MCQ label guardrails"]
    end
    subgraph ENV ["TemporalBenchEnv (OpenEnv)"]
        WS["FastAPI
WebSocket"]
        SAMP["EpisodeSampler
6+1+1+1 stratified"]
        GRADE["grade_answer
normalized match"]
        REW["Reward
per-step + bonus + coverage"]
    end

    GRPO --> RF
    RF --> VLLM
    VLLM -->|"generate"| PARSE
    PARSE -->|"answer string"| WS
    WS --> SAMP
    SAMP -->|"next question"| WS
    WS --> GRADE
    GRADE --> REW
    REW -->|"step + terminal reward"| WS
    WS -->|"observation"| RF

Figure 1. System architecture. The trainer never imports env-side types — everything crosses the WebSocket, exactly like our Lottery / Reasoning sibling envs.

Training uses GRPO (Group Relative Policy Optimization), which is a natural fit for per-step verifiable rewards with an additive terminal bonus. The training scaffolding is directly inherited from LotteryElicitationPT: TRL 1.0’s rollout_func contract, vLLM colocate/server, chat-template tokenization with enable_thinking=False, think-block stripping, null-safe MCQ-label parsing, and episode logging to reward_logs.jsonl.

Episode trace (Ideal, Illustrative)

Here is what a high-reward episode would look like — five representative steps out of nine, spanning all four domains. Primary domain is PSML; the trace shows the three cross-domain picks plus two primary turns, culminating in the terminal step where the coverage multiplier decides the shape of the bonus. Turns 2, 4, 6, 8 are elided (all PSML T1U, assumed correct for the 7 / 9 total) so the walkthrough stays focused on the cross-domain structure.

Illustrative, not captured. This is a hand-constructed walkthrough intended to explain the per-step reward, the 6 + 3 domain split, and the terminal coverage multiplier — not a real rollout from a trained (or even zero-shot) policy against the live environment. The prompts and agent answers below are author-written. The reward arithmetic (r_t, B, R) is computed exactly as env/reward.py would compute it for the stated correctness pattern. A captured trace from a real model will replace this once the training client is released and run at scale (see Architecture & training pipeline).

Turn 1 · dataset=PSML · task=T1U:trend

Prompt: “Based on the array (length=336), report trend: upward / downward / constant.”
Series rises monotonically over the tail window.
Agent answer: "upward" → correct (r_1 = 1.0)

Turn 3 · dataset=freshretailnet · task=T1U:seasonality

Prompt: daily demand for a fresh-retail SKU, peaks repeat every 7 steps with stable amplitude.
Agent answer: "fixed" → correct (r_3 = 1.0)

Turn 5 · dataset=MIMIC · task=T1U:outliers

Prompt: ICU vital trace with a single spike above q95 + 3·MAD.
Agent answer: "sudden_spike" → correct (r_5 = 1.0)

Turn 7 · dataset=causal_chambers · task=T3:S2

Prompt: wind-chamber actuator trace, contextual question on lagged response to a step input (capability C4).
Agent answer: "delayed_response" → correct (r_7 = 1.0)

Turn 9 · dataset=PSML · task=T2_MCQ (terminal)

Prompt: “Median demand level change (forecast horizon vs history)?”
Agent answer: "Higher" → correct (r_9 = 1.0)
Episode totals: C = 7 / 9 correct, all four domains covered → m = 1.0.
Terminal bonus: B = 0.5 · (7/9) · 1.0 ≈ 0.389.
Total return: R ≈ 7.389.

Four-of-four coverage → full m = 1.0 multiplier. Contrast: the same 7 / 9 accuracy with a missed cross-domain question (say, zero correct in MIMIC) would give B ≈ 0.311 and R ≈ 7.311. The coverage term is the whole reason a PSML-only policy loses to a generalist.

Evaluation protocol & projected targets

No trained-policy numbers yet. Available compute was exhausted before GRPO could be run to convergence and validated at a scale we consider meaningful. The “Trained HF policy” column therefore stays blank, and every number in the “Projected targets” table below is an analytical projection from the environment’s structure, not a measurement. GRPO runs against this Space will populate the empty column once the public training client is released and run at scale — see the canonical training-status notice under Architecture & training pipeline. No fabricated telemetry is shown.

Purple baselines (AgentBeats). For deployed purple policies on the same benchmark lineage, see Melady TS Purple (AgentScope) and Melady TS Base Purple Agent.

Baselines

The environment supports the following baseline policies out of the box (random + majority) plus an eval harness for zero-shot and trained LLMs. All run against the live OpenEnv WebSocket so numbers are directly comparable with the trained policy.

Baseline	Policy	What it isolates
Random MCQ	Uniform sample over `observation.options`	Lower bound; beats zero only if options are imbalanced
Majority-class	Always pick the per-`task_type` modal label from the bank	Isolates how much accuracy is available from priors alone
Zero-shot API LLM	GPT / Claude / Gemini via the eval harness	Strong “off-the-shelf” ceiling before any post-training
Zero-shot local LLM	Qwen2.5-7B-Instruct served via vLLM	Planned backbone for GRPO fine-tuning against this environment once training runs land
Trained HF policy	GRPO checkpoint from rollouts on this OpenEnv	Tests whether post-training on TS-Benchmark episodes beats zero-shot

Projected targets

The numbers below are analytical projections from the environment’s structure, not empirical results. They exist to anchor what “good” looks like once training runs land.

Metric	Random MCQ	Zero-shot LLM (expected)	Target for trained policy
Per-step accuracy (T1U & T3, 3–4 options)	≈ 0.25–0.33	>> 0.33	≥ strong zero-shot
Per-step accuracy (T2_MCQ, 4 options)	≈ 0.25	> random	≥ strong zero-shot
Coverage multiplier `m`	usually 0.8	0.8–1.0	1.0 consistently
Episode bonus `B` (λ_ep=0.5, N=9)	≈ 0.10	0.20–0.35	≥ 0.40

Current status

What is already in place:

Full environment: reset / step / state, nine-question stratified sampling across four domains, per-step + terminal reward with coverage multiplier.
Production banks: 2,775 TSQuestion records built from TS-benchmark/task_merged_dev_with_labels_tiers.jsonl via TS-benchmark/scripts/build_temporal_bench_openenv_banks.py and vendored under openenv-ts/TemporalBenchEnv/data/banks/ (PSML 750 / freshretailnet 616 / MIMIC 709 / causal_chambers 700).
Deployment path: openenv.yaml, Docker image (server/Dockerfile, TEMPORALBENCH_QUESTION_BANK_DIR=/app/env/data/banks), openenv validate passes, openenv push ready for HF Space.
Training scaffolding: the intended path mirrors LotteryElicitationPT (GRPO + TRL 1.0 rollout_func + vLLM). Training-client release status: see Architecture.
Purple agents (AgentBeats): live eval targets include Melady TS Purple (AgentScope) and Melady TS Base Purple Agent. Planned additions: see § Purple agent harnesses.

Engineering lessons (inherited)

Because TemporalBenchEnv intentionally rides on the same deployment + training pattern as LotteryElicitationEnv and ReasoningEconomicsEnv, most of the hard infra lessons are inherited rather than rediscovered. We document them briefly so the next submitter does not need to learn them from scratch.

Issue	Root cause	Fix (inherited)
NCCL desync under variable-length episodes	In `vllm_mode=server`, different DDP ranks make different numbers of `generate()` calls per episode → sequence-numbered NCCL collectives go out of sync.	Fixed-count `generate()` padding per episode; dummy generates discarded via `_temporary_vllm_max_tokens(..., 1)`. Gated on `world_size > 1`.
`max_completion_length` drift over multi-turn rollouts	The rollout function appends per-turn generations + observation suffixes each step; a 9-turn MCQ episode can easily exceed the nominal completion budget.	Hard-cap `completion_ids` to `max_completion_length`; `PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True`.
Null-safe MCQ parsing	LLMs sometimes emit `{"answer": null}` or malformed strings; one rank crashing kills all DDP ranks via gloo cascade.	Inherited `_safe_float` / `_safe_int` / null-string fallbacks; fallback action rather than crash. The same regression pattern from LotteryElicitationPT applies directly to MCQ labels.
`openenv validate` hygiene	OpenEnv’s CLI does a naive substring check for `main()` in `server/app.py`; an entrypoint like `main(port=args.port)` fails validation.	Match the Lottery pattern: `if __name__ == "__main__": main()`, with flags parsed via `parse_known_args` inside `main()`.

Foundations & citations

Foundation	Role in this project	Citation
TS-Benchmark (ours)	Task taxonomy (T1/T2/T3/T4), per-dataset label construction, MCQ question shape	OpenReview · in-repo: `TS-benchmark/TS-Benchmark.md`
Melady TS Green Agent	A2A evaluator whose task set TemporalBenchEnv re-exposes as an OpenEnv environment	AgentBeats · GitHub repository
AgentScope	Purple-agent harness — ReAct / MCP / A2A framework wrapping the backbone we post-train (live on AgentBeats)	Gao et al., arXiv:2508.16279 (2025) · arXiv:2402.14034 (2024) · GitHub
CAMEL	Purple-agent harness — role-playing multi-agent society, CAMEL-backed baseline on AgentBeats	Li et al., NeurIPS 2023 (arXiv:2303.17760) · GitHub
MetaGPT	Purple-agent harness (fast-follow-up) — SOP-driven multi-agent system, role-decomposition purple	Hong et al., ICLR 2024, oral (arXiv:2308.00352) · GitHub
TimeSeriesScientist (TSci)	Purple-agent harness (fast-follow-up) — TS-specialized Curator/Planner/Forecaster/Reporter agent; “ceiling” for agentic TS reasoning	Zhao et al., arXiv:2510.01538 (2025) · GitHub
FreshRetailNet-50K	Retail demand dataset; T1/T2/T3/T4 MCQ questions	Ding et al., arXiv:2505.16319, 2025
PSML	Power-system load dataset; primary domain for the default episode	Nature Sci. Data 2022
MIMIC-IV	ICU/EHR time-series dataset; medical-domain MCQ	Nature Sci. Data 2022
Causal Chambers	Physical-testbed TS dataset; contextual T3 and wind-chamber T1U	Nature MI 2024
TimeMaster / COUNTS / SenTSR-Bench	Prior art on composite RL rewards for TS-LLMs; motivates T2/T4 forecasting future work	TimeMaster (arXiv:2506.13705) · COUNTS (arXiv:2510.01116) · SenTSR-Bench (arXiv:2602.19455)
OpenEnv	Gym-style reset/step, WebSocket transport, HF Space deployment	HF Blog: Introducing OpenEnv
TRL + GRPO	GRPOTrainer, custom `rollout_func`, remote env rollouts	Shao et al., arXiv:2402.03300 (DeepSeekMath) · TRL × OpenEnv
LotteryElicitationEnv / PT	Sibling project — structural template for env / PT split, `rollout_func`, DDP padding, validation hygiene	HF Space (Env) · GitHub (PT)

Quick start

# 1. Install and run the env locally
cd openenv-ts/TemporalBenchEnv
uv sync --extra dev
uv run pytest tests/ -q
uv run openenv validate

# 2. Run the server (uvicorn, port 8000)
uv run server
# or
uvicorn server.app:app --reload

# 3. Build & run the Docker image
docker build -t temporalbenchenv:latest -f server/Dockerfile .
docker run --rm -p 8000:8000 \
    -e TEMPORALBENCH_QUESTION_BANK_DIR=/app/env/data/banks \
    temporalbenchenv:latest

# 4. Or pull / push a HF Space
export ENV_BASE_URL="https://huggingface.co/spaces/yashu2000/TemporalBenchEnv"
openenv push   # from the TemporalBenchEnv/ directory

# 5. Minimal client usage
python - <<'PY'
from client import TemporalBenchAction, TemporalBenchEnvClient
with TemporalBenchEnvClient(base_url="http://localhost:8000") as env:
    out = env.reset()
    while not out.done:
        q = out.observation
        out = env.step(TemporalBenchAction(answer=q.options[0]))
    print("total reward:", out.observation.reward)
PY

# 6. GRPO / TRL training against this Space (training client under test; not on GitHub yet)
# Release is planned after additional compute for reasonable model validation.

Purple track (AgentBeats): Melady TS Purple (AgentScope) · Melady TS Base Purple Agent.

Banks are reproducible from (env_seed, curriculum_stage, primary_domain). No external fixtures, no live API, no human labels — the same flat tiered JSONL that drives the Melady TS Green Agent drives this environment.

Can an LLM post-trained on our Green Agent’s own benchmark outperform zero-shot baselines at cross-domain TS reasoning?

The environment is built and deployed. Training-client release and empirical GRPO numbers are contingent on compute availability — see Architecture.

On AgentBeats, compare purple baselines such as Melady TS Purple (AgentScope) and Melady TS Base Purple Agent.

Future work

Run GRPO to convergence on Stage 1 → Stage 2 → Stage 3 curriculum against this environment and fill in the “Trained HF policy” column of the baselines table (blocked on releasing the training client; see Architecture). Relate outcomes to AgentBeats purple baselines (AgentScope / CAMEL) and, once landed, the planned MetaGPT / TSci harnesses.
T2 / T4 forecasting reward — replace the compute_forecasting_reward stub with the composite sketched in our v1 plan (normalized MSE + DTW-shape + direction + quantile + ArcTan-smoothed MAE). Opens T2 numeric and T4 contextual forecasting tasks without touching the step loop.
Confidence-calibration reward — TemporalBenchAction.confidence already exists; shape rewards around Brier-style calibration so the policy is incentivized to know when it knows.
CoT / reasoning reward — TemporalBenchAction.reasoning is captured per step but currently unscored; a light format-plus-consistency shaping mirrors the Lottery format-weight technique.
Token-budget curriculum (ReasoningEconomics-style) — constrain per-episode reasoning length to force terseness at higher stages.
Soft 60 / 40 task-type mix — enforce the v1-plan ratio of understanding (T1U+T3) to prediction (T2_MCQ) explicitly in EpisodeSampler; today it emerges from bank sizes.
SFT warm-up on valid MCQ JSON before GRPO — skips the cold-start formatting phase we hit in Lottery.
Human-subjects transfer — once the policy beats zero-shot on our bank, measure sim-to-real with held-out TS data outside the Green Agent’s training split.

Conclusion

TemporalBenchEnv is the training-side companion to our Melady TS Green Agent’s eval-side role. The Green Agent ranks purple agents on TS-Benchmark through the A2A protocol; TemporalBenchEnv re-exposes the same four-dataset MCQ suite as an OpenEnv-native sequential MDP, so an LLM can be post-trained on the very benchmark it will later be scored against.

Every design choice — 9-question episodes with 6 primary + 3 cross-domain, per-step correctness + terminal bonus with a domain-coverage multiplier, stagewise curriculum T1U → +T3 → +T2_MCQ, and a strict separation of env server and trainer over WebSocket — is aimed at preserving the verifiability that made the Green Agent’s signal trustworthy, while turning it into gradient. The infrastructure contributions are mostly inherited from our Lottery / Reasoning siblings; the novelty is the domain binding and the Green-Agent → OpenEnv pipeline.

The research question is open: can a GRPO-trained LLM, post-trained on our own benchmark, outperform strong zero-shot baselines at cross-domain TS reasoning? The environment, banks, and deployment path are shipped; full training results await the public training client and the compute budget described under Architecture. The purple track stays live on AgentBeats via Melady TS Purple (AgentScope) and Melady TS Base Purple Agent.