OpenEnv · Extension of Melady TS Green Agent

TemporalBenchEnv

An OpenEnv-native multi-step MCQ environment for post-training LLMs on time-series reasoning — built on the four datasets from our Melady TS Green Agent submission.

GitHub HF Space OpenEnv 4 Datasets 9-step MCQ GRPO (planned)
Live Environment Space → GitHub Repo

Multi-step TS reasoning as a verifiable environment

Most time-series LLM benchmarks grade a single prompt at a time. TemporalBenchEnv grades an episode: nine multiple-choice questions drawn from four time-series datasets, answered one per step, with a terminal bonus that rewards both accuracy and cross-domain coverage.

Every reward signal here is ground-truth arithmetic, not a judge. Labels are produced by the TS-Benchmark construction pipeline (trend / volatility / seasonality / outlier thresholds; S1–S5 family rules), so the environment can score an answer with a normalized string match against the stored ground truth. There is no LLM judge in the loop.

The falsifiable hypothesis this environment is built to test: whether a GRPO-trained LLM, post-trained on sequential episodes sampled from our Green Agent’s own benchmark, outperforms strong zero-shot baselines on per-domain MCQ accuracy while hitting the cross-domain coverage bonus. Empirical adjudication is contingent on the training runs described under Architecture & training pipeline.

Extension of our Melady TS Green Agent submission

TemporalBenchEnv is a direct extension of our AgentBeats Melady TS Green Agent submission. The Green Agent is an A2A-protocol evaluator that grades purple agents on 764 TS-Benchmark tasks (see the Green Agent GitHub repository). TemporalBenchEnv takes the same datasets and task taxonomy and re-exposes them as a sequential OpenEnv environment consumable by TRL’s rollout_func — turning the benchmark into a training target.

ArtifactMelady TS Green AgentTemporalBenchEnv (this submission)
RoleA2A evaluator of purple agentsOpenEnv RL environment for post-training LLMs
DatasetsPSML, freshretailnet, MIMIC, causal_chambersSame four datasets
TasksT1/T3 (accuracy) + T2/T4 (regression + MCQ), 764 totalMCQ subset: T1U, T3, T2_MCQ — 2,775 TSQuestion records
Per-domain bank sizesPackaged in the Docker imagePSML 750 · freshretailnet 616 · MIMIC 709 · causal_chambers 700
ProtocolA2A messaging, one-shot promptsWebSocket OpenEnv contract, sequential 9-step MDP
RewardMSE / MAE / RMSE / MASE / accuracy (eval metrics)Per-step correctness + terminal episode bonus w/ coverage multiplier
ConsumerAgentBeats leaderboardTRL 1.0 rollout_func, vLLM colocate / server, GRPO

The ETL from the Green Agent’s labeled JSONL into the per-domain TSQuestion banks the environment consumes lives in TS-benchmark/scripts/build_temporal_bench_openenv_banks.py; banks ship at openenv-ts/TemporalBenchEnv/data/banks/ and are loaded via the TEMPORALBENCH_QUESTION_BANK_DIR environment variable.

The Green Agent answered: which purple agent is best at TS reasoning right now?  TemporalBenchEnv answers: can we post-train an LLM, on that exact benchmark, to be the next best purple agent?

Purple agent harnesses: evaluating mainstream TS-capable agent stacks

The Green Agent scores purple agents over the A2A protocol — any stack that speaks A2A is a valid participant. To make the benchmark diagnostic for modern agentic time-series practice, we target four of the most popular open-source agent harnesses as purple agents on AgentBeats: two are implemented and live (AgentScope, CAMEL) and two are planned (MetaGPT, TimeSeriesScientist). Each harness stays unchanged internally; a thin A2A adapter feeds the Green Agent’s TS-Benchmark MCQs into the harness’s own reasoning / tool-use loop and returns a final label. These four frameworks are the de-facto “agentic harnesses” TS practitioners actually wrap around LLMs today, so they anchor the eval to real downstream usage.

The feedback loop we are instrumenting. For every harness we expose a swappable backbone LLM (e.g. Qwen2.5-7B-Instruct or GPT-4o-mini). The plan is: (1) score the base backbone inside each harness via the Melady TS Green Agent, (2) post-train that same backbone with this OpenEnv — TemporalBenchEnv’s randomized-domain / randomized-task 9-step MCQ episodes with per-step + terminal verifiable rewards, (3) re-score every harness with the post-trained backbone in place of the base, and (4) attribute any delta specifically to the RL post-training rather than to harness architecture. The panel makes the research question concrete: does randomized-domain verifiable-reward post-training on TemporalBenchEnv actually transfer to agentic TS reasoning under mainstream orchestration frameworks?

Implemented — live purple agents on AgentBeats

HarnessWhat it isWhy it matters for TS reasoningAgentBeats listing
AgentScope
Gao et al., “AgentScope 1.0: A Developer-Centric Framework for Building Agentic Applications” (arXiv:2508.16279, 2025); Gao et al., “AgentScope: A Flexible yet Robust Multi-Agent Platform” (arXiv:2402.14034, 2024)
Production-ready ReAct agent framework with built-in tools, memory, planning, MCP / A2A interop, and an agentic-RL tuner (Trinity-RFT). Apache-2.0. The most common Python-native ReAct / tool-use stack; A2A support makes the Green Agent wiring mechanical. Serves as our canonical “single-ReAct-loop” purple baseline over TS MCQs. Melady TS Purple (AgentScope)
CAMEL
Li et al., “CAMEL: Communicative Agents for ‘Mind’ Exploration of Large Language Model Society” (NeurIPS 2023, arXiv:2303.17760)
Role-playing multi-agent framework with stateful memory, structured messages, societies / workforce pipelines, and a strong focus on scaling laws of agents. Role-play + critic loop tests whether a trend / seasonality / outlier answer survives multi-turn discussion without drifting; complements AgentScope’s single-ReAct shape with an inter-agent communication surface. Melady TS Base Purple Agent

Planned — purple agents under development

HarnessWhat it isWhy it matters for TS reasoningStatus
MetaGPT
Hong et al., “MetaGPT: Meta Programming for a Multi-Agent Collaborative Framework” (ICLR 2024, oral; arXiv:2308.00352)
SOP-driven multi-agent system — explicit role decomposition (product manager / architect / engineer / QA) orchestrated by a meta-programming layer. One of the most-starred multi-agent frameworks in the wild. Gives us a decomposition-heavy purple: a “data analyst” + “forecaster” / “reviewer” role split answering the same MCQs, isolating whether explicit SOPs help TS reasoning vs. a flat ReAct loop. Planned. A2A wrapper not yet published on AgentBeats.
TimeSeriesScientist (TSci)
Zhao et al., “TimeSeriesScientist: A General-Purpose AI Agent for Time Series Analysis” (arXiv:2510.01538, 2025)
Domain-specific LangGraph agent purpose-built for TS: Curator → Planner → Forecaster → Reporter, with statistical / ML / DL model selection and ensembling (ARIMA, Prophet, LSTM, XGBoost, Transformer, …). The strongest TS-specialized agent we can find; a natural “ceiling” for agentic TS reasoning and a direct yardstick for whether an RL-post-trained generalist LLM inside a simpler harness closes the gap to a purpose-built TS pipeline. Planned. A2A adapter not yet published on AgentBeats.
Green Agent + TemporalBenchEnv + harness panel = a closed evaluation loop. The Green Agent scores every harness on a fixed TS-Benchmark MCQ set; TemporalBenchEnv uses the same datasets and task taxonomy to post-train a backbone with randomized-domain / randomized-task episodes under verifiable rewards; we then drop the post-trained backbone back into AgentScope, CAMEL, MetaGPT, and TSci and re-score. A delta from the RL post-training should show up as a uniform lift across all four harnesses — not just in one framework, and not just in the zero-shot MCQ row of the baselines table.

Why this benchmark matters

Time-series reasoning is one of the few areas where frontier LLMs still look visibly weak, but also one where the verifiable signal is clean: trend / volatility / seasonality / outlier labels are constructed from thresholded statistics of the series, so exact-match grading is unambiguous. That makes MCQ episodes a near-ideal RL target before touching noisier numeric forecasting rewards.

The design is transferable. Any benchmark that produces labeled MCQ records over a set of domains — medical diagnostics, power-grid anomaly tagging, retail demand regimes — fits the same 9-question cross-domain template. Datasets are the proxy; the capability is multi-step, multi-domain, verifiable TS reasoning.

Every reward component is ground-truth arithmetic. The environment samples questions from pre-built banks, scores answers via normalized string equality, and aggregates with a closed-form episode bonus. No LLM judge, no circular reward.

Prior work & novelty

Prior “LLMs + time series” work lands in one of three buckets. None occupies the cell we target:

Prior work bucketWhat it doesWhat it does not
Static TS benchmarks
TS-Benchmark (ours, OpenReview); FreshRetailNet (arXiv:2505.16319); PSML (Nat. Sci. Data 2022); MIMIC-IV (Nat. Sci. Data 2022); Causal Chambers (Nat. MI 2024)
Construct labeled MCQ / regression tasks over TS datasets, graded by fixed rulesNo RL-native environment contract, no sequential episodes, no post-training loop
TS-LLM composite RL rewards
TimeMaster (arXiv:2506.13705); COUNTS (arXiv:2510.01116); SenTSR-Bench (arXiv:2602.19455)
Composite reward shaping (MSE + DTW + direction + quantile + ArcTan) for TS multimodal LLMsNo OpenEnv contract, no sequential MCQ episodes, reward is forecasting-centric
A2A evaluators
Melady TS Green Agent (AgentBeats); other AgentBeats green agents
Score deployed purple agents on the benchmark through the A2A protocolNot a training environment; no per-step RL signal, no environment state
TemporalBenchEnv (ours)Sequential 9-step MCQ MDP over four TS datasets, structured Pydantic actions, terminal reward from ground-truth labels + coverage multiplier, OpenEnv + GRPO contractDoes not (yet) train numeric forecasting — T2/T4 reward is stubbed for future work
To our knowledge, no prior work exposes TS-Benchmark’s multi-dataset MCQ suite as an OpenEnv-native sequential MDP with verifiable terminal rewards suitable for GRPO post-training. The action space, reward decomposition, and sibling-env architecture follow the LotteryElicitationEnv lineage; the domain and task taxonomy follow our own Melady TS Green Agent.

What TemporalBenchEnv is

An OpenEnv-native sequential MDP in which an LLM agent answers nine MCQ questions — six from a primary domain and one each from the other three — earning per-step correctness and a terminal bonus that rewards cross-domain coverage.

Each episode proceeds like this:

The agent’s interface is deliberately minimal: a single answer string per step, no tool-call protocol. Optional confidence and reasoning fields exist on the action for future reward shaping.

Environment design

The core contract is three Pydantic types exchanged over the OpenEnv WebSocket (see openenv-ts/TemporalBenchEnv/env/models.py):

# Action (agent → env)
class TemporalBenchAction(Action):
    answer: str                         # MCQ label matching an option
    confidence: Optional[float]         # in [0, 1], unused in reward for now
    reasoning: Optional[str]            # optional CoT, unused in reward for now

# Observation (env → agent)
class TemporalBenchObservation(Observation):
    step_idx, steps_remaining, max_steps: int
    question: str                       # current MCQ prompt
    options: list[str]                  # 2+ answer choices
    task_type: str                      # "T1U" | "T3" | "T2_MCQ"
    dataset: str                        # "PSML" | "freshretailnet" | "MIMIC" | "causal_chambers"
    history: list[dict]                 # [{question, answer, correct, dataset, ...}, ...]
    accuracy_so_far: float
    done: bool; reward: Optional[float]; metadata: dict

# State (serializable snapshot)
class TemporalBenchState(State):
    episode_id: Optional[str]
    step_count, total_correct, total_questions: int
    current_accuracy: float
    primary_domain: str
    per_task_type_accuracy: dict[str, float]
    total_reward: float

Four datasets, three MCQ task types, and a three-stage curriculum shape the training distribution (see env/config.py and env/episode_sampler.py):

StageAllowed task typesPurpose
Stage 1T1U only (non-contextual understanding MCQ)Shorten credit assignment; learn trend / volatility / seasonality / outliers first
Stage 2T1U + T3 (contextual understanding, S1–S5 families)Add context-conditioned reasoning; maintain verifiable labels
Stage 3T1U + T3 + T2_MCQ (prediction-as-classification)Full MCQ track; adds direction-of-change / volatility-change / seasonality-alignment

Curriculum is honored both at EnvConfig(curriculum_stage=...) construction and at env.reset(curriculum_stage=...), so a single server can serve multiple stages to different sessions concurrently.

Why OpenEnv

OpenEnv gives us three things that matter for this submission: (1) a standard WebSocket environment contract consumable by TRL’s rollout_func, (2) per-session state with max_concurrent_envs=64 in our create_app factory — each WebSocket session gets a fresh TemporalBenchEnvironment via _env_factory so DDP ranks can hammer the same Space without cross-talk, and (3) a uniform deployment path. The same env code runs in-process for tests, as a Docker container for development (server/Dockerfile, with TEMPORALBENCH_QUESTION_BANK_DIR=/app/env/data/banks), and as a Hugging Face Space during training and evaluation.

No new abstractions were invented. Base types only: EnvClient, Environment, Pydantic Action / Observation / State. All extensions (history, per-domain coverage, per-task accuracy) ride on metadata or the serialized state. No new method signatures, no fork. The env ships with openenv.yaml and a Dockerfile, and passes uv run openenv validate.

Hygiene note: OpenEnv’s CLI validator does a naive substring check for main() in server/app.py. We match the reference pattern from LotteryElicitationEnv — an explicit main() call under if __name__ == "__main__" with CLI flags parsed via parse_known_args.

Scoring: per-step correctness + episode bonus

Reward is decomposed into a per-step term and a terminal bonus (see env/reward.py).

\[ r_t \;=\; \alpha \cdot \mathbf{1}\!\left[\hat a_t = a_t^\ast\right] \]

Here \(\hat a_t\) is the agent’s submitted label and \(a_t^\ast\) is the stored ground truth, compared under normalized string equality (also accepting an option whose normalized text matches).

\[ B \;=\; \lambda_{\mathrm{ep}} \,\cdot\, \frac{C}{N} \,\cdot\, m, \qquad m \;=\; \begin{cases} 1.0 & \text{all 4 domains have } \ge 1 \text{ correct} \\ 0.8 & \text{otherwise} \end{cases} \] \[ R \;=\; \sum_{t=1}^{N} r_t \;+\; B \]

\(C\) is the total correct count in the episode, \(N = 9\) is the episode length, \(\alpha, \lambda_{\mathrm{ep}}\) are alpha and lambda_ep in EnvConfig.

Defaults live in EnvConfig:

ComponentWeightWhat it rewards
Per-step correctnessα = 1.0Normalized-string match against the MCQ ground truth
Episode bonus weightλep = 0.5Scales the terminal accuracy×coverage term
Coverage multiplier{0.8, 1.0}1.0 iff every domain in EnvConfig.all_domains has ≥1 correct answer this episode
Forecasting rewardStubbed (compute_forecasting_reward raises NotImplementedError); future work
Why a coverage multiplier: per-step accuracy alone lets the agent ace the six primary questions while guessing on the other three domains, collapsing to a single-domain policy. The 0.8 penalty forces the policy to treat the three cross-domain questions as first-class signal — the very thing that distinguishes a TS-generalist from a PSML-only memorizer.

Architecture & training pipeline

Following the LotteryElicitationEnv / ReasoningEconomicsEnv lineage, TemporalBenchEnv (this page) ships the OpenEnv side: reset, step, rewards, and question banks over WebSocket. A separate trainer process — same env/trainer separation as LotteryElicitationEnv and LotteryElicitationPT — would drive GRPO with TRL’s rollout_func and vLLM against that socket, without in-process imports of env-side types.

Training client status. The companion GRPO / TRL package that runs rollouts against this environment is under active internal testing. We will release it on GitHub once we secure additional compute so we can stress-test and validate models at a scale we consider reasonable. This blog documents the shipped TemporalBenchEnv only until that release.
Purple agents on AgentBeats. The same TS-Benchmark task surface is also exercised by purple agents scored through the A2A green agent: live listings are Melady TS Purple (AgentScope) and Melady TS Base Purple Agent. Planned additions are documented in § Purple agent harnesses.
flowchart LR
    subgraph TRN ["Trainer (TRL + GRPO, WebSocket client)"]
        GRPO["GRPOTrainer
TRL 1.0"] RF["rollout_func"] VLLM["vLLM
colocate/server"] PARSE["action_parser
MCQ label guardrails"] end subgraph ENV ["TemporalBenchEnv (OpenEnv)"] WS["FastAPI
WebSocket"] SAMP["EpisodeSampler
6+1+1+1 stratified"] GRADE["grade_answer
normalized match"] REW["Reward
per-step + bonus + coverage"] end GRPO --> RF RF --> VLLM VLLM -->|"generate"| PARSE PARSE -->|"answer string"| WS WS --> SAMP SAMP -->|"next question"| WS WS --> GRADE GRADE --> REW REW -->|"step + terminal reward"| WS WS -->|"observation"| RF

Figure 1. System architecture. The trainer never imports env-side types — everything crosses the WebSocket, exactly like our Lottery / Reasoning sibling envs.

Training uses GRPO (Group Relative Policy Optimization), which is a natural fit for per-step verifiable rewards with an additive terminal bonus. The training scaffolding is directly inherited from LotteryElicitationPT: TRL 1.0’s rollout_func contract, vLLM colocate/server, chat-template tokenization with enable_thinking=False, think-block stripping, null-safe MCQ-label parsing, and episode logging to reward_logs.jsonl.

Episode trace (Ideal, Illustrative)

Here is what a high-reward episode would look like — five representative steps out of nine, spanning all four domains. Primary domain is PSML; the trace shows the three cross-domain picks plus two primary turns, culminating in the terminal step where the coverage multiplier decides the shape of the bonus. Turns 2, 4, 6, 8 are elided (all PSML T1U, assumed correct for the 7 / 9 total) so the walkthrough stays focused on the cross-domain structure.

Illustrative, not captured. This is a hand-constructed walkthrough intended to explain the per-step reward, the 6 + 3 domain split, and the terminal coverage multiplier — not a real rollout from a trained (or even zero-shot) policy against the live environment. The prompts and agent answers below are author-written. The reward arithmetic (r_t, B, R) is computed exactly as env/reward.py would compute it for the stated correctness pattern. A captured trace from a real model will replace this once the training client is released and run at scale (see Architecture & training pipeline).
Turn 1 · dataset=PSML · task=T1U:trend
Prompt: “Based on the array (length=336), report trend: upward / downward / constant.”
Series rises monotonically over the tail window.
Agent answer: "upward"  →  correct (r_1 = 1.0)
Turn 3 · dataset=freshretailnet · task=T1U:seasonality
Prompt: daily demand for a fresh-retail SKU, peaks repeat every 7 steps with stable amplitude.
Agent answer: "fixed"  →  correct (r_3 = 1.0)
Turn 5 · dataset=MIMIC · task=T1U:outliers
Prompt: ICU vital trace with a single spike above q95 + 3·MAD.
Agent answer: "sudden_spike"  →  correct (r_5 = 1.0)
Turn 7 · dataset=causal_chambers · task=T3:S2
Prompt: wind-chamber actuator trace, contextual question on lagged response to a step input (capability C4).
Agent answer: "delayed_response"  →  correct (r_7 = 1.0)
Turn 9 · dataset=PSML · task=T2_MCQ (terminal)
Prompt: “Median demand level change (forecast horizon vs history)?”
Agent answer: "Higher"  →  correct (r_9 = 1.0)
Episode totals: C = 7 / 9 correct, all four domains covered → m = 1.0.
Terminal bonus: B = 0.5 · (7/9) · 1.0 ≈ 0.389.
Total return: R ≈ 7.389.
Four-of-four coverage → full m = 1.0 multiplier. Contrast: the same 7 / 9 accuracy with a missed cross-domain question (say, zero correct in MIMIC) would give B ≈ 0.311 and R ≈ 7.311. The coverage term is the whole reason a PSML-only policy loses to a generalist.

Evaluation protocol & projected targets

No trained-policy numbers yet. Available compute was exhausted before GRPO could be run to convergence and validated at a scale we consider meaningful. The “Trained HF policy” column therefore stays blank, and every number in the “Projected targets” table below is an analytical projection from the environment’s structure, not a measurement. GRPO runs against this Space will populate the empty column once the public training client is released and run at scale — see the canonical training-status notice under Architecture & training pipeline. No fabricated telemetry is shown.
Purple baselines (AgentBeats). For deployed purple policies on the same benchmark lineage, see Melady TS Purple (AgentScope) and Melady TS Base Purple Agent.

Baselines

The environment supports the following baseline policies out of the box (random + majority) plus an eval harness for zero-shot and trained LLMs. All run against the live OpenEnv WebSocket so numbers are directly comparable with the trained policy.

BaselinePolicyWhat it isolates
Random MCQUniform sample over observation.optionsLower bound; beats zero only if options are imbalanced
Majority-classAlways pick the per-task_type modal label from the bankIsolates how much accuracy is available from priors alone
Zero-shot API LLMGPT / Claude / Gemini via the eval harnessStrong “off-the-shelf” ceiling before any post-training
Zero-shot local LLMQwen2.5-7B-Instruct served via vLLMPlanned backbone for GRPO fine-tuning against this environment once training runs land
Trained HF policyGRPO checkpoint from rollouts on this OpenEnvTests whether post-training on TS-Benchmark episodes beats zero-shot

Projected targets

The numbers below are analytical projections from the environment’s structure, not empirical results. They exist to anchor what “good” looks like once training runs land.

MetricRandom MCQZero-shot LLM (expected)Target for trained policy
Per-step accuracy (T1U & T3, 3–4 options)≈ 0.25–0.33>> 0.33≥ strong zero-shot
Per-step accuracy (T2_MCQ, 4 options)≈ 0.25> random≥ strong zero-shot
Coverage multiplier musually 0.80.8–1.01.0 consistently
Episode bonus Bep=0.5, N=9)≈ 0.100.20–0.35≥ 0.40

Current status

What is already in place:

Engineering lessons (inherited)

Because TemporalBenchEnv intentionally rides on the same deployment + training pattern as LotteryElicitationEnv and ReasoningEconomicsEnv, most of the hard infra lessons are inherited rather than rediscovered. We document them briefly so the next submitter does not need to learn them from scratch.

IssueRoot causeFix (inherited)
NCCL desync under variable-length episodes In vllm_mode=server, different DDP ranks make different numbers of generate() calls per episode → sequence-numbered NCCL collectives go out of sync. Fixed-count generate() padding per episode; dummy generates discarded via _temporary_vllm_max_tokens(..., 1). Gated on world_size > 1.
max_completion_length drift over multi-turn rollouts The rollout function appends per-turn generations + observation suffixes each step; a 9-turn MCQ episode can easily exceed the nominal completion budget. Hard-cap completion_ids to max_completion_length; PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True.
Null-safe MCQ parsing LLMs sometimes emit {"answer": null} or malformed strings; one rank crashing kills all DDP ranks via gloo cascade. Inherited _safe_float / _safe_int / null-string fallbacks; fallback action rather than crash. The same regression pattern from LotteryElicitationPT applies directly to MCQ labels.
openenv validate hygiene OpenEnv’s CLI does a naive substring check for main() in server/app.py; an entrypoint like main(port=args.port) fails validation. Match the Lottery pattern: if __name__ == "__main__": main(), with flags parsed via parse_known_args inside main().

Foundations & citations

FoundationRole in this projectCitation
TS-Benchmark (ours)Task taxonomy (T1/T2/T3/T4), per-dataset label construction, MCQ question shapeOpenReview · in-repo: TS-benchmark/TS-Benchmark.md
Melady TS Green AgentA2A evaluator whose task set TemporalBenchEnv re-exposes as an OpenEnv environmentAgentBeats · GitHub repository
AgentScopePurple-agent harness — ReAct / MCP / A2A framework wrapping the backbone we post-train (live on AgentBeats)Gao et al., arXiv:2508.16279 (2025) · arXiv:2402.14034 (2024) · GitHub
CAMELPurple-agent harness — role-playing multi-agent society, CAMEL-backed baseline on AgentBeatsLi et al., NeurIPS 2023 (arXiv:2303.17760) · GitHub
MetaGPTPurple-agent harness (fast-follow-up) — SOP-driven multi-agent system, role-decomposition purpleHong et al., ICLR 2024, oral (arXiv:2308.00352) · GitHub
TimeSeriesScientist (TSci)Purple-agent harness (fast-follow-up) — TS-specialized Curator/Planner/Forecaster/Reporter agent; “ceiling” for agentic TS reasoningZhao et al., arXiv:2510.01538 (2025) · GitHub
FreshRetailNet-50KRetail demand dataset; T1/T2/T3/T4 MCQ questionsDing et al., arXiv:2505.16319, 2025
PSMLPower-system load dataset; primary domain for the default episodeNature Sci. Data 2022
MIMIC-IVICU/EHR time-series dataset; medical-domain MCQNature Sci. Data 2022
Causal ChambersPhysical-testbed TS dataset; contextual T3 and wind-chamber T1UNature MI 2024
TimeMaster / COUNTS / SenTSR-BenchPrior art on composite RL rewards for TS-LLMs; motivates T2/T4 forecasting future workTimeMaster (arXiv:2506.13705) · COUNTS (arXiv:2510.01116) · SenTSR-Bench (arXiv:2602.19455)
OpenEnvGym-style reset/step, WebSocket transport, HF Space deploymentHF Blog: Introducing OpenEnv
TRL + GRPOGRPOTrainer, custom rollout_func, remote env rolloutsShao et al., arXiv:2402.03300 (DeepSeekMath) · TRL × OpenEnv
LotteryElicitationEnv / PTSibling project — structural template for env / PT split, rollout_func, DDP padding, validation hygieneHF Space (Env) · GitHub (PT)

Quick start

# 1. Install and run the env locally
cd openenv-ts/TemporalBenchEnv
uv sync --extra dev
uv run pytest tests/ -q
uv run openenv validate

# 2. Run the server (uvicorn, port 8000)
uv run server
# or
uvicorn server.app:app --reload

# 3. Build & run the Docker image
docker build -t temporalbenchenv:latest -f server/Dockerfile .
docker run --rm -p 8000:8000 \
    -e TEMPORALBENCH_QUESTION_BANK_DIR=/app/env/data/banks \
    temporalbenchenv:latest

# 4. Or pull / push a HF Space
export ENV_BASE_URL="https://huggingface.co/spaces/yashu2000/TemporalBenchEnv"
openenv push   # from the TemporalBenchEnv/ directory

# 5. Minimal client usage
python - <<'PY'
from client import TemporalBenchAction, TemporalBenchEnvClient
with TemporalBenchEnvClient(base_url="http://localhost:8000") as env:
    out = env.reset()
    while not out.done:
        q = out.observation
        out = env.step(TemporalBenchAction(answer=q.options[0]))
    print("total reward:", out.observation.reward)
PY

# 6. GRPO / TRL training against this Space (training client under test; not on GitHub yet)
# Release is planned after additional compute for reasonable model validation.

Purple track (AgentBeats): Melady TS Purple (AgentScope) · Melady TS Base Purple Agent.

Banks are reproducible from (env_seed, curriculum_stage, primary_domain). No external fixtures, no live API, no human labels — the same flat tiered JSONL that drives the Melady TS Green Agent drives this environment.

Can an LLM post-trained on our Green Agent’s own benchmark outperform zero-shot baselines at cross-domain TS reasoning?
The environment is built and deployed. Training-client release and empirical GRPO numbers are contingent on compute availability — see Architecture.
On AgentBeats, compare purple baselines such as Melady TS Purple (AgentScope) and Melady TS Base Purple Agent.

Future work

Conclusion

TemporalBenchEnv is the training-side companion to our Melady TS Green Agent’s eval-side role. The Green Agent ranks purple agents on TS-Benchmark through the A2A protocol; TemporalBenchEnv re-exposes the same four-dataset MCQ suite as an OpenEnv-native sequential MDP, so an LLM can be post-trained on the very benchmark it will later be scored against.

Every design choice — 9-question episodes with 6 primary + 3 cross-domain, per-step correctness + terminal bonus with a domain-coverage multiplier, stagewise curriculum T1U → +T3 → +T2_MCQ, and a strict separation of env server and trainer over WebSocket — is aimed at preserving the verifiability that made the Green Agent’s signal trustworthy, while turning it into gradient. The infrastructure contributions are mostly inherited from our Lottery / Reasoning siblings; the novelty is the domain binding and the Green-Agent → OpenEnv pipeline.

The research question is open: can a GRPO-trained LLM, post-trained on our own benchmark, outperform strong zero-shot baselines at cross-domain TS reasoning? The environment, banks, and deployment path are shipped; full training results await the public training client and the compute budget described under Architecture. The purple track stays live on AgentBeats via Melady TS Purple (AgentScope) and Melady TS Base Purple Agent.