An OpenEnv-native multi-step MCQ environment for post-training LLMs on time-series reasoning — built on the four datasets from our Melady TS Green Agent submission.
Most time-series LLM benchmarks grade a single prompt at a time. TemporalBenchEnv grades an episode: nine multiple-choice questions drawn from four time-series datasets, answered one per step, with a terminal bonus that rewards both accuracy and cross-domain coverage.
Every reward signal here is ground-truth arithmetic, not a judge. Labels are produced by the TS-Benchmark construction pipeline (trend / volatility / seasonality / outlier thresholds; S1–S5 family rules), so the environment can score an answer with a normalized string match against the stored ground truth. There is no LLM judge in the loop.
The falsifiable hypothesis this environment is built to test: whether a GRPO-trained LLM, post-trained on sequential episodes sampled from our Green Agent’s own benchmark, outperforms strong zero-shot baselines on per-domain MCQ accuracy while hitting the cross-domain coverage bonus. Empirical adjudication is contingent on the training runs described under Architecture & training pipeline.
TemporalBenchEnv is a direct extension of our AgentBeats Melady TS Green Agent submission. The Green Agent is an A2A-protocol evaluator that grades purple agents on 764 TS-Benchmark tasks (see the Green Agent GitHub repository). TemporalBenchEnv takes the same datasets and task taxonomy and re-exposes them as a sequential OpenEnv environment consumable by TRL’s rollout_func — turning the benchmark into a training target.
| Artifact | Melady TS Green Agent | TemporalBenchEnv (this submission) |
|---|---|---|
| Role | A2A evaluator of purple agents | OpenEnv RL environment for post-training LLMs |
| Datasets | PSML, freshretailnet, MIMIC, causal_chambers | Same four datasets |
| Tasks | T1/T3 (accuracy) + T2/T4 (regression + MCQ), 764 total | MCQ subset: T1U, T3, T2_MCQ — 2,775 TSQuestion records |
| Per-domain bank sizes | Packaged in the Docker image | PSML 750 · freshretailnet 616 · MIMIC 709 · causal_chambers 700 |
| Protocol | A2A messaging, one-shot prompts | WebSocket OpenEnv contract, sequential 9-step MDP |
| Reward | MSE / MAE / RMSE / MASE / accuracy (eval metrics) | Per-step correctness + terminal episode bonus w/ coverage multiplier |
| Consumer | AgentBeats leaderboard | TRL 1.0 rollout_func, vLLM colocate / server, GRPO |
The ETL from the Green Agent’s labeled JSONL into the per-domain TSQuestion banks the environment consumes lives in TS-benchmark/scripts/build_temporal_bench_openenv_banks.py; banks ship at openenv-ts/TemporalBenchEnv/data/banks/ and are loaded via the TEMPORALBENCH_QUESTION_BANK_DIR environment variable.
The Green Agent answered: which purple agent is best at TS reasoning right now? TemporalBenchEnv answers: can we post-train an LLM, on that exact benchmark, to be the next best purple agent?
The Green Agent scores purple agents over the A2A protocol — any stack that speaks A2A is a valid participant. To make the benchmark diagnostic for modern agentic time-series practice, we target four of the most popular open-source agent harnesses as purple agents on AgentBeats: two are implemented and live (AgentScope, CAMEL) and two are planned (MetaGPT, TimeSeriesScientist). Each harness stays unchanged internally; a thin A2A adapter feeds the Green Agent’s TS-Benchmark MCQs into the harness’s own reasoning / tool-use loop and returns a final label. These four frameworks are the de-facto “agentic harnesses” TS practitioners actually wrap around LLMs today, so they anchor the eval to real downstream usage.
| Harness | What it is | Why it matters for TS reasoning | AgentBeats listing |
|---|---|---|---|
| AgentScope Gao et al., “AgentScope 1.0: A Developer-Centric Framework for Building Agentic Applications” (arXiv:2508.16279, 2025); Gao et al., “AgentScope: A Flexible yet Robust Multi-Agent Platform” (arXiv:2402.14034, 2024) |
Production-ready ReAct agent framework with built-in tools, memory, planning, MCP / A2A interop, and an agentic-RL tuner (Trinity-RFT). Apache-2.0. | The most common Python-native ReAct / tool-use stack; A2A support makes the Green Agent wiring mechanical. Serves as our canonical “single-ReAct-loop” purple baseline over TS MCQs. | Melady TS Purple (AgentScope) |
| CAMEL Li et al., “CAMEL: Communicative Agents for ‘Mind’ Exploration of Large Language Model Society” (NeurIPS 2023, arXiv:2303.17760) |
Role-playing multi-agent framework with stateful memory, structured messages, societies / workforce pipelines, and a strong focus on scaling laws of agents. | Role-play + critic loop tests whether a trend / seasonality / outlier answer survives multi-turn discussion without drifting; complements AgentScope’s single-ReAct shape with an inter-agent communication surface. | Melady TS Base Purple Agent |
| Harness | What it is | Why it matters for TS reasoning | Status |
|---|---|---|---|
| MetaGPT Hong et al., “MetaGPT: Meta Programming for a Multi-Agent Collaborative Framework” (ICLR 2024, oral; arXiv:2308.00352) |
SOP-driven multi-agent system — explicit role decomposition (product manager / architect / engineer / QA) orchestrated by a meta-programming layer. One of the most-starred multi-agent frameworks in the wild. | Gives us a decomposition-heavy purple: a “data analyst” + “forecaster” / “reviewer” role split answering the same MCQs, isolating whether explicit SOPs help TS reasoning vs. a flat ReAct loop. | Planned. A2A wrapper not yet published on AgentBeats. |
| TimeSeriesScientist (TSci) Zhao et al., “TimeSeriesScientist: A General-Purpose AI Agent for Time Series Analysis” (arXiv:2510.01538, 2025) |
Domain-specific LangGraph agent purpose-built for TS: Curator → Planner → Forecaster → Reporter, with statistical / ML / DL model selection and ensembling (ARIMA, Prophet, LSTM, XGBoost, Transformer, …). | The strongest TS-specialized agent we can find; a natural “ceiling” for agentic TS reasoning and a direct yardstick for whether an RL-post-trained generalist LLM inside a simpler harness closes the gap to a purpose-built TS pipeline. | Planned. A2A adapter not yet published on AgentBeats. |
Green Agent + TemporalBenchEnv + harness panel = a closed evaluation loop. The Green Agent scores every harness on a fixed TS-Benchmark MCQ set; TemporalBenchEnv uses the same datasets and task taxonomy to post-train a backbone with randomized-domain / randomized-task episodes under verifiable rewards; we then drop the post-trained backbone back into AgentScope, CAMEL, MetaGPT, and TSci and re-score. A delta from the RL post-training should show up as a uniform lift across all four harnesses — not just in one framework, and not just in the zero-shot MCQ row of the baselines table.
Time-series reasoning is one of the few areas where frontier LLMs still look visibly weak, but also one where the verifiable signal is clean: trend / volatility / seasonality / outlier labels are constructed from thresholded statistics of the series, so exact-match grading is unambiguous. That makes MCQ episodes a near-ideal RL target before touching noisier numeric forecasting rewards.
The design is transferable. Any benchmark that produces labeled MCQ records over a set of domains — medical diagnostics, power-grid anomaly tagging, retail demand regimes — fits the same 9-question cross-domain template. Datasets are the proxy; the capability is multi-step, multi-domain, verifiable TS reasoning.
Every reward component is ground-truth arithmetic. The environment samples questions from pre-built banks, scores answers via normalized string equality, and aggregates with a closed-form episode bonus. No LLM judge, no circular reward.
Prior “LLMs + time series” work lands in one of three buckets. None occupies the cell we target:
| Prior work bucket | What it does | What it does not |
|---|---|---|
| Static TS benchmarks TS-Benchmark (ours, OpenReview); FreshRetailNet (arXiv:2505.16319); PSML (Nat. Sci. Data 2022); MIMIC-IV (Nat. Sci. Data 2022); Causal Chambers (Nat. MI 2024) | Construct labeled MCQ / regression tasks over TS datasets, graded by fixed rules | No RL-native environment contract, no sequential episodes, no post-training loop |
| TS-LLM composite RL rewards TimeMaster (arXiv:2506.13705); COUNTS (arXiv:2510.01116); SenTSR-Bench (arXiv:2602.19455) | Composite reward shaping (MSE + DTW + direction + quantile + ArcTan) for TS multimodal LLMs | No OpenEnv contract, no sequential MCQ episodes, reward is forecasting-centric |
| A2A evaluators Melady TS Green Agent (AgentBeats); other AgentBeats green agents | Score deployed purple agents on the benchmark through the A2A protocol | Not a training environment; no per-step RL signal, no environment state |
| TemporalBenchEnv (ours) | Sequential 9-step MCQ MDP over four TS datasets, structured Pydantic actions, terminal reward from ground-truth labels + coverage multiplier, OpenEnv + GRPO contract | Does not (yet) train numeric forecasting — T2/T4 reward is stubbed for future work |
To our knowledge, no prior work exposes TS-Benchmark’s multi-dataset MCQ suite as an OpenEnv-native sequential MDP with verifiable terminal rewards suitable for GRPO post-training. The action space, reward decomposition, and sibling-env architecture follow the LotteryElicitationEnv lineage; the domain and task taxonomy follow our own Melady TS Green Agent.
An OpenEnv-native sequential MDP in which an LLM agent answers nine MCQ questions — six from a primary domain and one each from the other three — earning per-step correctness and a terminal bonus that rewards cross-domain coverage.
Each episode proceeds like this:
reset() samples nine TSQuestion records: 6 from the primary domain (default PSML), with T3 families round-robined for diversity, and 1 from each of the three non-primary domains. Final order is shuffled.TemporalBenchAction containing the MCQ label, plus optional confidence and reasoning fields.question.answer (also accepting an option whose normalized text matches the ground truth), and returns the next question.alpha * correctness. Mid-episode bonuses are zero.lambda_ep * (total_correct / N) * coverage_multiplier, where the multiplier is 1.0 if every one of the four domains contributed at least one correct answer, else 0.8.The agent’s interface is deliberately minimal: a single answer string per step, no tool-call protocol. Optional confidence and reasoning fields exist on the action for future reward shaping.
The core contract is three Pydantic types exchanged over the OpenEnv WebSocket (see openenv-ts/TemporalBenchEnv/env/models.py):
# Action (agent → env)
class TemporalBenchAction(Action):
answer: str # MCQ label matching an option
confidence: Optional[float] # in [0, 1], unused in reward for now
reasoning: Optional[str] # optional CoT, unused in reward for now
# Observation (env → agent)
class TemporalBenchObservation(Observation):
step_idx, steps_remaining, max_steps: int
question: str # current MCQ prompt
options: list[str] # 2+ answer choices
task_type: str # "T1U" | "T3" | "T2_MCQ"
dataset: str # "PSML" | "freshretailnet" | "MIMIC" | "causal_chambers"
history: list[dict] # [{question, answer, correct, dataset, ...}, ...]
accuracy_so_far: float
done: bool; reward: Optional[float]; metadata: dict
# State (serializable snapshot)
class TemporalBenchState(State):
episode_id: Optional[str]
step_count, total_correct, total_questions: int
current_accuracy: float
primary_domain: str
per_task_type_accuracy: dict[str, float]
total_reward: float
Four datasets, three MCQ task types, and a three-stage curriculum shape the training distribution (see env/config.py and env/episode_sampler.py):
| Stage | Allowed task types | Purpose |
|---|---|---|
| Stage 1 | T1U only (non-contextual understanding MCQ) | Shorten credit assignment; learn trend / volatility / seasonality / outliers first |
| Stage 2 | T1U + T3 (contextual understanding, S1–S5 families) | Add context-conditioned reasoning; maintain verifiable labels |
| Stage 3 | T1U + T3 + T2_MCQ (prediction-as-classification) | Full MCQ track; adds direction-of-change / volatility-change / seasonality-alignment |
Curriculum is honored both at EnvConfig(curriculum_stage=...) construction and at env.reset(curriculum_stage=...), so a single server can serve multiple stages to different sessions concurrently.
OpenEnv gives us three things that matter for this submission: (1) a standard WebSocket environment contract consumable by TRL’s rollout_func, (2) per-session state with max_concurrent_envs=64 in our create_app factory — each WebSocket session gets a fresh TemporalBenchEnvironment via _env_factory so DDP ranks can hammer the same Space without cross-talk, and (3) a uniform deployment path. The same env code runs in-process for tests, as a Docker container for development (server/Dockerfile, with TEMPORALBENCH_QUESTION_BANK_DIR=/app/env/data/banks), and as a Hugging Face Space during training and evaluation.
No new abstractions were invented. Base types only: EnvClient, Environment, Pydantic Action / Observation / State. All extensions (history, per-domain coverage, per-task accuracy) ride on metadata or the serialized state. No new method signatures, no fork. The env ships with openenv.yaml and a Dockerfile, and passes uv run openenv validate.
Hygiene note: OpenEnv’s CLI validator does a naive substring check for main() in server/app.py. We match the reference pattern from LotteryElicitationEnv — an explicit main() call under if __name__ == "__main__" with CLI flags parsed via parse_known_args.
Reward is decomposed into a per-step term and a terminal bonus (see env/reward.py).
Here \(\hat a_t\) is the agent’s submitted label and \(a_t^\ast\) is the stored ground truth, compared under normalized string equality (also accepting an option whose normalized text matches).
\(C\) is the total correct count in the episode, \(N = 9\) is the episode length, \(\alpha, \lambda_{\mathrm{ep}}\) are alpha and lambda_ep in EnvConfig.
Defaults live in EnvConfig:
| Component | Weight | What it rewards |
|---|---|---|
| Per-step correctness | α = 1.0 | Normalized-string match against the MCQ ground truth |
| Episode bonus weight | λep = 0.5 | Scales the terminal accuracy×coverage term |
| Coverage multiplier | {0.8, 1.0} | 1.0 iff every domain in EnvConfig.all_domains has ≥1 correct answer this episode |
| Forecasting reward | — | Stubbed (compute_forecasting_reward raises NotImplementedError); future work |
Why a coverage multiplier: per-step accuracy alone lets the agent ace the six primary questions while guessing on the other three domains, collapsing to a single-domain policy. The 0.8 penalty forces the policy to treat the three cross-domain questions as first-class signal — the very thing that distinguishes a TS-generalist from a PSML-only memorizer.
Following the LotteryElicitationEnv / ReasoningEconomicsEnv lineage, TemporalBenchEnv (this page) ships the OpenEnv side: reset, step, rewards, and question banks over WebSocket. A separate trainer process — same env/trainer separation as LotteryElicitationEnv and LotteryElicitationPT — would drive GRPO with TRL’s rollout_func and vLLM against that socket, without in-process imports of env-side types.
Purple agents on AgentBeats. The same TS-Benchmark task surface is also exercised by purple agents scored through the A2A green agent: live listings are Melady TS Purple (AgentScope) and Melady TS Base Purple Agent. Planned additions are documented in § Purple agent harnesses.
flowchart LR
subgraph TRN ["Trainer (TRL + GRPO, WebSocket client)"]
GRPO["GRPOTrainer
TRL 1.0"]
RF["rollout_func"]
VLLM["vLLM
colocate/server"]
PARSE["action_parser
MCQ label guardrails"]
end
subgraph ENV ["TemporalBenchEnv (OpenEnv)"]
WS["FastAPI
WebSocket"]
SAMP["EpisodeSampler
6+1+1+1 stratified"]
GRADE["grade_answer
normalized match"]
REW["Reward
per-step + bonus + coverage"]
end
GRPO --> RF
RF --> VLLM
VLLM -->|"generate"| PARSE
PARSE -->|"answer string"| WS
WS --> SAMP
SAMP -->|"next question"| WS
WS --> GRADE
GRADE --> REW
REW -->|"step + terminal reward"| WS
WS -->|"observation"| RF
Figure 1. System architecture. The trainer never imports env-side types — everything crosses the WebSocket, exactly like our Lottery / Reasoning sibling envs.
Training uses GRPO (Group Relative Policy Optimization), which is a natural fit for per-step verifiable rewards with an additive terminal bonus. The training scaffolding is directly inherited from LotteryElicitationPT: TRL 1.0’s rollout_func contract, vLLM colocate/server, chat-template tokenization with enable_thinking=False, think-block stripping, null-safe MCQ-label parsing, and episode logging to reward_logs.jsonl.
Here is what a high-reward episode would look like — five representative steps out of nine, spanning all four domains. Primary domain is PSML; the trace shows the three cross-domain picks plus two primary turns, culminating in the terminal step where the coverage multiplier decides the shape of the bonus. Turns 2, 4, 6, 8 are elided (all PSML T1U, assumed correct for the 7 / 9 total) so the walkthrough stays focused on the cross-domain structure.
r_t, B, R) is computed exactly as env/reward.py would compute it for the stated correctness pattern. A captured trace from a real model will replace this once the training client is released and run at scale (see Architecture & training pipeline).
"upward" → correct (r_1 = 1.0)
"fixed" → correct (r_3 = 1.0)
q95 + 3·MAD."sudden_spike" → correct (r_5 = 1.0)
C4)."delayed_response" → correct (r_7 = 1.0)
"Higher" → correct (r_9 = 1.0)C = 7 / 9 correct, all four domains covered → m = 1.0.B = 0.5 · (7/9) · 1.0 ≈ 0.389.R ≈ 7.389.
B ≈ 0.311 and R ≈ 7.311. The coverage term is the whole reason a PSML-only policy loses to a generalist.
Purple baselines (AgentBeats). For deployed purple policies on the same benchmark lineage, see Melady TS Purple (AgentScope) and Melady TS Base Purple Agent.
The environment supports the following baseline policies out of the box (random + majority) plus an eval harness for zero-shot and trained LLMs. All run against the live OpenEnv WebSocket so numbers are directly comparable with the trained policy.
| Baseline | Policy | What it isolates |
|---|---|---|
| Random MCQ | Uniform sample over observation.options | Lower bound; beats zero only if options are imbalanced |
| Majority-class | Always pick the per-task_type modal label from the bank | Isolates how much accuracy is available from priors alone |
| Zero-shot API LLM | GPT / Claude / Gemini via the eval harness | Strong “off-the-shelf” ceiling before any post-training |
| Zero-shot local LLM | Qwen2.5-7B-Instruct served via vLLM | Planned backbone for GRPO fine-tuning against this environment once training runs land |
| Trained HF policy | GRPO checkpoint from rollouts on this OpenEnv | Tests whether post-training on TS-Benchmark episodes beats zero-shot |
The numbers below are analytical projections from the environment’s structure, not empirical results. They exist to anchor what “good” looks like once training runs land.
| Metric | Random MCQ | Zero-shot LLM (expected) | Target for trained policy |
|---|---|---|---|
| Per-step accuracy (T1U & T3, 3–4 options) | ≈ 0.25–0.33 | >> 0.33 | ≥ strong zero-shot |
| Per-step accuracy (T2_MCQ, 4 options) | ≈ 0.25 | > random | ≥ strong zero-shot |
Coverage multiplier m | usually 0.8 | 0.8–1.0 | 1.0 consistently |
Episode bonus B (λep=0.5, N=9) | ≈ 0.10 | 0.20–0.35 | ≥ 0.40 |
What is already in place:
reset / step / state, nine-question stratified sampling across four domains, per-step + terminal reward with coverage multiplier.TSQuestion records built from TS-benchmark/task_merged_dev_with_labels_tiers.jsonl via TS-benchmark/scripts/build_temporal_bench_openenv_banks.py and vendored under openenv-ts/TemporalBenchEnv/data/banks/ (PSML 750 / freshretailnet 616 / MIMIC 709 / causal_chambers 700).openenv.yaml, Docker image (server/Dockerfile, TEMPORALBENCH_QUESTION_BANK_DIR=/app/env/data/banks), openenv validate passes, openenv push ready for HF Space.LotteryElicitationPT (GRPO + TRL 1.0 rollout_func + vLLM). Training-client release status: see Architecture.Because TemporalBenchEnv intentionally rides on the same deployment + training pattern as LotteryElicitationEnv and ReasoningEconomicsEnv, most of the hard infra lessons are inherited rather than rediscovered. We document them briefly so the next submitter does not need to learn them from scratch.
| Issue | Root cause | Fix (inherited) |
|---|---|---|
| NCCL desync under variable-length episodes | In vllm_mode=server, different DDP ranks make different numbers of generate() calls per episode → sequence-numbered NCCL collectives go out of sync. |
Fixed-count generate() padding per episode; dummy generates discarded via _temporary_vllm_max_tokens(..., 1). Gated on world_size > 1. |
max_completion_length drift over multi-turn rollouts |
The rollout function appends per-turn generations + observation suffixes each step; a 9-turn MCQ episode can easily exceed the nominal completion budget. | Hard-cap completion_ids to max_completion_length; PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True. |
| Null-safe MCQ parsing | LLMs sometimes emit {"answer": null} or malformed strings; one rank crashing kills all DDP ranks via gloo cascade. |
Inherited _safe_float / _safe_int / null-string fallbacks; fallback action rather than crash. The same regression pattern from LotteryElicitationPT applies directly to MCQ labels. |
openenv validate hygiene |
OpenEnv’s CLI does a naive substring check for main() in server/app.py; an entrypoint like main(port=args.port) fails validation. |
Match the Lottery pattern: if __name__ == "__main__": main(), with flags parsed via parse_known_args inside main(). |
| Foundation | Role in this project | Citation |
|---|---|---|
| TS-Benchmark (ours) | Task taxonomy (T1/T2/T3/T4), per-dataset label construction, MCQ question shape | OpenReview · in-repo: TS-benchmark/TS-Benchmark.md |
| Melady TS Green Agent | A2A evaluator whose task set TemporalBenchEnv re-exposes as an OpenEnv environment | AgentBeats · GitHub repository |
| AgentScope | Purple-agent harness — ReAct / MCP / A2A framework wrapping the backbone we post-train (live on AgentBeats) | Gao et al., arXiv:2508.16279 (2025) · arXiv:2402.14034 (2024) · GitHub |
| CAMEL | Purple-agent harness — role-playing multi-agent society, CAMEL-backed baseline on AgentBeats | Li et al., NeurIPS 2023 (arXiv:2303.17760) · GitHub |
| MetaGPT | Purple-agent harness (fast-follow-up) — SOP-driven multi-agent system, role-decomposition purple | Hong et al., ICLR 2024, oral (arXiv:2308.00352) · GitHub |
| TimeSeriesScientist (TSci) | Purple-agent harness (fast-follow-up) — TS-specialized Curator/Planner/Forecaster/Reporter agent; “ceiling” for agentic TS reasoning | Zhao et al., arXiv:2510.01538 (2025) · GitHub |
| FreshRetailNet-50K | Retail demand dataset; T1/T2/T3/T4 MCQ questions | Ding et al., arXiv:2505.16319, 2025 |
| PSML | Power-system load dataset; primary domain for the default episode | Nature Sci. Data 2022 |
| MIMIC-IV | ICU/EHR time-series dataset; medical-domain MCQ | Nature Sci. Data 2022 |
| Causal Chambers | Physical-testbed TS dataset; contextual T3 and wind-chamber T1U | Nature MI 2024 |
| TimeMaster / COUNTS / SenTSR-Bench | Prior art on composite RL rewards for TS-LLMs; motivates T2/T4 forecasting future work | TimeMaster (arXiv:2506.13705) · COUNTS (arXiv:2510.01116) · SenTSR-Bench (arXiv:2602.19455) |
| OpenEnv | Gym-style reset/step, WebSocket transport, HF Space deployment | HF Blog: Introducing OpenEnv |
| TRL + GRPO | GRPOTrainer, custom rollout_func, remote env rollouts | Shao et al., arXiv:2402.03300 (DeepSeekMath) · TRL × OpenEnv |
| LotteryElicitationEnv / PT | Sibling project — structural template for env / PT split, rollout_func, DDP padding, validation hygiene | HF Space (Env) · GitHub (PT) |
# 1. Install and run the env locally
cd openenv-ts/TemporalBenchEnv
uv sync --extra dev
uv run pytest tests/ -q
uv run openenv validate
# 2. Run the server (uvicorn, port 8000)
uv run server
# or
uvicorn server.app:app --reload
# 3. Build & run the Docker image
docker build -t temporalbenchenv:latest -f server/Dockerfile .
docker run --rm -p 8000:8000 \
-e TEMPORALBENCH_QUESTION_BANK_DIR=/app/env/data/banks \
temporalbenchenv:latest
# 4. Or pull / push a HF Space
export ENV_BASE_URL="https://huggingface.co/spaces/yashu2000/TemporalBenchEnv"
openenv push # from the TemporalBenchEnv/ directory
# 5. Minimal client usage
python - <<'PY'
from client import TemporalBenchAction, TemporalBenchEnvClient
with TemporalBenchEnvClient(base_url="http://localhost:8000") as env:
out = env.reset()
while not out.done:
q = out.observation
out = env.step(TemporalBenchAction(answer=q.options[0]))
print("total reward:", out.observation.reward)
PY
# 6. GRPO / TRL training against this Space (training client under test; not on GitHub yet)
# Release is planned after additional compute for reasonable model validation.
Purple track (AgentBeats): Melady TS Purple (AgentScope) · Melady TS Base Purple Agent.
Banks are reproducible from (env_seed, curriculum_stage, primary_domain). No external fixtures, no live API, no human labels — the same flat tiered JSONL that drives the Melady TS Green Agent drives this environment.
compute_forecasting_reward stub with the composite sketched in our v1 plan (normalized MSE + DTW-shape + direction + quantile + ArcTan-smoothed MAE). Opens T2 numeric and T4 contextual forecasting tasks without touching the step loop.TemporalBenchAction.confidence already exists; shape rewards around Brier-style calibration so the policy is incentivized to know when it knows.TemporalBenchAction.reasoning is captured per step but currently unscored; a light format-plus-consistency shaping mirrors the Lottery format-weight technique.EpisodeSampler; today it emerges from bank sizes.TemporalBenchEnv is the training-side companion to our Melady TS Green Agent’s eval-side role. The Green Agent ranks purple agents on TS-Benchmark through the A2A protocol; TemporalBenchEnv re-exposes the same four-dataset MCQ suite as an OpenEnv-native sequential MDP, so an LLM can be post-trained on the very benchmark it will later be scored against.
Every design choice — 9-question episodes with 6 primary + 3 cross-domain, per-step correctness + terminal bonus with a domain-coverage multiplier, stagewise curriculum T1U → +T3 → +T2_MCQ, and a strict separation of env server and trainer over WebSocket — is aimed at preserving the verifiability that made the Green Agent’s signal trustworthy, while turning it into gradient. The infrastructure contributions are mostly inherited from our Lottery / Reasoning siblings; the novelty is the domain binding and the Green-Agent → OpenEnv pipeline.
The research question is open: can a GRPO-trained LLM, post-trained on our own benchmark, outperform strong zero-shot baselines at cross-domain TS reasoning? The environment, banks, and deployment path are shipped; full training results await the public training client and the compute budget described under Architecture. The purple track stays live on AgentBeats via Melady TS Purple (AgentScope) and Melady TS Base Purple Agent.