Spaces:
Running
Running
Updating blog with proper figures
Browse files- .gitattributes +1 -0
- README.md +79 -6
- banner.png +3 -0
- index.html +815 -18
.gitattributes
CHANGED
|
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
|
| 33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
| 34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
|
|
|
|
|
| 33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
| 34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
| 36 |
+
banner.png filter=lfs diff=lfs merge=lfs -text
|
README.md
CHANGED
|
@@ -1,12 +1,85 @@
|
|
| 1 |
---
|
| 2 |
title: TemporalBenchEnv Blog
|
| 3 |
-
emoji:
|
| 4 |
-
colorFrom:
|
| 5 |
-
colorTo:
|
| 6 |
sdk: static
|
| 7 |
pinned: false
|
| 8 |
-
license: apache-2.0
|
| 9 |
-
short_description: Blog for TemporalBench for OpenEnv Comp in AgentX
|
| 10 |
---
|
| 11 |
|
| 12 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
title: TemporalBenchEnv Blog
|
| 3 |
+
emoji: 🥁
|
| 4 |
+
colorFrom: indigo
|
| 5 |
+
colorTo: yellow
|
| 6 |
sdk: static
|
| 7 |
pinned: false
|
|
|
|
|
|
|
| 8 |
---
|
| 9 |
|
| 10 |
+
# TemporalBenchEnv Blog
|
| 11 |
+
|
| 12 |
+
Static Hugging Face Space hosting the blog write-up for
|
| 13 |
+
[`openenv-ts/TemporalBenchEnv`](../TemporalBenchEnv) — the OpenEnv-native
|
| 14 |
+
multi-step MCQ environment that extends our Melady TS Green Agent
|
| 15 |
+
submission into a GRPO training target.
|
| 16 |
+
|
| 17 |
+
## Files
|
| 18 |
+
|
| 19 |
+
| File | Purpose |
|
| 20 |
+
| --- | --- |
|
| 21 |
+
| `index.html` | Main blog page. Self-contained; pulls Inter + JetBrains Mono from Google Fonts, Mermaid from jsDelivr, MathJax from jsDelivr. |
|
| 22 |
+
| `banner.png` | Hero image (3-panel: Question → Answer → Score). Generated by `build_banner.py`. |
|
| 23 |
+
| `build_banner.py` | Matplotlib script that renders `banner.png`. Port of [`openenv-finance/impl-context/build_blog_images.py`](../../openenv-finance/impl-context/build_blog_images.py), palette-matched to the blog CSS variables. |
|
| 24 |
+
| `README.md` | This file (also doubles as HF Space frontmatter). |
|
| 25 |
+
|
| 26 |
+
## External links to fill in
|
| 27 |
+
|
| 28 |
+
All external URLs in `index.html` that are not yet final are marked with
|
| 29 |
+
`<TODO: ...>` / `<INSERT: ...>` placeholders. Search for `<TODO` to find
|
| 30 |
+
them all. The current inventory:
|
| 31 |
+
|
| 32 |
+
- GitHub repo URL for TemporalBenchEnv (hero badges, buttons, footer, quick start).
|
| 33 |
+
- HF Space URL for TemporalBenchEnv (hero badges, buttons, topnav, footer, quick start).
|
| 34 |
+
- Sibling `TemporalBenchPT` training-repo URL (architecture section, future work, quick start).
|
| 35 |
+
- TS-Benchmark arXiv link (prior work, foundations).
|
| 36 |
+
- TimeMaster / COUNTS / SenTSR-Bench arXiv links.
|
| 37 |
+
- Melady TS Green Agent public repo / demo URL (green-agent section, foundations, footer).
|
| 38 |
+
- Post-training numbers for the "Trained HF policy" row (results section).
|
| 39 |
+
- LotteryElicitationEnv / ReasoningEconomicsEnv public URLs (prior work, engineering).
|
| 40 |
+
|
| 41 |
+
## Regenerating `banner.png`
|
| 42 |
+
|
| 43 |
+
Requires `matplotlib` and `numpy`:
|
| 44 |
+
|
| 45 |
+
```bash
|
| 46 |
+
pip install matplotlib numpy
|
| 47 |
+
python build_banner.py # writes ./banner.png
|
| 48 |
+
python build_banner.py -o /tmp/b.png # custom path
|
| 49 |
+
python build_banner.py --dpi 300 # higher DPI for print
|
| 50 |
+
```
|
| 51 |
+
|
| 52 |
+
The script uses the same theme constants (`BG`, `SURFACE`, `ACCENT`, `ACCENT2`,
|
| 53 |
+
`GREEN`, `RED`, `ORANGE`) as `index.html`'s `:root` CSS variables so the
|
| 54 |
+
banner stays visually consistent with the page.
|
| 55 |
+
|
| 56 |
+
## Local preview
|
| 57 |
+
|
| 58 |
+
```bash
|
| 59 |
+
cd openenv-ts/TemporalBenchEnv_Blog
|
| 60 |
+
python -m http.server 8000
|
| 61 |
+
# open http://localhost:8000/index.html
|
| 62 |
+
```
|
| 63 |
+
|
| 64 |
+
Mermaid and MathJax render client-side over CDN, exactly as they do on the
|
| 65 |
+
static HF Space, so local preview matches production.
|
| 66 |
+
|
| 67 |
+
## Deploying as a Hugging Face static Space
|
| 68 |
+
|
| 69 |
+
1. Create a new static Space (SDK: `static`) under the target namespace.
|
| 70 |
+
2. Copy `index.html`, `banner.png`, and `README.md` to the Space repo.
|
| 71 |
+
3. Commit and push — the Space serves `index.html` directly.
|
| 72 |
+
|
| 73 |
+
The YAML frontmatter at the top of this README is the HF Space manifest;
|
| 74 |
+
rename `sdk: static` / `title` to match the target Space if needed.
|
| 75 |
+
|
| 76 |
+
## Style / structural source
|
| 77 |
+
|
| 78 |
+
The blog structure, CSS palette, and component library (topnav, hero, TOC,
|
| 79 |
+
episode trace, tables, mermaid-wrap, math-display, callout, notice panel) are
|
| 80 |
+
ported from the sibling
|
| 81 |
+
[`openenv-finance/LotteryElicitationEnv_Blog/index.html`](../../openenv-finance/LotteryElicitationEnv_Blog/index.html)
|
| 82 |
+
per the shared
|
| 83 |
+
[blog-creation-guidelines.md](../../openenv-finance/impl-context/blog-creation-guidelines.md).
|
| 84 |
+
Keeping the structural drift between the two blogs near zero makes the
|
| 85 |
+
Melady OpenEnv submissions read as a coherent series.
|
banner.png
ADDED
|
Git LFS Details
|
index.html
CHANGED
|
@@ -1,19 +1,816 @@
|
|
| 1 |
-
<!
|
| 2 |
-
<html>
|
| 3 |
-
|
| 4 |
-
|
| 5 |
-
|
| 6 |
-
|
| 7 |
-
|
| 8 |
-
|
| 9 |
-
|
| 10 |
-
|
| 11 |
-
|
| 12 |
-
|
| 13 |
-
|
| 14 |
-
|
| 15 |
-
|
| 16 |
-
|
| 17 |
-
|
| 18 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 19 |
</html>
|
|
|
|
| 1 |
+
<!DOCTYPE html>
|
| 2 |
+
<html lang="en">
|
| 3 |
+
<head>
|
| 4 |
+
<meta charset="utf-8">
|
| 5 |
+
<meta name="viewport" content="width=device-width, initial-scale=1">
|
| 6 |
+
<title>TemporalBenchEnv: An OpenEnv Benchmark for Post-training LLMs on Multi-step Time-Series Reasoning</title>
|
| 7 |
+
<meta name="description" content="TemporalBenchEnv: an OpenEnv-native multi-step MCQ environment for post-training LLMs on time-series reasoning, extending our Melady TS Green Agent submission with a verifiable-reward GRPO training path.">
|
| 8 |
+
<link rel="preconnect" href="https://fonts.googleapis.com">
|
| 9 |
+
<link rel="preconnect" href="https://fonts.gstatic.com" crossorigin>
|
| 10 |
+
<link href="https://fonts.googleapis.com/css2?family=Inter:wght@400;600;700;800&family=JetBrains+Mono:wght@400;600&display=swap" rel="stylesheet">
|
| 11 |
+
<!-- Mermaid for inline diagrams -->
|
| 12 |
+
<script type="module">
|
| 13 |
+
import mermaid from 'https://cdn.jsdelivr.net/npm/mermaid@11/dist/mermaid.esm.min.mjs';
|
| 14 |
+
mermaid.initialize({
|
| 15 |
+
startOnLoad: true,
|
| 16 |
+
theme: 'dark',
|
| 17 |
+
themeVariables: {
|
| 18 |
+
primaryColor: '#6366f1',
|
| 19 |
+
primaryTextColor: '#e2e8f0',
|
| 20 |
+
primaryBorderColor: '#818cf8',
|
| 21 |
+
lineColor: '#818cf8',
|
| 22 |
+
secondaryColor: '#1e293b',
|
| 23 |
+
tertiaryColor: '#172033',
|
| 24 |
+
background: '#0f172a',
|
| 25 |
+
mainBkg: '#1e293b',
|
| 26 |
+
nodeBorder: '#818cf8',
|
| 27 |
+
clusterBkg: '#172033',
|
| 28 |
+
clusterBorder: '#334155',
|
| 29 |
+
titleColor: '#e2e8f0',
|
| 30 |
+
edgeLabelBackground: '#1e293b',
|
| 31 |
+
nodeTextColor: '#e2e8f0'
|
| 32 |
+
},
|
| 33 |
+
flowchart: { curve: 'basis', htmlLabels: true },
|
| 34 |
+
fontFamily: 'Inter, sans-serif'
|
| 35 |
+
});
|
| 36 |
+
</script>
|
| 37 |
+
<style>
|
| 38 |
+
:root {
|
| 39 |
+
--bg: #0f172a; --surface: #1e293b; --surface-2: #172033; --border: #334155;
|
| 40 |
+
--text: #e2e8f0; --muted: #94a3b8; --accent: #6366f1;
|
| 41 |
+
--accent2: #818cf8; --green: #22c55e; --red: #ef4444;
|
| 42 |
+
--orange: #f59e0b; --radius: 12px;
|
| 43 |
+
}
|
| 44 |
+
* { margin: 0; padding: 0; box-sizing: border-box; }
|
| 45 |
+
html { scroll-behavior: smooth; }
|
| 46 |
+
body { font-family: 'Inter', -apple-system, BlinkMacSystemFont, 'Segoe UI', sans-serif;
|
| 47 |
+
background: var(--bg); color: var(--text); line-height: 1.7;
|
| 48 |
+
-webkit-font-smoothing: antialiased; }
|
| 49 |
+
.container { max-width: 820px; margin: 0 auto; padding: 2rem 1.5rem 4rem; }
|
| 50 |
+
/* Top nav */
|
| 51 |
+
.topnav { position: sticky; top: 0; z-index: 10; background: rgba(15,23,42,.85);
|
| 52 |
+
backdrop-filter: blur(10px); border-bottom: 1px solid var(--border);
|
| 53 |
+
padding: .9rem 1.5rem; display: flex; justify-content: space-between;
|
| 54 |
+
align-items: center; font-size: .88rem; }
|
| 55 |
+
.topnav .brand { font-weight: 700; color: var(--text); text-decoration: none;
|
| 56 |
+
display: flex; align-items: center; gap: .5rem; }
|
| 57 |
+
.topnav .brand .dot { width: 8px; height: 8px; border-radius: 50%;
|
| 58 |
+
background: var(--green); box-shadow: 0 0 8px rgba(34,197,94,.6); }
|
| 59 |
+
.topnav .links { display: flex; gap: 1.25rem; }
|
| 60 |
+
.topnav .links a { color: var(--muted); text-decoration: none; transition: color .15s; }
|
| 61 |
+
.topnav .links a:hover { color: var(--accent2); }
|
| 62 |
+
/* Hero */
|
| 63 |
+
.hero { text-align: center; padding: 4rem 0 2.5rem; }
|
| 64 |
+
.hero-badge { display: inline-block; background: rgba(99,102,241,.15); color: var(--accent2);
|
| 65 |
+
padding: .4rem 1.1rem; border-radius: 20px; font-size: .78rem; font-weight: 600;
|
| 66 |
+
letter-spacing: .08em; margin-bottom: 1.25rem;
|
| 67 |
+
border: 1px solid rgba(99,102,241,.3); text-transform: uppercase; }
|
| 68 |
+
.hero h1 { font-size: clamp(2rem, 4.2vw, 3.2rem); font-weight: 800; letter-spacing: -.025em;
|
| 69 |
+
line-height: 1.15;
|
| 70 |
+
background: linear-gradient(135deg, #e2e8f0 25%, #6366f1 100%);
|
| 71 |
+
-webkit-background-clip: text; -webkit-text-fill-color: transparent;
|
| 72 |
+
background-clip: text; }
|
| 73 |
+
.hero .subtitle { color: var(--muted); font-size: 1.15rem; max-width: 640px;
|
| 74 |
+
margin: 1rem auto 0; }
|
| 75 |
+
.hero .byline { color: var(--muted); font-size: .85rem; margin-top: 1.5rem;
|
| 76 |
+
font-style: italic; }
|
| 77 |
+
.banner { width: 100%; border-radius: var(--radius); margin: 2rem 0 3rem;
|
| 78 |
+
border: 1px solid var(--border); }
|
| 79 |
+
/* Badges row */
|
| 80 |
+
.badges { display: flex; justify-content: center; gap: .6rem; flex-wrap: wrap;
|
| 81 |
+
margin: 1.5rem 0; }
|
| 82 |
+
.badges img { height: 22px; }
|
| 83 |
+
/* Button group */
|
| 84 |
+
.btn-group { display: flex; gap: .75rem; justify-content: center; margin: 2rem 0;
|
| 85 |
+
flex-wrap: wrap; }
|
| 86 |
+
.btn { display: inline-flex; align-items: center; gap: .45rem; padding: .6rem 1.35rem;
|
| 87 |
+
background: var(--accent); color: white; border-radius: 8px; font-size: .88rem;
|
| 88 |
+
font-weight: 600; text-decoration: none; transition: all .2s; }
|
| 89 |
+
.btn:hover { background: var(--accent2); transform: translateY(-1px); }
|
| 90 |
+
.btn-outline { background: transparent; border: 1px solid var(--border); color: var(--text); }
|
| 91 |
+
.btn-outline:hover { border-color: var(--accent); color: var(--accent2);
|
| 92 |
+
background: rgba(99,102,241,.08); }
|
| 93 |
+
/* TOC */
|
| 94 |
+
.toc { background: var(--surface); border: 1px solid var(--border); border-radius: var(--radius);
|
| 95 |
+
padding: 1.25rem 1.5rem; margin: 0 0 2.5rem; }
|
| 96 |
+
.toc h3 { font-size: .82rem; font-weight: 700; letter-spacing: .08em; text-transform: uppercase;
|
| 97 |
+
color: var(--accent2); margin-bottom: .85rem; }
|
| 98 |
+
.toc ol { list-style: none; counter-reset: toc; display: flex; flex-wrap: wrap; gap: .35rem .8rem;
|
| 99 |
+
margin: 0; padding: 0; }
|
| 100 |
+
.toc ol li { counter-increment: toc; font-size: .88rem; }
|
| 101 |
+
.toc ol li::before { content: counter(toc) "."; color: var(--accent); font-weight: 700;
|
| 102 |
+
font-size: .8rem; margin-right: .3rem; }
|
| 103 |
+
.toc ol li a { color: var(--muted); text-decoration: none; transition: color .15s; }
|
| 104 |
+
.toc ol li a:hover { color: var(--accent2); }
|
| 105 |
+
/* Sections */
|
| 106 |
+
section { margin: 3.5rem 0; }
|
| 107 |
+
section h2 { font-size: 1.55rem; font-weight: 800; letter-spacing: -.01em;
|
| 108 |
+
margin-bottom: 1rem; color: var(--text);
|
| 109 |
+
border-left: 3px solid var(--accent); padding-left: .9rem; }
|
| 110 |
+
section h3 { font-size: 1.1rem; font-weight: 700; margin: 2rem 0 .75rem;
|
| 111 |
+
color: var(--accent2); }
|
| 112 |
+
section p { color: #cbd5e1; margin-bottom: 1rem; font-size: 1.02rem; }
|
| 113 |
+
section p strong { color: var(--text); }
|
| 114 |
+
section ul, section ol { color: #cbd5e1; margin: 1rem 0 1rem 1.5rem; }
|
| 115 |
+
section ul li, section ol li { margin-bottom: .5rem; font-size: 1rem; }
|
| 116 |
+
section ul li strong, section ol li strong { color: var(--text); }
|
| 117 |
+
/* Pull-quote */
|
| 118 |
+
blockquote { border-left: 3px solid var(--accent2);
|
| 119 |
+
background: rgba(99,102,241,.06); padding: 1.1rem 1.25rem;
|
| 120 |
+
margin: 1.5rem 0; border-radius: 0 8px 8px 0;
|
| 121 |
+
color: #e2e8f0; font-size: 1.02rem; }
|
| 122 |
+
/* Tables */
|
| 123 |
+
.table-wrap { margin: 1.5rem 0; overflow-x: auto;
|
| 124 |
+
background: var(--surface); border: 1px solid var(--border);
|
| 125 |
+
border-radius: var(--radius); }
|
| 126 |
+
table { width: 100%; border-collapse: collapse; font-size: .92rem; }
|
| 127 |
+
th { background: rgba(99,102,241,.1); color: var(--accent2);
|
| 128 |
+
font-size: .72rem; font-weight: 700; letter-spacing: .06em;
|
| 129 |
+
text-transform: uppercase; padding: .85rem 1rem; text-align: left; }
|
| 130 |
+
td { padding: .7rem 1rem; border-top: 1px solid var(--border); color: #cbd5e1; }
|
| 131 |
+
td.num { text-align: right; font-variant-numeric: tabular-nums;
|
| 132 |
+
font-family: 'JetBrains Mono', monospace; font-size: .88rem; }
|
| 133 |
+
tr:hover td { background: rgba(99,102,241,.04); }
|
| 134 |
+
td strong, th strong { color: var(--text); }
|
| 135 |
+
.task-id { font-family: 'JetBrains Mono', monospace; font-weight: 700;
|
| 136 |
+
color: var(--accent2); font-size: .85rem; }
|
| 137 |
+
tr.avg-row td { background: rgba(99,102,241,.08); font-weight: 700;
|
| 138 |
+
color: var(--text); }
|
| 139 |
+
tr.novel td:first-child { color: #fca5a5; }
|
| 140 |
+
/* Code */
|
| 141 |
+
pre { background: #0b1120; border: 1px solid var(--border);
|
| 142 |
+
border-radius: var(--radius); padding: 1.1rem 1.25rem; overflow-x: auto;
|
| 143 |
+
margin: 1.25rem 0; font-family: 'JetBrains Mono', monospace;
|
| 144 |
+
font-size: .85rem; line-height: 1.6; color: #d1d5db; }
|
| 145 |
+
pre .c { color: #64748b; }
|
| 146 |
+
code { font-family: 'JetBrains Mono', monospace; font-size: .88em;
|
| 147 |
+
background: rgba(99,102,241,.12); color: var(--accent2);
|
| 148 |
+
padding: .1em .35em; border-radius: 4px; }
|
| 149 |
+
pre code { background: none; color: inherit; padding: 0; font-size: 1em; }
|
| 150 |
+
/* Figure */
|
| 151 |
+
figure { margin: 2rem 0; }
|
| 152 |
+
figure img { width: 100%; border-radius: var(--radius);
|
| 153 |
+
border: 1px solid var(--border); }
|
| 154 |
+
figcaption { text-align: center; color: var(--muted); font-size: .85rem;
|
| 155 |
+
margin-top: .75rem; }
|
| 156 |
+
/* Mermaid diagram wrapper */
|
| 157 |
+
.mermaid-wrap { margin: 2rem 0; background: var(--surface); border: 1px solid var(--border);
|
| 158 |
+
border-radius: var(--radius); padding: 1.5rem 1rem; overflow-x: auto; }
|
| 159 |
+
.mermaid-wrap .mermaid { display: flex; justify-content: center; }
|
| 160 |
+
.mermaid-caption { text-align: center; color: var(--muted); font-size: .85rem;
|
| 161 |
+
margin-top: .75rem; }
|
| 162 |
+
/* Episode trace */
|
| 163 |
+
.episode-trace { background: var(--surface); border: 1px solid var(--border);
|
| 164 |
+
border-radius: var(--radius); padding: 1.25rem 1.5rem; margin: 1.5rem 0;
|
| 165 |
+
position: relative; }
|
| 166 |
+
.episode-trace::before { content: ''; position: absolute; left: 1.5rem; top: 2.5rem;
|
| 167 |
+
bottom: 1.25rem; width: 2px; background: var(--border); }
|
| 168 |
+
.trace-step { position: relative; padding-left: 2rem; margin-bottom: 1.25rem; }
|
| 169 |
+
.trace-step:last-child { margin-bottom: 0; }
|
| 170 |
+
.trace-step .step-marker { position: absolute; left: -.45rem; top: .2rem; width: 12px;
|
| 171 |
+
height: 12px; border-radius: 50%; border: 2px solid var(--accent);
|
| 172 |
+
background: var(--bg); z-index: 1; }
|
| 173 |
+
.trace-step .step-marker.terminal { background: var(--red); border-color: var(--red); }
|
| 174 |
+
.trace-step .step-marker.good { background: var(--green); border-color: var(--green); }
|
| 175 |
+
.trace-step .step-label { font-family: 'JetBrains Mono', monospace; font-size: .78rem;
|
| 176 |
+
color: var(--accent2); font-weight: 700; margin-bottom: .25rem; }
|
| 177 |
+
.trace-step .step-content { font-size: .9rem; color: #cbd5e1; }
|
| 178 |
+
.trace-step .step-content code { font-size: .82em; }
|
| 179 |
+
.trace-verdict { margin-top: 1rem; padding: .75rem 1rem; border-radius: 8px;
|
| 180 |
+
font-size: .9rem; font-weight: 600; }
|
| 181 |
+
.trace-verdict.bad { background: rgba(239,68,68,.1); border: 1px solid rgba(239,68,68,.3);
|
| 182 |
+
color: #fca5a5; }
|
| 183 |
+
.trace-verdict.good { background: rgba(34,197,94,.1); border: 1px solid rgba(34,197,94,.3);
|
| 184 |
+
color: #86efac; }
|
| 185 |
+
/* Callout for the closing question */
|
| 186 |
+
.callout { text-align: center; padding: 2rem 1.5rem; margin: 3rem 0;
|
| 187 |
+
background: linear-gradient(135deg, rgba(99,102,241,.08), rgba(129,140,248,.04));
|
| 188 |
+
border: 1px solid rgba(99,102,241,.25); border-radius: var(--radius); }
|
| 189 |
+
.callout .q { font-size: 1.25rem; font-weight: 700; color: var(--text);
|
| 190 |
+
font-style: italic; margin-bottom: .5rem; }
|
| 191 |
+
.callout .sub { color: var(--muted); font-size: .95rem; }
|
| 192 |
+
/* Notice panel (for "empirical results pending" etc.) */
|
| 193 |
+
.notice { padding: 1.1rem 1.25rem; margin: 1.5rem 0;
|
| 194 |
+
background: rgba(245,158,11,.08); border: 1px solid rgba(245,158,11,.3);
|
| 195 |
+
border-radius: var(--radius); color: #fde68a; font-size: .95rem; }
|
| 196 |
+
.notice strong { color: #fbbf24; }
|
| 197 |
+
/* Footer */
|
| 198 |
+
.footer { text-align: center; padding: 3rem 0 1rem; color: var(--muted);
|
| 199 |
+
font-size: .85rem; border-top: 1px solid var(--border); margin-top: 3rem; }
|
| 200 |
+
.footer a { color: var(--accent2); text-decoration: none; margin: 0 .5rem; }
|
| 201 |
+
.footer a:hover { text-decoration: underline; }
|
| 202 |
+
@media (max-width: 640px) {
|
| 203 |
+
.container { padding: 1rem 1rem 3rem; }
|
| 204 |
+
.hero { padding: 2.5rem 0 1.5rem; }
|
| 205 |
+
.topnav .links { display: none; }
|
| 206 |
+
section h2 { font-size: 1.3rem; }
|
| 207 |
+
table { font-size: .82rem; }
|
| 208 |
+
th, td { padding: .55rem .6rem; }
|
| 209 |
+
.toc ol { flex-direction: column; }
|
| 210 |
+
.episode-trace { padding: 1rem; }
|
| 211 |
+
.episode-trace::before { left: 1rem; }
|
| 212 |
+
}
|
| 213 |
+
/* MathJax display blocks */
|
| 214 |
+
.math-display {
|
| 215 |
+
margin: 1.25rem 0;
|
| 216 |
+
padding: 1rem 1.25rem 1.15rem;
|
| 217 |
+
overflow-x: auto;
|
| 218 |
+
background: var(--surface);
|
| 219 |
+
border: 1px solid var(--border);
|
| 220 |
+
border-radius: var(--radius);
|
| 221 |
+
text-align: center;
|
| 222 |
+
}
|
| 223 |
+
.math-display mjx-container[jax="CHTML"][display="true"] { margin: 0.65em 0 !important; }
|
| 224 |
+
.math-display mjx-container { color: #e2e8f0 !important; }
|
| 225 |
+
.math-note { font-size: .9rem; color: var(--muted); margin-top: .35rem; margin-bottom: 0; }
|
| 226 |
+
</style>
|
| 227 |
+
<!-- MathJax 3: client-side TeX like static HF Spaces (sdk: static) serve plain HTML. -->
|
| 228 |
+
<script>
|
| 229 |
+
window.MathJax = {
|
| 230 |
+
tex: {
|
| 231 |
+
inlineMath: [['\\(', '\\)']],
|
| 232 |
+
displayMath: [['\\[', '\\]']]
|
| 233 |
+
},
|
| 234 |
+
options: {
|
| 235 |
+
renderActions: {
|
| 236 |
+
addMenu: [0, '', '']
|
| 237 |
+
}
|
| 238 |
+
}
|
| 239 |
+
};
|
| 240 |
+
</script>
|
| 241 |
+
<script defer src="https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-mml-chtml.js" id="MathJax-script"></script>
|
| 242 |
+
</head>
|
| 243 |
+
<body>
|
| 244 |
+
|
| 245 |
+
<nav class="topnav">
|
| 246 |
+
<a href="#top" class="brand"><span class="dot"></span> TemporalBenchEnv Blog</a>
|
| 247 |
+
<div class="links">
|
| 248 |
+
<a href="#why">Why</a>
|
| 249 |
+
<a href="#green-agent">Green Agent</a>
|
| 250 |
+
<a href="#design">Design</a>
|
| 251 |
+
<a href="#scoring">Scoring</a>
|
| 252 |
+
<a href="#traces">Traces</a>
|
| 253 |
+
<a href="#results">Results</a>
|
| 254 |
+
<a href="#engineering">Engineering</a>
|
| 255 |
+
<a href="https://huggingface.co/spaces/yashu2000/TemporalBenchEnv" target="_blank">Live Space ↗</a>
|
| 256 |
+
</div>
|
| 257 |
+
</nav>
|
| 258 |
+
|
| 259 |
+
<div class="container" id="top">
|
| 260 |
+
|
| 261 |
+
<div class="hero">
|
| 262 |
+
<div class="hero-badge">OpenEnv · Extension of Melady TS Green Agent</div>
|
| 263 |
+
<h1>TemporalBenchEnv</h1>
|
| 264 |
+
<p class="subtitle">An OpenEnv-native multi-step MCQ environment for post-training LLMs on time-series reasoning — built on the four datasets from our Melady TS Green Agent submission.</p>
|
| 265 |
+
<div class="badges">
|
| 266 |
+
<a href="https://github.com/sharma-yash01/TemporalBenchEnv" target="_blank"><img src="https://img.shields.io/badge/GitHub-Repository-181717?logo=github" alt="GitHub"/></a>
|
| 267 |
+
<a href="https://huggingface.co/spaces/yashu2000/TemporalBenchEnv" target="_blank"><img src="https://img.shields.io/badge/HF%20Space-Live%20Demo-FFD21E?logo=huggingface&logoColor=black" alt="HF Space"/></a>
|
| 268 |
+
<img src="https://img.shields.io/badge/OpenEnv-Native-4B8BBE" alt="OpenEnv"/>
|
| 269 |
+
<img src="https://img.shields.io/badge/Datasets-4-brightgreen" alt="4 Datasets"/>
|
| 270 |
+
<img src="https://img.shields.io/badge/Episodes-9%20MCQ-blueviolet" alt="9-step MCQ"/>
|
| 271 |
+
<img src="https://img.shields.io/badge/Training-GRPO%20(planned)-orange" alt="GRPO (planned)"/>
|
| 272 |
+
</div>
|
| 273 |
+
<div class="byline">Melady / AgentX OpenEnv Track | Muyan Weng (USC) · Defu Cao (USC) · Yashaswi Sharma (USC) · Yan Liu (USC)</div>
|
| 274 |
+
</div>
|
| 275 |
+
|
| 276 |
+
<img src="banner.png" alt="TemporalBenchEnv: question → answer → score across four time-series datasets" class="banner"/>
|
| 277 |
+
|
| 278 |
+
<div class="btn-group">
|
| 279 |
+
<a class="btn" href="https://huggingface.co/spaces/yashu2000/TemporalBenchEnv" target="_blank">Live Environment Space →</a>
|
| 280 |
+
<a class="btn btn-outline" href="https://github.com/sharma-yash01/TemporalBenchEnv" target="_blank">GitHub Repo</a>
|
| 281 |
+
</div>
|
| 282 |
+
|
| 283 |
+
<!-- Table of Contents -->
|
| 284 |
+
<nav class="toc" id="toc">
|
| 285 |
+
<h3>Contents</h3>
|
| 286 |
+
<ol>
|
| 287 |
+
<li><a href="#why">Multi-step TS Reasoning as a Verifiable Env</a></li>
|
| 288 |
+
<li><a href="#green-agent">Extension of Our Green Agent Submission</a></li>
|
| 289 |
+
<li><a href="#purple-harnesses">Purple Agent Harnesses (Eval Targets)</a></li>
|
| 290 |
+
<li><a href="#matters">Why This Benchmark Matters</a></li>
|
| 291 |
+
<li><a href="#prior-work">Prior Work & Novelty</a></li>
|
| 292 |
+
<li><a href="#design">What TemporalBenchEnv Is</a></li>
|
| 293 |
+
<li><a href="#env-design">Environment Design</a></li>
|
| 294 |
+
<li><a href="#openenv">Why OpenEnv</a></li>
|
| 295 |
+
<li><a href="#scoring">Scoring: Per-step + Episode Bonus</a></li>
|
| 296 |
+
<li><a href="#architecture">Architecture & Training Pipeline</a></li>
|
| 297 |
+
<li><a href="#traces">Episode Trace (Ideal, Illustrative)</a></li>
|
| 298 |
+
<li><a href="#results">Results: Baselines & Projected Targets</a></li>
|
| 299 |
+
<li><a href="#engineering">Engineering Lessons (Inherited)</a></li>
|
| 300 |
+
<li><a href="#foundations">Foundations & Citations</a></li>
|
| 301 |
+
<li><a href="#quickstart">Quick Start</a></li>
|
| 302 |
+
<li><a href="#future">Future Work</a></li>
|
| 303 |
+
</ol>
|
| 304 |
+
</nav>
|
| 305 |
+
|
| 306 |
+
<!-- 1. WHY -->
|
| 307 |
+
<section id="why">
|
| 308 |
+
<h2>Multi-step TS reasoning as a verifiable environment</h2>
|
| 309 |
+
<p>Most time-series LLM benchmarks grade a single prompt at a time. <strong>TemporalBenchEnv</strong> grades an <strong>episode</strong>: nine multiple-choice questions drawn from four time-series datasets, answered one per step, with a terminal bonus that rewards both accuracy and <strong>cross-domain coverage</strong>.</p>
|
| 310 |
+
<!-- TODO: Replace OpenReview link with arXiv link when available -->
|
| 311 |
+
<p>Every reward signal here is <strong>ground-truth arithmetic</strong>, not a judge. Labels are produced by the <a href="https://openreview.net/forum?id=rp5qCsxXca¬eId=jYZcZ4GI0y" target="_blank" style="color:var(--accent2)">TS-Benchmark</a> construction pipeline (trend / volatility / seasonality / outlier thresholds; S1–S5 family rules), so the environment can score an answer with a normalized string match against the stored ground truth. There is no LLM judge in the loop.</p>
|
| 312 |
+
<p>The falsifiable hypothesis this environment is built to test: <em>whether a GRPO-trained LLM, post-trained on sequential episodes sampled from our Green Agent’s own benchmark, outperforms strong zero-shot baselines on per-domain MCQ accuracy while hitting the cross-domain coverage bonus.</em> Empirical adjudication is contingent on the training runs described under <a href="#architecture" style="color:var(--accent2)">Architecture & training pipeline</a>.</p>
|
| 313 |
+
</section>
|
| 314 |
+
|
| 315 |
+
<!-- 2. EXTENSION OF GREEN AGENT -->
|
| 316 |
+
<section id="green-agent">
|
| 317 |
+
<h2>Extension of our Melady TS Green Agent submission</h2>
|
| 318 |
+
<p>TemporalBenchEnv is a <strong>direct extension</strong> of our AgentBeats <a href="https://agentbeats.dev/sharma-yash01/melady-ts-green-agent" target="_blank" style="color:var(--accent2)"><strong>Melady TS Green Agent</strong></a> submission. The Green Agent is an A2A-protocol evaluator that grades purple agents on 764 TS-Benchmark tasks (see the <a href="https://github.com/sharma-yash01/Melady_Agent-TS-Green" target="_blank" style="color:var(--accent2)">Green Agent GitHub repository</a>). TemporalBenchEnv takes the <em>same datasets and task taxonomy</em> and re-exposes them as a sequential <strong>OpenEnv</strong> environment consumable by TRL’s <code>rollout_func</code> — turning the benchmark into a <strong>training target</strong>.</p>
|
| 319 |
+
<div class="table-wrap">
|
| 320 |
+
<table>
|
| 321 |
+
<thead><tr><th>Artifact</th><th>Melady TS Green Agent</th><th>TemporalBenchEnv (this submission)</th></tr></thead>
|
| 322 |
+
<tbody>
|
| 323 |
+
<tr><td><strong>Role</strong></td><td>A2A evaluator of purple agents</td><td>OpenEnv RL environment for post-training LLMs</td></tr>
|
| 324 |
+
<tr><td><strong>Datasets</strong></td><td>PSML, freshretailnet, MIMIC, causal_chambers</td><td>Same four datasets</td></tr>
|
| 325 |
+
<tr><td><strong>Tasks</strong></td><td>T1/T3 (accuracy) + T2/T4 (regression + MCQ), 764 total</td><td>MCQ subset: T1U, T3, T2_MCQ — 2,775 <code>TSQuestion</code> records</td></tr>
|
| 326 |
+
<tr><td><strong>Per-domain bank sizes</strong></td><td>Packaged in the Docker image</td><td>PSML 750 · freshretailnet 616 · MIMIC 709 · causal_chambers 700</td></tr>
|
| 327 |
+
<tr><td><strong>Protocol</strong></td><td>A2A messaging, one-shot prompts</td><td>WebSocket OpenEnv contract, sequential 9-step MDP</td></tr>
|
| 328 |
+
<tr><td><strong>Reward</strong></td><td>MSE / MAE / RMSE / MASE / accuracy (eval metrics)</td><td>Per-step correctness + terminal episode bonus w/ coverage multiplier</td></tr>
|
| 329 |
+
<tr><td><strong>Consumer</strong></td><td>AgentBeats leaderboard</td><td>TRL 1.0 <code>rollout_func</code>, vLLM colocate / server, GRPO</td></tr>
|
| 330 |
+
</tbody>
|
| 331 |
+
</table>
|
| 332 |
+
</div>
|
| 333 |
+
<p>The ETL from the Green Agent’s labeled JSONL into the per-domain <code>TSQuestion</code> banks the environment consumes lives in <code>TS-benchmark/scripts/build_temporal_bench_openenv_banks.py</code>; banks ship at <code>openenv-ts/TemporalBenchEnv/data/banks/</code> and are loaded via the <code>TEMPORALBENCH_QUESTION_BANK_DIR</code> environment variable.</p>
|
| 334 |
+
<blockquote>The Green Agent answered: <em>which purple agent is best at TS reasoning right now?</em> TemporalBenchEnv answers: <em>can we post-train an LLM, on that exact benchmark, to be the next best purple agent?</em></blockquote>
|
| 335 |
+
</section>
|
| 336 |
+
|
| 337 |
+
<!-- 2b. PURPLE AGENT HARNESSES -->
|
| 338 |
+
<section id="purple-harnesses">
|
| 339 |
+
<h2>Purple agent harnesses: evaluating mainstream TS-capable agent stacks</h2>
|
| 340 |
+
<p>The Green Agent scores <em>purple agents</em> over the A2A protocol — any stack that speaks A2A is a valid participant. To make the benchmark diagnostic for modern agentic time-series practice, we target four of the most popular open-source agent harnesses as purple agents on AgentBeats: <strong>two are implemented and live</strong> (AgentScope, CAMEL) and <strong>two are planned</strong> (MetaGPT, TimeSeriesScientist). Each harness stays <em>unchanged internally</em>; a thin A2A adapter feeds the Green Agent’s TS-Benchmark MCQs into the harness’s own reasoning / tool-use loop and returns a final label. These four frameworks are the de-facto “agentic harnesses” TS practitioners actually wrap around LLMs today, so they anchor the eval to real downstream usage.</p>
|
| 341 |
+
|
| 342 |
+
<div class="notice">
|
| 343 |
+
<strong>The feedback loop we are instrumenting.</strong> For every harness we expose a swappable backbone LLM (e.g. Qwen2.5-7B-Instruct or GPT-4o-mini). The plan is: <strong>(1)</strong> score the base backbone inside each harness via the <a href="https://agentbeats.dev/sharma-yash01/melady-ts-green-agent" target="_blank" style="color:var(--accent2)">Melady TS Green Agent</a>, <strong>(2)</strong> <strong>post-train that same backbone with this OpenEnv</strong> — TemporalBenchEnv’s randomized-domain / randomized-task 9-step MCQ episodes with per-step + terminal verifiable rewards, <strong>(3)</strong> <strong>re-score every harness</strong> with the post-trained backbone in place of the base, and <strong>(4)</strong> attribute any delta specifically to the RL post-training rather than to harness architecture. The panel makes the research question concrete: <em>does randomized-domain verifiable-reward post-training on TemporalBenchEnv actually transfer to agentic TS reasoning under mainstream orchestration frameworks?</em>
|
| 344 |
+
</div>
|
| 345 |
+
|
| 346 |
+
<h3>Implemented — live purple agents on AgentBeats</h3>
|
| 347 |
+
<div class="table-wrap">
|
| 348 |
+
<table>
|
| 349 |
+
<thead><tr><th>Harness</th><th>What it is</th><th>Why it matters for TS reasoning</th><th>AgentBeats listing</th></tr></thead>
|
| 350 |
+
<tbody>
|
| 351 |
+
<tr>
|
| 352 |
+
<td><strong><a href="https://github.com/agentscope-ai/agentscope" target="_blank" style="color:var(--accent2)">AgentScope</a></strong><br><span style="font-size:.85em;color:var(--muted)">Gao et al., <a href="https://arxiv.org/abs/2508.16279" target="_blank" style="color:var(--accent2)">“AgentScope 1.0: A Developer-Centric Framework for Building Agentic Applications” (arXiv:2508.16279, 2025)</a>; Gao et al., <a href="https://arxiv.org/abs/2402.14034" target="_blank" style="color:var(--accent2)">“AgentScope: A Flexible yet Robust Multi-Agent Platform” (arXiv:2402.14034, 2024)</a></span></td>
|
| 353 |
+
<td>Production-ready ReAct agent framework with built-in tools, memory, planning, MCP / A2A interop, and an agentic-RL tuner (Trinity-RFT). Apache-2.0.</td>
|
| 354 |
+
<td>The most common Python-native ReAct / tool-use stack; A2A support makes the Green Agent wiring mechanical. Serves as our canonical “single-ReAct-loop” purple baseline over TS MCQs.</td>
|
| 355 |
+
<td><a href="https://agentbeats.dev/sharma-yash01/melady-ts-purple-agentscope" target="_blank" style="color:var(--accent2)">Melady TS Purple (AgentScope)</a></td>
|
| 356 |
+
</tr>
|
| 357 |
+
<tr>
|
| 358 |
+
<td><strong><a href="https://github.com/camel-ai/camel" target="_blank" style="color:var(--accent2)">CAMEL</a></strong><br><span style="font-size:.85em;color:var(--muted)">Li et al., <a href="https://arxiv.org/abs/2303.17760" target="_blank" style="color:var(--accent2)">“CAMEL: Communicative Agents for ‘Mind’ Exploration of Large Language Model Society” (NeurIPS 2023, arXiv:2303.17760)</a></span></td>
|
| 359 |
+
<td>Role-playing multi-agent framework with stateful memory, structured messages, societies / workforce pipelines, and a strong focus on <em>scaling laws of agents</em>.</td>
|
| 360 |
+
<td>Role-play + critic loop tests whether a trend / seasonality / outlier answer survives multi-turn discussion without drifting; complements AgentScope’s single-ReAct shape with an inter-agent communication surface.</td>
|
| 361 |
+
<td><a href="https://agentbeats.dev/sharma-yash01/melady-ts-base-purple-agent" target="_blank" style="color:var(--accent2)">Melady TS Base Purple Agent</a></td>
|
| 362 |
+
</tr>
|
| 363 |
+
</tbody>
|
| 364 |
+
</table>
|
| 365 |
+
</div>
|
| 366 |
+
|
| 367 |
+
<h3>Planned — purple agents under development</h3>
|
| 368 |
+
<div class="table-wrap">
|
| 369 |
+
<table>
|
| 370 |
+
<thead><tr><th>Harness</th><th>What it is</th><th>Why it matters for TS reasoning</th><th>Status</th></tr></thead>
|
| 371 |
+
<tbody>
|
| 372 |
+
<tr>
|
| 373 |
+
<td><strong><a href="https://github.com/FoundationAgents/MetaGPT" target="_blank" style="color:var(--accent2)">MetaGPT</a></strong><br><span style="font-size:.85em;color:var(--muted)">Hong et al., <a href="https://arxiv.org/abs/2308.00352" target="_blank" style="color:var(--accent2)">“MetaGPT: Meta Programming for a Multi-Agent Collaborative Framework” (ICLR 2024, oral; arXiv:2308.00352)</a></span></td>
|
| 374 |
+
<td>SOP-driven multi-agent system — explicit role decomposition (product manager / architect / engineer / QA) orchestrated by a meta-programming layer. One of the most-starred multi-agent frameworks in the wild.</td>
|
| 375 |
+
<td>Gives us a decomposition-heavy purple: a “data analyst” + “forecaster” / “reviewer” role split answering the same MCQs, isolating whether explicit SOPs help TS reasoning vs. a flat ReAct loop.</td>
|
| 376 |
+
<td><strong>Planned.</strong> A2A wrapper not yet published on AgentBeats.</td>
|
| 377 |
+
</tr>
|
| 378 |
+
<tr>
|
| 379 |
+
<td><strong><a href="https://github.com/Y-Research-SBU/TimeSeriesScientist" target="_blank" style="color:var(--accent2)">TimeSeriesScientist (TSci)</a></strong><br><span style="font-size:.85em;color:var(--muted)">Zhao et al., <a href="https://arxiv.org/abs/2510.01538" target="_blank" style="color:var(--accent2)">“TimeSeriesScientist: A General-Purpose AI Agent for Time Series Analysis” (arXiv:2510.01538, 2025)</a></span></td>
|
| 380 |
+
<td>Domain-specific LangGraph agent purpose-built for TS: <em>Curator → Planner → Forecaster → Reporter</em>, with statistical / ML / DL model selection and ensembling (ARIMA, Prophet, LSTM, XGBoost, Transformer, …).</td>
|
| 381 |
+
<td>The strongest TS-specialized agent we can find; a natural “ceiling” for agentic TS reasoning and a direct yardstick for whether an RL-post-trained generalist LLM inside a simpler harness closes the gap to a purpose-built TS pipeline.</td>
|
| 382 |
+
<td><strong>Planned.</strong> A2A adapter not yet published on AgentBeats.</td>
|
| 383 |
+
</tr>
|
| 384 |
+
</tbody>
|
| 385 |
+
</table>
|
| 386 |
+
</div>
|
| 387 |
+
|
| 388 |
+
<blockquote><strong>Green Agent + TemporalBenchEnv + harness panel = a closed evaluation loop.</strong> The Green Agent scores every harness on a fixed TS-Benchmark MCQ set; TemporalBenchEnv uses the <em>same datasets and task taxonomy</em> to post-train a backbone with randomized-domain / randomized-task episodes under verifiable rewards; we then drop the post-trained backbone back into AgentScope, CAMEL, MetaGPT, and TSci and re-score. A delta from the RL post-training should show up as a <em>uniform</em> lift across all four harnesses — not just in one framework, and not just in the zero-shot MCQ row of the baselines table.</blockquote>
|
| 389 |
+
</section>
|
| 390 |
+
|
| 391 |
+
<!-- 3. WHY IT MATTERS -->
|
| 392 |
+
<section id="matters">
|
| 393 |
+
<h2>Why this benchmark matters</h2>
|
| 394 |
+
<p>Time-series reasoning is one of the few areas where frontier LLMs still look visibly weak, but also one where the <strong>verifiable signal is clean</strong>: trend / volatility / seasonality / outlier labels are constructed from thresholded statistics of the series, so exact-match grading is unambiguous. That makes MCQ episodes a near-ideal RL target before touching noisier numeric forecasting rewards.</p>
|
| 395 |
+
<p>The design is <strong>transferable</strong>. Any benchmark that produces labeled MCQ records over a set of domains — medical diagnostics, power-grid anomaly tagging, retail demand regimes — fits the same 9-question cross-domain template. Datasets are the proxy; the capability is <strong>multi-step, multi-domain, verifiable TS reasoning</strong>.</p>
|
| 396 |
+
<p>Every reward component is <strong>ground-truth arithmetic</strong>. The environment samples questions from pre-built banks, scores answers via normalized string equality, and aggregates with a closed-form episode bonus. No LLM judge, no circular reward.</p>
|
| 397 |
+
</section>
|
| 398 |
+
|
| 399 |
+
<!-- 4. PRIOR WORK & NOVELTY -->
|
| 400 |
+
<section id="prior-work">
|
| 401 |
+
<h2>Prior work & novelty</h2>
|
| 402 |
+
<p>Prior “LLMs + time series” work lands in one of three buckets. None occupies the cell we target:</p>
|
| 403 |
+
<div class="table-wrap">
|
| 404 |
+
<table>
|
| 405 |
+
<thead><tr><th>Prior work bucket</th><th>What it does</th><th>What it does not</th></tr></thead>
|
| 406 |
+
<tbody>
|
| 407 |
+
<tr><td><strong>Static TS benchmarks</strong><br><span style="font-size:.85em;color:var(--muted)">TS-Benchmark (ours, <a href="https://openreview.net/forum?id=rp5qCsxXca¬eId=jYZcZ4GI0y" target="_blank" style="color:var(--accent2)">OpenReview</a>); FreshRetailNet (<a href="https://arxiv.org/abs/2505.16319" target="_blank" style="color:var(--accent2)">arXiv:2505.16319</a>); PSML (<a href="https://www.nature.com/articles/s41597-022-01455-7" target="_blank" style="color:var(--accent2)">Nat. Sci. Data 2022</a>); MIMIC-IV (<a href="https://www.nature.com/articles/s41597-022-01899-x" target="_blank" style="color:var(--accent2)">Nat. Sci. Data 2022</a>); Causal Chambers (<a href="https://www.nature.com/articles/s42256-024-00964-x" target="_blank" style="color:var(--accent2)">Nat. MI 2024</a>)</span></td><td>Construct labeled MCQ / regression tasks over TS datasets, graded by fixed rules</td><td>No RL-native environment contract, no sequential episodes, no post-training loop</td></tr>
|
| 408 |
+
<tr><td><strong>TS-LLM composite RL rewards</strong><br><span style="font-size:.85em;color:var(--muted)">TimeMaster (<a href="https://arxiv.org/abs/2506.13705" target="_blank" style="color:var(--accent2)">arXiv:2506.13705</a>); COUNTS (<a href="https://arxiv.org/abs/2510.01116" target="_blank" style="color:var(--accent2)">arXiv:2510.01116</a>); SenTSR-Bench (<a href="https://arxiv.org/abs/2602.19455" target="_blank" style="color:var(--accent2)">arXiv:2602.19455</a>)</span></td><td>Composite reward shaping (MSE + DTW + direction + quantile + ArcTan) for TS multimodal LLMs</td><td>No OpenEnv contract, no sequential MCQ episodes, reward is forecasting-centric</td></tr>
|
| 409 |
+
<tr><td><strong>A2A evaluators</strong><br><span style="font-size:.85em;color:var(--muted)">Melady TS Green Agent (<a href="https://agentbeats.dev/sharma-yash01/melady-ts-green-agent" target="_blank" style="color:var(--accent2)">AgentBeats</a>); other AgentBeats green agents</span></td><td>Score deployed purple agents on the benchmark through the A2A protocol</td><td>Not a training environment; no per-step RL signal, no environment state</td></tr>
|
| 410 |
+
<tr class="novel"><td><strong>TemporalBenchEnv (ours)</strong></td><td>Sequential 9-step MCQ MDP over four TS datasets, structured Pydantic actions, terminal reward from ground-truth labels + coverage multiplier, OpenEnv + GRPO contract</td><td>Does not (yet) train numeric forecasting — T2/T4 reward is stubbed for future work</td></tr>
|
| 411 |
+
</tbody>
|
| 412 |
+
</table>
|
| 413 |
+
</div>
|
| 414 |
+
<blockquote>To our knowledge, no prior work exposes TS-Benchmark’s multi-dataset MCQ suite as an <strong>OpenEnv-native sequential MDP with verifiable terminal rewards</strong> suitable for GRPO post-training. The action space, reward decomposition, and sibling-env architecture follow the <a href="https://huggingface.co/spaces/yashu2000/LotteryElicitationEnv" target="_blank" style="color:var(--accent2)">LotteryElicitationEnv</a> lineage; the domain and task taxonomy follow our own Melady TS Green Agent.</blockquote>
|
| 415 |
+
</section>
|
| 416 |
+
|
| 417 |
+
<!-- 5. WHAT IT IS -->
|
| 418 |
+
<section id="design">
|
| 419 |
+
<h2>What TemporalBenchEnv is</h2>
|
| 420 |
+
<blockquote>An OpenEnv-native sequential MDP in which an LLM agent answers nine MCQ questions — six from a primary domain and one each from the other three — earning per-step correctness and a terminal bonus that rewards cross-domain coverage.</blockquote>
|
| 421 |
+
<p>Each episode proceeds like this:</p>
|
| 422 |
+
<ul>
|
| 423 |
+
<li><code>reset()</code> samples nine <code>TSQuestion</code> records: <strong>6 from the primary domain</strong> (default PSML), with T3 families round-robined for diversity, and <strong>1 from each of the three non-primary domains</strong>. Final order is shuffled.</li>
|
| 424 |
+
<li>On turn <em>t</em>, the agent emits a <code>TemporalBenchAction</code> containing the MCQ label, plus optional <code>confidence</code> and <code>reasoning</code> fields.</li>
|
| 425 |
+
<li>The environment <strong>grades</strong> the answer via normalized string equality against <code>question.answer</code> (also accepting an option whose normalized text matches the ground truth), and returns the next question.</li>
|
| 426 |
+
<li>Per-step reward is <code>alpha * correctness</code>. Mid-episode bonuses are zero.</li>
|
| 427 |
+
<li>On the final step, the environment adds the <strong>terminal episode bonus</strong>: <code>lambda_ep * (total_correct / N) * coverage_multiplier</code>, where the multiplier is <code>1.0</code> if every one of the four domains contributed at least one correct answer, else <code>0.8</code>.</li>
|
| 428 |
+
</ul>
|
| 429 |
+
<p>The agent’s interface is deliberately minimal: a single answer string per step, no tool-call protocol. Optional <code>confidence</code> and <code>reasoning</code> fields exist on the action for future reward shaping.</p>
|
| 430 |
+
</section>
|
| 431 |
+
|
| 432 |
+
<!-- 6. ENVIRONMENT DESIGN -->
|
| 433 |
+
<section id="env-design">
|
| 434 |
+
<h2>Environment design</h2>
|
| 435 |
+
<p>The core contract is three Pydantic types exchanged over the OpenEnv WebSocket (see <code>openenv-ts/TemporalBenchEnv/env/models.py</code>):</p>
|
| 436 |
+
<pre><code><span class="c"># Action (agent → env)</span>
|
| 437 |
+
class TemporalBenchAction(Action):
|
| 438 |
+
answer: str <span class="c"># MCQ label matching an option</span>
|
| 439 |
+
confidence: Optional[float] <span class="c"># in [0, 1], unused in reward for now</span>
|
| 440 |
+
reasoning: Optional[str] <span class="c"># optional CoT, unused in reward for now</span>
|
| 441 |
+
|
| 442 |
+
<span class="c"># Observation (env → agent)</span>
|
| 443 |
+
class TemporalBenchObservation(Observation):
|
| 444 |
+
step_idx, steps_remaining, max_steps: int
|
| 445 |
+
question: str <span class="c"># current MCQ prompt</span>
|
| 446 |
+
options: list[str] <span class="c"># 2+ answer choices</span>
|
| 447 |
+
task_type: str <span class="c"># "T1U" | "T3" | "T2_MCQ"</span>
|
| 448 |
+
dataset: str <span class="c"># "PSML" | "freshretailnet" | "MIMIC" | "causal_chambers"</span>
|
| 449 |
+
history: list[dict] <span class="c"># [{question, answer, correct, dataset, ...}, ...]</span>
|
| 450 |
+
accuracy_so_far: float
|
| 451 |
+
done: bool; reward: Optional[float]; metadata: dict
|
| 452 |
+
|
| 453 |
+
<span class="c"># State (serializable snapshot)</span>
|
| 454 |
+
class TemporalBenchState(State):
|
| 455 |
+
episode_id: Optional[str]
|
| 456 |
+
step_count, total_correct, total_questions: int
|
| 457 |
+
current_accuracy: float
|
| 458 |
+
primary_domain: str
|
| 459 |
+
per_task_type_accuracy: dict[str, float]
|
| 460 |
+
total_reward: float</code></pre>
|
| 461 |
+
|
| 462 |
+
<p>Four datasets, three MCQ task types, and a three-stage curriculum shape the training distribution (see <code>env/config.py</code> and <code>env/episode_sampler.py</code>):</p>
|
| 463 |
+
<div class="table-wrap">
|
| 464 |
+
<table>
|
| 465 |
+
<thead><tr><th>Stage</th><th>Allowed task types</th><th>Purpose</th></tr></thead>
|
| 466 |
+
<tbody>
|
| 467 |
+
<tr><td><strong>Stage 1</strong></td><td><code>T1U</code> only (non-contextual understanding MCQ)</td><td>Shorten credit assignment; learn trend / volatility / seasonality / outliers first</td></tr>
|
| 468 |
+
<tr><td><strong>Stage 2</strong></td><td><code>T1U</code> + <code>T3</code> (contextual understanding, S1–S5 families)</td><td>Add context-conditioned reasoning; maintain verifiable labels</td></tr>
|
| 469 |
+
<tr><td><strong>Stage 3</strong></td><td><code>T1U</code> + <code>T3</code> + <code>T2_MCQ</code> (prediction-as-classification)</td><td>Full MCQ track; adds direction-of-change / volatility-change / seasonality-alignment</td></tr>
|
| 470 |
+
</tbody>
|
| 471 |
+
</table>
|
| 472 |
+
</div>
|
| 473 |
+
<p>Curriculum is honored both at <code>EnvConfig(curriculum_stage=...)</code> construction and at <code>env.reset(curriculum_stage=...)</code>, so a single server can serve multiple stages to different sessions concurrently.</p>
|
| 474 |
+
</section>
|
| 475 |
+
|
| 476 |
+
<!-- 7. WHY OPENENV -->
|
| 477 |
+
<section id="openenv">
|
| 478 |
+
<h2>Why OpenEnv</h2>
|
| 479 |
+
<p>OpenEnv gives us three things that matter for this submission: (1) a standard WebSocket environment contract consumable by TRL’s <code>rollout_func</code>, (2) per-session state with <code>max_concurrent_envs=64</code> in our <code>create_app</code> factory — each WebSocket session gets a fresh <code>TemporalBenchEnvironment</code> via <code>_env_factory</code> so DDP ranks can hammer the same Space without cross-talk, and (3) a uniform deployment path. The same env code runs in-process for tests, as a Docker container for development (<code>server/Dockerfile</code>, with <code>TEMPORALBENCH_QUESTION_BANK_DIR=/app/env/data/banks</code>), and as a Hugging Face Space during training and evaluation.</p>
|
| 480 |
+
<p>No new abstractions were invented. Base types only: <code>EnvClient</code>, <code>Environment</code>, Pydantic <code>Action</code> / <code>Observation</code> / <code>State</code>. All extensions (history, per-domain coverage, per-task accuracy) ride on <code>metadata</code> or the serialized state. No new method signatures, no fork. The env ships with <code>openenv.yaml</code> and a <code>Dockerfile</code>, and passes <code>uv run openenv validate</code>.</p>
|
| 481 |
+
<p class="math-note"><strong>Hygiene note:</strong> OpenEnv’s CLI validator does a naive substring check for <code>main()</code> in <code>server/app.py</code>. We match the reference pattern from <a href="https://huggingface.co/spaces/yashu2000/LotteryElicitationEnv" target="_blank" style="color:var(--accent2)">LotteryElicitationEnv</a> — an explicit <code>main()</code> call under <code>if __name__ == "__main__"</code> with CLI flags parsed via <code>parse_known_args</code>.</p>
|
| 482 |
+
</section>
|
| 483 |
+
|
| 484 |
+
<!-- 8. SCORING -->
|
| 485 |
+
<section id="scoring">
|
| 486 |
+
<h2>Scoring: per-step correctness + episode bonus</h2>
|
| 487 |
+
<p>Reward is decomposed into a per-step term and a terminal bonus (see <code>env/reward.py</code>).</p>
|
| 488 |
+
<div class="math-display" aria-label="Per-step correctness reward">
|
| 489 |
+
\[
|
| 490 |
+
r_t \;=\; \alpha \cdot \mathbf{1}\!\left[\hat a_t = a_t^\ast\right]
|
| 491 |
+
\]
|
| 492 |
+
</div>
|
| 493 |
+
<p class="math-note">Here \(\hat a_t\) is the agent’s submitted label and \(a_t^\ast\) is the stored ground truth, compared under normalized string equality (also accepting an option whose normalized text matches).</p>
|
| 494 |
+
<div class="math-display" aria-label="Terminal episode bonus with coverage multiplier">
|
| 495 |
+
\[
|
| 496 |
+
B \;=\; \lambda_{\mathrm{ep}} \,\cdot\, \frac{C}{N} \,\cdot\, m,
|
| 497 |
+
\qquad
|
| 498 |
+
m \;=\; \begin{cases} 1.0 & \text{all 4 domains have } \ge 1 \text{ correct} \\ 0.8 & \text{otherwise} \end{cases}
|
| 499 |
+
\]
|
| 500 |
+
\[
|
| 501 |
+
R \;=\; \sum_{t=1}^{N} r_t \;+\; B
|
| 502 |
+
\]
|
| 503 |
+
</div>
|
| 504 |
+
<p class="math-note">\(C\) is the total correct count in the episode, \(N = 9\) is the episode length, \(\alpha, \lambda_{\mathrm{ep}}\) are <code>alpha</code> and <code>lambda_ep</code> in <code>EnvConfig</code>.</p>
|
| 505 |
+
<p>Defaults live in <code>EnvConfig</code>:</p>
|
| 506 |
+
<div class="table-wrap">
|
| 507 |
+
<table>
|
| 508 |
+
<thead><tr><th>Component</th><th>Weight</th><th>What it rewards</th></tr></thead>
|
| 509 |
+
<tbody>
|
| 510 |
+
<tr><td><strong>Per-step correctness</strong></td><td class="num">α = 1.0</td><td>Normalized-string match against the MCQ ground truth</td></tr>
|
| 511 |
+
<tr><td><strong>Episode bonus weight</strong></td><td class="num">λ<sub>ep</sub> = 0.5</td><td>Scales the terminal accuracy×coverage term</td></tr>
|
| 512 |
+
<tr><td><strong>Coverage multiplier</strong></td><td class="num">{0.8, 1.0}</td><td>1.0 iff every domain in <code>EnvConfig.all_domains</code> has ≥1 correct answer this episode</td></tr>
|
| 513 |
+
<tr><td><strong>Forecasting reward</strong></td><td class="num">—</td><td>Stubbed (<code>compute_forecasting_reward</code> raises <code>NotImplementedError</code>); future work</td></tr>
|
| 514 |
+
</tbody>
|
| 515 |
+
</table>
|
| 516 |
+
</div>
|
| 517 |
+
<blockquote><strong>Why a coverage multiplier:</strong> per-step accuracy alone lets the agent ace the six primary questions while guessing on the other three domains, collapsing to a single-domain policy. The <code>0.8</code> penalty forces the policy to treat the three cross-domain questions as first-class signal — the very thing that distinguishes a TS-generalist from a PSML-only memorizer.</blockquote>
|
| 518 |
+
</section>
|
| 519 |
+
|
| 520 |
+
<!-- 9. ARCHITECTURE & TRAINING PIPELINE -->
|
| 521 |
+
<section id="architecture">
|
| 522 |
+
<h2>Architecture & training pipeline</h2>
|
| 523 |
+
<p>Following the <a href="https://huggingface.co/spaces/yashu2000/LotteryElicitationEnv" target="_blank" style="color:var(--accent2)">LotteryElicitationEnv</a> / <a href="https://huggingface.co/spaces/yashu2000/reasoning-economic-env" target="_blank" style="color:var(--accent2)">ReasoningEconomicsEnv</a> lineage, <strong>TemporalBenchEnv</strong> (this page) ships the OpenEnv side: reset, step, rewards, and question banks over WebSocket. A separate trainer process — same env/trainer separation as <code>LotteryElicitationEnv</code> and <a href="https://github.com/sharma-yash01/LotteryElicitationPT" target="_blank" style="color:var(--accent2)"><code>LotteryElicitationPT</code></a> — would drive GRPO with TRL’s <code>rollout_func</code> and vLLM against that socket, without in-process imports of env-side types.</p>
|
| 524 |
+
|
| 525 |
+
<div class="notice">
|
| 526 |
+
<strong>Training client status.</strong> The companion GRPO / TRL package that runs rollouts against this environment is under <strong>active internal testing</strong>. We will release it on <strong>GitHub</strong> once we secure <strong>additional compute</strong> so we can stress-test and validate models at a scale we consider reasonable. This blog documents the shipped <strong>TemporalBenchEnv</strong> only until that release.
|
| 527 |
+
</div>
|
| 528 |
+
|
| 529 |
+
<blockquote><strong>Purple agents on AgentBeats.</strong> The same TS-Benchmark task surface is also exercised by <strong>purple agents</strong> scored through the A2A green agent: live listings are <a href="https://agentbeats.dev/sharma-yash01/melady-ts-purple-agentscope" target="_blank" style="color:var(--accent2)">Melady TS Purple (AgentScope)</a> and <a href="https://agentbeats.dev/sharma-yash01/melady-ts-base-purple-agent" target="_blank" style="color:var(--accent2)">Melady TS Base Purple Agent</a>. Planned additions are documented in <a href="#purple-harnesses" style="color:var(--accent2)">§ Purple agent harnesses</a>.</blockquote>
|
| 530 |
+
|
| 531 |
+
<div class="mermaid-wrap">
|
| 532 |
+
<pre class="mermaid">
|
| 533 |
+
flowchart LR
|
| 534 |
+
subgraph TRN ["Trainer (TRL + GRPO, WebSocket client)"]
|
| 535 |
+
GRPO["GRPOTrainer<br/>TRL 1.0"]
|
| 536 |
+
RF["rollout_func"]
|
| 537 |
+
VLLM["vLLM<br/>colocate/server"]
|
| 538 |
+
PARSE["action_parser<br/>MCQ label guardrails"]
|
| 539 |
+
end
|
| 540 |
+
subgraph ENV ["TemporalBenchEnv (OpenEnv)"]
|
| 541 |
+
WS["FastAPI<br/>WebSocket"]
|
| 542 |
+
SAMP["EpisodeSampler<br/>6+1+1+1 stratified"]
|
| 543 |
+
GRADE["grade_answer<br/>normalized match"]
|
| 544 |
+
REW["Reward<br/>per-step + bonus + coverage"]
|
| 545 |
+
end
|
| 546 |
+
|
| 547 |
+
GRPO --> RF
|
| 548 |
+
RF --> VLLM
|
| 549 |
+
VLLM -->|"generate"| PARSE
|
| 550 |
+
PARSE -->|"answer string"| WS
|
| 551 |
+
WS --> SAMP
|
| 552 |
+
SAMP -->|"next question"| WS
|
| 553 |
+
WS --> GRADE
|
| 554 |
+
GRADE --> REW
|
| 555 |
+
REW -->|"step + terminal reward"| WS
|
| 556 |
+
WS -->|"observation"| RF
|
| 557 |
+
</pre>
|
| 558 |
+
<p class="mermaid-caption">Figure 1. System architecture. The trainer never imports env-side types — everything crosses the WebSocket, exactly like our Lottery / Reasoning sibling envs.</p>
|
| 559 |
+
</div>
|
| 560 |
+
|
| 561 |
+
<p>Training uses <strong>GRPO</strong> (Group Relative Policy Optimization), which is a natural fit for per-step verifiable rewards with an additive terminal bonus. The training scaffolding is directly inherited from <a href="https://github.com/sharma-yash01/LotteryElicitationPT" target="_blank" style="color:var(--accent2)"><code>LotteryElicitationPT</code></a>: TRL 1.0’s <code>rollout_func</code> contract, vLLM colocate/server, chat-template tokenization with <code>enable_thinking=False</code>, think-block stripping, null-safe MCQ-label parsing, and episode logging to <code>reward_logs.jsonl</code>.</p>
|
| 562 |
+
</section>
|
| 563 |
+
|
| 564 |
+
<!-- 10. EPISODE TRACE (IDEAL, ILLUSTRATIVE) -->
|
| 565 |
+
<section id="traces">
|
| 566 |
+
<h2>Episode trace (Ideal, Illustrative)</h2>
|
| 567 |
+
<p>Here is what a high-reward episode <em>would</em> look like — five representative steps out of nine, spanning all four domains. Primary domain is PSML; the trace shows the three cross-domain picks plus two primary turns, culminating in the terminal step where the coverage multiplier decides the shape of the bonus. <strong>Turns 2, 4, 6, 8 are elided</strong> (all PSML T1U, assumed correct for the 7 / 9 total) so the walkthrough stays focused on the cross-domain structure.</p>
|
| 568 |
+
|
| 569 |
+
<div class="notice">
|
| 570 |
+
<strong>Illustrative, not captured.</strong> This is a hand-constructed walkthrough intended to explain the per-step reward, the 6 + 3 domain split, and the terminal coverage multiplier — <em>not</em> a real rollout from a trained (or even zero-shot) policy against the live environment. The prompts and agent answers below are author-written. The reward arithmetic (<code>r_t</code>, <code>B</code>, <code>R</code>) is computed exactly as <code>env/reward.py</code> would compute it for the stated correctness pattern. A captured trace from a real model will replace this once the training client is released and run at scale (see <a href="#architecture" style="color:var(--accent2)">Architecture & training pipeline</a>).
|
| 571 |
+
</div>
|
| 572 |
+
|
| 573 |
+
<div class="episode-trace">
|
| 574 |
+
<div class="trace-step">
|
| 575 |
+
<div class="step-marker good"></div>
|
| 576 |
+
<div class="step-label">Turn 1 · dataset=PSML · task=T1U:trend</div>
|
| 577 |
+
<div class="step-content">
|
| 578 |
+
<strong>Prompt:</strong> <em>“Based on the array (length=336), report trend: upward / downward / constant.”</em><br>
|
| 579 |
+
Series rises monotonically over the tail window.<br>
|
| 580 |
+
<strong>Agent answer:</strong> <code>"upward"</code> → <strong>correct</strong> (<code>r_1 = 1.0</code>)
|
| 581 |
+
</div>
|
| 582 |
+
</div>
|
| 583 |
+
<div class="trace-step">
|
| 584 |
+
<div class="step-marker good"></div>
|
| 585 |
+
<div class="step-label">Turn 3 · dataset=freshretailnet · task=T1U:seasonality</div>
|
| 586 |
+
<div class="step-content">
|
| 587 |
+
<strong>Prompt:</strong> daily demand for a fresh-retail SKU, peaks repeat every 7 steps with stable amplitude.<br>
|
| 588 |
+
<strong>Agent answer:</strong> <code>"fixed"</code> → <strong>correct</strong> (<code>r_3 = 1.0</code>)
|
| 589 |
+
</div>
|
| 590 |
+
</div>
|
| 591 |
+
<div class="trace-step">
|
| 592 |
+
<div class="step-marker good"></div>
|
| 593 |
+
<div class="step-label">Turn 5 · dataset=MIMIC · task=T1U:outliers</div>
|
| 594 |
+
<div class="step-content">
|
| 595 |
+
<strong>Prompt:</strong> ICU vital trace with a single spike above <code>q95 + 3·MAD</code>.<br>
|
| 596 |
+
<strong>Agent answer:</strong> <code>"sudden_spike"</code> → <strong>correct</strong> (<code>r_5 = 1.0</code>)
|
| 597 |
+
</div>
|
| 598 |
+
</div>
|
| 599 |
+
<div class="trace-step">
|
| 600 |
+
<div class="step-marker good"></div>
|
| 601 |
+
<div class="step-label">Turn 7 · dataset=causal_chambers · task=T3:S2</div>
|
| 602 |
+
<div class="step-content">
|
| 603 |
+
<strong>Prompt:</strong> wind-chamber actuator trace, contextual question on lagged response to a step input (capability <code>C4</code>).<br>
|
| 604 |
+
<strong>Agent answer:</strong> <code>"delayed_response"</code> → <strong>correct</strong> (<code>r_7 = 1.0</code>)
|
| 605 |
+
</div>
|
| 606 |
+
</div>
|
| 607 |
+
<div class="trace-step">
|
| 608 |
+
<div class="step-marker good"></div>
|
| 609 |
+
<div class="step-label">Turn 9 · dataset=PSML · task=T2_MCQ (terminal)</div>
|
| 610 |
+
<div class="step-content">
|
| 611 |
+
<strong>Prompt:</strong> “Median demand level change (forecast horizon vs history)?”<br>
|
| 612 |
+
<strong>Agent answer:</strong> <code>"Higher"</code> → <strong>correct</strong> (<code>r_9 = 1.0</code>)<br>
|
| 613 |
+
<strong>Episode totals:</strong> <code>C = 7 / 9</code> correct, all four domains covered → <code>m = 1.0</code>.<br>
|
| 614 |
+
<strong>Terminal bonus:</strong> <code>B = 0.5 · (7/9) · 1.0 ≈ 0.389</code>.<br>
|
| 615 |
+
<strong>Total return:</strong> <code>R ≈ 7.389</code>.
|
| 616 |
+
</div>
|
| 617 |
+
</div>
|
| 618 |
+
<div class="trace-verdict good">
|
| 619 |
+
Four-of-four coverage → full <strong>m = 1.0</strong> multiplier. Contrast: the same 7 / 9 accuracy with a missed cross-domain question (say, zero correct in MIMIC) would give <code>B ≈ 0.311</code> and <code>R ≈ 7.311</code>. The coverage term is the whole reason a PSML-only policy loses to a generalist.
|
| 620 |
+
</div>
|
| 621 |
+
</div>
|
| 622 |
+
</section>
|
| 623 |
+
|
| 624 |
+
<!-- 11. EVALUATION PROTOCOL & PROJECTED TARGETS -->
|
| 625 |
+
<section id="results">
|
| 626 |
+
<h2>Evaluation protocol & projected targets</h2>
|
| 627 |
+
|
| 628 |
+
<div class="notice">
|
| 629 |
+
<strong>No trained-policy numbers yet.</strong> Available compute was exhausted before GRPO could be run to convergence and validated at a scale we consider meaningful. The “Trained HF policy” column therefore stays blank, and every number in the “Projected targets” table below is an <em>analytical</em> projection from the environment’s structure, not a measurement. GRPO runs against this Space will populate the empty column once the public training client is released and run at scale — see the canonical training-status notice under <a href="#architecture">Architecture & training pipeline</a>. No fabricated telemetry is shown.
|
| 630 |
+
</div>
|
| 631 |
+
|
| 632 |
+
<blockquote><strong>Purple baselines (AgentBeats).</strong> For deployed purple policies on the same benchmark lineage, see <a href="https://agentbeats.dev/sharma-yash01/melady-ts-purple-agentscope" target="_blank" style="color:var(--accent2)">Melady TS Purple (AgentScope)</a> and <a href="https://agentbeats.dev/sharma-yash01/melady-ts-base-purple-agent" target="_blank" style="color:var(--accent2)">Melady TS Base Purple Agent</a>.</blockquote>
|
| 633 |
+
|
| 634 |
+
<h3>Baselines</h3>
|
| 635 |
+
<p>The environment supports the following baseline policies out of the box (random + majority) plus an eval harness for zero-shot and trained LLMs. All run against the live OpenEnv WebSocket so numbers are directly comparable with the trained policy.</p>
|
| 636 |
+
<div class="table-wrap">
|
| 637 |
+
<table>
|
| 638 |
+
<thead><tr><th>Baseline</th><th>Policy</th><th>What it isolates</th></tr></thead>
|
| 639 |
+
<tbody>
|
| 640 |
+
<tr><td><strong>Random MCQ</strong></td><td>Uniform sample over <code>observation.options</code></td><td>Lower bound; beats zero only if options are imbalanced</td></tr>
|
| 641 |
+
<tr><td><strong>Majority-class</strong></td><td>Always pick the per-<code>task_type</code> modal label from the bank</td><td>Isolates how much accuracy is available from priors alone</td></tr>
|
| 642 |
+
<tr><td><strong>Zero-shot API LLM</strong></td><td>GPT / Claude / Gemini via the eval harness</td><td>Strong “off-the-shelf” ceiling before any post-training</td></tr>
|
| 643 |
+
<tr><td><strong>Zero-shot local LLM</strong></td><td>Qwen2.5-7B-Instruct served via vLLM</td><td>Planned backbone for GRPO fine-tuning against this environment once training runs land</td></tr>
|
| 644 |
+
<tr><td><strong>Trained HF policy</strong></td><td>GRPO checkpoint from rollouts on this OpenEnv</td><td>Tests whether post-training on TS-Benchmark episodes beats zero-shot</td></tr>
|
| 645 |
+
</tbody>
|
| 646 |
+
</table>
|
| 647 |
+
</div>
|
| 648 |
+
|
| 649 |
+
<h3>Projected targets</h3>
|
| 650 |
+
<p>The numbers below are <strong>analytical projections</strong> from the environment’s structure, not empirical results. They exist to anchor what “good” looks like once training runs land.</p>
|
| 651 |
+
<div class="table-wrap">
|
| 652 |
+
<table>
|
| 653 |
+
<thead><tr><th>Metric</th><th>Random MCQ</th><th>Zero-shot LLM (expected)</th><th>Target for trained policy</th></tr></thead>
|
| 654 |
+
<tbody>
|
| 655 |
+
<tr><td>Per-step accuracy (T1U & T3, 3–4 options)</td><td class="num">≈ 0.25–0.33</td><td class="num">>> 0.33</td><td class="num">≥ strong zero-shot</td></tr>
|
| 656 |
+
<tr><td>Per-step accuracy (T2_MCQ, 4 options)</td><td class="num">≈ 0.25</td><td class="num">> random</td><td class="num">≥ strong zero-shot</td></tr>
|
| 657 |
+
<tr><td>Coverage multiplier <code>m</code></td><td class="num">usually 0.8</td><td class="num">0.8–1.0</td><td class="num">1.0 consistently</td></tr>
|
| 658 |
+
<tr><td>Episode bonus <code>B</code> (λ<sub>ep</sub>=0.5, N=9)</td><td class="num">≈ 0.10</td><td class="num">0.20–0.35</td><td class="num">≥ 0.40</td></tr>
|
| 659 |
+
</tbody>
|
| 660 |
+
</table>
|
| 661 |
+
</div>
|
| 662 |
+
<h3>Current status</h3>
|
| 663 |
+
<p>What is already in place:</p>
|
| 664 |
+
<ul>
|
| 665 |
+
<li><strong>Full environment</strong>: <code>reset</code> / <code>step</code> / <code>state</code>, nine-question stratified sampling across four domains, per-step + terminal reward with coverage multiplier.</li>
|
| 666 |
+
<li><strong>Production banks</strong>: 2,775 <code>TSQuestion</code> records built from <code>TS-benchmark/task_merged_dev_with_labels_tiers.jsonl</code> via <code>TS-benchmark/scripts/build_temporal_bench_openenv_banks.py</code> and vendored under <code>openenv-ts/TemporalBenchEnv/data/banks/</code> (PSML 750 / freshretailnet 616 / MIMIC 709 / causal_chambers 700).</li>
|
| 667 |
+
<li><strong>Deployment path</strong>: <code>openenv.yaml</code>, Docker image (<code>server/Dockerfile</code>, <code>TEMPORALBENCH_QUESTION_BANK_DIR=/app/env/data/banks</code>), <code>openenv validate</code> passes, <code>openenv push</code> ready for HF Space.</li>
|
| 668 |
+
<li><strong>Training scaffolding</strong>: the intended path mirrors <a href="https://github.com/sharma-yash01/LotteryElicitationPT" target="_blank" style="color:var(--accent2)"><code>LotteryElicitationPT</code></a> (GRPO + TRL 1.0 <code>rollout_func</code> + vLLM). Training-client release status: see <a href="#architecture">Architecture</a>.</li>
|
| 669 |
+
<li><strong>Purple agents (AgentBeats)</strong>: live eval targets include <a href="https://agentbeats.dev/sharma-yash01/melady-ts-purple-agentscope" target="_blank" style="color:var(--accent2)">Melady TS Purple (AgentScope)</a> and <a href="https://agentbeats.dev/sharma-yash01/melady-ts-base-purple-agent" target="_blank" style="color:var(--accent2)">Melady TS Base Purple Agent</a>. Planned additions: see <a href="#purple-harnesses">§ Purple agent harnesses</a>.</li>
|
| 670 |
+
</ul>
|
| 671 |
+
</section>
|
| 672 |
+
|
| 673 |
+
<!-- 12. ENGINEERING LESSONS (INHERITED) -->
|
| 674 |
+
<section id="engineering">
|
| 675 |
+
<h2>Engineering lessons (inherited)</h2>
|
| 676 |
+
<p>Because TemporalBenchEnv intentionally rides on the same deployment + training pattern as <a href="https://huggingface.co/spaces/yashu2000/LotteryElicitationEnv" target="_blank" style="color:var(--accent2)">LotteryElicitationEnv</a> and <a href="https://huggingface.co/spaces/yashu2000/reasoning-economic-env" target="_blank" style="color:var(--accent2)">ReasoningEconomicsEnv</a>, most of the hard infra lessons are <em>inherited</em> rather than rediscovered. We document them briefly so the next submitter does not need to learn them from scratch.</p>
|
| 677 |
+
<div class="table-wrap">
|
| 678 |
+
<table>
|
| 679 |
+
<thead><tr><th>Issue</th><th>Root cause</th><th>Fix (inherited)</th></tr></thead>
|
| 680 |
+
<tbody>
|
| 681 |
+
<tr>
|
| 682 |
+
<td><strong>NCCL desync</strong> under variable-length episodes</td>
|
| 683 |
+
<td>In <code>vllm_mode=server</code>, different DDP ranks make different numbers of <code>generate()</code> calls per episode → sequence-numbered NCCL collectives go out of sync.</td>
|
| 684 |
+
<td>Fixed-count <code>generate()</code> padding per episode; dummy generates discarded via <code>_temporary_vllm_max_tokens(..., 1)</code>. Gated on <code>world_size > 1</code>.</td>
|
| 685 |
+
</tr>
|
| 686 |
+
<tr>
|
| 687 |
+
<td><strong><code>max_completion_length</code></strong> drift over multi-turn rollouts</td>
|
| 688 |
+
<td>The rollout function appends per-turn generations + observation suffixes each step; a 9-turn MCQ episode can easily exceed the nominal completion budget.</td>
|
| 689 |
+
<td>Hard-cap <code>completion_ids</code> to <code>max_completion_length</code>; <code>PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True</code>.</td>
|
| 690 |
+
</tr>
|
| 691 |
+
<tr>
|
| 692 |
+
<td><strong>Null-safe MCQ parsing</strong></td>
|
| 693 |
+
<td>LLMs sometimes emit <code>{"answer": null}</code> or malformed strings; one rank crashing kills all DDP ranks via gloo cascade.</td>
|
| 694 |
+
<td>Inherited <code>_safe_float</code> / <code>_safe_int</code> / null-string fallbacks; fallback action rather than crash. The same regression pattern from <a href="https://github.com/sharma-yash01/LotteryElicitationPT" target="_blank" style="color:var(--accent2)">LotteryElicitationPT</a> applies directly to MCQ labels.</td>
|
| 695 |
+
</tr>
|
| 696 |
+
<tr>
|
| 697 |
+
<td><strong><code>openenv validate</code> hygiene</strong></td>
|
| 698 |
+
<td>OpenEnv’s CLI does a naive substring check for <code>main()</code> in <code>server/app.py</code>; an entrypoint like <code>main(port=args.port)</code> fails validation.</td>
|
| 699 |
+
<td>Match the Lottery pattern: <code>if __name__ == "__main__": main()</code>, with flags parsed via <code>parse_known_args</code> inside <code>main()</code>.</td>
|
| 700 |
+
</tr>
|
| 701 |
+
</tbody>
|
| 702 |
+
</table>
|
| 703 |
+
</div>
|
| 704 |
+
</section>
|
| 705 |
+
|
| 706 |
+
<!-- 13. FOUNDATIONS & CITATIONS -->
|
| 707 |
+
<section id="foundations">
|
| 708 |
+
<h2>Foundations & citations</h2>
|
| 709 |
+
<div class="table-wrap">
|
| 710 |
+
<table>
|
| 711 |
+
<thead><tr><th>Foundation</th><th>Role in this project</th><th>Citation</th></tr></thead>
|
| 712 |
+
<tbody>
|
| 713 |
+
<tr><td><strong>TS-Benchmark (ours)</strong></td><td>Task taxonomy (T1/T2/T3/T4), per-dataset label construction, MCQ question shape</td><td><a href="https://openreview.net/forum?id=rp5qCsxXca¬eId=jYZcZ4GI0y" target="_blank" style="color:var(--accent2)">OpenReview</a> · in-repo: <code>TS-benchmark/TS-Benchmark.md</code></td></tr>
|
| 714 |
+
<tr><td><strong>Melady TS Green Agent</strong></td><td>A2A evaluator whose task set TemporalBenchEnv re-exposes as an OpenEnv environment</td><td><a href="https://agentbeats.dev/sharma-yash01/melady-ts-green-agent" target="_blank" style="color:var(--accent2)">AgentBeats</a> · <a href="https://github.com/sharma-yash01/Melady_Agent-TS-Green" target="_blank" style="color:var(--accent2)">GitHub repository</a></td></tr>
|
| 715 |
+
<tr><td><strong>AgentScope</strong></td><td>Purple-agent harness — ReAct / MCP / A2A framework wrapping the backbone we post-train (<a href="#purple-harnesses">live on AgentBeats</a>)</td><td>Gao et al., <a href="https://arxiv.org/abs/2508.16279" target="_blank" style="color:var(--accent2)">arXiv:2508.16279</a> (2025) · <a href="https://arxiv.org/abs/2402.14034" target="_blank" style="color:var(--accent2)">arXiv:2402.14034</a> (2024) · <a href="https://github.com/agentscope-ai/agentscope" target="_blank" style="color:var(--accent2)">GitHub</a></td></tr>
|
| 716 |
+
<tr><td><strong>CAMEL</strong></td><td>Purple-agent harness — role-playing multi-agent society, CAMEL-backed baseline on AgentBeats</td><td>Li et al., <a href="https://arxiv.org/abs/2303.17760" target="_blank" style="color:var(--accent2)">NeurIPS 2023 (arXiv:2303.17760)</a> · <a href="https://github.com/camel-ai/camel" target="_blank" style="color:var(--accent2)">GitHub</a></td></tr>
|
| 717 |
+
<tr><td><strong>MetaGPT</strong></td><td>Purple-agent harness (fast-follow-up) — SOP-driven multi-agent system, role-decomposition purple</td><td>Hong et al., <a href="https://arxiv.org/abs/2308.00352" target="_blank" style="color:var(--accent2)">ICLR 2024, oral (arXiv:2308.00352)</a> · <a href="https://github.com/FoundationAgents/MetaGPT" target="_blank" style="color:var(--accent2)">GitHub</a></td></tr>
|
| 718 |
+
<tr><td><strong>TimeSeriesScientist (TSci)</strong></td><td>Purple-agent harness (fast-follow-up) — TS-specialized Curator/Planner/Forecaster/Reporter agent; “ceiling” for agentic TS reasoning</td><td>Zhao et al., <a href="https://arxiv.org/abs/2510.01538" target="_blank" style="color:var(--accent2)">arXiv:2510.01538</a> (2025) · <a href="https://github.com/Y-Research-SBU/TimeSeriesScientist" target="_blank" style="color:var(--accent2)">GitHub</a></td></tr>
|
| 719 |
+
<tr><td><strong>FreshRetailNet-50K</strong></td><td>Retail demand dataset; T1/T2/T3/T4 MCQ questions</td><td>Ding et al., <a href="https://arxiv.org/abs/2505.16319" target="_blank" style="color:var(--accent2)">arXiv:2505.16319</a>, 2025</td></tr>
|
| 720 |
+
<tr><td><strong>PSML</strong></td><td>Power-system load dataset; primary domain for the default episode</td><td><a href="https://www.nature.com/articles/s41597-022-01455-7" target="_blank" style="color:var(--accent2)">Nature Sci. Data 2022</a></td></tr>
|
| 721 |
+
<tr><td><strong>MIMIC-IV</strong></td><td>ICU/EHR time-series dataset; medical-domain MCQ</td><td><a href="https://www.nature.com/articles/s41597-022-01899-x" target="_blank" style="color:var(--accent2)">Nature Sci. Data 2022</a></td></tr>
|
| 722 |
+
<tr><td><strong>Causal Chambers</strong></td><td>Physical-testbed TS dataset; contextual T3 and wind-chamber T1U</td><td><a href="https://www.nature.com/articles/s42256-024-00964-x" target="_blank" style="color:var(--accent2)">Nature MI 2024</a></td></tr>
|
| 723 |
+
<tr><td><strong>TimeMaster / COUNTS / SenTSR-Bench</strong></td><td>Prior art on composite RL rewards for TS-LLMs; motivates T2/T4 forecasting future work</td><td><a href="https://arxiv.org/abs/2506.13705" target="_blank" style="color:var(--accent2)">TimeMaster (arXiv:2506.13705)</a> · <a href="https://arxiv.org/abs/2510.01116" target="_blank" style="color:var(--accent2)">COUNTS (arXiv:2510.01116)</a> · <a href="https://arxiv.org/abs/2602.19455" target="_blank" style="color:var(--accent2)">SenTSR-Bench (arXiv:2602.19455)</a></td></tr>
|
| 724 |
+
<tr><td><strong>OpenEnv</strong></td><td>Gym-style reset/step, WebSocket transport, HF Space deployment</td><td><a href="https://huggingface.co/blog/openenv" target="_blank" style="color:var(--accent2)">HF Blog: Introducing OpenEnv</a></td></tr>
|
| 725 |
+
<tr><td><strong>TRL + GRPO</strong></td><td>GRPOTrainer, custom <code>rollout_func</code>, remote env rollouts</td><td>Shao et al., <a href="https://arxiv.org/abs/2402.03300" target="_blank" style="color:var(--accent2)">arXiv:2402.03300</a> (DeepSeekMath) · <a href="https://huggingface.co/docs/trl/en/openenv" target="_blank" style="color:var(--accent2)">TRL × OpenEnv</a></td></tr>
|
| 726 |
+
<tr><td><strong>LotteryElicitationEnv / PT</strong></td><td>Sibling project — structural template for env / PT split, <code>rollout_func</code>, DDP padding, validation hygiene</td><td><a href="https://huggingface.co/spaces/yashu2000/LotteryElicitationEnv" target="_blank" style="color:var(--accent2)">HF Space (Env)</a> · <a href="https://github.com/sharma-yash01/LotteryElicitationPT" target="_blank" style="color:var(--accent2)">GitHub (PT)</a></td></tr>
|
| 727 |
+
</tbody>
|
| 728 |
+
</table>
|
| 729 |
+
</div>
|
| 730 |
+
</section>
|
| 731 |
+
|
| 732 |
+
<!-- 14. QUICK START -->
|
| 733 |
+
<section id="quickstart">
|
| 734 |
+
<h2>Quick start</h2>
|
| 735 |
+
<pre><code><span class="c"># 1. Install and run the env locally</span>
|
| 736 |
+
cd openenv-ts/TemporalBenchEnv
|
| 737 |
+
uv sync --extra dev
|
| 738 |
+
uv run pytest tests/ -q
|
| 739 |
+
uv run openenv validate
|
| 740 |
+
|
| 741 |
+
<span class="c"># 2. Run the server (uvicorn, port 8000)</span>
|
| 742 |
+
uv run server
|
| 743 |
+
<span class="c"># or</span>
|
| 744 |
+
uvicorn server.app:app --reload
|
| 745 |
+
|
| 746 |
+
<span class="c"># 3. Build & run the Docker image</span>
|
| 747 |
+
docker build -t temporalbenchenv:latest -f server/Dockerfile .
|
| 748 |
+
docker run --rm -p 8000:8000 \
|
| 749 |
+
-e TEMPORALBENCH_QUESTION_BANK_DIR=/app/env/data/banks \
|
| 750 |
+
temporalbenchenv:latest
|
| 751 |
+
|
| 752 |
+
<span class="c"># 4. Or pull / push a HF Space</span>
|
| 753 |
+
export ENV_BASE_URL="https://huggingface.co/spaces/yashu2000/TemporalBenchEnv"
|
| 754 |
+
openenv push <span class="c"># from the TemporalBenchEnv/ directory</span>
|
| 755 |
+
|
| 756 |
+
<span class="c"># 5. Minimal client usage</span>
|
| 757 |
+
python - <<'PY'
|
| 758 |
+
from client import TemporalBenchAction, TemporalBenchEnvClient
|
| 759 |
+
with TemporalBenchEnvClient(base_url="http://localhost:8000") as env:
|
| 760 |
+
out = env.reset()
|
| 761 |
+
while not out.done:
|
| 762 |
+
q = out.observation
|
| 763 |
+
out = env.step(TemporalBenchAction(answer=q.options[0]))
|
| 764 |
+
print("total reward:", out.observation.reward)
|
| 765 |
+
PY
|
| 766 |
+
|
| 767 |
+
<span class="c"># 6. GRPO / TRL training against this Space (training client under test; not on GitHub yet)</span>
|
| 768 |
+
<span class="c"># Release is planned after additional compute for reasonable model validation.</span></code></pre>
|
| 769 |
+
<p><strong>Purple track (AgentBeats):</strong> <a href="https://agentbeats.dev/sharma-yash01/melady-ts-purple-agentscope" target="_blank" style="color:var(--accent2)">Melady TS Purple (AgentScope)</a> · <a href="https://agentbeats.dev/sharma-yash01/melady-ts-base-purple-agent" target="_blank" style="color:var(--accent2)">Melady TS Base Purple Agent</a>.</p>
|
| 770 |
+
<p>Banks are reproducible from <code>(env_seed, curriculum_stage, primary_domain)</code>. No external fixtures, no live API, no human labels — the same flat tiered JSONL that drives the Melady TS Green Agent drives this environment.</p>
|
| 771 |
+
</section>
|
| 772 |
+
|
| 773 |
+
<div class="callout">
|
| 774 |
+
<div class="q">Can an LLM post-trained on our Green Agent’s own benchmark outperform zero-shot baselines at cross-domain TS reasoning?</div>
|
| 775 |
+
<div class="sub">The environment is built and deployed. Training-client release and empirical GRPO numbers are contingent on compute availability — see <a href="#architecture" style="color:var(--accent2)">Architecture</a>.</div>
|
| 776 |
+
<div class="sub" style="margin-top:.85rem;">On AgentBeats, compare purple baselines such as <a href="https://agentbeats.dev/sharma-yash01/melady-ts-purple-agentscope" target="_blank" style="color:var(--accent2)">Melady TS Purple (AgentScope)</a> and <a href="https://agentbeats.dev/sharma-yash01/melady-ts-base-purple-agent" target="_blank" style="color:var(--accent2)">Melady TS Base Purple Agent</a>.</div>
|
| 777 |
+
</div>
|
| 778 |
+
|
| 779 |
+
<!-- 15. FUTURE WORK -->
|
| 780 |
+
<section id="future">
|
| 781 |
+
<h2>Future work</h2>
|
| 782 |
+
<ul>
|
| 783 |
+
<li><strong>Run GRPO to convergence</strong> on Stage 1 → Stage 2 → Stage 3 curriculum against this environment and fill in the “Trained HF policy” column of the baselines table (blocked on releasing the training client; see Architecture). Relate outcomes to AgentBeats purple baselines (<a href="https://agentbeats.dev/sharma-yash01/melady-ts-purple-agentscope" target="_blank" style="color:var(--accent2)">AgentScope</a> / <a href="https://agentbeats.dev/sharma-yash01/melady-ts-base-purple-agent" target="_blank" style="color:var(--accent2)">CAMEL</a>) and, once landed, the planned MetaGPT / TSci harnesses.</li>
|
| 784 |
+
<li><strong>T2 / T4 forecasting reward</strong> — replace the <code>compute_forecasting_reward</code> stub with the composite sketched in our v1 plan (normalized MSE + DTW-shape + direction + quantile + ArcTan-smoothed MAE). Opens T2 numeric and T4 contextual forecasting tasks without touching the step loop.</li>
|
| 785 |
+
<li><strong>Confidence-calibration reward</strong> — <code>TemporalBenchAction.confidence</code> already exists; shape rewards around Brier-style calibration so the policy is incentivized to <em>know when it knows</em>.</li>
|
| 786 |
+
<li><strong>CoT / reasoning reward</strong> — <code>TemporalBenchAction.reasoning</code> is captured per step but currently unscored; a light format-plus-consistency shaping mirrors the Lottery format-weight technique.</li>
|
| 787 |
+
<li><strong>Token-budget curriculum</strong> (ReasoningEconomics-style) — constrain per-episode reasoning length to force terseness at higher stages.</li>
|
| 788 |
+
<li><strong>Soft 60 / 40 task-type mix</strong> — enforce the v1-plan ratio of understanding (T1U+T3) to prediction (T2_MCQ) explicitly in <code>EpisodeSampler</code>; today it emerges from bank sizes.</li>
|
| 789 |
+
<li><strong>SFT warm-up</strong> on valid MCQ JSON before GRPO — skips the cold-start formatting phase we hit in Lottery.</li>
|
| 790 |
+
<li><strong>Human-subjects transfer</strong> — once the policy beats zero-shot on our bank, measure sim-to-real with held-out TS data outside the Green Agent’s training split.</li>
|
| 791 |
+
</ul>
|
| 792 |
+
</section>
|
| 793 |
+
|
| 794 |
+
<!-- 16. CONCLUSION -->
|
| 795 |
+
<section>
|
| 796 |
+
<h2>Conclusion</h2>
|
| 797 |
+
<p><strong>TemporalBenchEnv</strong> is the training-side companion to our Melady TS Green Agent’s eval-side role. The Green Agent ranks purple agents on TS-Benchmark through the A2A protocol; TemporalBenchEnv re-exposes the same four-dataset MCQ suite as an OpenEnv-native sequential MDP, so an LLM can be <em>post-trained</em> on the very benchmark it will later be scored against.</p>
|
| 798 |
+
<p>Every design choice — 9-question episodes with 6 primary + 3 cross-domain, per-step correctness + terminal bonus with a domain-coverage multiplier, stagewise curriculum T1U → +T3 → +T2_MCQ, and a strict separation of env server and trainer over WebSocket — is aimed at preserving the verifiability that made the Green Agent’s signal trustworthy, while turning it into gradient. The infrastructure contributions are mostly <em>inherited</em> from our Lottery / Reasoning siblings; the novelty is the domain binding and the Green-Agent → OpenEnv pipeline.</p>
|
| 799 |
+
<p>The research question is open: <em>can a GRPO-trained LLM, post-trained on our own benchmark, outperform strong zero-shot baselines at cross-domain TS reasoning?</em> The environment, banks, and deployment path are shipped; full training results await the public training client and the compute budget described under Architecture. The purple track stays live on AgentBeats via <a href="https://agentbeats.dev/sharma-yash01/melady-ts-purple-agentscope" target="_blank" style="color:var(--accent2)">Melady TS Purple (AgentScope)</a> and <a href="https://agentbeats.dev/sharma-yash01/melady-ts-base-purple-agent" target="_blank" style="color:var(--accent2)">Melady TS Base Purple Agent</a>.</p>
|
| 800 |
+
</section>
|
| 801 |
+
|
| 802 |
+
<div class="footer">
|
| 803 |
+
<p>TemporalBenchEnv · Melady / AgentX OpenEnv Track · USC</p>
|
| 804 |
+
<p style="margin-top:.5rem;">
|
| 805 |
+
<a href="https://github.com/sharma-yash01/TemporalBenchEnv" target="_blank">GitHub</a> ·
|
| 806 |
+
<a href="https://huggingface.co/spaces/yashu2000/TemporalBenchEnv" target="_blank">HF Space</a> ·
|
| 807 |
+
<a href="https://agentbeats.dev/sharma-yash01/melady-ts-green-agent" target="_blank">Melady TS Green Agent</a> ·
|
| 808 |
+
<a href="https://github.com/sharma-yash01/Melady_Agent-TS-Green" target="_blank">Green Agent GitHub</a> ·
|
| 809 |
+
<a href="https://github.com/meta-pytorch/OpenEnv" target="_blank">OpenEnv Framework</a> ·
|
| 810 |
+
<a href="https://huggingface.co/docs/trl/en/openenv" target="_blank">TRL × OpenEnv</a>
|
| 811 |
+
</p>
|
| 812 |
+
</div>
|
| 813 |
+
|
| 814 |
+
</div>
|
| 815 |
+
</body>
|
| 816 |
</html>
|