Spaces:
Running
Running
File size: 73,422 Bytes
459e8ba 31ec3b3 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 | <!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1">
<title>TemporalBenchEnv: An OpenEnv Benchmark for Post-training LLMs on Multi-step Time-Series Reasoning</title>
<meta name="description" content="TemporalBenchEnv: an OpenEnv-native multi-step MCQ environment for post-training LLMs on time-series reasoning, extending our Melady TS Green Agent submission with a verifiable-reward GRPO training path.">
<link rel="preconnect" href="https://fonts.googleapis.com">
<link rel="preconnect" href="https://fonts.gstatic.com" crossorigin>
<link href="https://fonts.googleapis.com/css2?family=Inter:wght@400;600;700;800&family=JetBrains+Mono:wght@400;600&display=swap" rel="stylesheet">
<!-- Mermaid for inline diagrams -->
<script type="module">
import mermaid from 'https://cdn.jsdelivr.net/npm/mermaid@11/dist/mermaid.esm.min.mjs';
mermaid.initialize({
startOnLoad: true,
theme: 'dark',
themeVariables: {
primaryColor: '#6366f1',
primaryTextColor: '#e2e8f0',
primaryBorderColor: '#818cf8',
lineColor: '#818cf8',
secondaryColor: '#1e293b',
tertiaryColor: '#172033',
background: '#0f172a',
mainBkg: '#1e293b',
nodeBorder: '#818cf8',
clusterBkg: '#172033',
clusterBorder: '#334155',
titleColor: '#e2e8f0',
edgeLabelBackground: '#1e293b',
nodeTextColor: '#e2e8f0'
},
flowchart: { curve: 'basis', htmlLabels: true },
fontFamily: 'Inter, sans-serif'
});
</script>
<style>
:root {
--bg: #0f172a; --surface: #1e293b; --surface-2: #172033; --border: #334155;
--text: #e2e8f0; --muted: #94a3b8; --accent: #6366f1;
--accent2: #818cf8; --green: #22c55e; --red: #ef4444;
--orange: #f59e0b; --radius: 12px;
}
* { margin: 0; padding: 0; box-sizing: border-box; }
html { scroll-behavior: smooth; }
body { font-family: 'Inter', -apple-system, BlinkMacSystemFont, 'Segoe UI', sans-serif;
background: var(--bg); color: var(--text); line-height: 1.7;
-webkit-font-smoothing: antialiased; }
.container { max-width: 820px; margin: 0 auto; padding: 2rem 1.5rem 4rem; }
/* Top nav */
.topnav { position: sticky; top: 0; z-index: 10; background: rgba(15,23,42,.85);
backdrop-filter: blur(10px); border-bottom: 1px solid var(--border);
padding: .9rem 1.5rem; display: flex; justify-content: space-between;
align-items: center; font-size: .88rem; }
.topnav .brand { font-weight: 700; color: var(--text); text-decoration: none;
display: flex; align-items: center; gap: .5rem; }
.topnav .brand .dot { width: 8px; height: 8px; border-radius: 50%;
background: var(--green); box-shadow: 0 0 8px rgba(34,197,94,.6); }
.topnav .links { display: flex; gap: 1.25rem; }
.topnav .links a { color: var(--muted); text-decoration: none; transition: color .15s; }
.topnav .links a:hover { color: var(--accent2); }
/* Hero */
.hero { text-align: center; padding: 4rem 0 2.5rem; }
.hero-badge { display: inline-block; background: rgba(99,102,241,.15); color: var(--accent2);
padding: .4rem 1.1rem; border-radius: 20px; font-size: .78rem; font-weight: 600;
letter-spacing: .08em; margin-bottom: 1.25rem;
border: 1px solid rgba(99,102,241,.3); text-transform: uppercase; }
.hero h1 { font-size: clamp(2rem, 4.2vw, 3.2rem); font-weight: 800; letter-spacing: -.025em;
line-height: 1.15;
background: linear-gradient(135deg, #e2e8f0 25%, #6366f1 100%);
-webkit-background-clip: text; -webkit-text-fill-color: transparent;
background-clip: text; }
.hero .subtitle { color: var(--muted); font-size: 1.15rem; max-width: 640px;
margin: 1rem auto 0; }
.hero .byline { color: var(--muted); font-size: .85rem; margin-top: 1.5rem;
font-style: italic; }
.banner { width: 100%; border-radius: var(--radius); margin: 2rem 0 3rem;
border: 1px solid var(--border); }
/* Badges row */
.badges { display: flex; justify-content: center; gap: .6rem; flex-wrap: wrap;
margin: 1.5rem 0; }
.badges img { height: 22px; }
/* Button group */
.btn-group { display: flex; gap: .75rem; justify-content: center; margin: 2rem 0;
flex-wrap: wrap; }
.btn { display: inline-flex; align-items: center; gap: .45rem; padding: .6rem 1.35rem;
background: var(--accent); color: white; border-radius: 8px; font-size: .88rem;
font-weight: 600; text-decoration: none; transition: all .2s; }
.btn:hover { background: var(--accent2); transform: translateY(-1px); }
.btn-outline { background: transparent; border: 1px solid var(--border); color: var(--text); }
.btn-outline:hover { border-color: var(--accent); color: var(--accent2);
background: rgba(99,102,241,.08); }
/* TOC */
.toc { background: var(--surface); border: 1px solid var(--border); border-radius: var(--radius);
padding: 1.25rem 1.5rem; margin: 0 0 2.5rem; }
.toc h3 { font-size: .82rem; font-weight: 700; letter-spacing: .08em; text-transform: uppercase;
color: var(--accent2); margin-bottom: .85rem; }
.toc ol { list-style: none; counter-reset: toc; display: flex; flex-wrap: wrap; gap: .35rem .8rem;
margin: 0; padding: 0; }
.toc ol li { counter-increment: toc; font-size: .88rem; }
.toc ol li::before { content: counter(toc) "."; color: var(--accent); font-weight: 700;
font-size: .8rem; margin-right: .3rem; }
.toc ol li a { color: var(--muted); text-decoration: none; transition: color .15s; }
.toc ol li a:hover { color: var(--accent2); }
/* Sections */
section { margin: 3.5rem 0; }
section h2 { font-size: 1.55rem; font-weight: 800; letter-spacing: -.01em;
margin-bottom: 1rem; color: var(--text);
border-left: 3px solid var(--accent); padding-left: .9rem; }
section h3 { font-size: 1.1rem; font-weight: 700; margin: 2rem 0 .75rem;
color: var(--accent2); }
section p { color: #cbd5e1; margin-bottom: 1rem; font-size: 1.02rem; }
section p strong { color: var(--text); }
section ul, section ol { color: #cbd5e1; margin: 1rem 0 1rem 1.5rem; }
section ul li, section ol li { margin-bottom: .5rem; font-size: 1rem; }
section ul li strong, section ol li strong { color: var(--text); }
/* Pull-quote */
blockquote { border-left: 3px solid var(--accent2);
background: rgba(99,102,241,.06); padding: 1.1rem 1.25rem;
margin: 1.5rem 0; border-radius: 0 8px 8px 0;
color: #e2e8f0; font-size: 1.02rem; }
/* Tables */
.table-wrap { margin: 1.5rem 0; overflow-x: auto;
background: var(--surface); border: 1px solid var(--border);
border-radius: var(--radius); }
table { width: 100%; border-collapse: collapse; font-size: .92rem; }
th { background: rgba(99,102,241,.1); color: var(--accent2);
font-size: .72rem; font-weight: 700; letter-spacing: .06em;
text-transform: uppercase; padding: .85rem 1rem; text-align: left; }
td { padding: .7rem 1rem; border-top: 1px solid var(--border); color: #cbd5e1; }
td.num { text-align: right; font-variant-numeric: tabular-nums;
font-family: 'JetBrains Mono', monospace; font-size: .88rem; }
tr:hover td { background: rgba(99,102,241,.04); }
td strong, th strong { color: var(--text); }
.task-id { font-family: 'JetBrains Mono', monospace; font-weight: 700;
color: var(--accent2); font-size: .85rem; }
tr.avg-row td { background: rgba(99,102,241,.08); font-weight: 700;
color: var(--text); }
tr.novel td:first-child { color: #fca5a5; }
/* Code */
pre { background: #0b1120; border: 1px solid var(--border);
border-radius: var(--radius); padding: 1.1rem 1.25rem; overflow-x: auto;
margin: 1.25rem 0; font-family: 'JetBrains Mono', monospace;
font-size: .85rem; line-height: 1.6; color: #d1d5db; }
pre .c { color: #64748b; }
code { font-family: 'JetBrains Mono', monospace; font-size: .88em;
background: rgba(99,102,241,.12); color: var(--accent2);
padding: .1em .35em; border-radius: 4px; }
pre code { background: none; color: inherit; padding: 0; font-size: 1em; }
/* Figure */
figure { margin: 2rem 0; }
figure img { width: 100%; border-radius: var(--radius);
border: 1px solid var(--border); }
figcaption { text-align: center; color: var(--muted); font-size: .85rem;
margin-top: .75rem; }
/* Mermaid diagram wrapper */
.mermaid-wrap { margin: 2rem 0; background: var(--surface); border: 1px solid var(--border);
border-radius: var(--radius); padding: 1.5rem 1rem; overflow-x: auto; }
.mermaid-wrap .mermaid { display: flex; justify-content: center; }
.mermaid-caption { text-align: center; color: var(--muted); font-size: .85rem;
margin-top: .75rem; }
/* Episode trace */
.episode-trace { background: var(--surface); border: 1px solid var(--border);
border-radius: var(--radius); padding: 1.25rem 1.5rem; margin: 1.5rem 0;
position: relative; }
.episode-trace::before { content: ''; position: absolute; left: 1.5rem; top: 2.5rem;
bottom: 1.25rem; width: 2px; background: var(--border); }
.trace-step { position: relative; padding-left: 2rem; margin-bottom: 1.25rem; }
.trace-step:last-child { margin-bottom: 0; }
.trace-step .step-marker { position: absolute; left: -.45rem; top: .2rem; width: 12px;
height: 12px; border-radius: 50%; border: 2px solid var(--accent);
background: var(--bg); z-index: 1; }
.trace-step .step-marker.terminal { background: var(--red); border-color: var(--red); }
.trace-step .step-marker.good { background: var(--green); border-color: var(--green); }
.trace-step .step-label { font-family: 'JetBrains Mono', monospace; font-size: .78rem;
color: var(--accent2); font-weight: 700; margin-bottom: .25rem; }
.trace-step .step-content { font-size: .9rem; color: #cbd5e1; }
.trace-step .step-content code { font-size: .82em; }
.trace-verdict { margin-top: 1rem; padding: .75rem 1rem; border-radius: 8px;
font-size: .9rem; font-weight: 600; }
.trace-verdict.bad { background: rgba(239,68,68,.1); border: 1px solid rgba(239,68,68,.3);
color: #fca5a5; }
.trace-verdict.good { background: rgba(34,197,94,.1); border: 1px solid rgba(34,197,94,.3);
color: #86efac; }
/* Callout for the closing question */
.callout { text-align: center; padding: 2rem 1.5rem; margin: 3rem 0;
background: linear-gradient(135deg, rgba(99,102,241,.08), rgba(129,140,248,.04));
border: 1px solid rgba(99,102,241,.25); border-radius: var(--radius); }
.callout .q { font-size: 1.25rem; font-weight: 700; color: var(--text);
font-style: italic; margin-bottom: .5rem; }
.callout .sub { color: var(--muted); font-size: .95rem; }
/* Notice panel (for "empirical results pending" etc.) */
.notice { padding: 1.1rem 1.25rem; margin: 1.5rem 0;
background: rgba(245,158,11,.08); border: 1px solid rgba(245,158,11,.3);
border-radius: var(--radius); color: #fde68a; font-size: .95rem; }
.notice strong { color: #fbbf24; }
/* Footer */
.footer { text-align: center; padding: 3rem 0 1rem; color: var(--muted);
font-size: .85rem; border-top: 1px solid var(--border); margin-top: 3rem; }
.footer a { color: var(--accent2); text-decoration: none; margin: 0 .5rem; }
.footer a:hover { text-decoration: underline; }
@media (max-width: 640px) {
.container { padding: 1rem 1rem 3rem; }
.hero { padding: 2.5rem 0 1.5rem; }
.topnav .links { display: none; }
section h2 { font-size: 1.3rem; }
table { font-size: .82rem; }
th, td { padding: .55rem .6rem; }
.toc ol { flex-direction: column; }
.episode-trace { padding: 1rem; }
.episode-trace::before { left: 1rem; }
}
/* MathJax display blocks */
.math-display {
margin: 1.25rem 0;
padding: 1rem 1.25rem 1.15rem;
overflow-x: auto;
background: var(--surface);
border: 1px solid var(--border);
border-radius: var(--radius);
text-align: center;
}
.math-display mjx-container[jax="CHTML"][display="true"] { margin: 0.65em 0 !important; }
.math-display mjx-container { color: #e2e8f0 !important; }
.math-note { font-size: .9rem; color: var(--muted); margin-top: .35rem; margin-bottom: 0; }
</style>
<!-- MathJax 3: client-side TeX like static HF Spaces (sdk: static) serve plain HTML. -->
<script>
window.MathJax = {
tex: {
inlineMath: [['\\(', '\\)']],
displayMath: [['\\[', '\\]']]
},
options: {
renderActions: {
addMenu: [0, '', '']
}
}
};
</script>
<script defer src="https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-mml-chtml.js" id="MathJax-script"></script>
</head>
<body>
<nav class="topnav">
<a href="#top" class="brand"><span class="dot"></span> TemporalBenchEnv Blog</a>
<div class="links">
<a href="#why">Why</a>
<a href="#green-agent">Green Agent</a>
<a href="#design">Design</a>
<a href="#scoring">Scoring</a>
<a href="#traces">Traces</a>
<a href="#results">Results</a>
<a href="#engineering">Engineering</a>
<a href="https://huggingface.co/spaces/yashu2000/TemporalBenchEnv" target="_blank">Live Space ↗</a>
</div>
</nav>
<div class="container" id="top">
<div class="hero">
<div class="hero-badge">OpenEnv · Extension of Melady TS Green Agent</div>
<h1>TemporalBenchEnv</h1>
<p class="subtitle">An OpenEnv-native multi-step MCQ environment for post-training LLMs on time-series reasoning — built on the four datasets from our Melady TS Green Agent submission.</p>
<div class="badges">
<a href="https://github.com/sharma-yash01/TemporalBenchEnv" target="_blank"><img src="https://img.shields.io/badge/GitHub-Repository-181717?logo=github" alt="GitHub"/></a>
<a href="https://huggingface.co/spaces/yashu2000/TemporalBenchEnv" target="_blank"><img src="https://img.shields.io/badge/HF%20Space-Live%20Demo-FFD21E?logo=huggingface&logoColor=black" alt="HF Space"/></a>
<img src="https://img.shields.io/badge/OpenEnv-Native-4B8BBE" alt="OpenEnv"/>
<img src="https://img.shields.io/badge/Datasets-4-brightgreen" alt="4 Datasets"/>
<img src="https://img.shields.io/badge/Episodes-9%20MCQ-blueviolet" alt="9-step MCQ"/>
<img src="https://img.shields.io/badge/Training-GRPO%20(planned)-orange" alt="GRPO (planned)"/>
</div>
<div class="byline">Melady / AgentX OpenEnv Track | Muyan Weng (USC) · Defu Cao (USC) · Yashaswi Sharma (USC) · Yan Liu (USC)</div>
</div>
<img src="banner.png" alt="TemporalBenchEnv: question → answer → score across four time-series datasets" class="banner"/>
<div class="btn-group">
<a class="btn" href="https://huggingface.co/spaces/yashu2000/TemporalBenchEnv" target="_blank">Live Environment Space →</a>
<a class="btn btn-outline" href="https://github.com/sharma-yash01/TemporalBenchEnv" target="_blank">GitHub Repo</a>
</div>
<!-- Table of Contents -->
<nav class="toc" id="toc">
<h3>Contents</h3>
<ol>
<li><a href="#why">Multi-step TS Reasoning as a Verifiable Env</a></li>
<li><a href="#green-agent">Extension of Our Green Agent Submission</a></li>
<li><a href="#purple-harnesses">Purple Agent Harnesses (Eval Targets)</a></li>
<li><a href="#matters">Why This Benchmark Matters</a></li>
<li><a href="#prior-work">Prior Work & Novelty</a></li>
<li><a href="#design">What TemporalBenchEnv Is</a></li>
<li><a href="#env-design">Environment Design</a></li>
<li><a href="#openenv">Why OpenEnv</a></li>
<li><a href="#scoring">Scoring: Per-step + Episode Bonus</a></li>
<li><a href="#architecture">Architecture & Training Pipeline</a></li>
<li><a href="#traces">Episode Trace (Ideal, Illustrative)</a></li>
<li><a href="#results">Results: Baselines & Projected Targets</a></li>
<li><a href="#engineering">Engineering Lessons (Inherited)</a></li>
<li><a href="#foundations">Foundations & Citations</a></li>
<li><a href="#quickstart">Quick Start</a></li>
<li><a href="#future">Future Work</a></li>
</ol>
</nav>
<!-- 1. WHY -->
<section id="why">
<h2>Multi-step TS reasoning as a verifiable environment</h2>
<p>Most time-series LLM benchmarks grade a single prompt at a time. <strong>TemporalBenchEnv</strong> grades an <strong>episode</strong>: nine multiple-choice questions drawn from four time-series datasets, answered one per step, with a terminal bonus that rewards both accuracy and <strong>cross-domain coverage</strong>.</p>
<!-- TODO: Replace OpenReview link with arXiv link when available -->
<p>Every reward signal here is <strong>ground-truth arithmetic</strong>, not a judge. Labels are produced by the <a href="https://openreview.net/forum?id=rp5qCsxXca¬eId=jYZcZ4GI0y" target="_blank" style="color:var(--accent2)">TS-Benchmark</a> construction pipeline (trend / volatility / seasonality / outlier thresholds; S1–S5 family rules), so the environment can score an answer with a normalized string match against the stored ground truth. There is no LLM judge in the loop.</p>
<p>The falsifiable hypothesis this environment is built to test: <em>whether a GRPO-trained LLM, post-trained on sequential episodes sampled from our Green Agent’s own benchmark, outperforms strong zero-shot baselines on per-domain MCQ accuracy while hitting the cross-domain coverage bonus.</em> Empirical adjudication is contingent on the training runs described under <a href="#architecture" style="color:var(--accent2)">Architecture & training pipeline</a>.</p>
</section>
<!-- 2. EXTENSION OF GREEN AGENT -->
<section id="green-agent">
<h2>Extension of our Melady TS Green Agent submission</h2>
<p>TemporalBenchEnv is a <strong>direct extension</strong> of our AgentBeats <a href="https://agentbeats.dev/sharma-yash01/melady-ts-green-agent" target="_blank" style="color:var(--accent2)"><strong>Melady TS Green Agent</strong></a> submission. The Green Agent is an A2A-protocol evaluator that grades purple agents on 764 TS-Benchmark tasks (see the <a href="https://github.com/sharma-yash01/Melady_Agent-TS-Green" target="_blank" style="color:var(--accent2)">Green Agent GitHub repository</a>). TemporalBenchEnv takes the <em>same datasets and task taxonomy</em> and re-exposes them as a sequential <strong>OpenEnv</strong> environment consumable by TRL’s <code>rollout_func</code> — turning the benchmark into a <strong>training target</strong>.</p>
<div class="table-wrap">
<table>
<thead><tr><th>Artifact</th><th>Melady TS Green Agent</th><th>TemporalBenchEnv (this submission)</th></tr></thead>
<tbody>
<tr><td><strong>Role</strong></td><td>A2A evaluator of purple agents</td><td>OpenEnv RL environment for post-training LLMs</td></tr>
<tr><td><strong>Datasets</strong></td><td>PSML, freshretailnet, MIMIC, causal_chambers</td><td>Same four datasets</td></tr>
<tr><td><strong>Tasks</strong></td><td>T1/T3 (accuracy) + T2/T4 (regression + MCQ), 764 total</td><td>MCQ subset: T1U, T3, T2_MCQ — 2,775 <code>TSQuestion</code> records</td></tr>
<tr><td><strong>Per-domain bank sizes</strong></td><td>Packaged in the Docker image</td><td>PSML 750 · freshretailnet 616 · MIMIC 709 · causal_chambers 700</td></tr>
<tr><td><strong>Protocol</strong></td><td>A2A messaging, one-shot prompts</td><td>WebSocket OpenEnv contract, sequential 9-step MDP</td></tr>
<tr><td><strong>Reward</strong></td><td>MSE / MAE / RMSE / MASE / accuracy (eval metrics)</td><td>Per-step correctness + terminal episode bonus w/ coverage multiplier</td></tr>
<tr><td><strong>Consumer</strong></td><td>AgentBeats leaderboard</td><td>TRL 1.0 <code>rollout_func</code>, vLLM colocate / server, GRPO</td></tr>
</tbody>
</table>
</div>
<p>The ETL from the Green Agent’s labeled JSONL into the per-domain <code>TSQuestion</code> banks the environment consumes lives in <code>TS-benchmark/scripts/build_temporal_bench_openenv_banks.py</code>; banks ship at <code>openenv-ts/TemporalBenchEnv/data/banks/</code> and are loaded via the <code>TEMPORALBENCH_QUESTION_BANK_DIR</code> environment variable.</p>
<blockquote>The Green Agent answered: <em>which purple agent is best at TS reasoning right now?</em> TemporalBenchEnv answers: <em>can we post-train an LLM, on that exact benchmark, to be the next best purple agent?</em></blockquote>
</section>
<!-- 2b. PURPLE AGENT HARNESSES -->
<section id="purple-harnesses">
<h2>Purple agent harnesses: evaluating mainstream TS-capable agent stacks</h2>
<p>The Green Agent scores <em>purple agents</em> over the A2A protocol — any stack that speaks A2A is a valid participant. To make the benchmark diagnostic for modern agentic time-series practice, we target four of the most popular open-source agent harnesses as purple agents on AgentBeats: <strong>two are implemented and live</strong> (AgentScope, CAMEL) and <strong>two are planned</strong> (MetaGPT, TimeSeriesScientist). Each harness stays <em>unchanged internally</em>; a thin A2A adapter feeds the Green Agent’s TS-Benchmark MCQs into the harness’s own reasoning / tool-use loop and returns a final label. These four frameworks are the de-facto “agentic harnesses” TS practitioners actually wrap around LLMs today, so they anchor the eval to real downstream usage.</p>
<div class="notice">
<strong>The feedback loop we are instrumenting.</strong> For every harness we expose a swappable backbone LLM (e.g. Qwen2.5-7B-Instruct or GPT-4o-mini). The plan is: <strong>(1)</strong> score the base backbone inside each harness via the <a href="https://agentbeats.dev/sharma-yash01/melady-ts-green-agent" target="_blank" style="color:var(--accent2)">Melady TS Green Agent</a>, <strong>(2)</strong> <strong>post-train that same backbone with this OpenEnv</strong> — TemporalBenchEnv’s randomized-domain / randomized-task 9-step MCQ episodes with per-step + terminal verifiable rewards, <strong>(3)</strong> <strong>re-score every harness</strong> with the post-trained backbone in place of the base, and <strong>(4)</strong> attribute any delta specifically to the RL post-training rather than to harness architecture. The panel makes the research question concrete: <em>does randomized-domain verifiable-reward post-training on TemporalBenchEnv actually transfer to agentic TS reasoning under mainstream orchestration frameworks?</em>
</div>
<h3>Implemented — live purple agents on AgentBeats</h3>
<div class="table-wrap">
<table>
<thead><tr><th>Harness</th><th>What it is</th><th>Why it matters for TS reasoning</th><th>AgentBeats listing</th></tr></thead>
<tbody>
<tr>
<td><strong><a href="https://github.com/agentscope-ai/agentscope" target="_blank" style="color:var(--accent2)">AgentScope</a></strong><br><span style="font-size:.85em;color:var(--muted)">Gao et al., <a href="https://arxiv.org/abs/2508.16279" target="_blank" style="color:var(--accent2)">“AgentScope 1.0: A Developer-Centric Framework for Building Agentic Applications” (arXiv:2508.16279, 2025)</a>; Gao et al., <a href="https://arxiv.org/abs/2402.14034" target="_blank" style="color:var(--accent2)">“AgentScope: A Flexible yet Robust Multi-Agent Platform” (arXiv:2402.14034, 2024)</a></span></td>
<td>Production-ready ReAct agent framework with built-in tools, memory, planning, MCP / A2A interop, and an agentic-RL tuner (Trinity-RFT). Apache-2.0.</td>
<td>The most common Python-native ReAct / tool-use stack; A2A support makes the Green Agent wiring mechanical. Serves as our canonical “single-ReAct-loop” purple baseline over TS MCQs.</td>
<td><a href="https://agentbeats.dev/sharma-yash01/melady-ts-purple-agentscope" target="_blank" style="color:var(--accent2)">Melady TS Purple (AgentScope)</a></td>
</tr>
<tr>
<td><strong><a href="https://github.com/camel-ai/camel" target="_blank" style="color:var(--accent2)">CAMEL</a></strong><br><span style="font-size:.85em;color:var(--muted)">Li et al., <a href="https://arxiv.org/abs/2303.17760" target="_blank" style="color:var(--accent2)">“CAMEL: Communicative Agents for ‘Mind’ Exploration of Large Language Model Society” (NeurIPS 2023, arXiv:2303.17760)</a></span></td>
<td>Role-playing multi-agent framework with stateful memory, structured messages, societies / workforce pipelines, and a strong focus on <em>scaling laws of agents</em>.</td>
<td>Role-play + critic loop tests whether a trend / seasonality / outlier answer survives multi-turn discussion without drifting; complements AgentScope’s single-ReAct shape with an inter-agent communication surface.</td>
<td><a href="https://agentbeats.dev/sharma-yash01/melady-ts-base-purple-agent" target="_blank" style="color:var(--accent2)">Melady TS Base Purple Agent</a></td>
</tr>
</tbody>
</table>
</div>
<h3>Planned — purple agents under development</h3>
<div class="table-wrap">
<table>
<thead><tr><th>Harness</th><th>What it is</th><th>Why it matters for TS reasoning</th><th>Status</th></tr></thead>
<tbody>
<tr>
<td><strong><a href="https://github.com/FoundationAgents/MetaGPT" target="_blank" style="color:var(--accent2)">MetaGPT</a></strong><br><span style="font-size:.85em;color:var(--muted)">Hong et al., <a href="https://arxiv.org/abs/2308.00352" target="_blank" style="color:var(--accent2)">“MetaGPT: Meta Programming for a Multi-Agent Collaborative Framework” (ICLR 2024, oral; arXiv:2308.00352)</a></span></td>
<td>SOP-driven multi-agent system — explicit role decomposition (product manager / architect / engineer / QA) orchestrated by a meta-programming layer. One of the most-starred multi-agent frameworks in the wild.</td>
<td>Gives us a decomposition-heavy purple: a “data analyst” + “forecaster” / “reviewer” role split answering the same MCQs, isolating whether explicit SOPs help TS reasoning vs. a flat ReAct loop.</td>
<td><strong>Planned.</strong> A2A wrapper not yet published on AgentBeats.</td>
</tr>
<tr>
<td><strong><a href="https://github.com/Y-Research-SBU/TimeSeriesScientist" target="_blank" style="color:var(--accent2)">TimeSeriesScientist (TSci)</a></strong><br><span style="font-size:.85em;color:var(--muted)">Zhao et al., <a href="https://arxiv.org/abs/2510.01538" target="_blank" style="color:var(--accent2)">“TimeSeriesScientist: A General-Purpose AI Agent for Time Series Analysis” (arXiv:2510.01538, 2025)</a></span></td>
<td>Domain-specific LangGraph agent purpose-built for TS: <em>Curator → Planner → Forecaster → Reporter</em>, with statistical / ML / DL model selection and ensembling (ARIMA, Prophet, LSTM, XGBoost, Transformer, …).</td>
<td>The strongest TS-specialized agent we can find; a natural “ceiling” for agentic TS reasoning and a direct yardstick for whether an RL-post-trained generalist LLM inside a simpler harness closes the gap to a purpose-built TS pipeline.</td>
<td><strong>Planned.</strong> A2A adapter not yet published on AgentBeats.</td>
</tr>
</tbody>
</table>
</div>
<blockquote><strong>Green Agent + TemporalBenchEnv + harness panel = a closed evaluation loop.</strong> The Green Agent scores every harness on a fixed TS-Benchmark MCQ set; TemporalBenchEnv uses the <em>same datasets and task taxonomy</em> to post-train a backbone with randomized-domain / randomized-task episodes under verifiable rewards; we then drop the post-trained backbone back into AgentScope, CAMEL, MetaGPT, and TSci and re-score. A delta from the RL post-training should show up as a <em>uniform</em> lift across all four harnesses — not just in one framework, and not just in the zero-shot MCQ row of the baselines table.</blockquote>
</section>
<!-- 3. WHY IT MATTERS -->
<section id="matters">
<h2>Why this benchmark matters</h2>
<p>Time-series reasoning is one of the few areas where frontier LLMs still look visibly weak, but also one where the <strong>verifiable signal is clean</strong>: trend / volatility / seasonality / outlier labels are constructed from thresholded statistics of the series, so exact-match grading is unambiguous. That makes MCQ episodes a near-ideal RL target before touching noisier numeric forecasting rewards.</p>
<p>The design is <strong>transferable</strong>. Any benchmark that produces labeled MCQ records over a set of domains — medical diagnostics, power-grid anomaly tagging, retail demand regimes — fits the same 9-question cross-domain template. Datasets are the proxy; the capability is <strong>multi-step, multi-domain, verifiable TS reasoning</strong>.</p>
<p>Every reward component is <strong>ground-truth arithmetic</strong>. The environment samples questions from pre-built banks, scores answers via normalized string equality, and aggregates with a closed-form episode bonus. No LLM judge, no circular reward.</p>
</section>
<!-- 4. PRIOR WORK & NOVELTY -->
<section id="prior-work">
<h2>Prior work & novelty</h2>
<p>Prior “LLMs + time series” work lands in one of three buckets. None occupies the cell we target:</p>
<div class="table-wrap">
<table>
<thead><tr><th>Prior work bucket</th><th>What it does</th><th>What it does not</th></tr></thead>
<tbody>
<tr><td><strong>Static TS benchmarks</strong><br><span style="font-size:.85em;color:var(--muted)">TS-Benchmark (ours, <a href="https://openreview.net/forum?id=rp5qCsxXca¬eId=jYZcZ4GI0y" target="_blank" style="color:var(--accent2)">OpenReview</a>); FreshRetailNet (<a href="https://arxiv.org/abs/2505.16319" target="_blank" style="color:var(--accent2)">arXiv:2505.16319</a>); PSML (<a href="https://www.nature.com/articles/s41597-022-01455-7" target="_blank" style="color:var(--accent2)">Nat. Sci. Data 2022</a>); MIMIC-IV (<a href="https://www.nature.com/articles/s41597-022-01899-x" target="_blank" style="color:var(--accent2)">Nat. Sci. Data 2022</a>); Causal Chambers (<a href="https://www.nature.com/articles/s42256-024-00964-x" target="_blank" style="color:var(--accent2)">Nat. MI 2024</a>)</span></td><td>Construct labeled MCQ / regression tasks over TS datasets, graded by fixed rules</td><td>No RL-native environment contract, no sequential episodes, no post-training loop</td></tr>
<tr><td><strong>TS-LLM composite RL rewards</strong><br><span style="font-size:.85em;color:var(--muted)">TimeMaster (<a href="https://arxiv.org/abs/2506.13705" target="_blank" style="color:var(--accent2)">arXiv:2506.13705</a>); COUNTS (<a href="https://arxiv.org/abs/2510.01116" target="_blank" style="color:var(--accent2)">arXiv:2510.01116</a>); SenTSR-Bench (<a href="https://arxiv.org/abs/2602.19455" target="_blank" style="color:var(--accent2)">arXiv:2602.19455</a>)</span></td><td>Composite reward shaping (MSE + DTW + direction + quantile + ArcTan) for TS multimodal LLMs</td><td>No OpenEnv contract, no sequential MCQ episodes, reward is forecasting-centric</td></tr>
<tr><td><strong>A2A evaluators</strong><br><span style="font-size:.85em;color:var(--muted)">Melady TS Green Agent (<a href="https://agentbeats.dev/sharma-yash01/melady-ts-green-agent" target="_blank" style="color:var(--accent2)">AgentBeats</a>); other AgentBeats green agents</span></td><td>Score deployed purple agents on the benchmark through the A2A protocol</td><td>Not a training environment; no per-step RL signal, no environment state</td></tr>
<tr class="novel"><td><strong>TemporalBenchEnv (ours)</strong></td><td>Sequential 9-step MCQ MDP over four TS datasets, structured Pydantic actions, terminal reward from ground-truth labels + coverage multiplier, OpenEnv + GRPO contract</td><td>Does not (yet) train numeric forecasting — T2/T4 reward is stubbed for future work</td></tr>
</tbody>
</table>
</div>
<blockquote>To our knowledge, no prior work exposes TS-Benchmark’s multi-dataset MCQ suite as an <strong>OpenEnv-native sequential MDP with verifiable terminal rewards</strong> suitable for GRPO post-training. The action space, reward decomposition, and sibling-env architecture follow the <a href="https://huggingface.co/spaces/yashu2000/LotteryElicitationEnv" target="_blank" style="color:var(--accent2)">LotteryElicitationEnv</a> lineage; the domain and task taxonomy follow our own Melady TS Green Agent.</blockquote>
</section>
<!-- 5. WHAT IT IS -->
<section id="design">
<h2>What TemporalBenchEnv is</h2>
<blockquote>An OpenEnv-native sequential MDP in which an LLM agent answers nine MCQ questions — six from a primary domain and one each from the other three — earning per-step correctness and a terminal bonus that rewards cross-domain coverage.</blockquote>
<p>Each episode proceeds like this:</p>
<ul>
<li><code>reset()</code> samples nine <code>TSQuestion</code> records: <strong>6 from the primary domain</strong> (default PSML), with T3 families round-robined for diversity, and <strong>1 from each of the three non-primary domains</strong>. Final order is shuffled.</li>
<li>On turn <em>t</em>, the agent emits a <code>TemporalBenchAction</code> containing the MCQ label, plus optional <code>confidence</code> and <code>reasoning</code> fields.</li>
<li>The environment <strong>grades</strong> the answer via normalized string equality against <code>question.answer</code> (also accepting an option whose normalized text matches the ground truth), and returns the next question.</li>
<li>Per-step reward is <code>alpha * correctness</code>. Mid-episode bonuses are zero.</li>
<li>On the final step, the environment adds the <strong>terminal episode bonus</strong>: <code>lambda_ep * (total_correct / N) * coverage_multiplier</code>, where the multiplier is <code>1.0</code> if every one of the four domains contributed at least one correct answer, else <code>0.8</code>.</li>
</ul>
<p>The agent’s interface is deliberately minimal: a single answer string per step, no tool-call protocol. Optional <code>confidence</code> and <code>reasoning</code> fields exist on the action for future reward shaping.</p>
</section>
<!-- 6. ENVIRONMENT DESIGN -->
<section id="env-design">
<h2>Environment design</h2>
<p>The core contract is three Pydantic types exchanged over the OpenEnv WebSocket (see <code>openenv-ts/TemporalBenchEnv/env/models.py</code>):</p>
<pre><code><span class="c"># Action (agent → env)</span>
class TemporalBenchAction(Action):
answer: str <span class="c"># MCQ label matching an option</span>
confidence: Optional[float] <span class="c"># in [0, 1], unused in reward for now</span>
reasoning: Optional[str] <span class="c"># optional CoT, unused in reward for now</span>
<span class="c"># Observation (env → agent)</span>
class TemporalBenchObservation(Observation):
step_idx, steps_remaining, max_steps: int
question: str <span class="c"># current MCQ prompt</span>
options: list[str] <span class="c"># 2+ answer choices</span>
task_type: str <span class="c"># "T1U" | "T3" | "T2_MCQ"</span>
dataset: str <span class="c"># "PSML" | "freshretailnet" | "MIMIC" | "causal_chambers"</span>
history: list[dict] <span class="c"># [{question, answer, correct, dataset, ...}, ...]</span>
accuracy_so_far: float
done: bool; reward: Optional[float]; metadata: dict
<span class="c"># State (serializable snapshot)</span>
class TemporalBenchState(State):
episode_id: Optional[str]
step_count, total_correct, total_questions: int
current_accuracy: float
primary_domain: str
per_task_type_accuracy: dict[str, float]
total_reward: float</code></pre>
<p>Four datasets, three MCQ task types, and a three-stage curriculum shape the training distribution (see <code>env/config.py</code> and <code>env/episode_sampler.py</code>):</p>
<div class="table-wrap">
<table>
<thead><tr><th>Stage</th><th>Allowed task types</th><th>Purpose</th></tr></thead>
<tbody>
<tr><td><strong>Stage 1</strong></td><td><code>T1U</code> only (non-contextual understanding MCQ)</td><td>Shorten credit assignment; learn trend / volatility / seasonality / outliers first</td></tr>
<tr><td><strong>Stage 2</strong></td><td><code>T1U</code> + <code>T3</code> (contextual understanding, S1–S5 families)</td><td>Add context-conditioned reasoning; maintain verifiable labels</td></tr>
<tr><td><strong>Stage 3</strong></td><td><code>T1U</code> + <code>T3</code> + <code>T2_MCQ</code> (prediction-as-classification)</td><td>Full MCQ track; adds direction-of-change / volatility-change / seasonality-alignment</td></tr>
</tbody>
</table>
</div>
<p>Curriculum is honored both at <code>EnvConfig(curriculum_stage=...)</code> construction and at <code>env.reset(curriculum_stage=...)</code>, so a single server can serve multiple stages to different sessions concurrently.</p>
</section>
<!-- 7. WHY OPENENV -->
<section id="openenv">
<h2>Why OpenEnv</h2>
<p>OpenEnv gives us three things that matter for this submission: (1) a standard WebSocket environment contract consumable by TRL’s <code>rollout_func</code>, (2) per-session state with <code>max_concurrent_envs=64</code> in our <code>create_app</code> factory — each WebSocket session gets a fresh <code>TemporalBenchEnvironment</code> via <code>_env_factory</code> so DDP ranks can hammer the same Space without cross-talk, and (3) a uniform deployment path. The same env code runs in-process for tests, as a Docker container for development (<code>server/Dockerfile</code>, with <code>TEMPORALBENCH_QUESTION_BANK_DIR=/app/env/data/banks</code>), and as a Hugging Face Space during training and evaluation.</p>
<p>No new abstractions were invented. Base types only: <code>EnvClient</code>, <code>Environment</code>, Pydantic <code>Action</code> / <code>Observation</code> / <code>State</code>. All extensions (history, per-domain coverage, per-task accuracy) ride on <code>metadata</code> or the serialized state. No new method signatures, no fork. The env ships with <code>openenv.yaml</code> and a <code>Dockerfile</code>, and passes <code>uv run openenv validate</code>.</p>
<p class="math-note"><strong>Hygiene note:</strong> OpenEnv’s CLI validator does a naive substring check for <code>main()</code> in <code>server/app.py</code>. We match the reference pattern from <a href="https://huggingface.co/spaces/yashu2000/LotteryElicitationEnv" target="_blank" style="color:var(--accent2)">LotteryElicitationEnv</a> — an explicit <code>main()</code> call under <code>if __name__ == "__main__"</code> with CLI flags parsed via <code>parse_known_args</code>.</p>
</section>
<!-- 8. SCORING -->
<section id="scoring">
<h2>Scoring: per-step correctness + episode bonus</h2>
<p>Reward is decomposed into a per-step term and a terminal bonus (see <code>env/reward.py</code>).</p>
<div class="math-display" aria-label="Per-step correctness reward">
\[
r_t \;=\; \alpha \cdot \mathbf{1}\!\left[\hat a_t = a_t^\ast\right]
\]
</div>
<p class="math-note">Here \(\hat a_t\) is the agent’s submitted label and \(a_t^\ast\) is the stored ground truth, compared under normalized string equality (also accepting an option whose normalized text matches).</p>
<div class="math-display" aria-label="Terminal episode bonus with coverage multiplier">
\[
B \;=\; \lambda_{\mathrm{ep}} \,\cdot\, \frac{C}{N} \,\cdot\, m,
\qquad
m \;=\; \begin{cases} 1.0 & \text{all 4 domains have } \ge 1 \text{ correct} \\ 0.8 & \text{otherwise} \end{cases}
\]
\[
R \;=\; \sum_{t=1}^{N} r_t \;+\; B
\]
</div>
<p class="math-note">\(C\) is the total correct count in the episode, \(N = 9\) is the episode length, \(\alpha, \lambda_{\mathrm{ep}}\) are <code>alpha</code> and <code>lambda_ep</code> in <code>EnvConfig</code>.</p>
<p>Defaults live in <code>EnvConfig</code>:</p>
<div class="table-wrap">
<table>
<thead><tr><th>Component</th><th>Weight</th><th>What it rewards</th></tr></thead>
<tbody>
<tr><td><strong>Per-step correctness</strong></td><td class="num">α = 1.0</td><td>Normalized-string match against the MCQ ground truth</td></tr>
<tr><td><strong>Episode bonus weight</strong></td><td class="num">λ<sub>ep</sub> = 0.5</td><td>Scales the terminal accuracy×coverage term</td></tr>
<tr><td><strong>Coverage multiplier</strong></td><td class="num">{0.8, 1.0}</td><td>1.0 iff every domain in <code>EnvConfig.all_domains</code> has ≥1 correct answer this episode</td></tr>
<tr><td><strong>Forecasting reward</strong></td><td class="num">—</td><td>Stubbed (<code>compute_forecasting_reward</code> raises <code>NotImplementedError</code>); future work</td></tr>
</tbody>
</table>
</div>
<blockquote><strong>Why a coverage multiplier:</strong> per-step accuracy alone lets the agent ace the six primary questions while guessing on the other three domains, collapsing to a single-domain policy. The <code>0.8</code> penalty forces the policy to treat the three cross-domain questions as first-class signal — the very thing that distinguishes a TS-generalist from a PSML-only memorizer.</blockquote>
</section>
<!-- 9. ARCHITECTURE & TRAINING PIPELINE -->
<section id="architecture">
<h2>Architecture & training pipeline</h2>
<p>Following the <a href="https://huggingface.co/spaces/yashu2000/LotteryElicitationEnv" target="_blank" style="color:var(--accent2)">LotteryElicitationEnv</a> / <a href="https://huggingface.co/spaces/yashu2000/reasoning-economic-env" target="_blank" style="color:var(--accent2)">ReasoningEconomicsEnv</a> lineage, <strong>TemporalBenchEnv</strong> (this page) ships the OpenEnv side: reset, step, rewards, and question banks over WebSocket. A separate trainer process — same env/trainer separation as <code>LotteryElicitationEnv</code> and <a href="https://github.com/sharma-yash01/LotteryElicitationPT" target="_blank" style="color:var(--accent2)"><code>LotteryElicitationPT</code></a> — would drive GRPO with TRL’s <code>rollout_func</code> and vLLM against that socket, without in-process imports of env-side types.</p>
<div class="notice">
<strong>Training client status.</strong> The companion GRPO / TRL package that runs rollouts against this environment is under <strong>active internal testing</strong>. We will release it on <strong>GitHub</strong> once we secure <strong>additional compute</strong> so we can stress-test and validate models at a scale we consider reasonable. This blog documents the shipped <strong>TemporalBenchEnv</strong> only until that release.
</div>
<blockquote><strong>Purple agents on AgentBeats.</strong> The same TS-Benchmark task surface is also exercised by <strong>purple agents</strong> scored through the A2A green agent: live listings are <a href="https://agentbeats.dev/sharma-yash01/melady-ts-purple-agentscope" target="_blank" style="color:var(--accent2)">Melady TS Purple (AgentScope)</a> and <a href="https://agentbeats.dev/sharma-yash01/melady-ts-base-purple-agent" target="_blank" style="color:var(--accent2)">Melady TS Base Purple Agent</a>. Planned additions are documented in <a href="#purple-harnesses" style="color:var(--accent2)">§ Purple agent harnesses</a>.</blockquote>
<div class="mermaid-wrap">
<pre class="mermaid">
flowchart LR
subgraph TRN ["Trainer (TRL + GRPO, WebSocket client)"]
GRPO["GRPOTrainer<br/>TRL 1.0"]
RF["rollout_func"]
VLLM["vLLM<br/>colocate/server"]
PARSE["action_parser<br/>MCQ label guardrails"]
end
subgraph ENV ["TemporalBenchEnv (OpenEnv)"]
WS["FastAPI<br/>WebSocket"]
SAMP["EpisodeSampler<br/>6+1+1+1 stratified"]
GRADE["grade_answer<br/>normalized match"]
REW["Reward<br/>per-step + bonus + coverage"]
end
GRPO --> RF
RF --> VLLM
VLLM -->|"generate"| PARSE
PARSE -->|"answer string"| WS
WS --> SAMP
SAMP -->|"next question"| WS
WS --> GRADE
GRADE --> REW
REW -->|"step + terminal reward"| WS
WS -->|"observation"| RF
</pre>
<p class="mermaid-caption">Figure 1. System architecture. The trainer never imports env-side types — everything crosses the WebSocket, exactly like our Lottery / Reasoning sibling envs.</p>
</div>
<p>Training uses <strong>GRPO</strong> (Group Relative Policy Optimization), which is a natural fit for per-step verifiable rewards with an additive terminal bonus. The training scaffolding is directly inherited from <a href="https://github.com/sharma-yash01/LotteryElicitationPT" target="_blank" style="color:var(--accent2)"><code>LotteryElicitationPT</code></a>: TRL 1.0’s <code>rollout_func</code> contract, vLLM colocate/server, chat-template tokenization with <code>enable_thinking=False</code>, think-block stripping, null-safe MCQ-label parsing, and episode logging to <code>reward_logs.jsonl</code>.</p>
</section>
<!-- 10. EPISODE TRACE (IDEAL, ILLUSTRATIVE) -->
<section id="traces">
<h2>Episode trace (Ideal, Illustrative)</h2>
<p>Here is what a high-reward episode <em>would</em> look like — five representative steps out of nine, spanning all four domains. Primary domain is PSML; the trace shows the three cross-domain picks plus two primary turns, culminating in the terminal step where the coverage multiplier decides the shape of the bonus. <strong>Turns 2, 4, 6, 8 are elided</strong> (all PSML T1U, assumed correct for the 7 / 9 total) so the walkthrough stays focused on the cross-domain structure.</p>
<div class="notice">
<strong>Illustrative, not captured.</strong> This is a hand-constructed walkthrough intended to explain the per-step reward, the 6 + 3 domain split, and the terminal coverage multiplier — <em>not</em> a real rollout from a trained (or even zero-shot) policy against the live environment. The prompts and agent answers below are author-written. The reward arithmetic (<code>r_t</code>, <code>B</code>, <code>R</code>) is computed exactly as <code>env/reward.py</code> would compute it for the stated correctness pattern. A captured trace from a real model will replace this once the training client is released and run at scale (see <a href="#architecture" style="color:var(--accent2)">Architecture & training pipeline</a>).
</div>
<div class="episode-trace">
<div class="trace-step">
<div class="step-marker good"></div>
<div class="step-label">Turn 1 · dataset=PSML · task=T1U:trend</div>
<div class="step-content">
<strong>Prompt:</strong> <em>“Based on the array (length=336), report trend: upward / downward / constant.”</em><br>
Series rises monotonically over the tail window.<br>
<strong>Agent answer:</strong> <code>"upward"</code> → <strong>correct</strong> (<code>r_1 = 1.0</code>)
</div>
</div>
<div class="trace-step">
<div class="step-marker good"></div>
<div class="step-label">Turn 3 · dataset=freshretailnet · task=T1U:seasonality</div>
<div class="step-content">
<strong>Prompt:</strong> daily demand for a fresh-retail SKU, peaks repeat every 7 steps with stable amplitude.<br>
<strong>Agent answer:</strong> <code>"fixed"</code> → <strong>correct</strong> (<code>r_3 = 1.0</code>)
</div>
</div>
<div class="trace-step">
<div class="step-marker good"></div>
<div class="step-label">Turn 5 · dataset=MIMIC · task=T1U:outliers</div>
<div class="step-content">
<strong>Prompt:</strong> ICU vital trace with a single spike above <code>q95 + 3·MAD</code>.<br>
<strong>Agent answer:</strong> <code>"sudden_spike"</code> → <strong>correct</strong> (<code>r_5 = 1.0</code>)
</div>
</div>
<div class="trace-step">
<div class="step-marker good"></div>
<div class="step-label">Turn 7 · dataset=causal_chambers · task=T3:S2</div>
<div class="step-content">
<strong>Prompt:</strong> wind-chamber actuator trace, contextual question on lagged response to a step input (capability <code>C4</code>).<br>
<strong>Agent answer:</strong> <code>"delayed_response"</code> → <strong>correct</strong> (<code>r_7 = 1.0</code>)
</div>
</div>
<div class="trace-step">
<div class="step-marker good"></div>
<div class="step-label">Turn 9 · dataset=PSML · task=T2_MCQ (terminal)</div>
<div class="step-content">
<strong>Prompt:</strong> “Median demand level change (forecast horizon vs history)?”<br>
<strong>Agent answer:</strong> <code>"Higher"</code> → <strong>correct</strong> (<code>r_9 = 1.0</code>)<br>
<strong>Episode totals:</strong> <code>C = 7 / 9</code> correct, all four domains covered → <code>m = 1.0</code>.<br>
<strong>Terminal bonus:</strong> <code>B = 0.5 · (7/9) · 1.0 ≈ 0.389</code>.<br>
<strong>Total return:</strong> <code>R ≈ 7.389</code>.
</div>
</div>
<div class="trace-verdict good">
Four-of-four coverage → full <strong>m = 1.0</strong> multiplier. Contrast: the same 7 / 9 accuracy with a missed cross-domain question (say, zero correct in MIMIC) would give <code>B ≈ 0.311</code> and <code>R ≈ 7.311</code>. The coverage term is the whole reason a PSML-only policy loses to a generalist.
</div>
</div>
</section>
<!-- 11. EVALUATION PROTOCOL & PROJECTED TARGETS -->
<section id="results">
<h2>Evaluation protocol & projected targets</h2>
<div class="notice">
<strong>No trained-policy numbers yet.</strong> Available compute was exhausted before GRPO could be run to convergence and validated at a scale we consider meaningful. The “Trained HF policy” column therefore stays blank, and every number in the “Projected targets” table below is an <em>analytical</em> projection from the environment’s structure, not a measurement. GRPO runs against this Space will populate the empty column once the public training client is released and run at scale — see the canonical training-status notice under <a href="#architecture">Architecture & training pipeline</a>. No fabricated telemetry is shown.
</div>
<blockquote><strong>Purple baselines (AgentBeats).</strong> For deployed purple policies on the same benchmark lineage, see <a href="https://agentbeats.dev/sharma-yash01/melady-ts-purple-agentscope" target="_blank" style="color:var(--accent2)">Melady TS Purple (AgentScope)</a> and <a href="https://agentbeats.dev/sharma-yash01/melady-ts-base-purple-agent" target="_blank" style="color:var(--accent2)">Melady TS Base Purple Agent</a>.</blockquote>
<h3>Baselines</h3>
<p>The environment supports the following baseline policies out of the box (random + majority) plus an eval harness for zero-shot and trained LLMs. All run against the live OpenEnv WebSocket so numbers are directly comparable with the trained policy.</p>
<div class="table-wrap">
<table>
<thead><tr><th>Baseline</th><th>Policy</th><th>What it isolates</th></tr></thead>
<tbody>
<tr><td><strong>Random MCQ</strong></td><td>Uniform sample over <code>observation.options</code></td><td>Lower bound; beats zero only if options are imbalanced</td></tr>
<tr><td><strong>Majority-class</strong></td><td>Always pick the per-<code>task_type</code> modal label from the bank</td><td>Isolates how much accuracy is available from priors alone</td></tr>
<tr><td><strong>Zero-shot API LLM</strong></td><td>GPT / Claude / Gemini via the eval harness</td><td>Strong “off-the-shelf” ceiling before any post-training</td></tr>
<tr><td><strong>Zero-shot local LLM</strong></td><td>Qwen2.5-7B-Instruct served via vLLM</td><td>Planned backbone for GRPO fine-tuning against this environment once training runs land</td></tr>
<tr><td><strong>Trained HF policy</strong></td><td>GRPO checkpoint from rollouts on this OpenEnv</td><td>Tests whether post-training on TS-Benchmark episodes beats zero-shot</td></tr>
</tbody>
</table>
</div>
<h3>Projected targets</h3>
<p>The numbers below are <strong>analytical projections</strong> from the environment’s structure, not empirical results. They exist to anchor what “good” looks like once training runs land.</p>
<div class="table-wrap">
<table>
<thead><tr><th>Metric</th><th>Random MCQ</th><th>Zero-shot LLM (expected)</th><th>Target for trained policy</th></tr></thead>
<tbody>
<tr><td>Per-step accuracy (T1U & T3, 3–4 options)</td><td class="num">≈ 0.25–0.33</td><td class="num">>> 0.33</td><td class="num">≥ strong zero-shot</td></tr>
<tr><td>Per-step accuracy (T2_MCQ, 4 options)</td><td class="num">≈ 0.25</td><td class="num">> random</td><td class="num">≥ strong zero-shot</td></tr>
<tr><td>Coverage multiplier <code>m</code></td><td class="num">usually 0.8</td><td class="num">0.8–1.0</td><td class="num">1.0 consistently</td></tr>
<tr><td>Episode bonus <code>B</code> (λ<sub>ep</sub>=0.5, N=9)</td><td class="num">≈ 0.10</td><td class="num">0.20–0.35</td><td class="num">≥ 0.40</td></tr>
</tbody>
</table>
</div>
<h3>Current status</h3>
<p>What is already in place:</p>
<ul>
<li><strong>Full environment</strong>: <code>reset</code> / <code>step</code> / <code>state</code>, nine-question stratified sampling across four domains, per-step + terminal reward with coverage multiplier.</li>
<li><strong>Production banks</strong>: 2,775 <code>TSQuestion</code> records built from <code>TS-benchmark/task_merged_dev_with_labels_tiers.jsonl</code> via <code>TS-benchmark/scripts/build_temporal_bench_openenv_banks.py</code> and vendored under <code>openenv-ts/TemporalBenchEnv/data/banks/</code> (PSML 750 / freshretailnet 616 / MIMIC 709 / causal_chambers 700).</li>
<li><strong>Deployment path</strong>: <code>openenv.yaml</code>, Docker image (<code>server/Dockerfile</code>, <code>TEMPORALBENCH_QUESTION_BANK_DIR=/app/env/data/banks</code>), <code>openenv validate</code> passes, <code>openenv push</code> ready for HF Space.</li>
<li><strong>Training scaffolding</strong>: the intended path mirrors <a href="https://github.com/sharma-yash01/LotteryElicitationPT" target="_blank" style="color:var(--accent2)"><code>LotteryElicitationPT</code></a> (GRPO + TRL 1.0 <code>rollout_func</code> + vLLM). Training-client release status: see <a href="#architecture">Architecture</a>.</li>
<li><strong>Purple agents (AgentBeats)</strong>: live eval targets include <a href="https://agentbeats.dev/sharma-yash01/melady-ts-purple-agentscope" target="_blank" style="color:var(--accent2)">Melady TS Purple (AgentScope)</a> and <a href="https://agentbeats.dev/sharma-yash01/melady-ts-base-purple-agent" target="_blank" style="color:var(--accent2)">Melady TS Base Purple Agent</a>. Planned additions: see <a href="#purple-harnesses">§ Purple agent harnesses</a>.</li>
</ul>
</section>
<!-- 12. ENGINEERING LESSONS (INHERITED) -->
<section id="engineering">
<h2>Engineering lessons (inherited)</h2>
<p>Because TemporalBenchEnv intentionally rides on the same deployment + training pattern as <a href="https://huggingface.co/spaces/yashu2000/LotteryElicitationEnv" target="_blank" style="color:var(--accent2)">LotteryElicitationEnv</a> and <a href="https://huggingface.co/spaces/yashu2000/reasoning-economic-env" target="_blank" style="color:var(--accent2)">ReasoningEconomicsEnv</a>, most of the hard infra lessons are <em>inherited</em> rather than rediscovered. We document them briefly so the next submitter does not need to learn them from scratch.</p>
<div class="table-wrap">
<table>
<thead><tr><th>Issue</th><th>Root cause</th><th>Fix (inherited)</th></tr></thead>
<tbody>
<tr>
<td><strong>NCCL desync</strong> under variable-length episodes</td>
<td>In <code>vllm_mode=server</code>, different DDP ranks make different numbers of <code>generate()</code> calls per episode → sequence-numbered NCCL collectives go out of sync.</td>
<td>Fixed-count <code>generate()</code> padding per episode; dummy generates discarded via <code>_temporary_vllm_max_tokens(..., 1)</code>. Gated on <code>world_size > 1</code>.</td>
</tr>
<tr>
<td><strong><code>max_completion_length</code></strong> drift over multi-turn rollouts</td>
<td>The rollout function appends per-turn generations + observation suffixes each step; a 9-turn MCQ episode can easily exceed the nominal completion budget.</td>
<td>Hard-cap <code>completion_ids</code> to <code>max_completion_length</code>; <code>PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True</code>.</td>
</tr>
<tr>
<td><strong>Null-safe MCQ parsing</strong></td>
<td>LLMs sometimes emit <code>{"answer": null}</code> or malformed strings; one rank crashing kills all DDP ranks via gloo cascade.</td>
<td>Inherited <code>_safe_float</code> / <code>_safe_int</code> / null-string fallbacks; fallback action rather than crash. The same regression pattern from <a href="https://github.com/sharma-yash01/LotteryElicitationPT" target="_blank" style="color:var(--accent2)">LotteryElicitationPT</a> applies directly to MCQ labels.</td>
</tr>
<tr>
<td><strong><code>openenv validate</code> hygiene</strong></td>
<td>OpenEnv’s CLI does a naive substring check for <code>main()</code> in <code>server/app.py</code>; an entrypoint like <code>main(port=args.port)</code> fails validation.</td>
<td>Match the Lottery pattern: <code>if __name__ == "__main__": main()</code>, with flags parsed via <code>parse_known_args</code> inside <code>main()</code>.</td>
</tr>
</tbody>
</table>
</div>
</section>
<!-- 13. FOUNDATIONS & CITATIONS -->
<section id="foundations">
<h2>Foundations & citations</h2>
<div class="table-wrap">
<table>
<thead><tr><th>Foundation</th><th>Role in this project</th><th>Citation</th></tr></thead>
<tbody>
<tr><td><strong>TS-Benchmark (ours)</strong></td><td>Task taxonomy (T1/T2/T3/T4), per-dataset label construction, MCQ question shape</td><td><a href="https://openreview.net/forum?id=rp5qCsxXca¬eId=jYZcZ4GI0y" target="_blank" style="color:var(--accent2)">OpenReview</a> · in-repo: <code>TS-benchmark/TS-Benchmark.md</code></td></tr>
<tr><td><strong>Melady TS Green Agent</strong></td><td>A2A evaluator whose task set TemporalBenchEnv re-exposes as an OpenEnv environment</td><td><a href="https://agentbeats.dev/sharma-yash01/melady-ts-green-agent" target="_blank" style="color:var(--accent2)">AgentBeats</a> · <a href="https://github.com/sharma-yash01/Melady_Agent-TS-Green" target="_blank" style="color:var(--accent2)">GitHub repository</a></td></tr>
<tr><td><strong>AgentScope</strong></td><td>Purple-agent harness — ReAct / MCP / A2A framework wrapping the backbone we post-train (<a href="#purple-harnesses">live on AgentBeats</a>)</td><td>Gao et al., <a href="https://arxiv.org/abs/2508.16279" target="_blank" style="color:var(--accent2)">arXiv:2508.16279</a> (2025) · <a href="https://arxiv.org/abs/2402.14034" target="_blank" style="color:var(--accent2)">arXiv:2402.14034</a> (2024) · <a href="https://github.com/agentscope-ai/agentscope" target="_blank" style="color:var(--accent2)">GitHub</a></td></tr>
<tr><td><strong>CAMEL</strong></td><td>Purple-agent harness — role-playing multi-agent society, CAMEL-backed baseline on AgentBeats</td><td>Li et al., <a href="https://arxiv.org/abs/2303.17760" target="_blank" style="color:var(--accent2)">NeurIPS 2023 (arXiv:2303.17760)</a> · <a href="https://github.com/camel-ai/camel" target="_blank" style="color:var(--accent2)">GitHub</a></td></tr>
<tr><td><strong>MetaGPT</strong></td><td>Purple-agent harness (fast-follow-up) — SOP-driven multi-agent system, role-decomposition purple</td><td>Hong et al., <a href="https://arxiv.org/abs/2308.00352" target="_blank" style="color:var(--accent2)">ICLR 2024, oral (arXiv:2308.00352)</a> · <a href="https://github.com/FoundationAgents/MetaGPT" target="_blank" style="color:var(--accent2)">GitHub</a></td></tr>
<tr><td><strong>TimeSeriesScientist (TSci)</strong></td><td>Purple-agent harness (fast-follow-up) — TS-specialized Curator/Planner/Forecaster/Reporter agent; “ceiling” for agentic TS reasoning</td><td>Zhao et al., <a href="https://arxiv.org/abs/2510.01538" target="_blank" style="color:var(--accent2)">arXiv:2510.01538</a> (2025) · <a href="https://github.com/Y-Research-SBU/TimeSeriesScientist" target="_blank" style="color:var(--accent2)">GitHub</a></td></tr>
<tr><td><strong>FreshRetailNet-50K</strong></td><td>Retail demand dataset; T1/T2/T3/T4 MCQ questions</td><td>Ding et al., <a href="https://arxiv.org/abs/2505.16319" target="_blank" style="color:var(--accent2)">arXiv:2505.16319</a>, 2025</td></tr>
<tr><td><strong>PSML</strong></td><td>Power-system load dataset; primary domain for the default episode</td><td><a href="https://www.nature.com/articles/s41597-022-01455-7" target="_blank" style="color:var(--accent2)">Nature Sci. Data 2022</a></td></tr>
<tr><td><strong>MIMIC-IV</strong></td><td>ICU/EHR time-series dataset; medical-domain MCQ</td><td><a href="https://www.nature.com/articles/s41597-022-01899-x" target="_blank" style="color:var(--accent2)">Nature Sci. Data 2022</a></td></tr>
<tr><td><strong>Causal Chambers</strong></td><td>Physical-testbed TS dataset; contextual T3 and wind-chamber T1U</td><td><a href="https://www.nature.com/articles/s42256-024-00964-x" target="_blank" style="color:var(--accent2)">Nature MI 2024</a></td></tr>
<tr><td><strong>TimeMaster / COUNTS / SenTSR-Bench</strong></td><td>Prior art on composite RL rewards for TS-LLMs; motivates T2/T4 forecasting future work</td><td><a href="https://arxiv.org/abs/2506.13705" target="_blank" style="color:var(--accent2)">TimeMaster (arXiv:2506.13705)</a> · <a href="https://arxiv.org/abs/2510.01116" target="_blank" style="color:var(--accent2)">COUNTS (arXiv:2510.01116)</a> · <a href="https://arxiv.org/abs/2602.19455" target="_blank" style="color:var(--accent2)">SenTSR-Bench (arXiv:2602.19455)</a></td></tr>
<tr><td><strong>OpenEnv</strong></td><td>Gym-style reset/step, WebSocket transport, HF Space deployment</td><td><a href="https://huggingface.co/blog/openenv" target="_blank" style="color:var(--accent2)">HF Blog: Introducing OpenEnv</a></td></tr>
<tr><td><strong>TRL + GRPO</strong></td><td>GRPOTrainer, custom <code>rollout_func</code>, remote env rollouts</td><td>Shao et al., <a href="https://arxiv.org/abs/2402.03300" target="_blank" style="color:var(--accent2)">arXiv:2402.03300</a> (DeepSeekMath) · <a href="https://huggingface.co/docs/trl/en/openenv" target="_blank" style="color:var(--accent2)">TRL × OpenEnv</a></td></tr>
<tr><td><strong>LotteryElicitationEnv / PT</strong></td><td>Sibling project — structural template for env / PT split, <code>rollout_func</code>, DDP padding, validation hygiene</td><td><a href="https://huggingface.co/spaces/yashu2000/LotteryElicitationEnv" target="_blank" style="color:var(--accent2)">HF Space (Env)</a> · <a href="https://github.com/sharma-yash01/LotteryElicitationPT" target="_blank" style="color:var(--accent2)">GitHub (PT)</a></td></tr>
</tbody>
</table>
</div>
</section>
<!-- 14. QUICK START -->
<section id="quickstart">
<h2>Quick start</h2>
<pre><code><span class="c"># 1. Install and run the env locally</span>
cd openenv-ts/TemporalBenchEnv
uv sync --extra dev
uv run pytest tests/ -q
uv run openenv validate
<span class="c"># 2. Run the server (uvicorn, port 8000)</span>
uv run server
<span class="c"># or</span>
uvicorn server.app:app --reload
<span class="c"># 3. Build & run the Docker image</span>
docker build -t temporalbenchenv:latest -f server/Dockerfile .
docker run --rm -p 8000:8000 \
-e TEMPORALBENCH_QUESTION_BANK_DIR=/app/env/data/banks \
temporalbenchenv:latest
<span class="c"># 4. Or pull / push a HF Space</span>
export ENV_BASE_URL="https://huggingface.co/spaces/yashu2000/TemporalBenchEnv"
openenv push <span class="c"># from the TemporalBenchEnv/ directory</span>
<span class="c"># 5. Minimal client usage</span>
python - <<'PY'
from client import TemporalBenchAction, TemporalBenchEnvClient
with TemporalBenchEnvClient(base_url="http://localhost:8000") as env:
out = env.reset()
while not out.done:
q = out.observation
out = env.step(TemporalBenchAction(answer=q.options[0]))
print("total reward:", out.observation.reward)
PY
<span class="c"># 6. GRPO / TRL training against this Space (training client under test; not on GitHub yet)</span>
<span class="c"># Release is planned after additional compute for reasonable model validation.</span></code></pre>
<p><strong>Purple track (AgentBeats):</strong> <a href="https://agentbeats.dev/sharma-yash01/melady-ts-purple-agentscope" target="_blank" style="color:var(--accent2)">Melady TS Purple (AgentScope)</a> · <a href="https://agentbeats.dev/sharma-yash01/melady-ts-base-purple-agent" target="_blank" style="color:var(--accent2)">Melady TS Base Purple Agent</a>.</p>
<p>Banks are reproducible from <code>(env_seed, curriculum_stage, primary_domain)</code>. No external fixtures, no live API, no human labels — the same flat tiered JSONL that drives the Melady TS Green Agent drives this environment.</p>
</section>
<div class="callout">
<div class="q">Can an LLM post-trained on our Green Agent’s own benchmark outperform zero-shot baselines at cross-domain TS reasoning?</div>
<div class="sub">The environment is built and deployed. Training-client release and empirical GRPO numbers are contingent on compute availability — see <a href="#architecture" style="color:var(--accent2)">Architecture</a>.</div>
<div class="sub" style="margin-top:.85rem;">On AgentBeats, compare purple baselines such as <a href="https://agentbeats.dev/sharma-yash01/melady-ts-purple-agentscope" target="_blank" style="color:var(--accent2)">Melady TS Purple (AgentScope)</a> and <a href="https://agentbeats.dev/sharma-yash01/melady-ts-base-purple-agent" target="_blank" style="color:var(--accent2)">Melady TS Base Purple Agent</a>.</div>
</div>
<!-- 15. FUTURE WORK -->
<section id="future">
<h2>Future work</h2>
<ul>
<li><strong>Run GRPO to convergence</strong> on Stage 1 → Stage 2 → Stage 3 curriculum against this environment and fill in the “Trained HF policy” column of the baselines table (blocked on releasing the training client; see Architecture). Relate outcomes to AgentBeats purple baselines (<a href="https://agentbeats.dev/sharma-yash01/melady-ts-purple-agentscope" target="_blank" style="color:var(--accent2)">AgentScope</a> / <a href="https://agentbeats.dev/sharma-yash01/melady-ts-base-purple-agent" target="_blank" style="color:var(--accent2)">CAMEL</a>) and, once landed, the planned MetaGPT / TSci harnesses.</li>
<li><strong>T2 / T4 forecasting reward</strong> — replace the <code>compute_forecasting_reward</code> stub with the composite sketched in our v1 plan (normalized MSE + DTW-shape + direction + quantile + ArcTan-smoothed MAE). Opens T2 numeric and T4 contextual forecasting tasks without touching the step loop.</li>
<li><strong>Confidence-calibration reward</strong> — <code>TemporalBenchAction.confidence</code> already exists; shape rewards around Brier-style calibration so the policy is incentivized to <em>know when it knows</em>.</li>
<li><strong>CoT / reasoning reward</strong> — <code>TemporalBenchAction.reasoning</code> is captured per step but currently unscored; a light format-plus-consistency shaping mirrors the Lottery format-weight technique.</li>
<li><strong>Token-budget curriculum</strong> (ReasoningEconomics-style) — constrain per-episode reasoning length to force terseness at higher stages.</li>
<li><strong>Soft 60 / 40 task-type mix</strong> — enforce the v1-plan ratio of understanding (T1U+T3) to prediction (T2_MCQ) explicitly in <code>EpisodeSampler</code>; today it emerges from bank sizes.</li>
<li><strong>SFT warm-up</strong> on valid MCQ JSON before GRPO — skips the cold-start formatting phase we hit in Lottery.</li>
<li><strong>Human-subjects transfer</strong> — once the policy beats zero-shot on our bank, measure sim-to-real with held-out TS data outside the Green Agent’s training split.</li>
</ul>
</section>
<!-- 16. CONCLUSION -->
<section>
<h2>Conclusion</h2>
<p><strong>TemporalBenchEnv</strong> is the training-side companion to our Melady TS Green Agent’s eval-side role. The Green Agent ranks purple agents on TS-Benchmark through the A2A protocol; TemporalBenchEnv re-exposes the same four-dataset MCQ suite as an OpenEnv-native sequential MDP, so an LLM can be <em>post-trained</em> on the very benchmark it will later be scored against.</p>
<p>Every design choice — 9-question episodes with 6 primary + 3 cross-domain, per-step correctness + terminal bonus with a domain-coverage multiplier, stagewise curriculum T1U → +T3 → +T2_MCQ, and a strict separation of env server and trainer over WebSocket — is aimed at preserving the verifiability that made the Green Agent’s signal trustworthy, while turning it into gradient. The infrastructure contributions are mostly <em>inherited</em> from our Lottery / Reasoning siblings; the novelty is the domain binding and the Green-Agent → OpenEnv pipeline.</p>
<p>The research question is open: <em>can a GRPO-trained LLM, post-trained on our own benchmark, outperform strong zero-shot baselines at cross-domain TS reasoning?</em> The environment, banks, and deployment path are shipped; full training results await the public training client and the compute budget described under Architecture. The purple track stays live on AgentBeats via <a href="https://agentbeats.dev/sharma-yash01/melady-ts-purple-agentscope" target="_blank" style="color:var(--accent2)">Melady TS Purple (AgentScope)</a> and <a href="https://agentbeats.dev/sharma-yash01/melady-ts-base-purple-agent" target="_blank" style="color:var(--accent2)">Melady TS Base Purple Agent</a>.</p>
</section>
<div class="footer">
<p>TemporalBenchEnv · Melady / AgentX OpenEnv Track · USC</p>
<p style="margin-top:.5rem;">
<a href="https://github.com/sharma-yash01/TemporalBenchEnv" target="_blank">GitHub</a> ·
<a href="https://huggingface.co/spaces/yashu2000/TemporalBenchEnv" target="_blank">HF Space</a> ·
<a href="https://agentbeats.dev/sharma-yash01/melady-ts-green-agent" target="_blank">Melady TS Green Agent</a> ·
<a href="https://github.com/sharma-yash01/Melady_Agent-TS-Green" target="_blank">Green Agent GitHub</a> ·
<a href="https://github.com/meta-pytorch/OpenEnv" target="_blank">OpenEnv Framework</a> ·
<a href="https://huggingface.co/docs/trl/en/openenv" target="_blank">TRL × OpenEnv</a>
</p>
</div>
</div>
</body>
</html>
|