File size: 73,422 Bytes
459e8ba
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
31ec3b3
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1">
<title>TemporalBenchEnv: An OpenEnv Benchmark for Post-training LLMs on Multi-step Time-Series Reasoning</title>
<meta name="description" content="TemporalBenchEnv: an OpenEnv-native multi-step MCQ environment for post-training LLMs on time-series reasoning, extending our Melady TS Green Agent submission with a verifiable-reward GRPO training path.">
<link rel="preconnect" href="https://fonts.googleapis.com">
<link rel="preconnect" href="https://fonts.gstatic.com" crossorigin>
<link href="https://fonts.googleapis.com/css2?family=Inter:wght@400;600;700;800&family=JetBrains+Mono:wght@400;600&display=swap" rel="stylesheet">
<!-- Mermaid for inline diagrams -->
<script type="module">
  import mermaid from 'https://cdn.jsdelivr.net/npm/mermaid@11/dist/mermaid.esm.min.mjs';
  mermaid.initialize({
    startOnLoad: true,
    theme: 'dark',
    themeVariables: {
      primaryColor: '#6366f1',
      primaryTextColor: '#e2e8f0',
      primaryBorderColor: '#818cf8',
      lineColor: '#818cf8',
      secondaryColor: '#1e293b',
      tertiaryColor: '#172033',
      background: '#0f172a',
      mainBkg: '#1e293b',
      nodeBorder: '#818cf8',
      clusterBkg: '#172033',
      clusterBorder: '#334155',
      titleColor: '#e2e8f0',
      edgeLabelBackground: '#1e293b',
      nodeTextColor: '#e2e8f0'
    },
    flowchart: { curve: 'basis', htmlLabels: true },
    fontFamily: 'Inter, sans-serif'
  });
</script>
<style>
  :root {
    --bg: #0f172a; --surface: #1e293b; --surface-2: #172033; --border: #334155;
    --text: #e2e8f0; --muted: #94a3b8; --accent: #6366f1;
    --accent2: #818cf8; --green: #22c55e; --red: #ef4444;
    --orange: #f59e0b; --radius: 12px;
  }
  * { margin: 0; padding: 0; box-sizing: border-box; }
  html { scroll-behavior: smooth; }
  body { font-family: 'Inter', -apple-system, BlinkMacSystemFont, 'Segoe UI', sans-serif;
         background: var(--bg); color: var(--text); line-height: 1.7;
         -webkit-font-smoothing: antialiased; }
  .container { max-width: 820px; margin: 0 auto; padding: 2rem 1.5rem 4rem; }
  /* Top nav */
  .topnav { position: sticky; top: 0; z-index: 10; background: rgba(15,23,42,.85);
            backdrop-filter: blur(10px); border-bottom: 1px solid var(--border);
            padding: .9rem 1.5rem; display: flex; justify-content: space-between;
            align-items: center; font-size: .88rem; }
  .topnav .brand { font-weight: 700; color: var(--text); text-decoration: none;
                   display: flex; align-items: center; gap: .5rem; }
  .topnav .brand .dot { width: 8px; height: 8px; border-radius: 50%;
                        background: var(--green); box-shadow: 0 0 8px rgba(34,197,94,.6); }
  .topnav .links { display: flex; gap: 1.25rem; }
  .topnav .links a { color: var(--muted); text-decoration: none; transition: color .15s; }
  .topnav .links a:hover { color: var(--accent2); }
  /* Hero */
  .hero { text-align: center; padding: 4rem 0 2.5rem; }
  .hero-badge { display: inline-block; background: rgba(99,102,241,.15); color: var(--accent2);
                padding: .4rem 1.1rem; border-radius: 20px; font-size: .78rem; font-weight: 600;
                letter-spacing: .08em; margin-bottom: 1.25rem;
                border: 1px solid rgba(99,102,241,.3); text-transform: uppercase; }
  .hero h1 { font-size: clamp(2rem, 4.2vw, 3.2rem); font-weight: 800; letter-spacing: -.025em;
             line-height: 1.15;
             background: linear-gradient(135deg, #e2e8f0 25%, #6366f1 100%);
             -webkit-background-clip: text; -webkit-text-fill-color: transparent;
             background-clip: text; }
  .hero .subtitle { color: var(--muted); font-size: 1.15rem; max-width: 640px;
                    margin: 1rem auto 0; }
  .hero .byline { color: var(--muted); font-size: .85rem; margin-top: 1.5rem;
                  font-style: italic; }
  .banner { width: 100%; border-radius: var(--radius); margin: 2rem 0 3rem;
            border: 1px solid var(--border); }
  /* Badges row */
  .badges { display: flex; justify-content: center; gap: .6rem; flex-wrap: wrap;
            margin: 1.5rem 0; }
  .badges img { height: 22px; }
  /* Button group */
  .btn-group { display: flex; gap: .75rem; justify-content: center; margin: 2rem 0;
               flex-wrap: wrap; }
  .btn { display: inline-flex; align-items: center; gap: .45rem; padding: .6rem 1.35rem;
         background: var(--accent); color: white; border-radius: 8px; font-size: .88rem;
         font-weight: 600; text-decoration: none; transition: all .2s; }
  .btn:hover { background: var(--accent2); transform: translateY(-1px); }
  .btn-outline { background: transparent; border: 1px solid var(--border); color: var(--text); }
  .btn-outline:hover { border-color: var(--accent); color: var(--accent2);
                       background: rgba(99,102,241,.08); }
  /* TOC */
  .toc { background: var(--surface); border: 1px solid var(--border); border-radius: var(--radius);
         padding: 1.25rem 1.5rem; margin: 0 0 2.5rem; }
  .toc h3 { font-size: .82rem; font-weight: 700; letter-spacing: .08em; text-transform: uppercase;
            color: var(--accent2); margin-bottom: .85rem; }
  .toc ol { list-style: none; counter-reset: toc; display: flex; flex-wrap: wrap; gap: .35rem .8rem;
            margin: 0; padding: 0; }
  .toc ol li { counter-increment: toc; font-size: .88rem; }
  .toc ol li::before { content: counter(toc) "."; color: var(--accent); font-weight: 700;
                       font-size: .8rem; margin-right: .3rem; }
  .toc ol li a { color: var(--muted); text-decoration: none; transition: color .15s; }
  .toc ol li a:hover { color: var(--accent2); }
  /* Sections */
  section { margin: 3.5rem 0; }
  section h2 { font-size: 1.55rem; font-weight: 800; letter-spacing: -.01em;
               margin-bottom: 1rem; color: var(--text);
               border-left: 3px solid var(--accent); padding-left: .9rem; }
  section h3 { font-size: 1.1rem; font-weight: 700; margin: 2rem 0 .75rem;
               color: var(--accent2); }
  section p { color: #cbd5e1; margin-bottom: 1rem; font-size: 1.02rem; }
  section p strong { color: var(--text); }
  section ul, section ol { color: #cbd5e1; margin: 1rem 0 1rem 1.5rem; }
  section ul li, section ol li { margin-bottom: .5rem; font-size: 1rem; }
  section ul li strong, section ol li strong { color: var(--text); }
  /* Pull-quote */
  blockquote { border-left: 3px solid var(--accent2);
               background: rgba(99,102,241,.06); padding: 1.1rem 1.25rem;
               margin: 1.5rem 0; border-radius: 0 8px 8px 0;
               color: #e2e8f0; font-size: 1.02rem; }
  /* Tables */
  .table-wrap { margin: 1.5rem 0; overflow-x: auto;
                background: var(--surface); border: 1px solid var(--border);
                border-radius: var(--radius); }
  table { width: 100%; border-collapse: collapse; font-size: .92rem; }
  th { background: rgba(99,102,241,.1); color: var(--accent2);
       font-size: .72rem; font-weight: 700; letter-spacing: .06em;
       text-transform: uppercase; padding: .85rem 1rem; text-align: left; }
  td { padding: .7rem 1rem; border-top: 1px solid var(--border); color: #cbd5e1; }
  td.num { text-align: right; font-variant-numeric: tabular-nums;
           font-family: 'JetBrains Mono', monospace; font-size: .88rem; }
  tr:hover td { background: rgba(99,102,241,.04); }
  td strong, th strong { color: var(--text); }
  .task-id { font-family: 'JetBrains Mono', monospace; font-weight: 700;
             color: var(--accent2); font-size: .85rem; }
  tr.avg-row td { background: rgba(99,102,241,.08); font-weight: 700;
                  color: var(--text); }
  tr.novel td:first-child { color: #fca5a5; }
  /* Code */
  pre { background: #0b1120; border: 1px solid var(--border);
        border-radius: var(--radius); padding: 1.1rem 1.25rem; overflow-x: auto;
        margin: 1.25rem 0; font-family: 'JetBrains Mono', monospace;
        font-size: .85rem; line-height: 1.6; color: #d1d5db; }
  pre .c { color: #64748b; }
  code { font-family: 'JetBrains Mono', monospace; font-size: .88em;
         background: rgba(99,102,241,.12); color: var(--accent2);
         padding: .1em .35em; border-radius: 4px; }
  pre code { background: none; color: inherit; padding: 0; font-size: 1em; }
  /* Figure */
  figure { margin: 2rem 0; }
  figure img { width: 100%; border-radius: var(--radius);
               border: 1px solid var(--border); }
  figcaption { text-align: center; color: var(--muted); font-size: .85rem;
               margin-top: .75rem; }
  /* Mermaid diagram wrapper */
  .mermaid-wrap { margin: 2rem 0; background: var(--surface); border: 1px solid var(--border);
                  border-radius: var(--radius); padding: 1.5rem 1rem; overflow-x: auto; }
  .mermaid-wrap .mermaid { display: flex; justify-content: center; }
  .mermaid-caption { text-align: center; color: var(--muted); font-size: .85rem;
                     margin-top: .75rem; }
  /* Episode trace */
  .episode-trace { background: var(--surface); border: 1px solid var(--border);
                   border-radius: var(--radius); padding: 1.25rem 1.5rem; margin: 1.5rem 0;
                   position: relative; }
  .episode-trace::before { content: ''; position: absolute; left: 1.5rem; top: 2.5rem;
                           bottom: 1.25rem; width: 2px; background: var(--border); }
  .trace-step { position: relative; padding-left: 2rem; margin-bottom: 1.25rem; }
  .trace-step:last-child { margin-bottom: 0; }
  .trace-step .step-marker { position: absolute; left: -.45rem; top: .2rem; width: 12px;
                             height: 12px; border-radius: 50%; border: 2px solid var(--accent);
                             background: var(--bg); z-index: 1; }
  .trace-step .step-marker.terminal { background: var(--red); border-color: var(--red); }
  .trace-step .step-marker.good { background: var(--green); border-color: var(--green); }
  .trace-step .step-label { font-family: 'JetBrains Mono', monospace; font-size: .78rem;
                            color: var(--accent2); font-weight: 700; margin-bottom: .25rem; }
  .trace-step .step-content { font-size: .9rem; color: #cbd5e1; }
  .trace-step .step-content code { font-size: .82em; }
  .trace-verdict { margin-top: 1rem; padding: .75rem 1rem; border-radius: 8px;
                   font-size: .9rem; font-weight: 600; }
  .trace-verdict.bad { background: rgba(239,68,68,.1); border: 1px solid rgba(239,68,68,.3);
                       color: #fca5a5; }
  .trace-verdict.good { background: rgba(34,197,94,.1); border: 1px solid rgba(34,197,94,.3);
                        color: #86efac; }
  /* Callout for the closing question */
  .callout { text-align: center; padding: 2rem 1.5rem; margin: 3rem 0;
             background: linear-gradient(135deg, rgba(99,102,241,.08), rgba(129,140,248,.04));
             border: 1px solid rgba(99,102,241,.25); border-radius: var(--radius); }
  .callout .q { font-size: 1.25rem; font-weight: 700; color: var(--text);
                font-style: italic; margin-bottom: .5rem; }
  .callout .sub { color: var(--muted); font-size: .95rem; }
  /* Notice panel (for "empirical results pending" etc.) */
  .notice { padding: 1.1rem 1.25rem; margin: 1.5rem 0;
            background: rgba(245,158,11,.08); border: 1px solid rgba(245,158,11,.3);
            border-radius: var(--radius); color: #fde68a; font-size: .95rem; }
  .notice strong { color: #fbbf24; }
  /* Footer */
  .footer { text-align: center; padding: 3rem 0 1rem; color: var(--muted);
            font-size: .85rem; border-top: 1px solid var(--border); margin-top: 3rem; }
  .footer a { color: var(--accent2); text-decoration: none; margin: 0 .5rem; }
  .footer a:hover { text-decoration: underline; }
  @media (max-width: 640px) {
    .container { padding: 1rem 1rem 3rem; }
    .hero { padding: 2.5rem 0 1.5rem; }
    .topnav .links { display: none; }
    section h2 { font-size: 1.3rem; }
    table { font-size: .82rem; }
    th, td { padding: .55rem .6rem; }
    .toc ol { flex-direction: column; }
    .episode-trace { padding: 1rem; }
    .episode-trace::before { left: 1rem; }
  }
  /* MathJax display blocks */
  .math-display {
    margin: 1.25rem 0;
    padding: 1rem 1.25rem 1.15rem;
    overflow-x: auto;
    background: var(--surface);
    border: 1px solid var(--border);
    border-radius: var(--radius);
    text-align: center;
  }
  .math-display mjx-container[jax="CHTML"][display="true"] { margin: 0.65em 0 !important; }
  .math-display mjx-container { color: #e2e8f0 !important; }
  .math-note { font-size: .9rem; color: var(--muted); margin-top: .35rem; margin-bottom: 0; }
</style>
<!-- MathJax 3: client-side TeX like static HF Spaces (sdk: static) serve plain HTML. -->
<script>
window.MathJax = {
  tex: {
    inlineMath: [['\\(', '\\)']],
    displayMath: [['\\[', '\\]']]
  },
  options: {
    renderActions: {
      addMenu: [0, '', '']
    }
  }
};
</script>
<script defer src="https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-mml-chtml.js" id="MathJax-script"></script>
</head>
<body>

<nav class="topnav">
  <a href="#top" class="brand"><span class="dot"></span> TemporalBenchEnv Blog</a>
  <div class="links">
    <a href="#why">Why</a>
    <a href="#green-agent">Green Agent</a>
    <a href="#design">Design</a>
    <a href="#scoring">Scoring</a>
    <a href="#traces">Traces</a>
    <a href="#results">Results</a>
    <a href="#engineering">Engineering</a>
    <a href="https://huggingface.co/spaces/yashu2000/TemporalBenchEnv" target="_blank">Live Space &#x2197;</a>
  </div>
</nav>

<div class="container" id="top">

  <div class="hero">
    <div class="hero-badge">OpenEnv &middot; Extension of Melady TS Green Agent</div>
    <h1>TemporalBenchEnv</h1>
    <p class="subtitle">An OpenEnv-native multi-step MCQ environment for post-training LLMs on time-series reasoning &mdash; built on the four datasets from our Melady TS Green Agent submission.</p>
    <div class="badges">
      <a href="https://github.com/sharma-yash01/TemporalBenchEnv" target="_blank"><img src="https://img.shields.io/badge/GitHub-Repository-181717?logo=github" alt="GitHub"/></a>
      <a href="https://huggingface.co/spaces/yashu2000/TemporalBenchEnv" target="_blank"><img src="https://img.shields.io/badge/HF%20Space-Live%20Demo-FFD21E?logo=huggingface&logoColor=black" alt="HF Space"/></a>
      <img src="https://img.shields.io/badge/OpenEnv-Native-4B8BBE" alt="OpenEnv"/>
      <img src="https://img.shields.io/badge/Datasets-4-brightgreen" alt="4 Datasets"/>
      <img src="https://img.shields.io/badge/Episodes-9%20MCQ-blueviolet" alt="9-step MCQ"/>
      <img src="https://img.shields.io/badge/Training-GRPO%20(planned)-orange" alt="GRPO (planned)"/>
    </div>
    <div class="byline">Melady / AgentX OpenEnv Track &nbsp;|&nbsp; Muyan Weng (USC) &middot; Defu Cao (USC) &middot; Yashaswi Sharma (USC) &middot; Yan Liu (USC)</div>
  </div>

  <img src="banner.png" alt="TemporalBenchEnv: question &rarr; answer &rarr; score across four time-series datasets" class="banner"/>

  <div class="btn-group">
    <a class="btn" href="https://huggingface.co/spaces/yashu2000/TemporalBenchEnv" target="_blank">Live Environment Space &rarr;</a>
    <a class="btn btn-outline" href="https://github.com/sharma-yash01/TemporalBenchEnv" target="_blank">GitHub Repo</a>
  </div>

  <!-- Table of Contents -->
  <nav class="toc" id="toc">
    <h3>Contents</h3>
    <ol>
      <li><a href="#why">Multi-step TS Reasoning as a Verifiable Env</a></li>
      <li><a href="#green-agent">Extension of Our Green Agent Submission</a></li>
      <li><a href="#purple-harnesses">Purple Agent Harnesses (Eval Targets)</a></li>
      <li><a href="#matters">Why This Benchmark Matters</a></li>
      <li><a href="#prior-work">Prior Work &amp; Novelty</a></li>
      <li><a href="#design">What TemporalBenchEnv Is</a></li>
      <li><a href="#env-design">Environment Design</a></li>
      <li><a href="#openenv">Why OpenEnv</a></li>
      <li><a href="#scoring">Scoring: Per-step + Episode Bonus</a></li>
      <li><a href="#architecture">Architecture &amp; Training Pipeline</a></li>
      <li><a href="#traces">Episode Trace (Ideal, Illustrative)</a></li>
      <li><a href="#results">Results: Baselines &amp; Projected Targets</a></li>
      <li><a href="#engineering">Engineering Lessons (Inherited)</a></li>
      <li><a href="#foundations">Foundations &amp; Citations</a></li>
      <li><a href="#quickstart">Quick Start</a></li>
      <li><a href="#future">Future Work</a></li>
    </ol>
  </nav>

  <!-- 1. WHY -->
  <section id="why">
    <h2>Multi-step TS reasoning as a verifiable environment</h2>
    <p>Most time-series LLM benchmarks grade a single prompt at a time. <strong>TemporalBenchEnv</strong> grades an <strong>episode</strong>: nine multiple-choice questions drawn from four time-series datasets, answered one per step, with a terminal bonus that rewards both accuracy and <strong>cross-domain coverage</strong>.</p>
    <!-- TODO: Replace OpenReview link with arXiv link when available -->
    <p>Every reward signal here is <strong>ground-truth arithmetic</strong>, not a judge. Labels are produced by the <a href="https://openreview.net/forum?id=rp5qCsxXca&noteId=jYZcZ4GI0y" target="_blank" style="color:var(--accent2)">TS-Benchmark</a> construction pipeline (trend / volatility / seasonality / outlier thresholds; S1&ndash;S5 family rules), so the environment can score an answer with a normalized string match against the stored ground truth. There is no LLM judge in the loop.</p>
    <p>The falsifiable hypothesis this environment is built to test: <em>whether a GRPO-trained LLM, post-trained on sequential episodes sampled from our Green Agent&rsquo;s own benchmark, outperforms strong zero-shot baselines on per-domain MCQ accuracy while hitting the cross-domain coverage bonus.</em> Empirical adjudication is contingent on the training runs described under <a href="#architecture" style="color:var(--accent2)">Architecture &amp; training pipeline</a>.</p>
  </section>

  <!-- 2. EXTENSION OF GREEN AGENT -->
  <section id="green-agent">
    <h2>Extension of our Melady TS Green Agent submission</h2>
    <p>TemporalBenchEnv is a <strong>direct extension</strong> of our AgentBeats <a href="https://agentbeats.dev/sharma-yash01/melady-ts-green-agent" target="_blank" style="color:var(--accent2)"><strong>Melady TS Green Agent</strong></a> submission. The Green Agent is an A2A-protocol evaluator that grades purple agents on 764 TS-Benchmark tasks (see the <a href="https://github.com/sharma-yash01/Melady_Agent-TS-Green" target="_blank" style="color:var(--accent2)">Green Agent GitHub repository</a>). TemporalBenchEnv takes the <em>same datasets and task taxonomy</em> and re-exposes them as a sequential <strong>OpenEnv</strong> environment consumable by TRL&rsquo;s <code>rollout_func</code> &mdash; turning the benchmark into a <strong>training target</strong>.</p>
    <div class="table-wrap">
      <table>
        <thead><tr><th>Artifact</th><th>Melady TS Green Agent</th><th>TemporalBenchEnv (this submission)</th></tr></thead>
        <tbody>
          <tr><td><strong>Role</strong></td><td>A2A evaluator of purple agents</td><td>OpenEnv RL environment for post-training LLMs</td></tr>
          <tr><td><strong>Datasets</strong></td><td>PSML, freshretailnet, MIMIC, causal_chambers</td><td>Same four datasets</td></tr>
          <tr><td><strong>Tasks</strong></td><td>T1/T3 (accuracy) + T2/T4 (regression + MCQ), 764 total</td><td>MCQ subset: T1U, T3, T2_MCQ &mdash; 2,775 <code>TSQuestion</code> records</td></tr>
          <tr><td><strong>Per-domain bank sizes</strong></td><td>Packaged in the Docker image</td><td>PSML 750 &middot; freshretailnet 616 &middot; MIMIC 709 &middot; causal_chambers 700</td></tr>
          <tr><td><strong>Protocol</strong></td><td>A2A messaging, one-shot prompts</td><td>WebSocket OpenEnv contract, sequential 9-step MDP</td></tr>
          <tr><td><strong>Reward</strong></td><td>MSE / MAE / RMSE / MASE / accuracy (eval metrics)</td><td>Per-step correctness + terminal episode bonus w/ coverage multiplier</td></tr>
          <tr><td><strong>Consumer</strong></td><td>AgentBeats leaderboard</td><td>TRL 1.0 <code>rollout_func</code>, vLLM colocate / server, GRPO</td></tr>
        </tbody>
      </table>
    </div>
    <p>The ETL from the Green Agent&rsquo;s labeled JSONL into the per-domain <code>TSQuestion</code> banks the environment consumes lives in <code>TS-benchmark/scripts/build_temporal_bench_openenv_banks.py</code>; banks ship at <code>openenv-ts/TemporalBenchEnv/data/banks/</code> and are loaded via the <code>TEMPORALBENCH_QUESTION_BANK_DIR</code> environment variable.</p>
    <blockquote>The Green Agent answered: <em>which purple agent is best at TS reasoning right now?</em> &nbsp;TemporalBenchEnv answers: <em>can we post-train an LLM, on that exact benchmark, to be the next best purple agent?</em></blockquote>
  </section>

  <!-- 2b. PURPLE AGENT HARNESSES -->
  <section id="purple-harnesses">
    <h2>Purple agent harnesses: evaluating mainstream TS-capable agent stacks</h2>
    <p>The Green Agent scores <em>purple agents</em> over the A2A protocol &mdash; any stack that speaks A2A is a valid participant. To make the benchmark diagnostic for modern agentic time-series practice, we target four of the most popular open-source agent harnesses as purple agents on AgentBeats: <strong>two are implemented and live</strong> (AgentScope, CAMEL) and <strong>two are planned</strong> (MetaGPT, TimeSeriesScientist). Each harness stays <em>unchanged internally</em>; a thin A2A adapter feeds the Green Agent&rsquo;s TS-Benchmark MCQs into the harness&rsquo;s own reasoning / tool-use loop and returns a final label. These four frameworks are the de-facto &ldquo;agentic harnesses&rdquo; TS practitioners actually wrap around LLMs today, so they anchor the eval to real downstream usage.</p>

    <div class="notice">
      <strong>The feedback loop we are instrumenting.</strong> For every harness we expose a swappable backbone LLM (e.g. Qwen2.5-7B-Instruct or GPT-4o-mini). The plan is: <strong>(1)</strong> score the base backbone inside each harness via the <a href="https://agentbeats.dev/sharma-yash01/melady-ts-green-agent" target="_blank" style="color:var(--accent2)">Melady TS Green Agent</a>, <strong>(2)</strong> <strong>post-train that same backbone with this OpenEnv</strong> &mdash; TemporalBenchEnv&rsquo;s randomized-domain / randomized-task 9-step MCQ episodes with per-step + terminal verifiable rewards, <strong>(3)</strong> <strong>re-score every harness</strong> with the post-trained backbone in place of the base, and <strong>(4)</strong> attribute any delta specifically to the RL post-training rather than to harness architecture. The panel makes the research question concrete: <em>does randomized-domain verifiable-reward post-training on TemporalBenchEnv actually transfer to agentic TS reasoning under mainstream orchestration frameworks?</em>
    </div>

    <h3>Implemented &mdash; live purple agents on AgentBeats</h3>
    <div class="table-wrap">
      <table>
        <thead><tr><th>Harness</th><th>What it is</th><th>Why it matters for TS reasoning</th><th>AgentBeats listing</th></tr></thead>
        <tbody>
          <tr>
            <td><strong><a href="https://github.com/agentscope-ai/agentscope" target="_blank" style="color:var(--accent2)">AgentScope</a></strong><br><span style="font-size:.85em;color:var(--muted)">Gao et al., <a href="https://arxiv.org/abs/2508.16279" target="_blank" style="color:var(--accent2)">&ldquo;AgentScope 1.0: A Developer-Centric Framework for Building Agentic Applications&rdquo; (arXiv:2508.16279, 2025)</a>; Gao et al., <a href="https://arxiv.org/abs/2402.14034" target="_blank" style="color:var(--accent2)">&ldquo;AgentScope: A Flexible yet Robust Multi-Agent Platform&rdquo; (arXiv:2402.14034, 2024)</a></span></td>
            <td>Production-ready ReAct agent framework with built-in tools, memory, planning, MCP / A2A interop, and an agentic-RL tuner (Trinity-RFT). Apache-2.0.</td>
            <td>The most common Python-native ReAct / tool-use stack; A2A support makes the Green Agent wiring mechanical. Serves as our canonical &ldquo;single-ReAct-loop&rdquo; purple baseline over TS MCQs.</td>
            <td><a href="https://agentbeats.dev/sharma-yash01/melady-ts-purple-agentscope" target="_blank" style="color:var(--accent2)">Melady TS Purple (AgentScope)</a></td>
          </tr>
          <tr>
            <td><strong><a href="https://github.com/camel-ai/camel" target="_blank" style="color:var(--accent2)">CAMEL</a></strong><br><span style="font-size:.85em;color:var(--muted)">Li et al., <a href="https://arxiv.org/abs/2303.17760" target="_blank" style="color:var(--accent2)">&ldquo;CAMEL: Communicative Agents for &lsquo;Mind&rsquo; Exploration of Large Language Model Society&rdquo; (NeurIPS 2023, arXiv:2303.17760)</a></span></td>
            <td>Role-playing multi-agent framework with stateful memory, structured messages, societies / workforce pipelines, and a strong focus on <em>scaling laws of agents</em>.</td>
            <td>Role-play + critic loop tests whether a trend / seasonality / outlier answer survives multi-turn discussion without drifting; complements AgentScope&rsquo;s single-ReAct shape with an inter-agent communication surface.</td>
            <td><a href="https://agentbeats.dev/sharma-yash01/melady-ts-base-purple-agent" target="_blank" style="color:var(--accent2)">Melady TS Base Purple Agent</a></td>
          </tr>
        </tbody>
      </table>
    </div>

    <h3>Planned &mdash; purple agents under development</h3>
    <div class="table-wrap">
      <table>
        <thead><tr><th>Harness</th><th>What it is</th><th>Why it matters for TS reasoning</th><th>Status</th></tr></thead>
        <tbody>
          <tr>
            <td><strong><a href="https://github.com/FoundationAgents/MetaGPT" target="_blank" style="color:var(--accent2)">MetaGPT</a></strong><br><span style="font-size:.85em;color:var(--muted)">Hong et al., <a href="https://arxiv.org/abs/2308.00352" target="_blank" style="color:var(--accent2)">&ldquo;MetaGPT: Meta Programming for a Multi-Agent Collaborative Framework&rdquo; (ICLR 2024, oral; arXiv:2308.00352)</a></span></td>
            <td>SOP-driven multi-agent system &mdash; explicit role decomposition (product manager / architect / engineer / QA) orchestrated by a meta-programming layer. One of the most-starred multi-agent frameworks in the wild.</td>
            <td>Gives us a decomposition-heavy purple: a &ldquo;data analyst&rdquo; + &ldquo;forecaster&rdquo; / &ldquo;reviewer&rdquo; role split answering the same MCQs, isolating whether explicit SOPs help TS reasoning vs. a flat ReAct loop.</td>
            <td><strong>Planned.</strong> A2A wrapper not yet published on AgentBeats.</td>
          </tr>
          <tr>
            <td><strong><a href="https://github.com/Y-Research-SBU/TimeSeriesScientist" target="_blank" style="color:var(--accent2)">TimeSeriesScientist (TSci)</a></strong><br><span style="font-size:.85em;color:var(--muted)">Zhao et al., <a href="https://arxiv.org/abs/2510.01538" target="_blank" style="color:var(--accent2)">&ldquo;TimeSeriesScientist: A General-Purpose AI Agent for Time Series Analysis&rdquo; (arXiv:2510.01538, 2025)</a></span></td>
            <td>Domain-specific LangGraph agent purpose-built for TS: <em>Curator &rarr; Planner &rarr; Forecaster &rarr; Reporter</em>, with statistical / ML / DL model selection and ensembling (ARIMA, Prophet, LSTM, XGBoost, Transformer, &hellip;).</td>
            <td>The strongest TS-specialized agent we can find; a natural &ldquo;ceiling&rdquo; for agentic TS reasoning and a direct yardstick for whether an RL-post-trained generalist LLM inside a simpler harness closes the gap to a purpose-built TS pipeline.</td>
            <td><strong>Planned.</strong> A2A adapter not yet published on AgentBeats.</td>
          </tr>
        </tbody>
      </table>
    </div>

    <blockquote><strong>Green Agent + TemporalBenchEnv + harness panel = a closed evaluation loop.</strong> The Green Agent scores every harness on a fixed TS-Benchmark MCQ set; TemporalBenchEnv uses the <em>same datasets and task taxonomy</em> to post-train a backbone with randomized-domain / randomized-task episodes under verifiable rewards; we then drop the post-trained backbone back into AgentScope, CAMEL, MetaGPT, and TSci and re-score. A delta from the RL post-training should show up as a <em>uniform</em> lift across all four harnesses &mdash; not just in one framework, and not just in the zero-shot MCQ row of the baselines table.</blockquote>
  </section>

  <!-- 3. WHY IT MATTERS -->
  <section id="matters">
    <h2>Why this benchmark matters</h2>
    <p>Time-series reasoning is one of the few areas where frontier LLMs still look visibly weak, but also one where the <strong>verifiable signal is clean</strong>: trend / volatility / seasonality / outlier labels are constructed from thresholded statistics of the series, so exact-match grading is unambiguous. That makes MCQ episodes a near-ideal RL target before touching noisier numeric forecasting rewards.</p>
    <p>The design is <strong>transferable</strong>. Any benchmark that produces labeled MCQ records over a set of domains &mdash; medical diagnostics, power-grid anomaly tagging, retail demand regimes &mdash; fits the same 9-question cross-domain template. Datasets are the proxy; the capability is <strong>multi-step, multi-domain, verifiable TS reasoning</strong>.</p>
    <p>Every reward component is <strong>ground-truth arithmetic</strong>. The environment samples questions from pre-built banks, scores answers via normalized string equality, and aggregates with a closed-form episode bonus. No LLM judge, no circular reward.</p>
  </section>

  <!-- 4. PRIOR WORK & NOVELTY -->
  <section id="prior-work">
    <h2>Prior work &amp; novelty</h2>
    <p>Prior &ldquo;LLMs + time series&rdquo; work lands in one of three buckets. None occupies the cell we target:</p>
    <div class="table-wrap">
      <table>
        <thead><tr><th>Prior work bucket</th><th>What it does</th><th>What it does not</th></tr></thead>
        <tbody>
          <tr><td><strong>Static TS benchmarks</strong><br><span style="font-size:.85em;color:var(--muted)">TS-Benchmark (ours, <a href="https://openreview.net/forum?id=rp5qCsxXca&noteId=jYZcZ4GI0y" target="_blank" style="color:var(--accent2)">OpenReview</a>); FreshRetailNet (<a href="https://arxiv.org/abs/2505.16319" target="_blank" style="color:var(--accent2)">arXiv:2505.16319</a>); PSML (<a href="https://www.nature.com/articles/s41597-022-01455-7" target="_blank" style="color:var(--accent2)">Nat. Sci. Data 2022</a>); MIMIC-IV (<a href="https://www.nature.com/articles/s41597-022-01899-x" target="_blank" style="color:var(--accent2)">Nat. Sci. Data 2022</a>); Causal Chambers (<a href="https://www.nature.com/articles/s42256-024-00964-x" target="_blank" style="color:var(--accent2)">Nat. MI 2024</a>)</span></td><td>Construct labeled MCQ / regression tasks over TS datasets, graded by fixed rules</td><td>No RL-native environment contract, no sequential episodes, no post-training loop</td></tr>
          <tr><td><strong>TS-LLM composite RL rewards</strong><br><span style="font-size:.85em;color:var(--muted)">TimeMaster (<a href="https://arxiv.org/abs/2506.13705" target="_blank" style="color:var(--accent2)">arXiv:2506.13705</a>); COUNTS (<a href="https://arxiv.org/abs/2510.01116" target="_blank" style="color:var(--accent2)">arXiv:2510.01116</a>); SenTSR-Bench (<a href="https://arxiv.org/abs/2602.19455" target="_blank" style="color:var(--accent2)">arXiv:2602.19455</a>)</span></td><td>Composite reward shaping (MSE + DTW + direction + quantile + ArcTan) for TS multimodal LLMs</td><td>No OpenEnv contract, no sequential MCQ episodes, reward is forecasting-centric</td></tr>
          <tr><td><strong>A2A evaluators</strong><br><span style="font-size:.85em;color:var(--muted)">Melady TS Green Agent (<a href="https://agentbeats.dev/sharma-yash01/melady-ts-green-agent" target="_blank" style="color:var(--accent2)">AgentBeats</a>); other AgentBeats green agents</span></td><td>Score deployed purple agents on the benchmark through the A2A protocol</td><td>Not a training environment; no per-step RL signal, no environment state</td></tr>
          <tr class="novel"><td><strong>TemporalBenchEnv (ours)</strong></td><td>Sequential 9-step MCQ MDP over four TS datasets, structured Pydantic actions, terminal reward from ground-truth labels + coverage multiplier, OpenEnv + GRPO contract</td><td>Does not (yet) train numeric forecasting &mdash; T2/T4 reward is stubbed for future work</td></tr>
        </tbody>
      </table>
    </div>
    <blockquote>To our knowledge, no prior work exposes TS-Benchmark&rsquo;s multi-dataset MCQ suite as an <strong>OpenEnv-native sequential MDP with verifiable terminal rewards</strong> suitable for GRPO post-training. The action space, reward decomposition, and sibling-env architecture follow the <a href="https://huggingface.co/spaces/yashu2000/LotteryElicitationEnv" target="_blank" style="color:var(--accent2)">LotteryElicitationEnv</a> lineage; the domain and task taxonomy follow our own Melady TS Green Agent.</blockquote>
  </section>

  <!-- 5. WHAT IT IS -->
  <section id="design">
    <h2>What TemporalBenchEnv is</h2>
    <blockquote>An OpenEnv-native sequential MDP in which an LLM agent answers nine MCQ questions &mdash; six from a primary domain and one each from the other three &mdash; earning per-step correctness and a terminal bonus that rewards cross-domain coverage.</blockquote>
    <p>Each episode proceeds like this:</p>
    <ul>
      <li><code>reset()</code> samples nine <code>TSQuestion</code> records: <strong>6 from the primary domain</strong> (default PSML), with T3 families round-robined for diversity, and <strong>1 from each of the three non-primary domains</strong>. Final order is shuffled.</li>
      <li>On turn <em>t</em>, the agent emits a <code>TemporalBenchAction</code> containing the MCQ label, plus optional <code>confidence</code> and <code>reasoning</code> fields.</li>
      <li>The environment <strong>grades</strong> the answer via normalized string equality against <code>question.answer</code> (also accepting an option whose normalized text matches the ground truth), and returns the next question.</li>
      <li>Per-step reward is <code>alpha * correctness</code>. Mid-episode bonuses are zero.</li>
      <li>On the final step, the environment adds the <strong>terminal episode bonus</strong>: <code>lambda_ep * (total_correct / N) * coverage_multiplier</code>, where the multiplier is <code>1.0</code> if every one of the four domains contributed at least one correct answer, else <code>0.8</code>.</li>
    </ul>
    <p>The agent&rsquo;s interface is deliberately minimal: a single answer string per step, no tool-call protocol. Optional <code>confidence</code> and <code>reasoning</code> fields exist on the action for future reward shaping.</p>
  </section>

  <!-- 6. ENVIRONMENT DESIGN -->
  <section id="env-design">
    <h2>Environment design</h2>
    <p>The core contract is three Pydantic types exchanged over the OpenEnv WebSocket (see <code>openenv-ts/TemporalBenchEnv/env/models.py</code>):</p>
    <pre><code><span class="c"># Action (agent &rarr; env)</span>
class TemporalBenchAction(Action):
    answer: str                         <span class="c"># MCQ label matching an option</span>
    confidence: Optional[float]         <span class="c"># in [0, 1], unused in reward for now</span>
    reasoning: Optional[str]            <span class="c"># optional CoT, unused in reward for now</span>

<span class="c"># Observation (env &rarr; agent)</span>
class TemporalBenchObservation(Observation):
    step_idx, steps_remaining, max_steps: int
    question: str                       <span class="c"># current MCQ prompt</span>
    options: list[str]                  <span class="c"># 2+ answer choices</span>
    task_type: str                      <span class="c"># "T1U" | "T3" | "T2_MCQ"</span>
    dataset: str                        <span class="c"># "PSML" | "freshretailnet" | "MIMIC" | "causal_chambers"</span>
    history: list[dict]                 <span class="c"># [{question, answer, correct, dataset, ...}, ...]</span>
    accuracy_so_far: float
    done: bool; reward: Optional[float]; metadata: dict

<span class="c"># State (serializable snapshot)</span>
class TemporalBenchState(State):
    episode_id: Optional[str]
    step_count, total_correct, total_questions: int
    current_accuracy: float
    primary_domain: str
    per_task_type_accuracy: dict[str, float]
    total_reward: float</code></pre>

    <p>Four datasets, three MCQ task types, and a three-stage curriculum shape the training distribution (see <code>env/config.py</code> and <code>env/episode_sampler.py</code>):</p>
    <div class="table-wrap">
      <table>
        <thead><tr><th>Stage</th><th>Allowed task types</th><th>Purpose</th></tr></thead>
        <tbody>
          <tr><td><strong>Stage 1</strong></td><td><code>T1U</code> only (non-contextual understanding MCQ)</td><td>Shorten credit assignment; learn trend / volatility / seasonality / outliers first</td></tr>
          <tr><td><strong>Stage 2</strong></td><td><code>T1U</code> + <code>T3</code> (contextual understanding, S1&ndash;S5 families)</td><td>Add context-conditioned reasoning; maintain verifiable labels</td></tr>
          <tr><td><strong>Stage 3</strong></td><td><code>T1U</code> + <code>T3</code> + <code>T2_MCQ</code> (prediction-as-classification)</td><td>Full MCQ track; adds direction-of-change / volatility-change / seasonality-alignment</td></tr>
        </tbody>
      </table>
    </div>
    <p>Curriculum is honored both at <code>EnvConfig(curriculum_stage=...)</code> construction and at <code>env.reset(curriculum_stage=...)</code>, so a single server can serve multiple stages to different sessions concurrently.</p>
  </section>

  <!-- 7. WHY OPENENV -->
  <section id="openenv">
    <h2>Why OpenEnv</h2>
    <p>OpenEnv gives us three things that matter for this submission: (1) a standard WebSocket environment contract consumable by TRL&rsquo;s <code>rollout_func</code>, (2) per-session state with <code>max_concurrent_envs=64</code> in our <code>create_app</code> factory &mdash; each WebSocket session gets a fresh <code>TemporalBenchEnvironment</code> via <code>_env_factory</code> so DDP ranks can hammer the same Space without cross-talk, and (3) a uniform deployment path. The same env code runs in-process for tests, as a Docker container for development (<code>server/Dockerfile</code>, with <code>TEMPORALBENCH_QUESTION_BANK_DIR=/app/env/data/banks</code>), and as a Hugging Face Space during training and evaluation.</p>
    <p>No new abstractions were invented. Base types only: <code>EnvClient</code>, <code>Environment</code>, Pydantic <code>Action</code> / <code>Observation</code> / <code>State</code>. All extensions (history, per-domain coverage, per-task accuracy) ride on <code>metadata</code> or the serialized state. No new method signatures, no fork. The env ships with <code>openenv.yaml</code> and a <code>Dockerfile</code>, and passes <code>uv run openenv validate</code>.</p>
    <p class="math-note"><strong>Hygiene note:</strong> OpenEnv&rsquo;s CLI validator does a naive substring check for <code>main()</code> in <code>server/app.py</code>. We match the reference pattern from <a href="https://huggingface.co/spaces/yashu2000/LotteryElicitationEnv" target="_blank" style="color:var(--accent2)">LotteryElicitationEnv</a> &mdash; an explicit <code>main()</code> call under <code>if __name__ == "__main__"</code> with CLI flags parsed via <code>parse_known_args</code>.</p>
  </section>

  <!-- 8. SCORING -->
  <section id="scoring">
    <h2>Scoring: per-step correctness + episode bonus</h2>
    <p>Reward is decomposed into a per-step term and a terminal bonus (see <code>env/reward.py</code>).</p>
    <div class="math-display" aria-label="Per-step correctness reward">
\[
r_t \;=\; \alpha \cdot \mathbf{1}\!\left[\hat a_t = a_t^\ast\right]
\]
    </div>
    <p class="math-note">Here \(\hat a_t\) is the agent&rsquo;s submitted label and \(a_t^\ast\) is the stored ground truth, compared under normalized string equality (also accepting an option whose normalized text matches).</p>
    <div class="math-display" aria-label="Terminal episode bonus with coverage multiplier">
\[
B \;=\; \lambda_{\mathrm{ep}} \,\cdot\, \frac{C}{N} \,\cdot\, m,
\qquad
m \;=\; \begin{cases} 1.0 & \text{all 4 domains have } \ge 1 \text{ correct} \\ 0.8 & \text{otherwise} \end{cases}
\]
\[
R \;=\; \sum_{t=1}^{N} r_t \;+\; B
\]
    </div>
    <p class="math-note">\(C\) is the total correct count in the episode, \(N = 9\) is the episode length, \(\alpha, \lambda_{\mathrm{ep}}\) are <code>alpha</code> and <code>lambda_ep</code> in <code>EnvConfig</code>.</p>
    <p>Defaults live in <code>EnvConfig</code>:</p>
    <div class="table-wrap">
      <table>
        <thead><tr><th>Component</th><th>Weight</th><th>What it rewards</th></tr></thead>
        <tbody>
          <tr><td><strong>Per-step correctness</strong></td><td class="num">&alpha; = 1.0</td><td>Normalized-string match against the MCQ ground truth</td></tr>
          <tr><td><strong>Episode bonus weight</strong></td><td class="num">&lambda;<sub>ep</sub> = 0.5</td><td>Scales the terminal accuracy&times;coverage term</td></tr>
          <tr><td><strong>Coverage multiplier</strong></td><td class="num">{0.8, 1.0}</td><td>1.0 iff every domain in <code>EnvConfig.all_domains</code> has &ge;1 correct answer this episode</td></tr>
          <tr><td><strong>Forecasting reward</strong></td><td class="num">&mdash;</td><td>Stubbed (<code>compute_forecasting_reward</code> raises <code>NotImplementedError</code>); future work</td></tr>
        </tbody>
      </table>
    </div>
    <blockquote><strong>Why a coverage multiplier:</strong> per-step accuracy alone lets the agent ace the six primary questions while guessing on the other three domains, collapsing to a single-domain policy. The <code>0.8</code> penalty forces the policy to treat the three cross-domain questions as first-class signal &mdash; the very thing that distinguishes a TS-generalist from a PSML-only memorizer.</blockquote>
  </section>

  <!-- 9. ARCHITECTURE & TRAINING PIPELINE -->
  <section id="architecture">
    <h2>Architecture &amp; training pipeline</h2>
    <p>Following the <a href="https://huggingface.co/spaces/yashu2000/LotteryElicitationEnv" target="_blank" style="color:var(--accent2)">LotteryElicitationEnv</a> / <a href="https://huggingface.co/spaces/yashu2000/reasoning-economic-env" target="_blank" style="color:var(--accent2)">ReasoningEconomicsEnv</a> lineage, <strong>TemporalBenchEnv</strong> (this page) ships the OpenEnv side: reset, step, rewards, and question banks over WebSocket. A separate trainer process &mdash; same env/trainer separation as <code>LotteryElicitationEnv</code> and <a href="https://github.com/sharma-yash01/LotteryElicitationPT" target="_blank" style="color:var(--accent2)"><code>LotteryElicitationPT</code></a> &mdash; would drive GRPO with TRL&rsquo;s <code>rollout_func</code> and vLLM against that socket, without in-process imports of env-side types.</p>

    <div class="notice">
      <strong>Training client status.</strong> The companion GRPO / TRL package that runs rollouts against this environment is under <strong>active internal testing</strong>. We will release it on <strong>GitHub</strong> once we secure <strong>additional compute</strong> so we can stress-test and validate models at a scale we consider reasonable. This blog documents the shipped <strong>TemporalBenchEnv</strong> only until that release.
    </div>

    <blockquote><strong>Purple agents on AgentBeats.</strong> The same TS-Benchmark task surface is also exercised by <strong>purple agents</strong> scored through the A2A green agent: live listings are <a href="https://agentbeats.dev/sharma-yash01/melady-ts-purple-agentscope" target="_blank" style="color:var(--accent2)">Melady TS Purple (AgentScope)</a> and <a href="https://agentbeats.dev/sharma-yash01/melady-ts-base-purple-agent" target="_blank" style="color:var(--accent2)">Melady TS Base Purple Agent</a>. Planned additions are documented in <a href="#purple-harnesses" style="color:var(--accent2)">&sect;&nbsp;Purple agent harnesses</a>.</blockquote>

    <div class="mermaid-wrap">
      <pre class="mermaid">
flowchart LR
    subgraph TRN ["Trainer (TRL + GRPO, WebSocket client)"]
        GRPO["GRPOTrainer<br/>TRL 1.0"]
        RF["rollout_func"]
        VLLM["vLLM<br/>colocate/server"]
        PARSE["action_parser<br/>MCQ label guardrails"]
    end
    subgraph ENV ["TemporalBenchEnv (OpenEnv)"]
        WS["FastAPI<br/>WebSocket"]
        SAMP["EpisodeSampler<br/>6+1+1+1 stratified"]
        GRADE["grade_answer<br/>normalized match"]
        REW["Reward<br/>per-step + bonus + coverage"]
    end

    GRPO --> RF
    RF --> VLLM
    VLLM -->|"generate"| PARSE
    PARSE -->|"answer string"| WS
    WS --> SAMP
    SAMP -->|"next question"| WS
    WS --> GRADE
    GRADE --> REW
    REW -->|"step + terminal reward"| WS
    WS -->|"observation"| RF
      </pre>
      <p class="mermaid-caption">Figure 1. System architecture. The trainer never imports env-side types &mdash; everything crosses the WebSocket, exactly like our Lottery / Reasoning sibling envs.</p>
    </div>

    <p>Training uses <strong>GRPO</strong> (Group Relative Policy Optimization), which is a natural fit for per-step verifiable rewards with an additive terminal bonus. The training scaffolding is directly inherited from <a href="https://github.com/sharma-yash01/LotteryElicitationPT" target="_blank" style="color:var(--accent2)"><code>LotteryElicitationPT</code></a>: TRL 1.0&rsquo;s <code>rollout_func</code> contract, vLLM colocate/server, chat-template tokenization with <code>enable_thinking=False</code>, think-block stripping, null-safe MCQ-label parsing, and episode logging to <code>reward_logs.jsonl</code>.</p>
  </section>

  <!-- 10. EPISODE TRACE (IDEAL, ILLUSTRATIVE) -->
  <section id="traces">
    <h2>Episode trace (Ideal, Illustrative)</h2>
    <p>Here is what a high-reward episode <em>would</em> look like &mdash; five representative steps out of nine, spanning all four domains. Primary domain is PSML; the trace shows the three cross-domain picks plus two primary turns, culminating in the terminal step where the coverage multiplier decides the shape of the bonus. <strong>Turns 2, 4, 6, 8 are elided</strong> (all PSML T1U, assumed correct for the 7 / 9 total) so the walkthrough stays focused on the cross-domain structure.</p>

    <div class="notice">
      <strong>Illustrative, not captured.</strong> This is a hand-constructed walkthrough intended to explain the per-step reward, the 6 + 3 domain split, and the terminal coverage multiplier &mdash; <em>not</em> a real rollout from a trained (or even zero-shot) policy against the live environment. The prompts and agent answers below are author-written. The reward arithmetic (<code>r_t</code>, <code>B</code>, <code>R</code>) is computed exactly as <code>env/reward.py</code> would compute it for the stated correctness pattern. A captured trace from a real model will replace this once the training client is released and run at scale (see <a href="#architecture" style="color:var(--accent2)">Architecture &amp; training pipeline</a>).
    </div>

    <div class="episode-trace">
      <div class="trace-step">
        <div class="step-marker good"></div>
        <div class="step-label">Turn 1 &middot; dataset=PSML &middot; task=T1U:trend</div>
        <div class="step-content">
          <strong>Prompt:</strong> <em>&ldquo;Based on the array (length=336), report trend: upward / downward / constant.&rdquo;</em><br>
          Series rises monotonically over the tail window.<br>
          <strong>Agent answer:</strong> <code>"upward"</code> &nbsp;&rarr;&nbsp; <strong>correct</strong> (<code>r_1 = 1.0</code>)
        </div>
      </div>
      <div class="trace-step">
        <div class="step-marker good"></div>
        <div class="step-label">Turn 3 &middot; dataset=freshretailnet &middot; task=T1U:seasonality</div>
        <div class="step-content">
          <strong>Prompt:</strong> daily demand for a fresh-retail SKU, peaks repeat every 7 steps with stable amplitude.<br>
          <strong>Agent answer:</strong> <code>"fixed"</code> &nbsp;&rarr;&nbsp; <strong>correct</strong> (<code>r_3 = 1.0</code>)
        </div>
      </div>
      <div class="trace-step">
        <div class="step-marker good"></div>
        <div class="step-label">Turn 5 &middot; dataset=MIMIC &middot; task=T1U:outliers</div>
        <div class="step-content">
          <strong>Prompt:</strong> ICU vital trace with a single spike above <code>q95 + 3&middot;MAD</code>.<br>
          <strong>Agent answer:</strong> <code>"sudden_spike"</code> &nbsp;&rarr;&nbsp; <strong>correct</strong> (<code>r_5 = 1.0</code>)
        </div>
      </div>
      <div class="trace-step">
        <div class="step-marker good"></div>
        <div class="step-label">Turn 7 &middot; dataset=causal_chambers &middot; task=T3:S2</div>
        <div class="step-content">
          <strong>Prompt:</strong> wind-chamber actuator trace, contextual question on lagged response to a step input (capability <code>C4</code>).<br>
          <strong>Agent answer:</strong> <code>"delayed_response"</code> &nbsp;&rarr;&nbsp; <strong>correct</strong> (<code>r_7 = 1.0</code>)
        </div>
      </div>
      <div class="trace-step">
        <div class="step-marker good"></div>
        <div class="step-label">Turn 9 &middot; dataset=PSML &middot; task=T2_MCQ (terminal)</div>
        <div class="step-content">
          <strong>Prompt:</strong> &ldquo;Median demand level change (forecast horizon vs history)?&rdquo;<br>
          <strong>Agent answer:</strong> <code>"Higher"</code> &nbsp;&rarr;&nbsp; <strong>correct</strong> (<code>r_9 = 1.0</code>)<br>
          <strong>Episode totals:</strong> <code>C = 7 / 9</code> correct, all four domains covered &rarr; <code>m = 1.0</code>.<br>
          <strong>Terminal bonus:</strong> <code>B = 0.5 &middot; (7/9) &middot; 1.0 &asymp; 0.389</code>.<br>
          <strong>Total return:</strong> <code>R &asymp; 7.389</code>.
        </div>
      </div>
      <div class="trace-verdict good">
        Four-of-four coverage &rarr; full <strong>m = 1.0</strong> multiplier. Contrast: the same 7 / 9 accuracy with a missed cross-domain question (say, zero correct in MIMIC) would give <code>B &asymp; 0.311</code> and <code>R &asymp; 7.311</code>. The coverage term is the whole reason a PSML-only policy loses to a generalist.
      </div>
    </div>
  </section>

  <!-- 11. EVALUATION PROTOCOL & PROJECTED TARGETS -->
  <section id="results">
    <h2>Evaluation protocol &amp; projected targets</h2>

    <div class="notice">
      <strong>No trained-policy numbers yet.</strong> Available compute was exhausted before GRPO could be run to convergence and validated at a scale we consider meaningful. The &ldquo;Trained HF policy&rdquo; column therefore stays blank, and every number in the &ldquo;Projected targets&rdquo; table below is an <em>analytical</em> projection from the environment&rsquo;s structure, not a measurement. GRPO runs against this Space will populate the empty column once the public training client is released and run at scale &mdash; see the canonical training-status notice under <a href="#architecture">Architecture &amp; training pipeline</a>. No fabricated telemetry is shown.
    </div>

    <blockquote><strong>Purple baselines (AgentBeats).</strong> For deployed purple policies on the same benchmark lineage, see <a href="https://agentbeats.dev/sharma-yash01/melady-ts-purple-agentscope" target="_blank" style="color:var(--accent2)">Melady TS Purple (AgentScope)</a> and <a href="https://agentbeats.dev/sharma-yash01/melady-ts-base-purple-agent" target="_blank" style="color:var(--accent2)">Melady TS Base Purple Agent</a>.</blockquote>

    <h3>Baselines</h3>
    <p>The environment supports the following baseline policies out of the box (random + majority) plus an eval harness for zero-shot and trained LLMs. All run against the live OpenEnv WebSocket so numbers are directly comparable with the trained policy.</p>
    <div class="table-wrap">
      <table>
        <thead><tr><th>Baseline</th><th>Policy</th><th>What it isolates</th></tr></thead>
        <tbody>
          <tr><td><strong>Random MCQ</strong></td><td>Uniform sample over <code>observation.options</code></td><td>Lower bound; beats zero only if options are imbalanced</td></tr>
          <tr><td><strong>Majority-class</strong></td><td>Always pick the per-<code>task_type</code> modal label from the bank</td><td>Isolates how much accuracy is available from priors alone</td></tr>
          <tr><td><strong>Zero-shot API LLM</strong></td><td>GPT / Claude / Gemini via the eval harness</td><td>Strong &ldquo;off-the-shelf&rdquo; ceiling before any post-training</td></tr>
          <tr><td><strong>Zero-shot local LLM</strong></td><td>Qwen2.5-7B-Instruct served via vLLM</td><td>Planned backbone for GRPO fine-tuning against this environment once training runs land</td></tr>
          <tr><td><strong>Trained HF policy</strong></td><td>GRPO checkpoint from rollouts on this OpenEnv</td><td>Tests whether post-training on TS-Benchmark episodes beats zero-shot</td></tr>
        </tbody>
      </table>
    </div>

    <h3>Projected targets</h3>
    <p>The numbers below are <strong>analytical projections</strong> from the environment&rsquo;s structure, not empirical results. They exist to anchor what &ldquo;good&rdquo; looks like once training runs land.</p>
    <div class="table-wrap">
      <table>
        <thead><tr><th>Metric</th><th>Random MCQ</th><th>Zero-shot LLM (expected)</th><th>Target for trained policy</th></tr></thead>
        <tbody>
          <tr><td>Per-step accuracy (T1U &amp; T3, 3&ndash;4 options)</td><td class="num">&asymp; 0.25&ndash;0.33</td><td class="num">&gt;&gt; 0.33</td><td class="num">&ge; strong zero-shot</td></tr>
          <tr><td>Per-step accuracy (T2_MCQ, 4 options)</td><td class="num">&asymp; 0.25</td><td class="num">&gt; random</td><td class="num">&ge; strong zero-shot</td></tr>
          <tr><td>Coverage multiplier <code>m</code></td><td class="num">usually 0.8</td><td class="num">0.8&ndash;1.0</td><td class="num">1.0 consistently</td></tr>
          <tr><td>Episode bonus <code>B</code> (&lambda;<sub>ep</sub>=0.5, N=9)</td><td class="num">&asymp; 0.10</td><td class="num">0.20&ndash;0.35</td><td class="num">&ge; 0.40</td></tr>
        </tbody>
      </table>
    </div>
    <h3>Current status</h3>
    <p>What is already in place:</p>
    <ul>
      <li><strong>Full environment</strong>: <code>reset</code> / <code>step</code> / <code>state</code>, nine-question stratified sampling across four domains, per-step + terminal reward with coverage multiplier.</li>
      <li><strong>Production banks</strong>: 2,775 <code>TSQuestion</code> records built from <code>TS-benchmark/task_merged_dev_with_labels_tiers.jsonl</code> via <code>TS-benchmark/scripts/build_temporal_bench_openenv_banks.py</code> and vendored under <code>openenv-ts/TemporalBenchEnv/data/banks/</code> (PSML 750 / freshretailnet 616 / MIMIC 709 / causal_chambers 700).</li>
      <li><strong>Deployment path</strong>: <code>openenv.yaml</code>, Docker image (<code>server/Dockerfile</code>, <code>TEMPORALBENCH_QUESTION_BANK_DIR=/app/env/data/banks</code>), <code>openenv validate</code> passes, <code>openenv push</code> ready for HF Space.</li>
      <li><strong>Training scaffolding</strong>: the intended path mirrors <a href="https://github.com/sharma-yash01/LotteryElicitationPT" target="_blank" style="color:var(--accent2)"><code>LotteryElicitationPT</code></a> (GRPO + TRL 1.0 <code>rollout_func</code> + vLLM). Training-client release status: see <a href="#architecture">Architecture</a>.</li>
      <li><strong>Purple agents (AgentBeats)</strong>: live eval targets include <a href="https://agentbeats.dev/sharma-yash01/melady-ts-purple-agentscope" target="_blank" style="color:var(--accent2)">Melady TS Purple (AgentScope)</a> and <a href="https://agentbeats.dev/sharma-yash01/melady-ts-base-purple-agent" target="_blank" style="color:var(--accent2)">Melady TS Base Purple Agent</a>. Planned additions: see <a href="#purple-harnesses">&sect; Purple agent harnesses</a>.</li>
    </ul>
  </section>

  <!-- 12. ENGINEERING LESSONS (INHERITED) -->
  <section id="engineering">
    <h2>Engineering lessons (inherited)</h2>
    <p>Because TemporalBenchEnv intentionally rides on the same deployment + training pattern as <a href="https://huggingface.co/spaces/yashu2000/LotteryElicitationEnv" target="_blank" style="color:var(--accent2)">LotteryElicitationEnv</a> and <a href="https://huggingface.co/spaces/yashu2000/reasoning-economic-env" target="_blank" style="color:var(--accent2)">ReasoningEconomicsEnv</a>, most of the hard infra lessons are <em>inherited</em> rather than rediscovered. We document them briefly so the next submitter does not need to learn them from scratch.</p>
    <div class="table-wrap">
      <table>
        <thead><tr><th>Issue</th><th>Root cause</th><th>Fix (inherited)</th></tr></thead>
        <tbody>
          <tr>
            <td><strong>NCCL desync</strong> under variable-length episodes</td>
            <td>In <code>vllm_mode=server</code>, different DDP ranks make different numbers of <code>generate()</code> calls per episode &rarr; sequence-numbered NCCL collectives go out of sync.</td>
            <td>Fixed-count <code>generate()</code> padding per episode; dummy generates discarded via <code>_temporary_vllm_max_tokens(..., 1)</code>. Gated on <code>world_size &gt; 1</code>.</td>
          </tr>
          <tr>
            <td><strong><code>max_completion_length</code></strong> drift over multi-turn rollouts</td>
            <td>The rollout function appends per-turn generations + observation suffixes each step; a 9-turn MCQ episode can easily exceed the nominal completion budget.</td>
            <td>Hard-cap <code>completion_ids</code> to <code>max_completion_length</code>; <code>PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True</code>.</td>
          </tr>
          <tr>
            <td><strong>Null-safe MCQ parsing</strong></td>
            <td>LLMs sometimes emit <code>{"answer": null}</code> or malformed strings; one rank crashing kills all DDP ranks via gloo cascade.</td>
            <td>Inherited <code>_safe_float</code> / <code>_safe_int</code> / null-string fallbacks; fallback action rather than crash. The same regression pattern from <a href="https://github.com/sharma-yash01/LotteryElicitationPT" target="_blank" style="color:var(--accent2)">LotteryElicitationPT</a> applies directly to MCQ labels.</td>
          </tr>
          <tr>
            <td><strong><code>openenv validate</code> hygiene</strong></td>
            <td>OpenEnv&rsquo;s CLI does a naive substring check for <code>main()</code> in <code>server/app.py</code>; an entrypoint like <code>main(port=args.port)</code> fails validation.</td>
            <td>Match the Lottery pattern: <code>if __name__ == "__main__": main()</code>, with flags parsed via <code>parse_known_args</code> inside <code>main()</code>.</td>
          </tr>
        </tbody>
      </table>
    </div>
  </section>

  <!-- 13. FOUNDATIONS & CITATIONS -->
  <section id="foundations">
    <h2>Foundations &amp; citations</h2>
    <div class="table-wrap">
      <table>
        <thead><tr><th>Foundation</th><th>Role in this project</th><th>Citation</th></tr></thead>
        <tbody>
          <tr><td><strong>TS-Benchmark (ours)</strong></td><td>Task taxonomy (T1/T2/T3/T4), per-dataset label construction, MCQ question shape</td><td><a href="https://openreview.net/forum?id=rp5qCsxXca&noteId=jYZcZ4GI0y" target="_blank" style="color:var(--accent2)">OpenReview</a> &middot; in-repo: <code>TS-benchmark/TS-Benchmark.md</code></td></tr>
          <tr><td><strong>Melady TS Green Agent</strong></td><td>A2A evaluator whose task set TemporalBenchEnv re-exposes as an OpenEnv environment</td><td><a href="https://agentbeats.dev/sharma-yash01/melady-ts-green-agent" target="_blank" style="color:var(--accent2)">AgentBeats</a> &middot; <a href="https://github.com/sharma-yash01/Melady_Agent-TS-Green" target="_blank" style="color:var(--accent2)">GitHub repository</a></td></tr>
          <tr><td><strong>AgentScope</strong></td><td>Purple-agent harness &mdash; ReAct / MCP / A2A framework wrapping the backbone we post-train (<a href="#purple-harnesses">live on AgentBeats</a>)</td><td>Gao et al., <a href="https://arxiv.org/abs/2508.16279" target="_blank" style="color:var(--accent2)">arXiv:2508.16279</a> (2025) &middot; <a href="https://arxiv.org/abs/2402.14034" target="_blank" style="color:var(--accent2)">arXiv:2402.14034</a> (2024) &middot; <a href="https://github.com/agentscope-ai/agentscope" target="_blank" style="color:var(--accent2)">GitHub</a></td></tr>
          <tr><td><strong>CAMEL</strong></td><td>Purple-agent harness &mdash; role-playing multi-agent society, CAMEL-backed baseline on AgentBeats</td><td>Li et al., <a href="https://arxiv.org/abs/2303.17760" target="_blank" style="color:var(--accent2)">NeurIPS 2023 (arXiv:2303.17760)</a> &middot; <a href="https://github.com/camel-ai/camel" target="_blank" style="color:var(--accent2)">GitHub</a></td></tr>
          <tr><td><strong>MetaGPT</strong></td><td>Purple-agent harness (fast-follow-up) &mdash; SOP-driven multi-agent system, role-decomposition purple</td><td>Hong et al., <a href="https://arxiv.org/abs/2308.00352" target="_blank" style="color:var(--accent2)">ICLR 2024, oral (arXiv:2308.00352)</a> &middot; <a href="https://github.com/FoundationAgents/MetaGPT" target="_blank" style="color:var(--accent2)">GitHub</a></td></tr>
          <tr><td><strong>TimeSeriesScientist (TSci)</strong></td><td>Purple-agent harness (fast-follow-up) &mdash; TS-specialized Curator/Planner/Forecaster/Reporter agent; &ldquo;ceiling&rdquo; for agentic TS reasoning</td><td>Zhao et al., <a href="https://arxiv.org/abs/2510.01538" target="_blank" style="color:var(--accent2)">arXiv:2510.01538</a> (2025) &middot; <a href="https://github.com/Y-Research-SBU/TimeSeriesScientist" target="_blank" style="color:var(--accent2)">GitHub</a></td></tr>
          <tr><td><strong>FreshRetailNet-50K</strong></td><td>Retail demand dataset; T1/T2/T3/T4 MCQ questions</td><td>Ding et al., <a href="https://arxiv.org/abs/2505.16319" target="_blank" style="color:var(--accent2)">arXiv:2505.16319</a>, 2025</td></tr>
          <tr><td><strong>PSML</strong></td><td>Power-system load dataset; primary domain for the default episode</td><td><a href="https://www.nature.com/articles/s41597-022-01455-7" target="_blank" style="color:var(--accent2)">Nature Sci. Data 2022</a></td></tr>
          <tr><td><strong>MIMIC-IV</strong></td><td>ICU/EHR time-series dataset; medical-domain MCQ</td><td><a href="https://www.nature.com/articles/s41597-022-01899-x" target="_blank" style="color:var(--accent2)">Nature Sci. Data 2022</a></td></tr>
          <tr><td><strong>Causal Chambers</strong></td><td>Physical-testbed TS dataset; contextual T3 and wind-chamber T1U</td><td><a href="https://www.nature.com/articles/s42256-024-00964-x" target="_blank" style="color:var(--accent2)">Nature MI 2024</a></td></tr>
          <tr><td><strong>TimeMaster / COUNTS / SenTSR-Bench</strong></td><td>Prior art on composite RL rewards for TS-LLMs; motivates T2/T4 forecasting future work</td><td><a href="https://arxiv.org/abs/2506.13705" target="_blank" style="color:var(--accent2)">TimeMaster (arXiv:2506.13705)</a> &middot; <a href="https://arxiv.org/abs/2510.01116" target="_blank" style="color:var(--accent2)">COUNTS (arXiv:2510.01116)</a> &middot; <a href="https://arxiv.org/abs/2602.19455" target="_blank" style="color:var(--accent2)">SenTSR-Bench (arXiv:2602.19455)</a></td></tr>
          <tr><td><strong>OpenEnv</strong></td><td>Gym-style reset/step, WebSocket transport, HF Space deployment</td><td><a href="https://huggingface.co/blog/openenv" target="_blank" style="color:var(--accent2)">HF Blog: Introducing OpenEnv</a></td></tr>
          <tr><td><strong>TRL + GRPO</strong></td><td>GRPOTrainer, custom <code>rollout_func</code>, remote env rollouts</td><td>Shao et al., <a href="https://arxiv.org/abs/2402.03300" target="_blank" style="color:var(--accent2)">arXiv:2402.03300</a> (DeepSeekMath) &middot; <a href="https://huggingface.co/docs/trl/en/openenv" target="_blank" style="color:var(--accent2)">TRL &times; OpenEnv</a></td></tr>
          <tr><td><strong>LotteryElicitationEnv / PT</strong></td><td>Sibling project &mdash; structural template for env / PT split, <code>rollout_func</code>, DDP padding, validation hygiene</td><td><a href="https://huggingface.co/spaces/yashu2000/LotteryElicitationEnv" target="_blank" style="color:var(--accent2)">HF Space (Env)</a> &middot; <a href="https://github.com/sharma-yash01/LotteryElicitationPT" target="_blank" style="color:var(--accent2)">GitHub (PT)</a></td></tr>
        </tbody>
      </table>
    </div>
  </section>

  <!-- 14. QUICK START -->
  <section id="quickstart">
    <h2>Quick start</h2>
    <pre><code><span class="c"># 1. Install and run the env locally</span>
cd openenv-ts/TemporalBenchEnv
uv sync --extra dev
uv run pytest tests/ -q
uv run openenv validate

<span class="c"># 2. Run the server (uvicorn, port 8000)</span>
uv run server
<span class="c"># or</span>
uvicorn server.app:app --reload

<span class="c"># 3. Build &amp; run the Docker image</span>
docker build -t temporalbenchenv:latest -f server/Dockerfile .
docker run --rm -p 8000:8000 \
    -e TEMPORALBENCH_QUESTION_BANK_DIR=/app/env/data/banks \
    temporalbenchenv:latest

<span class="c"># 4. Or pull / push a HF Space</span>
export ENV_BASE_URL="https://huggingface.co/spaces/yashu2000/TemporalBenchEnv"
openenv push   <span class="c"># from the TemporalBenchEnv/ directory</span>

<span class="c"># 5. Minimal client usage</span>
python - &lt;&lt;'PY'
from client import TemporalBenchAction, TemporalBenchEnvClient
with TemporalBenchEnvClient(base_url=&quot;http://localhost:8000&quot;) as env:
    out = env.reset()
    while not out.done:
        q = out.observation
        out = env.step(TemporalBenchAction(answer=q.options[0]))
    print(&quot;total reward:&quot;, out.observation.reward)
PY

<span class="c"># 6. GRPO / TRL training against this Space (training client under test; not on GitHub yet)</span>
<span class="c"># Release is planned after additional compute for reasonable model validation.</span></code></pre>
    <p><strong>Purple track (AgentBeats):</strong> <a href="https://agentbeats.dev/sharma-yash01/melady-ts-purple-agentscope" target="_blank" style="color:var(--accent2)">Melady TS Purple (AgentScope)</a> &middot; <a href="https://agentbeats.dev/sharma-yash01/melady-ts-base-purple-agent" target="_blank" style="color:var(--accent2)">Melady TS Base Purple Agent</a>.</p>
    <p>Banks are reproducible from <code>(env_seed, curriculum_stage, primary_domain)</code>. No external fixtures, no live API, no human labels &mdash; the same flat tiered JSONL that drives the Melady TS Green Agent drives this environment.</p>
  </section>

  <div class="callout">
    <div class="q">Can an LLM post-trained on our Green Agent&rsquo;s own benchmark outperform zero-shot baselines at cross-domain TS reasoning?</div>
    <div class="sub">The environment is built and deployed. Training-client release and empirical GRPO numbers are contingent on compute availability &mdash; see <a href="#architecture" style="color:var(--accent2)">Architecture</a>.</div>
    <div class="sub" style="margin-top:.85rem;">On AgentBeats, compare purple baselines such as <a href="https://agentbeats.dev/sharma-yash01/melady-ts-purple-agentscope" target="_blank" style="color:var(--accent2)">Melady TS Purple (AgentScope)</a> and <a href="https://agentbeats.dev/sharma-yash01/melady-ts-base-purple-agent" target="_blank" style="color:var(--accent2)">Melady TS Base Purple Agent</a>.</div>
  </div>

  <!-- 15. FUTURE WORK -->
  <section id="future">
    <h2>Future work</h2>
    <ul>
      <li><strong>Run GRPO to convergence</strong> on Stage 1 &rarr; Stage 2 &rarr; Stage 3 curriculum against this environment and fill in the &ldquo;Trained HF policy&rdquo; column of the baselines table (blocked on releasing the training client; see Architecture). Relate outcomes to AgentBeats purple baselines (<a href="https://agentbeats.dev/sharma-yash01/melady-ts-purple-agentscope" target="_blank" style="color:var(--accent2)">AgentScope</a> / <a href="https://agentbeats.dev/sharma-yash01/melady-ts-base-purple-agent" target="_blank" style="color:var(--accent2)">CAMEL</a>) and, once landed, the planned MetaGPT / TSci harnesses.</li>
      <li><strong>T2 / T4 forecasting reward</strong> &mdash; replace the <code>compute_forecasting_reward</code> stub with the composite sketched in our v1 plan (normalized MSE + DTW-shape + direction + quantile + ArcTan-smoothed MAE). Opens T2 numeric and T4 contextual forecasting tasks without touching the step loop.</li>
      <li><strong>Confidence-calibration reward</strong> &mdash; <code>TemporalBenchAction.confidence</code> already exists; shape rewards around Brier-style calibration so the policy is incentivized to <em>know when it knows</em>.</li>
      <li><strong>CoT / reasoning reward</strong> &mdash; <code>TemporalBenchAction.reasoning</code> is captured per step but currently unscored; a light format-plus-consistency shaping mirrors the Lottery format-weight technique.</li>
      <li><strong>Token-budget curriculum</strong> (ReasoningEconomics-style) &mdash; constrain per-episode reasoning length to force terseness at higher stages.</li>
      <li><strong>Soft 60 / 40 task-type mix</strong> &mdash; enforce the v1-plan ratio of understanding (T1U+T3) to prediction (T2_MCQ) explicitly in <code>EpisodeSampler</code>; today it emerges from bank sizes.</li>
      <li><strong>SFT warm-up</strong> on valid MCQ JSON before GRPO &mdash; skips the cold-start formatting phase we hit in Lottery.</li>
      <li><strong>Human-subjects transfer</strong> &mdash; once the policy beats zero-shot on our bank, measure sim-to-real with held-out TS data outside the Green Agent&rsquo;s training split.</li>
    </ul>
  </section>

  <!-- 16. CONCLUSION -->
  <section>
    <h2>Conclusion</h2>
    <p><strong>TemporalBenchEnv</strong> is the training-side companion to our Melady TS Green Agent&rsquo;s eval-side role. The Green Agent ranks purple agents on TS-Benchmark through the A2A protocol; TemporalBenchEnv re-exposes the same four-dataset MCQ suite as an OpenEnv-native sequential MDP, so an LLM can be <em>post-trained</em> on the very benchmark it will later be scored against.</p>
    <p>Every design choice &mdash; 9-question episodes with 6 primary + 3 cross-domain, per-step correctness + terminal bonus with a domain-coverage multiplier, stagewise curriculum T1U &rarr; +T3 &rarr; +T2_MCQ, and a strict separation of env server and trainer over WebSocket &mdash; is aimed at preserving the verifiability that made the Green Agent&rsquo;s signal trustworthy, while turning it into gradient. The infrastructure contributions are mostly <em>inherited</em> from our Lottery / Reasoning siblings; the novelty is the domain binding and the Green-Agent &rarr; OpenEnv pipeline.</p>
    <p>The research question is open: <em>can a GRPO-trained LLM, post-trained on our own benchmark, outperform strong zero-shot baselines at cross-domain TS reasoning?</em> The environment, banks, and deployment path are shipped; full training results await the public training client and the compute budget described under Architecture. The purple track stays live on AgentBeats via <a href="https://agentbeats.dev/sharma-yash01/melady-ts-purple-agentscope" target="_blank" style="color:var(--accent2)">Melady TS Purple (AgentScope)</a> and <a href="https://agentbeats.dev/sharma-yash01/melady-ts-base-purple-agent" target="_blank" style="color:var(--accent2)">Melady TS Base Purple Agent</a>.</p>
  </section>

  <div class="footer">
    <p>TemporalBenchEnv &middot; Melady / AgentX OpenEnv Track &middot; USC</p>
    <p style="margin-top:.5rem;">
      <a href="https://github.com/sharma-yash01/TemporalBenchEnv" target="_blank">GitHub</a> &middot;
      <a href="https://huggingface.co/spaces/yashu2000/TemporalBenchEnv" target="_blank">HF Space</a> &middot;
      <a href="https://agentbeats.dev/sharma-yash01/melady-ts-green-agent" target="_blank">Melady TS Green Agent</a> &middot;
      <a href="https://github.com/sharma-yash01/Melady_Agent-TS-Green" target="_blank">Green Agent GitHub</a> &middot;
      <a href="https://github.com/meta-pytorch/OpenEnv" target="_blank">OpenEnv Framework</a> &middot;
      <a href="https://huggingface.co/docs/trl/en/openenv" target="_blank">TRL &times; OpenEnv</a>
    </p>
  </div>

</div>
</body>
</html>