File size: 14,347 Bytes
02e973e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
# RULES.md - Project Constitution & AI Guardrails
# OpenEnv Email Triage Environment

EVERY AI agent, copilot, or assistant working on this project MUST read and obey this file before generating ANY code.

REVISION 2: Updated based on sample inference.py analysis.
Where submission rules conflict with the original brief, SUBMISSION RULES WIN.
Where the sample script reveals patterns, MATCH THE PATTERNS.

## 0. GOLDEN RULE

> Do NOT generate code that you cannot explain line by line.
> Do NOT add features not listed in this document.
> Do NOT deviate from the file map, naming conventions, or interfaces defined here.
> When in doubt, do LESS, not more.

---

## 1. SCOPE - What This Project Is

- An OpenEnv-compliant AI agent training environment
- Domain: Email Triage (classify, prioritise, route emails)
- Deployed as a Docker-based Hugging Face Space
- Evaluated by inference.py using OpenAI Client with configurable endpoint

### What this project is NOT

- A chatbot
- A web app with a UI
- A game or toy problem
- A fine-tuning pipeline
- A multi-agent system
- An LLM wrapper with extra features
- A BrowserGym environment (the sample uses BrowserGym - we do NOT)

---

## 2. SUBMISSION CHECKLIST - DISQUALIFICATION CRITERIA

These are automated checks. Failing ANY ONE means disqualification.

| # | Check | What the validator does |
|---|---|---|
| 1 | HF Space deploys | Pings Space URL - must return HTTP 200 and respond to reset() |
| 2 | OpenEnv spec compliance | Validates openenv.yaml, typed models, /step, /reset, /state |
| 3 | Dockerfile builds | Runs docker build on the submitted repo - must succeed |
| 4 | Inference reproduces | Runs inference.py - must complete without error and produce scores |
| 5 | 3+ tasks with graders | Enumerates tasks, runs each grader, verifies scores in [0.0, 1.0] |
| 6 | Pre-validation script | Runs `./validate-submission.sh <ping_url> .` and expects all 3 checks to pass |

### 2.1 Mandatory pre-submit validation

- Before claiming "submission ready", run `./validate-submission.sh <ping_url> .` from repo root.
- If `<ping_url>` is unavailable, request it and block readiness claims until provided.
- Any AI assistant working on this repo must treat validator failure as a hard stop.

### Infrastructure constraints

| Constraint | Limit |
|---|---|
| vCPU | 2 |
| Memory | 8 GB |
| Inference runtime | < 20 minutes |

---

## 3. ENVIRONMENT VARIABLES - Mandatory

```python
import os

API_BASE_URL = os.getenv("API_BASE_URL")
API_KEY = os.getenv("HF_TOKEN") or os.getenv("API_KEY")
MODEL_NAME = os.getenv("MODEL_NAME")
```

How to use in code (EXACT PATTERN - matches sample):

```python
from openai import OpenAI

client = OpenAI(
    base_url=API_BASE_URL,
    api_key=API_KEY,
)

completion = client.chat.completions.create(
    model=MODEL_NAME,
    messages=[...],
    temperature=0.2,
    max_tokens=200,
    stream=False,
)

response_text = completion.choices[0].message.content or ""
```

Rules:

- NEVER hard-code any of these values
- NEVER use os.environ["VAR"] (use os.getenv() - matches sample)
- NEVER use any LLM client other than openai.OpenAI
- Support both HF_TOKEN and API_KEY with or fallback (matches sample)

---

## 4. FILE MAP - Strict Build Order

| Order | File | Purpose | May import from |
|---|---|---|---|
| 1st | models.py | Pydantic models + StepResult wrapper | stdlib, pydantic only |
| 2nd | tasks.py | Task definitions + hard-coded email data | models.py only |
| 3rd | graders.py | Deterministic grader functions | models.py, tasks.py only |
| 4th | environment.py | Core env class: step, reset, state | models, tasks, graders |
| 5th | server.py | Flask HTTP wrapper: /reset, /step, /state | environment.py, models.py |
| 6th | inference.py | OpenAI Client inference script | models.py, environment.py |
| 7th | openenv.yaml | Spec metadata | N/A (data file) |
| 8th | Dockerfile | Container build | N/A (config file) |
| 8th | requirements.txt | Pinned dependencies | N/A (config file) |
| 9th | README.md | Full documentation | N/A (documentation) |
| 10th | validate-submission.sh | Pre-submission validator script | N/A (shell script) |

### Rules about files

- Do NOT create files not listed above. No utils.py, helpers.py, or config.py.
- Do NOT merge files. Each file has one responsibility.
- Do NOT create subdirectories. All files live in the project root.
- Do NOT add init.py. This is not a package.

---

## 5. DEPENDENCY RULES

### Allowed dependencies

```txt
pydantic>=2.0,<3.0
flask>=3.0,<4.0
openai>=1.0,<2.0
gunicorn>=21.0,<23.0
```

### Conditionally allowed (only if needed)

```txt
numpy
Pillow
```

### Forbidden

- No LangChain, LlamaIndex, or any agent framework
- No pandas or scipy
- No database libraries
- No async frameworks (FastAPI, aiohttp) - use Flask
- No frontend frameworks (Streamlit, Gradio)
- No ML libraries (torch, transformers, sklearn)

---

## 6. PYDANTIC MODEL RULES

### models.py constraints

- ALL models MUST inherit from pydantic.BaseModel
- ALL fields MUST have explicit type annotations
- ALL Literal types MUST use typing.Literal with exhaustive values
- NO methods on models (except StepResult and ResetResult wrappers)
- NO validators that call external services
- NO default_factory that uses randomness
- Field names MUST be snake_case
- NO nested models deeper than 2 levels

### Required models (exact names)

```python
class EmailObservation(BaseModel): ...
class TriageAction(BaseModel): ...
class RewardResult(BaseModel): ...
class EnvironmentState(BaseModel): ...
class StepResult(BaseModel): ...
class ResetResult(BaseModel): ...
```

### StepResult and ResetResult interface (mandatory)

```python
class StepResult(BaseModel):
    observation: EmailObservation
    reward: float
    done: bool
    info: dict[str, str | int | float | bool]

class ResetResult(BaseModel):
    observation: EmailObservation
    info: dict[str, str | int | float | bool]
```

### EmailObservation required fields

| Field | Type | Required |
|---|---|---|
| email_id | str | Yes |
| subject | str | Yes |
| body | str | Yes |
| sender | str | Yes |
| timestamp | str | Yes |
| thread_history | list[str] | Yes |
| task_id | str | Yes |
| step_number | int | Yes |
| total_emails | int | Yes |

### TriageAction required fields

| Field | Type | Required |
|---|---|---|
| label | Literal["urgent", "normal", "spam", "archive"] | Yes |
| summary | str | Yes |
| route_to | str | Yes |

### RewardResult required fields

| Field | Type | Required |
|---|---|---|
| score | float | Yes |
| breakdown | dict[str, float] | Yes |
| feedback | str | Yes |

### EnvironmentState required fields

| Field | Type | Required |
|---|---|---|
| task_id | str | Yes |
| current_step | int | Yes |
| total_steps | int | Yes |
| done | bool | Yes |
| action_history | list | Yes |
| reward_history | list | Yes |

---

## 7. ENVIRONMENT CLASS RULES

- Class name: EmailTriageEnv
- Constructor: __init__(self, task_id: str)
- MUST accept a task_id string
- MUST NOT call any external API
- MUST NOT use randomness

### reset() -> ResetResult

- MUST return a ResetResult object (not a bare observation)
- result.observation must contain the first email
- MUST reset all internal state
- MUST be callable multiple times without side effects
- HF Space validator will call /reset and expect HTTP 200 + valid JSON

### step(action: TriageAction) -> StepResult

- MUST return a StepResult object (not a tuple)
- result.observation: next email or terminal observation
- result.reward: float score for this step
- result.done: bool indicating episode end
- result.info: metadata dict
- MUST never raise an exception from bad agent input
- If action validation fails: return StepResult with reward=0.0 and continue
- MUST increment step counter
- MUST set done=True when all emails processed or max_steps hit

### state() -> EnvironmentState

- MUST return the full current internal state
- MUST be read-only

### Hard rules for environment.py

- NO randomness
- NO API calls
- NO file I/O during step/reset/state
- NO global mutable state
- NO threading or async
- NO print statements

---

## 8. TASK DATA RULES

Unchanged from previous version.

- All email data MUST be hard-coded
- NO loading from external files, URLs, or databases
- Task IDs: task_easy, task_medium, task_hard
- Each task defines: task_id, description, emails, ground_truth
- Ground truth MUST NOT be in observations (no answer leakage)
- Realistic professional email content
- NO offensive or NSFW content

---

## 9. GRADER RULES

Unchanged from previous version.

- Pure functions
- Deterministic
- Partial credit
- Scores in [0.0, 1.0]

---

## 10. REWARD FUNCTION RULES

Unchanged from previous version.

```text
final_reward = base_score - (step_count * 0.01) + trajectory_bonus - penalties
```

Final reward is clipped to [-1.0, 1.0].

---

## 11. SERVER RULES

### server.py constraints

- MUST use Flask
- Exactly THREE routes:
  - POST /reset: accepts {"task_id": str}, returns ResetResult JSON
  - POST /step: accepts TriageAction JSON, returns StepResult JSON
  - POST /state: returns EnvironmentState JSON
- MUST listen on port 7860
- MUST handle malformed JSON gracefully (return 400)
- All responses must include Content-Type: application/json
- Validator will ping and call /reset, which must return HTTP 200

### /step response format

```json
{
  "observation": {},
  "reward": 0.85,
  "done": false,
  "info": {"step": 1, "task_id": "task_easy"}
}
```

### /reset response format

```json
{
  "observation": {},
  "info": {"task_id": "task_easy"}
}
```

---

## 12. INFERENCE SCRIPT RULES

CRITICAL PATTERNS FROM SAMPLE - MUST FOLLOW

### Architecture (matches sample)

```text
1. Initialize OpenAI client with env vars
2. Create environment instance
3. Call reset(), get initial observation
4. Loop up to MAX_STEPS:
   a. Build prompt from observation + history
   b. Call LLM
   c. Parse response into action (with fallback)
   d. Call step(action)
   e. Record history
   f. Check done flag
5. Print results
```

### Mandatory constants

```python
MAX_STEPS = 10
TEMPERATURE = 0.2
MAX_TOKENS = 200
FALLBACK_ACTION = ...
```

### Response parsing rules

- Do NOT rely only on response_format={"type": "json_object"}
- Parse free-text responses with regex or string matching
- If parsing fails, use a fallback action
- Strip prefixes like action: or next action: before parsing
- Regex parsing with fallback is preferred

### History tracking

```python
history: list[str] = []
history_line = f"Step {step}: {action} -> reward {reward:+.2f}"
history.append(history_line)
```

### Error handling

```python
try:
    completion = client.chat.completions.create(...)
    response_text = completion.choices[0].message.content or ""
except Exception as exc:
    print(f"Model request failed ({exc}). Using fallback action.")
    response_text = ""
```

### Output format

```text
Episode: task_easy
Step 1: label=urgent, route=safety -> reward +0.85
Final score: 0.85

=== SCORE TABLE ===
Task         Score    Steps
task_easy    0.85     1
task_medium  0.62     5
task_hard    0.45     2
Mean         0.64
```

### File naming and location

- File MUST be named inference.py
- MUST be in the project root directory
- MUST be runnable with python inference.py
- MUST complete in under 20 minutes

---

## 13. DOCKERFILE RULES

- Base image: python:3.11-slim
- WORKDIR: /app
- Copy requirements.txt first, pip install, then copy source
- EXPOSE 7860
- Create non-root user
- CMD starts the server
- Must build with --platform linux/amd64
- Must run within 2 vCPU / 8 GB memory
- No unnecessary system packages
- No CUDA/GPU dependencies

---

## 14. CODE STYLE RULES

- Python 3.11+
- Type hints on ALL function signatures
- Docstrings on ALL public functions (Google style)
- No single-letter variable names except i in loops
- Comments explain WHY, not WHAT
- Max line length: 100 characters
- f-strings only
- No wildcard imports
- Import order: stdlib -> third-party -> local

---

## 15. WHAT AI MUST NEVER DO

- Never add features not in this spec
- Never use an LLM inside a grader
- Never generate fake scores
- Never create a UI
- Never use randomness in the environment
- Never store API keys in code
- Never skip error handling in step()
- Never use bare dicts where Pydantic models are specified
- Never name the inference script baseline.py
- Never use OPENAI_API_KEY; use HF_TOKEN/API_KEY
- Never use response_format={"type": "json_object"} without text-parsing fallback
- Never return tuples from step/reset; use StepResult/ResetResult objects
- Never skip the fallback action pattern
- Never skip history tracking in inference

---

## 16. DEFINITION OF DONE - Per Phase Checklist

### Phase 1 complete when

- models.py exists with all 6 models (including StepResult, ResetResult)
- All fields match this document
- Models instantiate with sample data without errors
- StepResult has observation, reward, done, info attributes

### Phase 2 complete when

- tasks.py exists with 3 tasks
- All email data is realistic and hard-coded
- Ground truth exists for every email
- No answer leakage

### Phase 3 complete when

- graders.py has 3 pure grader functions
- Partial credit works
- All scores in [0.0, 1.0]

### Phase 4 complete when

- environment.py has EmailTriageEnv class
- reset() returns ResetResult
- step() returns StepResult
- step() handles invalid input without crashing
- Full episode runs to completion

### Phase 5 complete when

- server.py has /reset, /step, /state routes
- /reset returns {"observation": ..., "info": ...}
- /step returns {"observation": ..., "reward": ..., "done": ..., "info": ...}
- Malformed requests return 400
- Port 7860

### Phase 6 complete when

- inference.py follows sample architecture
- Uses os.getenv() for API_BASE_URL, HF_TOKEN/API_KEY, MODEL_NAME
- Has MAX_STEPS, TEMPERATURE, MAX_TOKENS, FALLBACK constants
- Has history tracking
- Has response parsing with fallback
- Has try/except around API calls
- Prints score table
- Completes in under 20 minutes

### Phase 7-9

Unchanged from previous version.

---

## 17. WHEN IN DOUBT

- Re-read this file
- Re-read the project briefing
- Re-read the sample inference.py
- Match the sample patterns
- Choose the simpler option
- Ask the human, do not guess

This file is the law. Code that violates it gets deleted.