openenv1 / RULES.md
Imaginephoenix's picture
Upload 17 files
f6339e7 verified
# RULES.md - Project Constitution & AI Guardrails
# OpenEnv Email Triage Environment
EVERY AI agent, copilot, or assistant working on this project MUST read and obey this file before generating ANY code.
REVISION 2: Updated based on sample inference.py analysis.
Where submission rules conflict with the original brief, SUBMISSION RULES WIN.
Where the sample script reveals patterns, MATCH THE PATTERNS.
## 0. GOLDEN RULE
> Do NOT generate code that you cannot explain line by line.
> Do NOT add features not listed in this document.
> Do NOT deviate from the file map, naming conventions, or interfaces defined here.
> When in doubt, do LESS, not more.
---
## 1. SCOPE - What This Project Is
- An OpenEnv-compliant AI agent training environment
- Domain: Email Triage (classify, prioritise, route emails)
- Deployed as a Docker-based Hugging Face Space
- Evaluated by inference.py using OpenAI Client with configurable endpoint
### What this project is NOT
- A chatbot
- A web app with a UI
- A game or toy problem
- A fine-tuning pipeline
- A multi-agent system
- An LLM wrapper with extra features
- A BrowserGym environment (the sample uses BrowserGym - we do NOT)
---
## 2. SUBMISSION CHECKLIST - DISQUALIFICATION CRITERIA
These are automated checks. Failing ANY ONE means disqualification.
| # | Check | What the validator does |
|---|---|---|
| 1 | HF Space deploys | Pings Space URL - must return HTTP 200 and respond to reset() |
| 2 | OpenEnv spec compliance | Validates openenv.yaml, typed models, /step, /reset, /state |
| 3 | Dockerfile builds | Runs docker build on the submitted repo - must succeed |
| 4 | Inference reproduces | Runs inference.py - must complete without error and produce scores |
| 5 | 3+ tasks with graders | Enumerates tasks, runs each grader, verifies scores in [0.0, 1.0] |
| 6 | Pre-validation script | Runs `./validate-submission.sh <ping_url> .` and expects all 3 checks to pass |
### 2.1 Mandatory pre-submit validation
- Before claiming "submission ready", run `./validate-submission.sh <ping_url> .` from repo root.
- If `<ping_url>` is unavailable, request it and block readiness claims until provided.
- Any AI assistant working on this repo must treat validator failure as a hard stop.
### Infrastructure constraints
| Constraint | Limit |
|---|---|
| vCPU | 2 |
| Memory | 8 GB |
| Inference runtime | < 20 minutes |
---
## 3. ENVIRONMENT VARIABLES - Mandatory
```python
import os
API_BASE_URL = os.getenv("API_BASE_URL")
API_KEY = os.getenv("HF_TOKEN") or os.getenv("API_KEY")
MODEL_NAME = os.getenv("MODEL_NAME")
```
How to use in code (EXACT PATTERN - matches sample):
```python
from openai import OpenAI
client = OpenAI(
base_url=API_BASE_URL,
api_key=API_KEY,
)
completion = client.chat.completions.create(
model=MODEL_NAME,
messages=[...],
temperature=0.2,
max_tokens=200,
stream=False,
)
response_text = completion.choices[0].message.content or ""
```
Rules:
- NEVER hard-code any of these values
- NEVER use os.environ["VAR"] (use os.getenv() - matches sample)
- NEVER use any LLM client other than openai.OpenAI
- Support both HF_TOKEN and API_KEY with or fallback (matches sample)
---
## 4. FILE MAP - Strict Build Order
| Order | File | Purpose | May import from |
|---|---|---|---|
| 1st | models.py | Pydantic models + StepResult wrapper | stdlib, pydantic only |
| 2nd | tasks.py | Task definitions + hard-coded email data | models.py only |
| 3rd | graders.py | Deterministic grader functions | models.py, tasks.py only |
| 4th | environment.py | Core env class: step, reset, state | models, tasks, graders |
| 5th | server.py | Flask HTTP wrapper: /reset, /step, /state | environment.py, models.py |
| 6th | inference.py | OpenAI Client inference script | models.py, environment.py |
| 7th | openenv.yaml | Spec metadata | N/A (data file) |
| 8th | Dockerfile | Container build | N/A (config file) |
| 8th | requirements.txt | Pinned dependencies | N/A (config file) |
| 9th | README.md | Full documentation | N/A (documentation) |
| 10th | validate-submission.sh | Pre-submission validator script | N/A (shell script) |
### Rules about files
- Do NOT create files not listed above. No utils.py, helpers.py, or config.py.
- Do NOT merge files. Each file has one responsibility.
- Do NOT create subdirectories. All files live in the project root.
- Do NOT add init.py. This is not a package.
---
## 5. DEPENDENCY RULES
### Allowed dependencies
```txt
pydantic>=2.0,<3.0
flask>=3.0,<4.0
openai>=1.0,<2.0
gunicorn>=21.0,<23.0
```
### Conditionally allowed (only if needed)
```txt
numpy
Pillow
```
### Forbidden
- No LangChain, LlamaIndex, or any agent framework
- No pandas or scipy
- No database libraries
- No async frameworks (FastAPI, aiohttp) - use Flask
- No frontend frameworks (Streamlit, Gradio)
- No ML libraries (torch, transformers, sklearn)
---
## 6. PYDANTIC MODEL RULES
### models.py constraints
- ALL models MUST inherit from pydantic.BaseModel
- ALL fields MUST have explicit type annotations
- ALL Literal types MUST use typing.Literal with exhaustive values
- NO methods on models (except StepResult and ResetResult wrappers)
- NO validators that call external services
- NO default_factory that uses randomness
- Field names MUST be snake_case
- NO nested models deeper than 2 levels
### Required models (exact names)
```python
class EmailObservation(BaseModel): ...
class TriageAction(BaseModel): ...
class RewardResult(BaseModel): ...
class EnvironmentState(BaseModel): ...
class StepResult(BaseModel): ...
class ResetResult(BaseModel): ...
```
### StepResult and ResetResult interface (mandatory)
```python
class StepResult(BaseModel):
observation: EmailObservation
reward: float
done: bool
info: dict[str, str | int | float | bool]
class ResetResult(BaseModel):
observation: EmailObservation
info: dict[str, str | int | float | bool]
```
### EmailObservation required fields
| Field | Type | Required |
|---|---|---|
| email_id | str | Yes |
| subject | str | Yes |
| body | str | Yes |
| sender | str | Yes |
| timestamp | str | Yes |
| thread_history | list[str] | Yes |
| task_id | str | Yes |
| step_number | int | Yes |
| total_emails | int | Yes |
### TriageAction required fields
| Field | Type | Required |
|---|---|---|
| label | Literal["urgent", "normal", "spam", "archive"] | Yes |
| summary | str | Yes |
| route_to | str | Yes |
### RewardResult required fields
| Field | Type | Required |
|---|---|---|
| score | float | Yes |
| breakdown | dict[str, float] | Yes |
| feedback | str | Yes |
### EnvironmentState required fields
| Field | Type | Required |
|---|---|---|
| task_id | str | Yes |
| current_step | int | Yes |
| total_steps | int | Yes |
| done | bool | Yes |
| action_history | list | Yes |
| reward_history | list | Yes |
---
## 7. ENVIRONMENT CLASS RULES
- Class name: EmailTriageEnv
- Constructor: __init__(self, task_id: str)
- MUST accept a task_id string
- MUST NOT call any external API
- MUST NOT use randomness
### reset() -> ResetResult
- MUST return a ResetResult object (not a bare observation)
- result.observation must contain the first email
- MUST reset all internal state
- MUST be callable multiple times without side effects
- HF Space validator will call /reset and expect HTTP 200 + valid JSON
### step(action: TriageAction) -> StepResult
- MUST return a StepResult object (not a tuple)
- result.observation: next email or terminal observation
- result.reward: float score for this step
- result.done: bool indicating episode end
- result.info: metadata dict
- MUST never raise an exception from bad agent input
- If action validation fails: return StepResult with reward=0.0 and continue
- MUST increment step counter
- MUST set done=True when all emails processed or max_steps hit
### state() -> EnvironmentState
- MUST return the full current internal state
- MUST be read-only
### Hard rules for environment.py
- NO randomness
- NO API calls
- NO file I/O during step/reset/state
- NO global mutable state
- NO threading or async
- NO print statements
---
## 8. TASK DATA RULES
Unchanged from previous version.
- All email data MUST be hard-coded
- NO loading from external files, URLs, or databases
- Task IDs: task_easy, task_medium, task_hard
- Each task defines: task_id, description, emails, ground_truth
- Ground truth MUST NOT be in observations (no answer leakage)
- Realistic professional email content
- NO offensive or NSFW content
---
## 9. GRADER RULES
Unchanged from previous version.
- Pure functions
- Deterministic
- Partial credit
- Scores in [0.0, 1.0]
---
## 10. REWARD FUNCTION RULES
Unchanged from previous version.
```text
final_reward = base_score - (step_count * 0.01) + trajectory_bonus - penalties
```
Final reward is clipped to [-1.0, 1.0].
---
## 11. SERVER RULES
### server.py constraints
- MUST use Flask
- Exactly THREE routes:
- POST /reset: accepts {"task_id": str}, returns ResetResult JSON
- POST /step: accepts TriageAction JSON, returns StepResult JSON
- POST /state: returns EnvironmentState JSON
- MUST listen on port 7860
- MUST handle malformed JSON gracefully (return 400)
- All responses must include Content-Type: application/json
- Validator will ping and call /reset, which must return HTTP 200
### /step response format
```json
{
"observation": {},
"reward": 0.85,
"done": false,
"info": {"step": 1, "task_id": "task_easy"}
}
```
### /reset response format
```json
{
"observation": {},
"info": {"task_id": "task_easy"}
}
```
---
## 12. INFERENCE SCRIPT RULES
CRITICAL PATTERNS FROM SAMPLE - MUST FOLLOW
### Architecture (matches sample)
```text
1. Initialize OpenAI client with env vars
2. Create environment instance
3. Call reset(), get initial observation
4. Loop up to MAX_STEPS:
a. Build prompt from observation + history
b. Call LLM
c. Parse response into action (with fallback)
d. Call step(action)
e. Record history
f. Check done flag
5. Print results
```
### Mandatory constants
```python
MAX_STEPS = 10
TEMPERATURE = 0.2
MAX_TOKENS = 200
FALLBACK_ACTION = ...
```
### Response parsing rules
- Do NOT rely only on response_format={"type": "json_object"}
- Parse free-text responses with regex or string matching
- If parsing fails, use a fallback action
- Strip prefixes like action: or next action: before parsing
- Regex parsing with fallback is preferred
### History tracking
```python
history: list[str] = []
history_line = f"Step {step}: {action} -> reward {reward:+.2f}"
history.append(history_line)
```
### Error handling
```python
try:
completion = client.chat.completions.create(...)
response_text = completion.choices[0].message.content or ""
except Exception as exc:
print(f"Model request failed ({exc}). Using fallback action.")
response_text = ""
```
### Output format
```text
Episode: task_easy
Step 1: label=urgent, route=safety -> reward +0.85
Final score: 0.85
=== SCORE TABLE ===
Task Score Steps
task_easy 0.85 1
task_medium 0.62 5
task_hard 0.45 2
Mean 0.64
```
### File naming and location
- File MUST be named inference.py
- MUST be in the project root directory
- MUST be runnable with python inference.py
- MUST complete in under 20 minutes
---
## 13. DOCKERFILE RULES
- Base image: python:3.11-slim
- WORKDIR: /app
- Copy requirements.txt first, pip install, then copy source
- EXPOSE 7860
- Create non-root user
- CMD starts the server
- Must build with --platform linux/amd64
- Must run within 2 vCPU / 8 GB memory
- No unnecessary system packages
- No CUDA/GPU dependencies
---
## 14. CODE STYLE RULES
- Python 3.11+
- Type hints on ALL function signatures
- Docstrings on ALL public functions (Google style)
- No single-letter variable names except i in loops
- Comments explain WHY, not WHAT
- Max line length: 100 characters
- f-strings only
- No wildcard imports
- Import order: stdlib -> third-party -> local
---
## 15. WHAT AI MUST NEVER DO
- Never add features not in this spec
- Never use an LLM inside a grader
- Never generate fake scores
- Never create a UI
- Never use randomness in the environment
- Never store API keys in code
- Never skip error handling in step()
- Never use bare dicts where Pydantic models are specified
- Never name the inference script baseline.py
- Never use OPENAI_API_KEY; use HF_TOKEN/API_KEY
- Never use response_format={"type": "json_object"} without text-parsing fallback
- Never return tuples from step/reset; use StepResult/ResetResult objects
- Never skip the fallback action pattern
- Never skip history tracking in inference
---
## 16. DEFINITION OF DONE - Per Phase Checklist
### Phase 1 complete when
- models.py exists with all 6 models (including StepResult, ResetResult)
- All fields match this document
- Models instantiate with sample data without errors
- StepResult has observation, reward, done, info attributes
### Phase 2 complete when
- tasks.py exists with 3 tasks
- All email data is realistic and hard-coded
- Ground truth exists for every email
- No answer leakage
### Phase 3 complete when
- graders.py has 3 pure grader functions
- Partial credit works
- All scores in [0.0, 1.0]
### Phase 4 complete when
- environment.py has EmailTriageEnv class
- reset() returns ResetResult
- step() returns StepResult
- step() handles invalid input without crashing
- Full episode runs to completion
### Phase 5 complete when
- server.py has /reset, /step, /state routes
- /reset returns {"observation": ..., "info": ...}
- /step returns {"observation": ..., "reward": ..., "done": ..., "info": ...}
- Malformed requests return 400
- Port 7860
### Phase 6 complete when
- inference.py follows sample architecture
- Uses os.getenv() for API_BASE_URL, HF_TOKEN/API_KEY, MODEL_NAME
- Has MAX_STEPS, TEMPERATURE, MAX_TOKENS, FALLBACK constants
- Has history tracking
- Has response parsing with fallback
- Has try/except around API calls
- Prints score table
- Completes in under 20 minutes
### Phase 7-9
Unchanged from previous version.
---
## 17. WHEN IN DOUBT
- Re-read this file
- Re-read the project briefing
- Re-read the sample inference.py
- Match the sample patterns
- Choose the simpler option
- Ask the human, do not guess
This file is the law. Code that violates it gets deleted.