Spaces:

Imaginephoenix
/

openenv

Runtime error

File size: 14,347 Bytes

02e973e

# RULES.md - Project Constitution & AI Guardrails
# OpenEnv Email Triage Environment

EVERY AI agent, copilot, or assistant working on this project MUST read and obey this file before generating ANY code.

REVISION 2: Updated based on sample inference.py analysis.
Where submission rules conflict with the original brief, SUBMISSION RULES WIN.
Where the sample script reveals patterns, MATCH THE PATTERNS.

## 0. GOLDEN RULE

> Do NOT generate code that you cannot explain line by line.
> Do NOT add features not listed in this document.
> Do NOT deviate from the file map, naming conventions, or interfaces defined here.
> When in doubt, do LESS, not more.

---

## 1. SCOPE - What This Project Is

- An OpenEnv-compliant AI agent training environment
- Domain: Email Triage (classify, prioritise, route emails)
- Deployed as a Docker-based Hugging Face Space
- Evaluated by inference.py using OpenAI Client with configurable endpoint

### What this project is NOT

- A chatbot
- A web app with a UI
- A game or toy problem
- A fine-tuning pipeline
- A multi-agent system
- An LLM wrapper with extra features
- A BrowserGym environment (the sample uses BrowserGym - we do NOT)

---

## 2. SUBMISSION CHECKLIST - DISQUALIFICATION CRITERIA

These are automated checks. Failing ANY ONE means disqualification.

| # | Check | What the validator does |
|---|---|---|
| 1 | HF Space deploys | Pings Space URL - must return HTTP 200 and respond to reset() |
| 2 | OpenEnv spec compliance | Validates openenv.yaml, typed models, /step, /reset, /state |
| 3 | Dockerfile builds | Runs docker build on the submitted repo - must succeed |
| 4 | Inference reproduces | Runs inference.py - must complete without error and produce scores |
| 5 | 3+ tasks with graders | Enumerates tasks, runs each grader, verifies scores in [0.0, 1.0] |
| 6 | Pre-validation script | Runs `./validate-submission.sh <ping_url> .` and expects all 3 checks to pass |

### 2.1 Mandatory pre-submit validation

- Before claiming "submission ready", run `./validate-submission.sh <ping_url> .` from repo root.
- If `<ping_url>` is unavailable, request it and block readiness claims until provided.
- Any AI assistant working on this repo must treat validator failure as a hard stop.

### Infrastructure constraints

| Constraint | Limit |
|---|---|
| vCPU | 2 |
| Memory | 8 GB |
| Inference runtime | < 20 minutes |

---

## 3. ENVIRONMENT VARIABLES - Mandatory

```python
import os

API_BASE_URL = os.getenv("API_BASE_URL")
API_KEY = os.getenv("HF_TOKEN") or os.getenv("API_KEY")
MODEL_NAME = os.getenv("MODEL_NAME")
```

How to use in code (EXACT PATTERN - matches sample):

```python
from openai import OpenAI

client = OpenAI(
    base_url=API_BASE_URL,
    api_key=API_KEY,
)

completion = client.chat.completions.create(
    model=MODEL_NAME,
    messages=[...],
    temperature=0.2,
    max_tokens=200,
    stream=False,
)

response_text = completion.choices[0].message.content or ""
```

Rules:

- NEVER hard-code any of these values
- NEVER use os.environ["VAR"] (use os.getenv() - matches sample)
- NEVER use any LLM client other than openai.OpenAI
- Support both HF_TOKEN and API_KEY with or fallback (matches sample)

---

## 4. FILE MAP - Strict Build Order

| Order | File | Purpose | May import from |
|---|---|---|---|
| 1st | models.py | Pydantic models + StepResult wrapper | stdlib, pydantic only |
| 2nd | tasks.py | Task definitions + hard-coded email data | models.py only |
| 3rd | graders.py | Deterministic grader functions | models.py, tasks.py only |
| 4th | environment.py | Core env class: step, reset, state | models, tasks, graders |
| 5th | server.py | Flask HTTP wrapper: /reset, /step, /state | environment.py, models.py |
| 6th | inference.py | OpenAI Client inference script | models.py, environment.py |
| 7th | openenv.yaml | Spec metadata | N/A (data file) |
| 8th | Dockerfile | Container build | N/A (config file) |
| 8th | requirements.txt | Pinned dependencies | N/A (config file) |
| 9th | README.md | Full documentation | N/A (documentation) |
| 10th | validate-submission.sh | Pre-submission validator script | N/A (shell script) |

### Rules about files

- Do NOT create files not listed above. No utils.py, helpers.py, or config.py.
- Do NOT merge files. Each file has one responsibility.
- Do NOT create subdirectories. All files live in the project root.
- Do NOT add init.py. This is not a package.

---

## 5. DEPENDENCY RULES

### Allowed dependencies

```txt
pydantic>=2.0,<3.0
flask>=3.0,<4.0
openai>=1.0,<2.0
gunicorn>=21.0,<23.0
```

### Conditionally allowed (only if needed)

```txt
numpy
Pillow
```

### Forbidden

- No LangChain, LlamaIndex, or any agent framework
- No pandas or scipy
- No database libraries
- No async frameworks (FastAPI, aiohttp) - use Flask
- No frontend frameworks (Streamlit, Gradio)
- No ML libraries (torch, transformers, sklearn)

---

## 6. PYDANTIC MODEL RULES

### models.py constraints

- ALL models MUST inherit from pydantic.BaseModel
- ALL fields MUST have explicit type annotations
- ALL Literal types MUST use typing.Literal with exhaustive values
- NO methods on models (except StepResult and ResetResult wrappers)
- NO validators that call external services
- NO default_factory that uses randomness
- Field names MUST be snake_case
- NO nested models deeper than 2 levels

### Required models (exact names)

```python
class EmailObservation(BaseModel): ...
class TriageAction(BaseModel): ...
class RewardResult(BaseModel): ...
class EnvironmentState(BaseModel): ...
class StepResult(BaseModel): ...
class ResetResult(BaseModel): ...
```

### StepResult and ResetResult interface (mandatory)

```python
class StepResult(BaseModel):
    observation: EmailObservation
    reward: float
    done: bool
    info: dict[str, str | int | float | bool]

class ResetResult(BaseModel):
    observation: EmailObservation
    info: dict[str, str | int | float | bool]
```

### EmailObservation required fields

| Field | Type | Required |
|---|---|---|
| email_id | str | Yes |
| subject | str | Yes |
| body | str | Yes |
| sender | str | Yes |
| timestamp | str | Yes |
| thread_history | list[str] | Yes |
| task_id | str | Yes |
| step_number | int | Yes |
| total_emails | int | Yes |

### TriageAction required fields

| Field | Type | Required |
|---|---|---|
| label | Literal["urgent", "normal", "spam", "archive"] | Yes |
| summary | str | Yes |
| route_to | str | Yes |

### RewardResult required fields

| Field | Type | Required |
|---|---|---|
| score | float | Yes |
| breakdown | dict[str, float] | Yes |
| feedback | str | Yes |

### EnvironmentState required fields

| Field | Type | Required |
|---|---|---|
| task_id | str | Yes |
| current_step | int | Yes |
| total_steps | int | Yes |
| done | bool | Yes |
| action_history | list | Yes |
| reward_history | list | Yes |

---

## 7. ENVIRONMENT CLASS RULES

- Class name: EmailTriageEnv
- Constructor: __init__(self, task_id: str)
- MUST accept a task_id string
- MUST NOT call any external API
- MUST NOT use randomness

### reset() -> ResetResult

- MUST return a ResetResult object (not a bare observation)
- result.observation must contain the first email
- MUST reset all internal state
- MUST be callable multiple times without side effects
- HF Space validator will call /reset and expect HTTP 200 + valid JSON

### step(action: TriageAction) -> StepResult

- MUST return a StepResult object (not a tuple)
- result.observation: next email or terminal observation
- result.reward: float score for this step
- result.done: bool indicating episode end
- result.info: metadata dict
- MUST never raise an exception from bad agent input
- If action validation fails: return StepResult with reward=0.0 and continue
- MUST increment step counter
- MUST set done=True when all emails processed or max_steps hit

### state() -> EnvironmentState

- MUST return the full current internal state
- MUST be read-only

### Hard rules for environment.py

- NO randomness
- NO API calls
- NO file I/O during step/reset/state
- NO global mutable state
- NO threading or async
- NO print statements

---

## 8. TASK DATA RULES

Unchanged from previous version.

- All email data MUST be hard-coded
- NO loading from external files, URLs, or databases
- Task IDs: task_easy, task_medium, task_hard
- Each task defines: task_id, description, emails, ground_truth
- Ground truth MUST NOT be in observations (no answer leakage)
- Realistic professional email content
- NO offensive or NSFW content

---

## 9. GRADER RULES

Unchanged from previous version.

- Pure functions
- Deterministic
- Partial credit
- Scores in [0.0, 1.0]

---

## 10. REWARD FUNCTION RULES

Unchanged from previous version.

```text
final_reward = base_score - (step_count * 0.01) + trajectory_bonus - penalties
```

Final reward is clipped to [-1.0, 1.0].

---

## 11. SERVER RULES

### server.py constraints

- MUST use Flask
- Exactly THREE routes:
  - POST /reset: accepts {"task_id": str}, returns ResetResult JSON
  - POST /step: accepts TriageAction JSON, returns StepResult JSON
  - POST /state: returns EnvironmentState JSON
- MUST listen on port 7860
- MUST handle malformed JSON gracefully (return 400)
- All responses must include Content-Type: application/json
- Validator will ping and call /reset, which must return HTTP 200

### /step response format

```json
{
  "observation": {},
  "reward": 0.85,
  "done": false,
  "info": {"step": 1, "task_id": "task_easy"}
}
```

### /reset response format

```json
{
  "observation": {},
  "info": {"task_id": "task_easy"}
}
```

---

## 12. INFERENCE SCRIPT RULES

CRITICAL PATTERNS FROM SAMPLE - MUST FOLLOW

### Architecture (matches sample)

```text
1. Initialize OpenAI client with env vars
2. Create environment instance
3. Call reset(), get initial observation
4. Loop up to MAX_STEPS:
   a. Build prompt from observation + history
   b. Call LLM
   c. Parse response into action (with fallback)
   d. Call step(action)
   e. Record history
   f. Check done flag
5. Print results
```

### Mandatory constants

```python
MAX_STEPS = 10
TEMPERATURE = 0.2
MAX_TOKENS = 200
FALLBACK_ACTION = ...
```

### Response parsing rules

- Do NOT rely only on response_format={"type": "json_object"}
- Parse free-text responses with regex or string matching
- If parsing fails, use a fallback action
- Strip prefixes like action: or next action: before parsing
- Regex parsing with fallback is preferred

### History tracking

```python
history: list[str] = []
history_line = f"Step {step}: {action} -> reward {reward:+.2f}"
history.append(history_line)
```

### Error handling

```python
try:
    completion = client.chat.completions.create(...)
    response_text = completion.choices[0].message.content or ""
except Exception as exc:
    print(f"Model request failed ({exc}). Using fallback action.")
    response_text = ""
```

### Output format

```text
Episode: task_easy
Step 1: label=urgent, route=safety -> reward +0.85
Final score: 0.85

=== SCORE TABLE ===
Task         Score    Steps
task_easy    0.85     1
task_medium  0.62     5
task_hard    0.45     2
Mean         0.64
```

### File naming and location

- File MUST be named inference.py
- MUST be in the project root directory
- MUST be runnable with python inference.py
- MUST complete in under 20 minutes

---

## 13. DOCKERFILE RULES

- Base image: python:3.11-slim
- WORKDIR: /app
- Copy requirements.txt first, pip install, then copy source
- EXPOSE 7860
- Create non-root user
- CMD starts the server
- Must build with --platform linux/amd64
- Must run within 2 vCPU / 8 GB memory
- No unnecessary system packages
- No CUDA/GPU dependencies

---

## 14. CODE STYLE RULES

- Python 3.11+
- Type hints on ALL function signatures
- Docstrings on ALL public functions (Google style)
- No single-letter variable names except i in loops
- Comments explain WHY, not WHAT
- Max line length: 100 characters
- f-strings only
- No wildcard imports
- Import order: stdlib -> third-party -> local

---

## 15. WHAT AI MUST NEVER DO

- Never add features not in this spec
- Never use an LLM inside a grader
- Never generate fake scores
- Never create a UI
- Never use randomness in the environment
- Never store API keys in code
- Never skip error handling in step()
- Never use bare dicts where Pydantic models are specified
- Never name the inference script baseline.py
- Never use OPENAI_API_KEY; use HF_TOKEN/API_KEY
- Never use response_format={"type": "json_object"} without text-parsing fallback
- Never return tuples from step/reset; use StepResult/ResetResult objects
- Never skip the fallback action pattern
- Never skip history tracking in inference

---

## 16. DEFINITION OF DONE - Per Phase Checklist

### Phase 1 complete when

- models.py exists with all 6 models (including StepResult, ResetResult)
- All fields match this document
- Models instantiate with sample data without errors
- StepResult has observation, reward, done, info attributes

### Phase 2 complete when

- tasks.py exists with 3 tasks
- All email data is realistic and hard-coded
- Ground truth exists for every email
- No answer leakage

### Phase 3 complete when

- graders.py has 3 pure grader functions
- Partial credit works
- All scores in [0.0, 1.0]

### Phase 4 complete when

- environment.py has EmailTriageEnv class
- reset() returns ResetResult
- step() returns StepResult
- step() handles invalid input without crashing
- Full episode runs to completion

### Phase 5 complete when

- server.py has /reset, /step, /state routes
- /reset returns {"observation": ..., "info": ...}
- /step returns {"observation": ..., "reward": ..., "done": ..., "info": ...}
- Malformed requests return 400
- Port 7860

### Phase 6 complete when

- inference.py follows sample architecture
- Uses os.getenv() for API_BASE_URL, HF_TOKEN/API_KEY, MODEL_NAME
- Has MAX_STEPS, TEMPERATURE, MAX_TOKENS, FALLBACK constants
- Has history tracking
- Has response parsing with fallback
- Has try/except around API calls
- Prints score table
- Completes in under 20 minutes

### Phase 7-9

Unchanged from previous version.

---

## 17. WHEN IN DOUBT

- Re-read this file
- Re-read the project briefing
- Re-read the sample inference.py
- Match the sample patterns
- Choose the simpler option
- Ask the human, do not guess

This file is the law. Code that violates it gets deleted.