Spaces:

Imaginephoenix
/

openenv

Runtime error

App Files Files Community

openenv / RULES.md

Imaginephoenix

Upload 15 files

02e973e verified about 2 months ago

preview code

raw

history blame contribute delete

14.3 kB

	# RULES.md - Project Constitution & AI Guardrails
	# OpenEnv Email Triage Environment

	EVERY AI agent, copilot, or assistant working on this project MUST read and obey this file before generating ANY code.

	REVISION 2: Updated based on sample inference.py analysis.
	Where submission rules conflict with the original brief, SUBMISSION RULES WIN.
	Where the sample script reveals patterns, MATCH THE PATTERNS.

	## 0. GOLDEN RULE

	> Do NOT generate code that you cannot explain line by line.
	> Do NOT add features not listed in this document.
	> Do NOT deviate from the file map, naming conventions, or interfaces defined here.
	> When in doubt, do LESS, not more.

	---

	## 1. SCOPE - What This Project Is

	- An OpenEnv-compliant AI agent training environment
	- Domain: Email Triage (classify, prioritise, route emails)
	- Deployed as a Docker-based Hugging Face Space
	- Evaluated by inference.py using OpenAI Client with configurable endpoint

	### What this project is NOT

	- A chatbot
	- A web app with a UI
	- A game or toy problem
	- A fine-tuning pipeline
	- A multi-agent system
	- An LLM wrapper with extra features
	- A BrowserGym environment (the sample uses BrowserGym - we do NOT)

	---

	## 2. SUBMISSION CHECKLIST - DISQUALIFICATION CRITERIA

	These are automated checks. Failing ANY ONE means disqualification.

	\| # \| Check \| What the validator does \|
	\|---\|---\|---\|
	\| 1 \| HF Space deploys \| Pings Space URL - must return HTTP 200 and respond to reset() \|
	\| 2 \| OpenEnv spec compliance \| Validates openenv.yaml, typed models, /step, /reset, /state \|
	\| 3 \| Dockerfile builds \| Runs docker build on the submitted repo - must succeed \|
	\| 4 \| Inference reproduces \| Runs inference.py - must complete without error and produce scores \|
	\| 5 \| 3+ tasks with graders \| Enumerates tasks, runs each grader, verifies scores in [0.0, 1.0] \|
	\| 6 \| Pre-validation script \| Runs `./validate-submission.sh <ping_url> .` and expects all 3 checks to pass \|

	### 2.1 Mandatory pre-submit validation

	- Before claiming "submission ready", run `./validate-submission.sh <ping_url> .` from repo root.
	- If `<ping_url>` is unavailable, request it and block readiness claims until provided.
	- Any AI assistant working on this repo must treat validator failure as a hard stop.

	### Infrastructure constraints

	\| Constraint \| Limit \|
	\|---\|---\|
	\| vCPU \| 2 \|
	\| Memory \| 8 GB \|
	\| Inference runtime \| < 20 minutes \|

	---

	## 3. ENVIRONMENT VARIABLES - Mandatory

	```python
	import os

	API_BASE_URL = os.getenv("API_BASE_URL")
	API_KEY = os.getenv("HF_TOKEN") or os.getenv("API_KEY")
	MODEL_NAME = os.getenv("MODEL_NAME")
	```

	How to use in code (EXACT PATTERN - matches sample):

	```python
	from openai import OpenAI

	client = OpenAI(
	base_url=API_BASE_URL,
	api_key=API_KEY,
	)

	completion = client.chat.completions.create(
	model=MODEL_NAME,
	messages=[...],
	temperature=0.2,
	max_tokens=200,
	stream=False,
	)

	response_text = completion.choices[0].message.content or ""
	```

	Rules:

	- NEVER hard-code any of these values
	- NEVER use os.environ["VAR"] (use os.getenv() - matches sample)
	- NEVER use any LLM client other than openai.OpenAI
	- Support both HF_TOKEN and API_KEY with or fallback (matches sample)

	---

	## 4. FILE MAP - Strict Build Order

	\| Order \| File \| Purpose \| May import from \|
	\|---\|---\|---\|---\|
	\| 1st \| models.py \| Pydantic models + StepResult wrapper \| stdlib, pydantic only \|
	\| 2nd \| tasks.py \| Task definitions + hard-coded email data \| models.py only \|
	\| 3rd \| graders.py \| Deterministic grader functions \| models.py, tasks.py only \|
	\| 4th \| environment.py \| Core env class: step, reset, state \| models, tasks, graders \|
	\| 5th \| server.py \| Flask HTTP wrapper: /reset, /step, /state \| environment.py, models.py \|
	\| 6th \| inference.py \| OpenAI Client inference script \| models.py, environment.py \|
	\| 7th \| openenv.yaml \| Spec metadata \| N/A (data file) \|
	\| 8th \| Dockerfile \| Container build \| N/A (config file) \|
	\| 8th \| requirements.txt \| Pinned dependencies \| N/A (config file) \|
	\| 9th \| README.md \| Full documentation \| N/A (documentation) \|
	\| 10th \| validate-submission.sh \| Pre-submission validator script \| N/A (shell script) \|

	### Rules about files

	- Do NOT create files not listed above. No utils.py, helpers.py, or config.py.
	- Do NOT merge files. Each file has one responsibility.
	- Do NOT create subdirectories. All files live in the project root.
	- Do NOT add init.py. This is not a package.

	---

	## 5. DEPENDENCY RULES

	### Allowed dependencies

	```txt
	pydantic>=2.0,<3.0
	flask>=3.0,<4.0
	openai>=1.0,<2.0
	gunicorn>=21.0,<23.0
	```

	### Conditionally allowed (only if needed)

	```txt
	numpy
	Pillow
	```

	### Forbidden

	- No LangChain, LlamaIndex, or any agent framework
	- No pandas or scipy
	- No database libraries
	- No async frameworks (FastAPI, aiohttp) - use Flask
	- No frontend frameworks (Streamlit, Gradio)
	- No ML libraries (torch, transformers, sklearn)

	---

	## 6. PYDANTIC MODEL RULES

	### models.py constraints

	- ALL models MUST inherit from pydantic.BaseModel
	- ALL fields MUST have explicit type annotations
	- ALL Literal types MUST use typing.Literal with exhaustive values
	- NO methods on models (except StepResult and ResetResult wrappers)
	- NO validators that call external services
	- NO default_factory that uses randomness
	- Field names MUST be snake_case
	- NO nested models deeper than 2 levels

	### Required models (exact names)

	```python
	class EmailObservation(BaseModel): ...
	class TriageAction(BaseModel): ...
	class RewardResult(BaseModel): ...
	class EnvironmentState(BaseModel): ...
	class StepResult(BaseModel): ...
	class ResetResult(BaseModel): ...
	```

	### StepResult and ResetResult interface (mandatory)

	```python
	class StepResult(BaseModel):
	observation: EmailObservation
	reward: float
	done: bool
	info: dict[str, str \| int \| float \| bool]

	class ResetResult(BaseModel):
	observation: EmailObservation
	info: dict[str, str \| int \| float \| bool]
	```

	### EmailObservation required fields

	\| Field \| Type \| Required \|
	\|---\|---\|---\|
	\| email_id \| str \| Yes \|
	\| subject \| str \| Yes \|
	\| body \| str \| Yes \|
	\| sender \| str \| Yes \|
	\| timestamp \| str \| Yes \|
	\| thread_history \| list[str] \| Yes \|
	\| task_id \| str \| Yes \|
	\| step_number \| int \| Yes \|
	\| total_emails \| int \| Yes \|

	### TriageAction required fields

	\| Field \| Type \| Required \|
	\|---\|---\|---\|
	\| label \| Literal["urgent", "normal", "spam", "archive"] \| Yes \|
	\| summary \| str \| Yes \|
	\| route_to \| str \| Yes \|

	### RewardResult required fields

	\| Field \| Type \| Required \|
	\|---\|---\|---\|
	\| score \| float \| Yes \|
	\| breakdown \| dict[str, float] \| Yes \|
	\| feedback \| str \| Yes \|

	### EnvironmentState required fields

	\| Field \| Type \| Required \|
	\|---\|---\|---\|
	\| task_id \| str \| Yes \|
	\| current_step \| int \| Yes \|
	\| total_steps \| int \| Yes \|
	\| done \| bool \| Yes \|
	\| action_history \| list \| Yes \|
	\| reward_history \| list \| Yes \|

	---

	## 7. ENVIRONMENT CLASS RULES

	- Class name: EmailTriageEnv
	- Constructor: __init__(self, task_id: str)
	- MUST accept a task_id string
	- MUST NOT call any external API
	- MUST NOT use randomness

	### reset() -> ResetResult

	- MUST return a ResetResult object (not a bare observation)
	- result.observation must contain the first email
	- MUST reset all internal state
	- MUST be callable multiple times without side effects
	- HF Space validator will call /reset and expect HTTP 200 + valid JSON

	### step(action: TriageAction) -> StepResult

	- MUST return a StepResult object (not a tuple)
	- result.observation: next email or terminal observation
	- result.reward: float score for this step
	- result.done: bool indicating episode end
	- result.info: metadata dict
	- MUST never raise an exception from bad agent input
	- If action validation fails: return StepResult with reward=0.0 and continue
	- MUST increment step counter
	- MUST set done=True when all emails processed or max_steps hit

	### state() -> EnvironmentState

	- MUST return the full current internal state
	- MUST be read-only

	### Hard rules for environment.py

	- NO randomness
	- NO API calls
	- NO file I/O during step/reset/state
	- NO global mutable state
	- NO threading or async
	- NO print statements

	---

	## 8. TASK DATA RULES

	Unchanged from previous version.

	- All email data MUST be hard-coded
	- NO loading from external files, URLs, or databases
	- Task IDs: task_easy, task_medium, task_hard
	- Each task defines: task_id, description, emails, ground_truth
	- Ground truth MUST NOT be in observations (no answer leakage)
	- Realistic professional email content
	- NO offensive or NSFW content

	---

	## 9. GRADER RULES

	Unchanged from previous version.

	- Pure functions
	- Deterministic
	- Partial credit
	- Scores in [0.0, 1.0]

	---

	## 10. REWARD FUNCTION RULES

	Unchanged from previous version.

	```text
	final_reward = base_score - (step_count * 0.01) + trajectory_bonus - penalties
	```

	Final reward is clipped to [-1.0, 1.0].

	---

	## 11. SERVER RULES

	### server.py constraints

	- MUST use Flask
	- Exactly THREE routes:
	- POST /reset: accepts {"task_id": str}, returns ResetResult JSON
	- POST /step: accepts TriageAction JSON, returns StepResult JSON
	- POST /state: returns EnvironmentState JSON
	- MUST listen on port 7860
	- MUST handle malformed JSON gracefully (return 400)
	- All responses must include Content-Type: application/json
	- Validator will ping and call /reset, which must return HTTP 200

	### /step response format

	```json
	{
	"observation": {},
	"reward": 0.85,
	"done": false,
	"info": {"step": 1, "task_id": "task_easy"}
	}
	```

	### /reset response format

	```json
	{
	"observation": {},
	"info": {"task_id": "task_easy"}
	}
	```

	---

	## 12. INFERENCE SCRIPT RULES

	CRITICAL PATTERNS FROM SAMPLE - MUST FOLLOW

	### Architecture (matches sample)

	```text
	1. Initialize OpenAI client with env vars
	2. Create environment instance
	3. Call reset(), get initial observation
	4. Loop up to MAX_STEPS:
	a. Build prompt from observation + history
	b. Call LLM
	c. Parse response into action (with fallback)
	d. Call step(action)
	e. Record history
	f. Check done flag
	5. Print results
	```

	### Mandatory constants

	```python
	MAX_STEPS = 10
	TEMPERATURE = 0.2
	MAX_TOKENS = 200
	FALLBACK_ACTION = ...
	```

	### Response parsing rules

	- Do NOT rely only on response_format={"type": "json_object"}
	- Parse free-text responses with regex or string matching
	- If parsing fails, use a fallback action
	- Strip prefixes like action: or next action: before parsing
	- Regex parsing with fallback is preferred

	### History tracking

	```python
	history: list[str] = []
	history_line = f"Step {step}: {action} -> reward {reward:+.2f}"
	history.append(history_line)
	```

	### Error handling

	```python
	try:
	completion = client.chat.completions.create(...)
	response_text = completion.choices[0].message.content or ""
	except Exception as exc:
	print(f"Model request failed ({exc}). Using fallback action.")
	response_text = ""
	```

	### Output format

	```text
	Episode: task_easy
	Step 1: label=urgent, route=safety -> reward +0.85
	Final score: 0.85

	=== SCORE TABLE ===
	Task Score Steps
	task_easy 0.85 1
	task_medium 0.62 5
	task_hard 0.45 2
	Mean 0.64
	```

	### File naming and location

	- File MUST be named inference.py
	- MUST be in the project root directory
	- MUST be runnable with python inference.py
	- MUST complete in under 20 minutes

	---

	## 13. DOCKERFILE RULES

	- Base image: python:3.11-slim
	- WORKDIR: /app
	- Copy requirements.txt first, pip install, then copy source
	- EXPOSE 7860
	- Create non-root user
	- CMD starts the server
	- Must build with --platform linux/amd64
	- Must run within 2 vCPU / 8 GB memory
	- No unnecessary system packages
	- No CUDA/GPU dependencies

	---

	## 14. CODE STYLE RULES

	- Python 3.11+
	- Type hints on ALL function signatures
	- Docstrings on ALL public functions (Google style)
	- No single-letter variable names except i in loops
	- Comments explain WHY, not WHAT
	- Max line length: 100 characters
	- f-strings only
	- No wildcard imports
	- Import order: stdlib -> third-party -> local

	---

	## 15. WHAT AI MUST NEVER DO

	- Never add features not in this spec
	- Never use an LLM inside a grader
	- Never generate fake scores
	- Never create a UI
	- Never use randomness in the environment
	- Never store API keys in code
	- Never skip error handling in step()
	- Never use bare dicts where Pydantic models are specified
	- Never name the inference script baseline.py
	- Never use OPENAI_API_KEY; use HF_TOKEN/API_KEY
	- Never use response_format={"type": "json_object"} without text-parsing fallback
	- Never return tuples from step/reset; use StepResult/ResetResult objects
	- Never skip the fallback action pattern
	- Never skip history tracking in inference

	---

	## 16. DEFINITION OF DONE - Per Phase Checklist

	### Phase 1 complete when

	- models.py exists with all 6 models (including StepResult, ResetResult)
	- All fields match this document
	- Models instantiate with sample data without errors
	- StepResult has observation, reward, done, info attributes

	### Phase 2 complete when

	- tasks.py exists with 3 tasks
	- All email data is realistic and hard-coded
	- Ground truth exists for every email
	- No answer leakage

	### Phase 3 complete when

	- graders.py has 3 pure grader functions
	- Partial credit works
	- All scores in [0.0, 1.0]

	### Phase 4 complete when

	- environment.py has EmailTriageEnv class
	- reset() returns ResetResult
	- step() returns StepResult
	- step() handles invalid input without crashing
	- Full episode runs to completion

	### Phase 5 complete when

	- server.py has /reset, /step, /state routes
	- /reset returns {"observation": ..., "info": ...}
	- /step returns {"observation": ..., "reward": ..., "done": ..., "info": ...}
	- Malformed requests return 400
	- Port 7860

	### Phase 6 complete when

	- inference.py follows sample architecture
	- Uses os.getenv() for API_BASE_URL, HF_TOKEN/API_KEY, MODEL_NAME
	- Has MAX_STEPS, TEMPERATURE, MAX_TOKENS, FALLBACK constants
	- Has history tracking
	- Has response parsing with fallback
	- Has try/except around API calls
	- Prints score table
	- Completes in under 20 minutes

	### Phase 7-9

	Unchanged from previous version.

	---

	## 17. WHEN IN DOUBT

	- Re-read this file
	- Re-read the project briefing
	- Re-read the sample inference.py
	- Match the sample patterns
	- Choose the simpler option
	- Ask the human, do not guess

	This file is the law. Code that violates it gets deleted.