Spaces:
Configuration error
ConfigDebuggerEnv
ConfigDebuggerEnv is a real-world OpenEnv environment for iterative configuration debugging. It simulates tasks that platform engineers and ML engineers face in production: fixing Docker Compose, Kubernetes, and training configuration mistakes under step limits.
Why this environment
Configuration bugs are expensive and common in real systems. They are often partially valid YAML but semantically wrong (type mismatches, missing units, interdependent constraints). This environment provides dense trajectory rewards so an agent can learn corrective behaviors instead of only terminal success/failure.
OpenEnv API
The server exposes the standard lifecycle:
- POST /reset
- POST /step
- GET /state
Typed models
- Action model: ConfigAction
- Observation model: ConfigObservation
- Reward model: ConfigReward
- State model: EnvState
Models are defined in server/models.py and validated with Pydantic.
Action space
ConfigAction fields:
- operation: edit | add | delete
- path: dot path with optional list indexes (example: spec.template.spec.containers.0.image)
- value: JSON-serializable payload for edit/add
Observation space
ConfigObservation fields:
- task_id
- task_description
- current_config (YAML string)
- syntax_valid
- validation_errors
- schema_score (0.0 to 1.0)
- logic_score (0.0 to 1.0)
- overall_score (0.0 to 1.0)
- step_count
- max_steps
Tasks and graders
Three deterministic tasks are included:
- easy_docker (easy)
- medium_k8s (medium)
- hard_ml_config (hard)
Each task has:
- A broken starting configuration
- A target configuration
- Weighted required paths for schema grading
- Deterministic logic checks
Grading always returns normalized values in [0.0, 1.0].
Reward design
Reward has dense progression with penalties:
- Base reward is current overall score
- Positive delta bonus on improvement
- Regression penalty on negative delta
- Loop penalty for repeated states
- Penalty for invalid actions
- Penalty for destructive top-level deletes
- Small completion bonus when solved
This creates meaningful signals across the full episode, not only at termination.
Project structure
- openenv.yaml
- Dockerfile
- requirements.txt
- inference.py
- server/
- data.py
- env.py
- main.py
- models.py
Local setup
- Install dependencies
pip install -r requirements.txt
- Run server
python -m uvicorn server.main:app --host 0.0.0.0 --port 8000 --reload
- Quick API check
curl -X POST "http://localhost:8000/reset" -H "Content-Type: application/json" -d "{\"task_id\":\"easy_docker\"}"
Baseline inference
Heuristic baseline (fully reproducible):
python inference.py --policy heuristic --api-base-url http://localhost:8000 --seed 42
OpenAI baseline (uses OpenAI Python client and OPENAI_API_KEY):
set OPENAI_API_KEY=your_key_here
python inference.py --policy openai --model gpt-4o-mini --api-base-url http://localhost:8000 --seed 42
The script evaluates all three tasks and prints per-task and average scores.
Docker
Build:
docker build -t configdebugger-env .
Run:
docker run -p 7860:7860 configdebugger-env
Hugging Face Spaces notes
- Use Docker SDK
- Ensure Space port maps to 7860
- Add tag: openenv
- Include environment variables for external evaluation if needed
Validation checklist
- Typed Observation/Action/Reward models: yes
- reset/step/state implemented: yes
- 3 tasks with deterministic graders: yes
- Reward in range [0.0, 1.0] with partial progress: yes
- Baseline inference script with OpenAI client: yes
- Dockerfile included: yes
- OpenEnv metadata file included: yes