Spaces:
Sleeping
Sleeping
| title: CloudOps Optimizer | |
| emoji: ☁️ | |
| colorFrom: blue | |
| colorTo: green | |
| sdk: docker | |
| app_port: 7860 | |
| # 🚀 Project Overview | |
| **CloudOps Optimizer** is an OpenEnv simulation for Autonomous FinOps. It challenges AI agents to balance cloud infrastructure costs against performance SLAs, simulating real-world SRE tasks. | |
| ### The Problem It Simulates | |
| Companies using AWS/Azure/GCP waste millions yearly on: | |
| - **Oversized servers** - paying for capacity they don't need | |
| - **Undersized servers** - causing performance issues | |
| - **Poor resource allocation** - balancing cost vs performance | |
| ### The Agent's Job | |
| 1. See current infrastructure (CPU usage, costs, latency) | |
| 2. Choose actions like `change srv-1 to t3.small` | |
| 3. Get rewards/penalties based on cost savings + performance | |
| 4. Learn to optimize cost vs performance tradeoffs | |
| --- | |
| # CloudOps Optimizer Environment | |
| ## Overview | |
| **CloudOps Optimizer** is a real-world simulation of cloud infrastructure cost and performance optimization. The agent acts as a Cloud Site Reliability Engineer (SRE) optimizing a fleet of virtual cloud instances to meet Service Level Agreement (SLA) requirements while minimizing monthly costs. | |
| ## Why This Matters | |
| - **Real-world utility**: Every company using AWS/Azure/GCP struggles with "Cloud Waste". Training agents to right-size instances is a multi-million dollar problem. | |
| - **Not a toy**: Unlike chatbots or simple games, this environment requires quantitative reasoning about cost vs performance tradeoffs. | |
| ## Environment Description | |
| ### Observation Space | |
| The agent receives structured data including: | |
| - **Inventory**: List of cloud resources (id, type, cpu_usage, mem_usage, monthly_cost) | |
| - **Metrics**: Real-time performance (avg_latency_ms, error_rate, throughput_rps) | |
| - **SLA**: Target constraints (max_latency_ms, max_budget, min_uptime_pct) | |
| - **Task Info**: task_id, task_name, difficulty, current step | |
| ### Action Space | |
| The agent sends text commands in format: `change [resource_id] to [instance_type]` | |
| Available instance types: | |
| - `t3.nano`: $3.60/mo, capacity 1.0 | |
| - `t3.small`: $11.50/mo, capacity 2.0 | |
| - `t3.medium`: $23.00/mo, capacity 4.0 | |
| - `m5.large`: $70.00/mo, capacity 8.0 | |
| - `m5.xlarge`: $140.00/mo, capacity 16.0 | |
| ## Tasks & Grading | |
| | Task | Difficulty | Description | Grading | | |
| |------|------------|-------------|---------| | |
| | Right-Sizing | Easy | Reduce an overpriced server without breaking SLA | Score = reward value (0-1) | | |
| | Latency Fix | Medium | Resolve performance bottleneck under budget | Score = reward value (0-1) | | |
| | Balance Optimization | Hard | Optimize multi-server cluster with tight constraints | Score = reward value (0-1) | | |
| ### Reward Function | |
| The reward provides **continuous signals** over the trajectory: | |
| ``` | |
| R = cost_reward + performance_reward | |
| ``` | |
| Where: | |
| - **Cost Reward (0-0.5)**: Higher as cost approaches budget | |
| - **Performance Reward (0-0.5)**: Higher as latency stays under SLA | |
| **Partial Progress**: Agent receives incremental rewards for each improvement. | |
| **Penalties**: System crash (CPU > 110%) results in 0 reward and episode end. | |
| ## Setup & Usage | |
| ### Prerequisites | |
| - Python 3.10+ | |
| - OpenAI API key (HF_TOKEN) | |
| ### Local Installation | |
| ```bash | |
| # Install dependencies | |
| pip install -e . | |
| # Run baseline inference | |
| export HF_TOKEN=your_huggingface_token | |
| python inference.py | |
| ``` | |
| ### Docker Execution | |
| ```bash | |
| docker build -t cloud-ops-env . | |
| docker run -p 8000:8000 cloud-ops-env | |
| ``` | |
| ### API Endpoints | |
| - `POST /reset` - Reset environment with optional task_id | |
| - `POST /step` - Execute action | |
| - `GET /state` - Get current state | |
| - `GET /health` - Health check | |
| ## Baseline Results | |
| Model: Qwen/Qwen2.5-72B-Instruct | |
| | Task | Score | Steps | | |
| |------|-------|-------| | |
| | Right-Sizing (Easy) | 0.125 | 1 | | |
| | Latency Fix (Medium) | 0.000 | 1 | | |
| | Balance (Hard) | 0.000 | 1 | | |
| **Average: 0.042** | |
| Note: Baseline scores indicate the model needs better prompting to handle the optimization tradeoffs. The environment correctly penalizes overshooting budget (easy) and undersizing (medium/hard causing crashes). | |
| ## Files | |
| - `openenv.yaml` - OpenEnv specification | |
| - `models.py` - Pydantic models (Observation, Action, Reward) | |
| - `env/core.py` - Environment logic with state machine | |
| - `server/app.py` - FastAPI server | |
| - `inference.py` - Baseline inference script | |
| - `Dockerfile` - Container build | |
| ## Spec Compliance | |
| - [x] Typed Pydantic models | |
| - [x] reset() returns Observation | |
| - [x] step(action) returns (Observation, Reward, done, info) | |
| - [x] state() returns current state | |
| - [x] openenv.yaml with metadata | |
| - [x] openenv validate passes | |
| - [x] 3 tasks with deterministic graders (0.0-1.0) | |
| - [x] Partial reward signals | |
| - [x] Strict [START]/[STEP]/[END] log format in inference.py |