--- title: CloudOps Optimizer emoji: ☁️ colorFrom: blue colorTo: green sdk: docker app_port: 7860 --- # 🚀 Project Overview **CloudOps Optimizer** is an OpenEnv simulation for Autonomous FinOps. It challenges AI agents to balance cloud infrastructure costs against performance SLAs, simulating real-world SRE tasks. ### The Problem It Simulates Companies using AWS/Azure/GCP waste millions yearly on: - **Oversized servers** - paying for capacity they don't need - **Undersized servers** - causing performance issues - **Poor resource allocation** - balancing cost vs performance ### The Agent's Job 1. See current infrastructure (CPU usage, costs, latency) 2. Choose actions like `change srv-1 to t3.small` 3. Get rewards/penalties based on cost savings + performance 4. Learn to optimize cost vs performance tradeoffs --- # CloudOps Optimizer Environment ## Overview **CloudOps Optimizer** is a real-world simulation of cloud infrastructure cost and performance optimization. The agent acts as a Cloud Site Reliability Engineer (SRE) optimizing a fleet of virtual cloud instances to meet Service Level Agreement (SLA) requirements while minimizing monthly costs. ## Why This Matters - **Real-world utility**: Every company using AWS/Azure/GCP struggles with "Cloud Waste". Training agents to right-size instances is a multi-million dollar problem. - **Not a toy**: Unlike chatbots or simple games, this environment requires quantitative reasoning about cost vs performance tradeoffs. ## Environment Description ### Observation Space The agent receives structured data including: - **Inventory**: List of cloud resources (id, type, cpu_usage, mem_usage, monthly_cost) - **Metrics**: Real-time performance (avg_latency_ms, error_rate, throughput_rps) - **SLA**: Target constraints (max_latency_ms, max_budget, min_uptime_pct) - **Task Info**: task_id, task_name, difficulty, current step ### Action Space The agent sends text commands in format: `change [resource_id] to [instance_type]` Available instance types: - `t3.nano`: $3.60/mo, capacity 1.0 - `t3.small`: $11.50/mo, capacity 2.0 - `t3.medium`: $23.00/mo, capacity 4.0 - `m5.large`: $70.00/mo, capacity 8.0 - `m5.xlarge`: $140.00/mo, capacity 16.0 ## Tasks & Grading | Task | Difficulty | Description | Grading | |------|------------|-------------|---------| | Right-Sizing | Easy | Reduce an overpriced server without breaking SLA | Score = reward value (0-1) | | Latency Fix | Medium | Resolve performance bottleneck under budget | Score = reward value (0-1) | | Balance Optimization | Hard | Optimize multi-server cluster with tight constraints | Score = reward value (0-1) | ### Reward Function The reward provides **continuous signals** over the trajectory: ``` R = cost_reward + performance_reward ``` Where: - **Cost Reward (0-0.5)**: Higher as cost approaches budget - **Performance Reward (0-0.5)**: Higher as latency stays under SLA **Partial Progress**: Agent receives incremental rewards for each improvement. **Penalties**: System crash (CPU > 110%) results in 0 reward and episode end. ## Setup & Usage ### Prerequisites - Python 3.10+ - OpenAI API key (HF_TOKEN) ### Local Installation ```bash # Install dependencies pip install -e . # Run baseline inference export HF_TOKEN=your_huggingface_token python inference.py ``` ### Docker Execution ```bash docker build -t cloud-ops-env . docker run -p 8000:8000 cloud-ops-env ``` ### API Endpoints - `POST /reset` - Reset environment with optional task_id - `POST /step` - Execute action - `GET /state` - Get current state - `GET /health` - Health check ## Baseline Results Model: Qwen/Qwen2.5-72B-Instruct | Task | Score | Steps | |------|-------|-------| | Right-Sizing (Easy) | 0.125 | 1 | | Latency Fix (Medium) | 0.000 | 1 | | Balance (Hard) | 0.000 | 1 | **Average: 0.042** Note: Baseline scores indicate the model needs better prompting to handle the optimization tradeoffs. The environment correctly penalizes overshooting budget (easy) and undersizing (medium/hard causing crashes). ## Files - `openenv.yaml` - OpenEnv specification - `models.py` - Pydantic models (Observation, Action, Reward) - `env/core.py` - Environment logic with state machine - `server/app.py` - FastAPI server - `inference.py` - Baseline inference script - `Dockerfile` - Container build ## Spec Compliance - [x] Typed Pydantic models - [x] reset() returns Observation - [x] step(action) returns (Observation, Reward, done, info) - [x] state() returns current state - [x] openenv.yaml with metadata - [x] openenv validate passes - [x] 3 tasks with deterministic graders (0.0-1.0) - [x] Partial reward signals - [x] Strict [START]/[STEP]/[END] log format in inference.py