cloud-ops-optimizer / README.md
hirann's picture
Add Project Overview to README
c8c0f98 verified
---
title: CloudOps Optimizer
emoji: ☁️
colorFrom: blue
colorTo: green
sdk: docker
app_port: 7860
---
# 🚀 Project Overview
**CloudOps Optimizer** is an OpenEnv simulation for Autonomous FinOps. It challenges AI agents to balance cloud infrastructure costs against performance SLAs, simulating real-world SRE tasks.
### The Problem It Simulates
Companies using AWS/Azure/GCP waste millions yearly on:
- **Oversized servers** - paying for capacity they don't need
- **Undersized servers** - causing performance issues
- **Poor resource allocation** - balancing cost vs performance
### The Agent's Job
1. See current infrastructure (CPU usage, costs, latency)
2. Choose actions like `change srv-1 to t3.small`
3. Get rewards/penalties based on cost savings + performance
4. Learn to optimize cost vs performance tradeoffs
---
# CloudOps Optimizer Environment
## Overview
**CloudOps Optimizer** is a real-world simulation of cloud infrastructure cost and performance optimization. The agent acts as a Cloud Site Reliability Engineer (SRE) optimizing a fleet of virtual cloud instances to meet Service Level Agreement (SLA) requirements while minimizing monthly costs.
## Why This Matters
- **Real-world utility**: Every company using AWS/Azure/GCP struggles with "Cloud Waste". Training agents to right-size instances is a multi-million dollar problem.
- **Not a toy**: Unlike chatbots or simple games, this environment requires quantitative reasoning about cost vs performance tradeoffs.
## Environment Description
### Observation Space
The agent receives structured data including:
- **Inventory**: List of cloud resources (id, type, cpu_usage, mem_usage, monthly_cost)
- **Metrics**: Real-time performance (avg_latency_ms, error_rate, throughput_rps)
- **SLA**: Target constraints (max_latency_ms, max_budget, min_uptime_pct)
- **Task Info**: task_id, task_name, difficulty, current step
### Action Space
The agent sends text commands in format: `change [resource_id] to [instance_type]`
Available instance types:
- `t3.nano`: $3.60/mo, capacity 1.0
- `t3.small`: $11.50/mo, capacity 2.0
- `t3.medium`: $23.00/mo, capacity 4.0
- `m5.large`: $70.00/mo, capacity 8.0
- `m5.xlarge`: $140.00/mo, capacity 16.0
## Tasks & Grading
| Task | Difficulty | Description | Grading |
|------|------------|-------------|---------|
| Right-Sizing | Easy | Reduce an overpriced server without breaking SLA | Score = reward value (0-1) |
| Latency Fix | Medium | Resolve performance bottleneck under budget | Score = reward value (0-1) |
| Balance Optimization | Hard | Optimize multi-server cluster with tight constraints | Score = reward value (0-1) |
### Reward Function
The reward provides **continuous signals** over the trajectory:
```
R = cost_reward + performance_reward
```
Where:
- **Cost Reward (0-0.5)**: Higher as cost approaches budget
- **Performance Reward (0-0.5)**: Higher as latency stays under SLA
**Partial Progress**: Agent receives incremental rewards for each improvement.
**Penalties**: System crash (CPU > 110%) results in 0 reward and episode end.
## Setup & Usage
### Prerequisites
- Python 3.10+
- OpenAI API key (HF_TOKEN)
### Local Installation
```bash
# Install dependencies
pip install -e .
# Run baseline inference
export HF_TOKEN=your_huggingface_token
python inference.py
```
### Docker Execution
```bash
docker build -t cloud-ops-env .
docker run -p 8000:8000 cloud-ops-env
```
### API Endpoints
- `POST /reset` - Reset environment with optional task_id
- `POST /step` - Execute action
- `GET /state` - Get current state
- `GET /health` - Health check
## Baseline Results
Model: Qwen/Qwen2.5-72B-Instruct
| Task | Score | Steps |
|------|-------|-------|
| Right-Sizing (Easy) | 0.125 | 1 |
| Latency Fix (Medium) | 0.000 | 1 |
| Balance (Hard) | 0.000 | 1 |
**Average: 0.042**
Note: Baseline scores indicate the model needs better prompting to handle the optimization tradeoffs. The environment correctly penalizes overshooting budget (easy) and undersizing (medium/hard causing crashes).
## Files
- `openenv.yaml` - OpenEnv specification
- `models.py` - Pydantic models (Observation, Action, Reward)
- `env/core.py` - Environment logic with state machine
- `server/app.py` - FastAPI server
- `inference.py` - Baseline inference script
- `Dockerfile` - Container build
## Spec Compliance
- [x] Typed Pydantic models
- [x] reset() returns Observation
- [x] step(action) returns (Observation, Reward, done, info)
- [x] state() returns current state
- [x] openenv.yaml with metadata
- [x] openenv validate passes
- [x] 3 tasks with deterministic graders (0.0-1.0)
- [x] Partial reward signals
- [x] Strict [START]/[STEP]/[END] log format in inference.py