Spaces:
Sleeping
Sleeping
File size: 4,718 Bytes
f57e1f6 c8c0f98 dc42cb3 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 | ---
title: CloudOps Optimizer
emoji: ☁️
colorFrom: blue
colorTo: green
sdk: docker
app_port: 7860
---
# 🚀 Project Overview
**CloudOps Optimizer** is an OpenEnv simulation for Autonomous FinOps. It challenges AI agents to balance cloud infrastructure costs against performance SLAs, simulating real-world SRE tasks.
### The Problem It Simulates
Companies using AWS/Azure/GCP waste millions yearly on:
- **Oversized servers** - paying for capacity they don't need
- **Undersized servers** - causing performance issues
- **Poor resource allocation** - balancing cost vs performance
### The Agent's Job
1. See current infrastructure (CPU usage, costs, latency)
2. Choose actions like `change srv-1 to t3.small`
3. Get rewards/penalties based on cost savings + performance
4. Learn to optimize cost vs performance tradeoffs
---
# CloudOps Optimizer Environment
## Overview
**CloudOps Optimizer** is a real-world simulation of cloud infrastructure cost and performance optimization. The agent acts as a Cloud Site Reliability Engineer (SRE) optimizing a fleet of virtual cloud instances to meet Service Level Agreement (SLA) requirements while minimizing monthly costs.
## Why This Matters
- **Real-world utility**: Every company using AWS/Azure/GCP struggles with "Cloud Waste". Training agents to right-size instances is a multi-million dollar problem.
- **Not a toy**: Unlike chatbots or simple games, this environment requires quantitative reasoning about cost vs performance tradeoffs.
## Environment Description
### Observation Space
The agent receives structured data including:
- **Inventory**: List of cloud resources (id, type, cpu_usage, mem_usage, monthly_cost)
- **Metrics**: Real-time performance (avg_latency_ms, error_rate, throughput_rps)
- **SLA**: Target constraints (max_latency_ms, max_budget, min_uptime_pct)
- **Task Info**: task_id, task_name, difficulty, current step
### Action Space
The agent sends text commands in format: `change [resource_id] to [instance_type]`
Available instance types:
- `t3.nano`: $3.60/mo, capacity 1.0
- `t3.small`: $11.50/mo, capacity 2.0
- `t3.medium`: $23.00/mo, capacity 4.0
- `m5.large`: $70.00/mo, capacity 8.0
- `m5.xlarge`: $140.00/mo, capacity 16.0
## Tasks & Grading
| Task | Difficulty | Description | Grading |
|------|------------|-------------|---------|
| Right-Sizing | Easy | Reduce an overpriced server without breaking SLA | Score = reward value (0-1) |
| Latency Fix | Medium | Resolve performance bottleneck under budget | Score = reward value (0-1) |
| Balance Optimization | Hard | Optimize multi-server cluster with tight constraints | Score = reward value (0-1) |
### Reward Function
The reward provides **continuous signals** over the trajectory:
```
R = cost_reward + performance_reward
```
Where:
- **Cost Reward (0-0.5)**: Higher as cost approaches budget
- **Performance Reward (0-0.5)**: Higher as latency stays under SLA
**Partial Progress**: Agent receives incremental rewards for each improvement.
**Penalties**: System crash (CPU > 110%) results in 0 reward and episode end.
## Setup & Usage
### Prerequisites
- Python 3.10+
- OpenAI API key (HF_TOKEN)
### Local Installation
```bash
# Install dependencies
pip install -e .
# Run baseline inference
export HF_TOKEN=your_huggingface_token
python inference.py
```
### Docker Execution
```bash
docker build -t cloud-ops-env .
docker run -p 8000:8000 cloud-ops-env
```
### API Endpoints
- `POST /reset` - Reset environment with optional task_id
- `POST /step` - Execute action
- `GET /state` - Get current state
- `GET /health` - Health check
## Baseline Results
Model: Qwen/Qwen2.5-72B-Instruct
| Task | Score | Steps |
|------|-------|-------|
| Right-Sizing (Easy) | 0.125 | 1 |
| Latency Fix (Medium) | 0.000 | 1 |
| Balance (Hard) | 0.000 | 1 |
**Average: 0.042**
Note: Baseline scores indicate the model needs better prompting to handle the optimization tradeoffs. The environment correctly penalizes overshooting budget (easy) and undersizing (medium/hard causing crashes).
## Files
- `openenv.yaml` - OpenEnv specification
- `models.py` - Pydantic models (Observation, Action, Reward)
- `env/core.py` - Environment logic with state machine
- `server/app.py` - FastAPI server
- `inference.py` - Baseline inference script
- `Dockerfile` - Container build
## Spec Compliance
- [x] Typed Pydantic models
- [x] reset() returns Observation
- [x] step(action) returns (Observation, Reward, done, info)
- [x] state() returns current state
- [x] openenv.yaml with metadata
- [x] openenv validate passes
- [x] 3 tasks with deterministic graders (0.0-1.0)
- [x] Partial reward signals
- [x] Strict [START]/[STEP]/[END] log format in inference.py |