File size: 4,718 Bytes
f57e1f6
 
 
 
 
 
 
 
 
c8c0f98
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
dc42cb3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
---
title: CloudOps Optimizer
emoji: ☁️
colorFrom: blue
colorTo: green
sdk: docker
app_port: 7860
---

# 🚀 Project Overview

**CloudOps Optimizer** is an OpenEnv simulation for Autonomous FinOps. It challenges AI agents to balance cloud infrastructure costs against performance SLAs, simulating real-world SRE tasks.

### The Problem It Simulates
Companies using AWS/Azure/GCP waste millions yearly on:
- **Oversized servers** - paying for capacity they don't need
- **Undersized servers** - causing performance issues
- **Poor resource allocation** - balancing cost vs performance

### The Agent's Job
1. See current infrastructure (CPU usage, costs, latency)
2. Choose actions like `change srv-1 to t3.small`
3. Get rewards/penalties based on cost savings + performance
4. Learn to optimize cost vs performance tradeoffs

---

# CloudOps Optimizer Environment

## Overview

**CloudOps Optimizer** is a real-world simulation of cloud infrastructure cost and performance optimization. The agent acts as a Cloud Site Reliability Engineer (SRE) optimizing a fleet of virtual cloud instances to meet Service Level Agreement (SLA) requirements while minimizing monthly costs.

## Why This Matters

- **Real-world utility**: Every company using AWS/Azure/GCP struggles with "Cloud Waste". Training agents to right-size instances is a multi-million dollar problem.
- **Not a toy**: Unlike chatbots or simple games, this environment requires quantitative reasoning about cost vs performance tradeoffs.

## Environment Description

### Observation Space

The agent receives structured data including:
- **Inventory**: List of cloud resources (id, type, cpu_usage, mem_usage, monthly_cost)
- **Metrics**: Real-time performance (avg_latency_ms, error_rate, throughput_rps)
- **SLA**: Target constraints (max_latency_ms, max_budget, min_uptime_pct)
- **Task Info**: task_id, task_name, difficulty, current step

### Action Space

The agent sends text commands in format: `change [resource_id] to [instance_type]`

Available instance types:
- `t3.nano`: $3.60/mo, capacity 1.0
- `t3.small`: $11.50/mo, capacity 2.0
- `t3.medium`: $23.00/mo, capacity 4.0
- `m5.large`: $70.00/mo, capacity 8.0
- `m5.xlarge`: $140.00/mo, capacity 16.0

## Tasks & Grading

| Task | Difficulty | Description | Grading |
|------|------------|-------------|---------|
| Right-Sizing | Easy | Reduce an overpriced server without breaking SLA | Score = reward value (0-1) |
| Latency Fix | Medium | Resolve performance bottleneck under budget | Score = reward value (0-1) |
| Balance Optimization | Hard | Optimize multi-server cluster with tight constraints | Score = reward value (0-1) |

### Reward Function

The reward provides **continuous signals** over the trajectory:

```
R = cost_reward + performance_reward
```

Where:
- **Cost Reward (0-0.5)**: Higher as cost approaches budget
- **Performance Reward (0-0.5)**: Higher as latency stays under SLA

**Partial Progress**: Agent receives incremental rewards for each improvement.
**Penalties**: System crash (CPU > 110%) results in 0 reward and episode end.

## Setup & Usage

### Prerequisites
- Python 3.10+
- OpenAI API key (HF_TOKEN)

### Local Installation

```bash
# Install dependencies
pip install -e .

# Run baseline inference
export HF_TOKEN=your_huggingface_token
python inference.py
```

### Docker Execution

```bash
docker build -t cloud-ops-env .
docker run -p 8000:8000 cloud-ops-env
```

### API Endpoints

- `POST /reset` - Reset environment with optional task_id
- `POST /step` - Execute action
- `GET /state` - Get current state
- `GET /health` - Health check

## Baseline Results

Model: Qwen/Qwen2.5-72B-Instruct

| Task | Score | Steps |
|------|-------|-------|
| Right-Sizing (Easy) | 0.125 | 1 |
| Latency Fix (Medium) | 0.000 | 1 |
| Balance (Hard) | 0.000 | 1 |

**Average: 0.042**

Note: Baseline scores indicate the model needs better prompting to handle the optimization tradeoffs. The environment correctly penalizes overshooting budget (easy) and undersizing (medium/hard causing crashes).

## Files

- `openenv.yaml` - OpenEnv specification
- `models.py` - Pydantic models (Observation, Action, Reward)
- `env/core.py` - Environment logic with state machine
- `server/app.py` - FastAPI server
- `inference.py` - Baseline inference script
- `Dockerfile` - Container build

## Spec Compliance

- [x] Typed Pydantic models
- [x] reset() returns Observation
- [x] step(action) returns (Observation, Reward, done, info)
- [x] state() returns current state
- [x] openenv.yaml with metadata
- [x] openenv validate passes
- [x] 3 tasks with deterministic graders (0.0-1.0)
- [x] Partial reward signals
- [x] Strict [START]/[STEP]/[END] log format in inference.py