Spaces:

hirann
/

cloud-ops-optimizer

Sleeping

File size: 4,718 Bytes

---
title: CloudOps Optimizer
emoji: ☁️
colorFrom: blue
colorTo: green
sdk: docker
app_port: 7860
---

# 🚀 Project Overview

**CloudOps Optimizer** is an OpenEnv simulation for Autonomous FinOps. It challenges AI agents to balance cloud infrastructure costs against performance SLAs, simulating real-world SRE tasks.

### The Problem It Simulates
Companies using AWS/Azure/GCP waste millions yearly on:
- **Oversized servers** - paying for capacity they don't need
- **Undersized servers** - causing performance issues
- **Poor resource allocation** - balancing cost vs performance

### The Agent's Job
1. See current infrastructure (CPU usage, costs, latency)
2. Choose actions like `change srv-1 to t3.small`
3. Get rewards/penalties based on cost savings + performance
4. Learn to optimize cost vs performance tradeoffs

---

# CloudOps Optimizer Environment

## Overview

**CloudOps Optimizer** is a real-world simulation of cloud infrastructure cost and performance optimization. The agent acts as a Cloud Site Reliability Engineer (SRE) optimizing a fleet of virtual cloud instances to meet Service Level Agreement (SLA) requirements while minimizing monthly costs.

## Why This Matters

- **Real-world utility**: Every company using AWS/Azure/GCP struggles with "Cloud Waste". Training agents to right-size instances is a multi-million dollar problem.
- **Not a toy**: Unlike chatbots or simple games, this environment requires quantitative reasoning about cost vs performance tradeoffs.

## Environment Description

### Observation Space

The agent receives structured data including:
- **Inventory**: List of cloud resources (id, type, cpu_usage, mem_usage, monthly_cost)
- **Metrics**: Real-time performance (avg_latency_ms, error_rate, throughput_rps)
- **SLA**: Target constraints (max_latency_ms, max_budget, min_uptime_pct)
- **Task Info**: task_id, task_name, difficulty, current step

### Action Space

The agent sends text commands in format: `change [resource_id] to [instance_type]`

Available instance types:
- `t3.nano`: $3.60/mo, capacity 1.0
- `t3.small`: $11.50/mo, capacity 2.0
- `t3.medium`: $23.00/mo, capacity 4.0
- `m5.large`: $70.00/mo, capacity 8.0
- `m5.xlarge`: $140.00/mo, capacity 16.0

## Tasks & Grading

| Task | Difficulty | Description | Grading |
|------|------------|-------------|---------|
| Right-Sizing | Easy | Reduce an overpriced server without breaking SLA | Score = reward value (0-1) |
| Latency Fix | Medium | Resolve performance bottleneck under budget | Score = reward value (0-1) |
| Balance Optimization | Hard | Optimize multi-server cluster with tight constraints | Score = reward value (0-1) |

### Reward Function

The reward provides **continuous signals** over the trajectory:

```
R = cost_reward + performance_reward
```

Where:
- **Cost Reward (0-0.5)**: Higher as cost approaches budget
- **Performance Reward (0-0.5)**: Higher as latency stays under SLA

**Partial Progress**: Agent receives incremental rewards for each improvement.
**Penalties**: System crash (CPU > 110%) results in 0 reward and episode end.

## Setup & Usage

### Prerequisites
- Python 3.10+
- OpenAI API key (HF_TOKEN)

### Local Installation

```bash
# Install dependencies
pip install -e .

# Run baseline inference
export HF_TOKEN=your_huggingface_token
python inference.py
```

### Docker Execution

```bash
docker build -t cloud-ops-env .
docker run -p 8000:8000 cloud-ops-env
```

### API Endpoints

- `POST /reset` - Reset environment with optional task_id
- `POST /step` - Execute action
- `GET /state` - Get current state
- `GET /health` - Health check

## Baseline Results

Model: Qwen/Qwen2.5-72B-Instruct

| Task | Score | Steps |
|------|-------|-------|
| Right-Sizing (Easy) | 0.125 | 1 |
| Latency Fix (Medium) | 0.000 | 1 |
| Balance (Hard) | 0.000 | 1 |

**Average: 0.042**

Note: Baseline scores indicate the model needs better prompting to handle the optimization tradeoffs. The environment correctly penalizes overshooting budget (easy) and undersizing (medium/hard causing crashes).

## Files

- `openenv.yaml` - OpenEnv specification
- `models.py` - Pydantic models (Observation, Action, Reward)
- `env/core.py` - Environment logic with state machine
- `server/app.py` - FastAPI server
- `inference.py` - Baseline inference script
- `Dockerfile` - Container build

## Spec Compliance

- [x] Typed Pydantic models
- [x] reset() returns Observation
- [x] step(action) returns (Observation, Reward, done, info)
- [x] state() returns current state
- [x] openenv.yaml with metadata
- [x] openenv validate passes
- [x] 3 tasks with deterministic graders (0.0-1.0)
- [x] Partial reward signals
- [x] Strict [START]/[STEP]/[END] log format in inference.py