Spaces:

hirann
/

cloud-ops-optimizer

Sleeping

App Files Files Community

cloud-ops-optimizer / README.md

hirann

Add Project Overview to README

c8c0f98 verified about 1 month ago

preview code

raw

history blame contribute delete

4.72 kB

	---
	title: CloudOps Optimizer
	emoji: ☁️
	colorFrom: blue
	colorTo: green
	sdk: docker
	app_port: 7860
	---

	# 🚀 Project Overview

	CloudOps Optimizer is an OpenEnv simulation for Autonomous FinOps. It challenges AI agents to balance cloud infrastructure costs against performance SLAs, simulating real-world SRE tasks.

	### The Problem It Simulates
	Companies using AWS/Azure/GCP waste millions yearly on:
	- Oversized servers - paying for capacity they don't need
	- Undersized servers - causing performance issues
	- Poor resource allocation - balancing cost vs performance

	### The Agent's Job
	1. See current infrastructure (CPU usage, costs, latency)
	2. Choose actions like `change srv-1 to t3.small`
	3. Get rewards/penalties based on cost savings + performance
	4. Learn to optimize cost vs performance tradeoffs

	---

	# CloudOps Optimizer Environment

	## Overview

	CloudOps Optimizer is a real-world simulation of cloud infrastructure cost and performance optimization. The agent acts as a Cloud Site Reliability Engineer (SRE) optimizing a fleet of virtual cloud instances to meet Service Level Agreement (SLA) requirements while minimizing monthly costs.

	## Why This Matters

	- Real-world utility: Every company using AWS/Azure/GCP struggles with "Cloud Waste". Training agents to right-size instances is a multi-million dollar problem.
	- Not a toy: Unlike chatbots or simple games, this environment requires quantitative reasoning about cost vs performance tradeoffs.

	## Environment Description

	### Observation Space

	The agent receives structured data including:
	- Inventory: List of cloud resources (id, type, cpu_usage, mem_usage, monthly_cost)
	- Metrics: Real-time performance (avg_latency_ms, error_rate, throughput_rps)
	- SLA: Target constraints (max_latency_ms, max_budget, min_uptime_pct)
	- Task Info: task_id, task_name, difficulty, current step

	### Action Space

	The agent sends text commands in format: `change [resource_id] to [instance_type]`

	Available instance types:
	- `t3.nano`: $3.60/mo, capacity 1.0
	- `t3.small`: $11.50/mo, capacity 2.0
	- `t3.medium`: $23.00/mo, capacity 4.0
	- `m5.large`: $70.00/mo, capacity 8.0
	- `m5.xlarge`: $140.00/mo, capacity 16.0

	## Tasks & Grading

	\| Task \| Difficulty \| Description \| Grading \|
	\|------\|------------\|-------------\|---------\|
	\| Right-Sizing \| Easy \| Reduce an overpriced server without breaking SLA \| Score = reward value (0-1) \|
	\| Latency Fix \| Medium \| Resolve performance bottleneck under budget \| Score = reward value (0-1) \|
	\| Balance Optimization \| Hard \| Optimize multi-server cluster with tight constraints \| Score = reward value (0-1) \|

	### Reward Function

	The reward provides continuous signals over the trajectory:

	```
	R = cost_reward + performance_reward
	```

	Where:
	- Cost Reward (0-0.5): Higher as cost approaches budget
	- Performance Reward (0-0.5): Higher as latency stays under SLA

	Partial Progress: Agent receives incremental rewards for each improvement.
	Penalties: System crash (CPU > 110%) results in 0 reward and episode end.

	## Setup & Usage

	### Prerequisites
	- Python 3.10+
	- OpenAI API key (HF_TOKEN)

	### Local Installation

	```bash
	# Install dependencies
	pip install -e .

	# Run baseline inference
	export HF_TOKEN=your_huggingface_token
	python inference.py
	```

	### Docker Execution

	```bash
	docker build -t cloud-ops-env .
	docker run -p 8000:8000 cloud-ops-env
	```

	### API Endpoints

	- `POST /reset` - Reset environment with optional task_id
	- `POST /step` - Execute action
	- `GET /state` - Get current state
	- `GET /health` - Health check

	## Baseline Results

	Model: Qwen/Qwen2.5-72B-Instruct

	\| Task \| Score \| Steps \|
	\|------\|-------\|-------\|
	\| Right-Sizing (Easy) \| 0.125 \| 1 \|
	\| Latency Fix (Medium) \| 0.000 \| 1 \|
	\| Balance (Hard) \| 0.000 \| 1 \|

	Average: 0.042

	Note: Baseline scores indicate the model needs better prompting to handle the optimization tradeoffs. The environment correctly penalizes overshooting budget (easy) and undersizing (medium/hard causing crashes).

	## Files

	- `openenv.yaml` - OpenEnv specification
	- `models.py` - Pydantic models (Observation, Action, Reward)
	- `env/core.py` - Environment logic with state machine
	- `server/app.py` - FastAPI server
	- `inference.py` - Baseline inference script
	- `Dockerfile` - Container build

	## Spec Compliance

	- [x] Typed Pydantic models
	- [x] reset() returns Observation
	- [x] step(action) returns (Observation, Reward, done, info)
	- [x] state() returns current state
	- [x] openenv.yaml with metadata
	- [x] openenv validate passes
	- [x] 3 tasks with deterministic graders (0.0-1.0)
	- [x] Partial reward signals
	- [x] Strict [START]/[STEP]/[END] log format in inference.py