Spaces:
Sleeping
Sleeping
ChaitanyaRasane commited on
Commit Β·
f582a68
0
Parent(s):
deploy: clean initial commit
Browse files- .dockerignore +8 -0
- .gitignore +8 -0
- Dockerfile +20 -0
- README.md +98 -0
- agents/__init__.py +9 -0
- agents/heuristic_agent.py +189 -0
- agents/random_agent.py +54 -0
- backend/main.py +225 -0
- baseline.py +197 -0
- benchmark.py +353 -0
- env.py +364 -0
- frontend/index.html +227 -0
- frontend/script.js +454 -0
- frontend/styles.css +128 -0
- heuristic_agent.py +463 -0
- leaderboard.json +20 -0
- openenv.yaml +31 -0
- prd_adaptive_ui_layout_optimization_environment_final_enhanced.md +305 -0
- requirements.txt +3 -0
.dockerignore
ADDED
|
@@ -0,0 +1,8 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
.git
|
| 2 |
+
__pycache__
|
| 3 |
+
venv
|
| 4 |
+
.env
|
| 5 |
+
*.pyc
|
| 6 |
+
.gemini
|
| 7 |
+
node_modules
|
| 8 |
+
.DS_Store
|
.gitignore
ADDED
|
@@ -0,0 +1,8 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
venv/
|
| 2 |
+
__pycache__/
|
| 3 |
+
.env
|
| 4 |
+
.gemini/
|
| 5 |
+
apikey.txt
|
| 6 |
+
*.pyc
|
| 7 |
+
models.json
|
| 8 |
+
models_list.json
|
Dockerfile
ADDED
|
@@ -0,0 +1,20 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Use an official Python 3.10 runtime as a parent image
|
| 2 |
+
FROM python:3.10-slim
|
| 3 |
+
|
| 4 |
+
# Set the working directory in the container
|
| 5 |
+
WORKDIR /app
|
| 6 |
+
|
| 7 |
+
# Copy the current directory contents into the container at /app
|
| 8 |
+
COPY . /app
|
| 9 |
+
|
| 10 |
+
# Install any needed packages specified in requirements.txt
|
| 11 |
+
RUN pip install --no-cache-dir -r requirements.txt
|
| 12 |
+
|
| 13 |
+
# Make port 80 available to the world outside this container
|
| 14 |
+
EXPOSE 80
|
| 15 |
+
|
| 16 |
+
# Environment variable for the HF token (can be overridden at runtime)
|
| 17 |
+
ENV HF_TOKEN=""
|
| 18 |
+
|
| 19 |
+
# Run baseline.py when the container launches
|
| 20 |
+
CMD ["python", "baseline.py"]
|
README.md
ADDED
|
@@ -0,0 +1,98 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# UI Layout Optimizer: Adaptive UI Optimization Environment (OpenEnv)
|
| 2 |
+
|
| 3 |
+
[](https://github.com/OpenEnv-Protocol)
|
| 4 |
+
[](https://opensource.org/licenses/MIT)
|
| 5 |
+
|
| 6 |
+
## π Motivation
|
| 7 |
+
In modern digital products, static A/B testing often fails to capture the nuance of diverse user behaviors. The **UI Layout Optimizer** is an OpenEnv-compliant environment designed to train agents that dynamically adapt layout configurationsβsuch as button sizes, form lengths, and wizard stepsβto maximize conversion rates and user satisfaction in real-time.
|
| 8 |
+
|
| 9 |
+
By simulating various user personas (impatient, careful, new users) and their psychological responses to UI friction, this environment provides a standardized benchmark for autonomous UI optimization agents.
|
| 10 |
+
|
| 11 |
+
---
|
| 12 |
+
|
| 13 |
+
## π οΈ Environment Specification
|
| 14 |
+
|
| 15 |
+
### Action Space
|
| 16 |
+
The agent can manipulate the UI layout through seven distinct actions:
|
| 17 |
+
|
| 18 |
+
| Action | Description |
|
| 19 |
+
| :--- | :--- |
|
| 20 |
+
| `increase_button` | Increments the button size multiplier. |
|
| 21 |
+
| `decrease_form` | Reduces the number of form fields to lower friction. |
|
| 22 |
+
| `increase_steps` | Adds a step to the checkout flow/wizard. |
|
| 23 |
+
| `decrease_steps` | Removes a step to streamline the completion flow. |
|
| 24 |
+
| `reorder_sections` | Optimizes the component arrangement. |
|
| 25 |
+
| `set_button_size` | Continuously tunes the button size (0.5 - 2.0). |
|
| 26 |
+
| `noop` | Maintains the current layout state. |
|
| 27 |
+
|
| 28 |
+
### Observation Space
|
| 29 |
+
At each step, the agent receives an `Observation` containing:
|
| 30 |
+
|
| 31 |
+
- **Device**: `mobile` or `desktop` (affects user tolerance thresholds).
|
| 32 |
+
- **Layout**: Current `button_size`, `form_length`, and number of `steps`.
|
| 33 |
+
- **Progress**: A scalar value (0.0 to 1.0) representing task completion.
|
| 34 |
+
- **Last Action**: Feedback on the previous operation.
|
| 35 |
+
|
| 36 |
+
### Task Descriptions
|
| 37 |
+
Evaluation is conducted across three difficulty tiers:
|
| 38 |
+
|
| 39 |
+
1. **Easy**: Discrete actions only, stable user types, and low noise levels.
|
| 40 |
+
2. **Medium**: Mixed user personas with stochastic drop-off rates.
|
| 41 |
+
3. **Hard**: Hidden user types, continuous action tuning, and highly noisy feedback.
|
| 42 |
+
|
| 43 |
+
---
|
| 44 |
+
|
| 45 |
+
## π» Usage
|
| 46 |
+
|
| 47 |
+
### Prerequisites
|
| 48 |
+
- Python 3.10+
|
| 49 |
+
- Hugging Face API Token (for LLM-based agents)
|
| 50 |
+
|
| 51 |
+
### Local Execution
|
| 52 |
+
1. Install dependencies:
|
| 53 |
+
```bash
|
| 54 |
+
pip install -r requirements.txt
|
| 55 |
+
```
|
| 56 |
+
2. Run the baseline evaluation:
|
| 57 |
+
```bash
|
| 58 |
+
export HF_TOKEN="your_token_here"
|
| 59 |
+
python baseline.py
|
| 60 |
+
```
|
| 61 |
+
|
| 62 |
+
### Running with Docker
|
| 63 |
+
1. Build the image:
|
| 64 |
+
```bash
|
| 65 |
+
docker build -t ui-optimizer .
|
| 66 |
+
```
|
| 67 |
+
2. Run the container:
|
| 68 |
+
```bash
|
| 69 |
+
docker run -e HF_TOKEN="your_token_here" ui-optimizer
|
| 70 |
+
```
|
| 71 |
+
|
| 72 |
+
---
|
| 73 |
+
|
| 74 |
+
## βοΈ Deployment to Hugging Face Spaces
|
| 75 |
+
|
| 76 |
+
This project is optimized for deployment as a **Docker Space**.
|
| 77 |
+
|
| 78 |
+
1. Create a new Space on [Hugging Face](https://huggingface.co/new-space).
|
| 79 |
+
2. Select **Docker** as the SDK.
|
| 80 |
+
3. In the Space **Settings**, add your `HF_TOKEN` as a Secret.
|
| 81 |
+
4. Push the project files (including `Dockerfile` and `requirements.txt`) to the Space repository.
|
| 82 |
+
5. Hugging Face will automatically build and deploy the container.
|
| 83 |
+
|
| 84 |
+
---
|
| 85 |
+
|
| 86 |
+
## π Baseline Results (Example)
|
| 87 |
+
Evaluation results using the provided `baseline.py` hybrid agent:
|
| 88 |
+
|
| 89 |
+
| Task | Avg Reward | Completion Rate | Final Score |
|
| 90 |
+
| :--- | :--- | :--- | :--- |
|
| 91 |
+
| Easy | 1.8450 | 92.0% | 0.8931 |
|
| 92 |
+
| Medium | 1.4210 | 78.0% | 0.7323 |
|
| 93 |
+
| Hard | 0.9820 | 54.0% | 0.5126 |
|
| 94 |
+
|
| 95 |
+
---
|
| 96 |
+
|
| 97 |
+
## π License
|
| 98 |
+
This project is licensed under the MIT License - see the LICENSE file for details.
|
agents/__init__.py
ADDED
|
@@ -0,0 +1,9 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# agents/__init__.py
|
| 2 |
+
"""
|
| 3 |
+
Agent package for the UI Layout Optimization environment.
|
| 4 |
+
|
| 5 |
+
All agents expose a common interface:
|
| 6 |
+
agent.reset() -- clear per-episode state
|
| 7 |
+
agent.act(obs) -- select an Action given an Observation
|
| 8 |
+
agent.update(info) -- ingest the env info dict after a step
|
| 9 |
+
"""
|
agents/heuristic_agent.py
ADDED
|
@@ -0,0 +1,189 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
heuristic_agent.py (agents package)
|
| 3 |
+
------------------------------------
|
| 4 |
+
Multi-stage heuristic agent for UIEnv.
|
| 5 |
+
|
| 6 |
+
Decision pipeline (priority order, first match wins):
|
| 7 |
+
Stage 1 -> Risk Mitigation (prevent imminent drop)
|
| 8 |
+
Stage 2 -> Feedback Adaptation (react to distrust / drop signals)
|
| 9 |
+
Stage 3 -> Layout Optimization (converge toward ideal layout)
|
| 10 |
+
Stage 4 -> Exploration (controlled randomness in safe states)
|
| 11 |
+
Stage 5 -> Fallback (safe default when layout is near-optimal)
|
| 12 |
+
"""
|
| 13 |
+
|
| 14 |
+
from __future__ import annotations
|
| 15 |
+
|
| 16 |
+
import random
|
| 17 |
+
import sys
|
| 18 |
+
import os
|
| 19 |
+
from collections import deque
|
| 20 |
+
from typing import Optional
|
| 21 |
+
|
| 22 |
+
sys.path.insert(0, os.path.abspath(os.path.join(os.path.dirname(__file__), "..")))
|
| 23 |
+
|
| 24 |
+
from env import Action, Observation
|
| 25 |
+
|
| 26 |
+
# ---------------------------------------------------------------------------
|
| 27 |
+
# Optimal layout targets (derived from reward shaping in env.py)
|
| 28 |
+
# ---------------------------------------------------------------------------
|
| 29 |
+
|
| 30 |
+
BUTTON_SWEET_LOW: float = 0.9
|
| 31 |
+
BUTTON_SWEET_HIGH: float = 1.3
|
| 32 |
+
BUTTON_SWEET_MID: float = 1.1
|
| 33 |
+
|
| 34 |
+
TARGET_STEPS: int = 2
|
| 35 |
+
TARGET_FORM_LENGTH: int = 4
|
| 36 |
+
SAFE_FORM_FLOOR: int = 3
|
| 37 |
+
|
| 38 |
+
DROP_STEPS_THRESHOLD: int = 3
|
| 39 |
+
DROP_FORM_THRESHOLD: int = 5
|
| 40 |
+
|
| 41 |
+
EXPLORE_PROBABILITY: float = 0.07
|
| 42 |
+
NOOP_SAFE_LIMIT: int = 1
|
| 43 |
+
|
| 44 |
+
_INVERSE_ACTIONS: dict[str, str] = {
|
| 45 |
+
"increase_button": "set_button_size",
|
| 46 |
+
"increase_steps": "decrease_steps",
|
| 47 |
+
"decrease_steps": "increase_steps",
|
| 48 |
+
}
|
| 49 |
+
|
| 50 |
+
|
| 51 |
+
class HeuristicAgent:
|
| 52 |
+
"""Structured, multi-stage heuristic agent for UIEnv."""
|
| 53 |
+
|
| 54 |
+
NAME = "HeuristicAgent"
|
| 55 |
+
|
| 56 |
+
def __init__(self, seed: int = 99) -> None:
|
| 57 |
+
self._rng = random.Random(seed)
|
| 58 |
+
self.last_outcome: Optional[str] = None
|
| 59 |
+
self.noop_streak: int = 0
|
| 60 |
+
self.action_history: deque[str] = deque(maxlen=5)
|
| 61 |
+
self.distrust_count: int = 0
|
| 62 |
+
self.drop_count: int = 0
|
| 63 |
+
self.step_number: int = 0
|
| 64 |
+
|
| 65 |
+
# ------------------------------------------------------------------ #
|
| 66 |
+
# Public API #
|
| 67 |
+
# ------------------------------------------------------------------ #
|
| 68 |
+
|
| 69 |
+
def reset(self) -> None:
|
| 70 |
+
self.last_outcome = None
|
| 71 |
+
self.noop_streak = 0
|
| 72 |
+
self.action_history.clear()
|
| 73 |
+
self.distrust_count = 0
|
| 74 |
+
self.drop_count = 0
|
| 75 |
+
self.step_number = 0
|
| 76 |
+
|
| 77 |
+
def act(self, obs: Observation) -> Action:
|
| 78 |
+
self.step_number += 1
|
| 79 |
+
action = (
|
| 80 |
+
self._risk_mitigation(obs)
|
| 81 |
+
or self._adaptation(obs)
|
| 82 |
+
or self._optimize_layout(obs)
|
| 83 |
+
or self._explore(obs)
|
| 84 |
+
or self._fallback(obs)
|
| 85 |
+
)
|
| 86 |
+
self.action_history.append(action.type)
|
| 87 |
+
if action.type == "noop":
|
| 88 |
+
self.noop_streak += 1
|
| 89 |
+
else:
|
| 90 |
+
self.noop_streak = 0
|
| 91 |
+
return action
|
| 92 |
+
|
| 93 |
+
def update(self, info: dict) -> None:
|
| 94 |
+
outcome = info.get("outcome", "continue")
|
| 95 |
+
self.last_outcome = outcome
|
| 96 |
+
if outcome == "distrust":
|
| 97 |
+
self.distrust_count += 1
|
| 98 |
+
elif outcome == "drop":
|
| 99 |
+
self.drop_count += 1
|
| 100 |
+
|
| 101 |
+
def __repr__(self) -> str:
|
| 102 |
+
return self.NAME
|
| 103 |
+
|
| 104 |
+
# ------------------------------------------------------------------ #
|
| 105 |
+
# Helpers #
|
| 106 |
+
# ------------------------------------------------------------------ #
|
| 107 |
+
|
| 108 |
+
def _would_oscillate(self, candidate: str) -> bool:
|
| 109 |
+
if not self.action_history:
|
| 110 |
+
return False
|
| 111 |
+
last = self.action_history[-1]
|
| 112 |
+
inv = _INVERSE_ACTIONS.get(candidate)
|
| 113 |
+
return last == inv or _INVERSE_ACTIONS.get(last) == candidate
|
| 114 |
+
|
| 115 |
+
@staticmethod
|
| 116 |
+
def _make(action_type: str, value: float | None = None) -> Action:
|
| 117 |
+
return Action(type=action_type, value=value)
|
| 118 |
+
|
| 119 |
+
# ---- Stage 1: Risk Mitigation ------------------------------------ #
|
| 120 |
+
|
| 121 |
+
def _risk_mitigation(self, obs: Observation) -> Optional[Action]:
|
| 122 |
+
layout = obs.layout
|
| 123 |
+
|
| 124 |
+
# Calculate mathematical drop risk from extreme values
|
| 125 |
+
step_risk = max(0, layout.steps - 3) * 0.20
|
| 126 |
+
form_risk = max(0, layout.form_length - 5) * 0.15
|
| 127 |
+
|
| 128 |
+
# 1. Eliminate the highest immediate source of dropout
|
| 129 |
+
if form_risk > step_risk and form_risk > 0:
|
| 130 |
+
return self._make("decrease_form")
|
| 131 |
+
if step_risk > 0:
|
| 132 |
+
return self._make("decrease_steps")
|
| 133 |
+
|
| 134 |
+
# 2. Distrust/Drop combo from terrible button sizes
|
| 135 |
+
if layout.button_size < 0.9 or layout.button_size > 1.3:
|
| 136 |
+
# Jump directly to 1.25 to hit the hidden `> 1.2` user preference sweet spot instantly
|
| 137 |
+
return self._make("set_button_size", 1.25)
|
| 138 |
+
|
| 139 |
+
return None
|
| 140 |
+
|
| 141 |
+
# ---- Stage 2: Feedback Adaptation -------------------------------- #
|
| 142 |
+
|
| 143 |
+
def _adaptation(self, obs: Observation) -> Optional[Action]:
|
| 144 |
+
if self.last_outcome == "distrust":
|
| 145 |
+
layout = obs.layout
|
| 146 |
+
if layout.steps < TARGET_STEPS and not self._would_oscillate("increase_steps"):
|
| 147 |
+
return self._make("increase_steps")
|
| 148 |
+
return None
|
| 149 |
+
|
| 150 |
+
if self.last_outcome == "drop":
|
| 151 |
+
layout = obs.layout
|
| 152 |
+
if layout.steps > TARGET_STEPS and not self._would_oscillate("decrease_steps"):
|
| 153 |
+
return self._make("decrease_steps")
|
| 154 |
+
if layout.form_length > SAFE_FORM_FLOOR:
|
| 155 |
+
return self._make("decrease_form")
|
| 156 |
+
return None
|
| 157 |
+
|
| 158 |
+
return None
|
| 159 |
+
|
| 160 |
+
# ---- Stage 3: Layout Optimization -------------------------------- #
|
| 161 |
+
|
| 162 |
+
def _optimize_layout(self, obs: Observation) -> Optional[Action]:
|
| 163 |
+
layout = obs.layout
|
| 164 |
+
|
| 165 |
+
# Fine-tune steps down to optimal 2
|
| 166 |
+
if layout.steps > TARGET_STEPS and not self._would_oscillate("decrease_steps"):
|
| 167 |
+
return self._make("decrease_steps")
|
| 168 |
+
|
| 169 |
+
# Fine-tune form length down to optimal 4 (avoids hidden penalty)
|
| 170 |
+
if layout.form_length > TARGET_FORM_LENGTH:
|
| 171 |
+
return self._make("decrease_form")
|
| 172 |
+
|
| 173 |
+
return None
|
| 174 |
+
|
| 175 |
+
# ---- Stage 4: Exploration ---------------------------------------- #
|
| 176 |
+
|
| 177 |
+
def _explore(self, obs: Observation) -> Optional[Action]:
|
| 178 |
+
if self.last_outcome in ("drop", "distrust"):
|
| 179 |
+
return None
|
| 180 |
+
# Light exploration around the golden ratio if comfortable
|
| 181 |
+
if self._rng.random() < EXPLORE_PROBABILITY:
|
| 182 |
+
target = self._rng.uniform(1.20, 1.29)
|
| 183 |
+
return self._make("set_button_size", round(target, 2))
|
| 184 |
+
return None
|
| 185 |
+
|
| 186 |
+
# ---- Stage 5: Fallback ------------------------------------------- #
|
| 187 |
+
|
| 188 |
+
def _fallback(self, obs: Observation) -> Action:
|
| 189 |
+
return self._make("noop")
|
agents/random_agent.py
ADDED
|
@@ -0,0 +1,54 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
random_agent.py
|
| 3 |
+
---------------
|
| 4 |
+
Uniformly random discrete-action agent for UIEnv.
|
| 5 |
+
|
| 6 |
+
Serves as the baseline in the benchmarking leaderboard.
|
| 7 |
+
Every call to act() picks an action uniformly at random from
|
| 8 |
+
the six discrete action types (no set_button_size, which
|
| 9 |
+
requires a continuous value).
|
| 10 |
+
"""
|
| 11 |
+
|
| 12 |
+
from __future__ import annotations
|
| 13 |
+
|
| 14 |
+
import random
|
| 15 |
+
import sys
|
| 16 |
+
import os
|
| 17 |
+
|
| 18 |
+
# Ensure project root is importable
|
| 19 |
+
sys.path.insert(0, os.path.abspath(os.path.join(os.path.dirname(__file__), "..")))
|
| 20 |
+
|
| 21 |
+
from env import Action, Observation
|
| 22 |
+
|
| 23 |
+
|
| 24 |
+
class RandomAgent:
|
| 25 |
+
"""Uniformly random discrete-action agent."""
|
| 26 |
+
|
| 27 |
+
NAME = "RandomAgent"
|
| 28 |
+
|
| 29 |
+
_ACTIONS = [
|
| 30 |
+
"increase_button",
|
| 31 |
+
"decrease_form",
|
| 32 |
+
"increase_steps",
|
| 33 |
+
"decrease_steps",
|
| 34 |
+
"reorder_sections",
|
| 35 |
+
"noop",
|
| 36 |
+
]
|
| 37 |
+
|
| 38 |
+
def __init__(self, seed: int = 99) -> None:
|
| 39 |
+
self._rng = random.Random(seed)
|
| 40 |
+
|
| 41 |
+
def reset(self) -> None:
|
| 42 |
+
"""No state to clear."""
|
| 43 |
+
pass
|
| 44 |
+
|
| 45 |
+
def act(self, obs: Observation) -> Action:
|
| 46 |
+
"""Pick a uniformly random discrete action."""
|
| 47 |
+
return Action(type=self._rng.choice(self._ACTIONS), value=None)
|
| 48 |
+
|
| 49 |
+
def update(self, info: dict) -> None:
|
| 50 |
+
"""No learning or adaptation."""
|
| 51 |
+
pass
|
| 52 |
+
|
| 53 |
+
def __repr__(self) -> str:
|
| 54 |
+
return self.NAME
|
backend/main.py
ADDED
|
@@ -0,0 +1,225 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
backend/main.py
|
| 3 |
+
---------------
|
| 4 |
+
FastAPI server for the UIEnv interactive simulator.
|
| 5 |
+
|
| 6 |
+
Endpoints:
|
| 7 |
+
POST /reset -- Reset environment, return observation
|
| 8 |
+
POST /step -- Apply one action, return (obs, reward, done, info)
|
| 9 |
+
POST /run_episode -- Run a full episode with a chosen agent
|
| 10 |
+
GET /leaderboard -- Benchmark all agents and return ranked results
|
| 11 |
+
GET / -- Serve the frontend
|
| 12 |
+
"""
|
| 13 |
+
|
| 14 |
+
from __future__ import annotations
|
| 15 |
+
|
| 16 |
+
import sys
|
| 17 |
+
import os
|
| 18 |
+
|
| 19 |
+
# Ensure project root is importable
|
| 20 |
+
PROJECT_ROOT = os.path.abspath(os.path.join(os.path.dirname(__file__), ".."))
|
| 21 |
+
sys.path.insert(0, PROJECT_ROOT)
|
| 22 |
+
|
| 23 |
+
from fastapi import FastAPI, HTTPException
|
| 24 |
+
from fastapi.staticfiles import StaticFiles
|
| 25 |
+
from fastapi.responses import FileResponse
|
| 26 |
+
from fastapi.middleware.cors import CORSMiddleware
|
| 27 |
+
from pydantic import BaseModel
|
| 28 |
+
from typing import Optional, Any
|
| 29 |
+
import time
|
| 30 |
+
|
| 31 |
+
from env import UIEnv, Action, Observation
|
| 32 |
+
from agents.random_agent import RandomAgent
|
| 33 |
+
from agents.heuristic_agent import HeuristicAgent
|
| 34 |
+
from benchmark import BenchmarkRunner
|
| 35 |
+
|
| 36 |
+
# ======================================================================
|
| 37 |
+
# App setup
|
| 38 |
+
# ======================================================================
|
| 39 |
+
|
| 40 |
+
app = FastAPI(title="UIEnv Interactive Simulator", version="1.0.0")
|
| 41 |
+
|
| 42 |
+
app.add_middleware(
|
| 43 |
+
CORSMiddleware,
|
| 44 |
+
allow_origins=["*"],
|
| 45 |
+
allow_methods=["*"],
|
| 46 |
+
allow_headers=["*"],
|
| 47 |
+
)
|
| 48 |
+
|
| 49 |
+
# Serve frontend static files
|
| 50 |
+
FRONTEND_DIR = os.path.join(PROJECT_ROOT, "frontend")
|
| 51 |
+
app.mount("/static", StaticFiles(directory=FRONTEND_DIR), name="static")
|
| 52 |
+
|
| 53 |
+
# ======================================================================
|
| 54 |
+
# Global state
|
| 55 |
+
# ======================================================================
|
| 56 |
+
|
| 57 |
+
env = UIEnv(seed=42)
|
| 58 |
+
current_obs: Optional[Observation] = None
|
| 59 |
+
episode_done: bool = True
|
| 60 |
+
|
| 61 |
+
# Agent registry
|
| 62 |
+
AGENTS = {
|
| 63 |
+
"random": lambda: RandomAgent(seed=99),
|
| 64 |
+
"heuristic": lambda: HeuristicAgent(seed=99),
|
| 65 |
+
}
|
| 66 |
+
|
| 67 |
+
# ======================================================================
|
| 68 |
+
# Request / Response schemas
|
| 69 |
+
# ======================================================================
|
| 70 |
+
|
| 71 |
+
class StepRequest(BaseModel):
|
| 72 |
+
action: str
|
| 73 |
+
value: Optional[float] = None
|
| 74 |
+
|
| 75 |
+
class EpisodeRequest(BaseModel):
|
| 76 |
+
agent: str = "heuristic"
|
| 77 |
+
|
| 78 |
+
# ======================================================================
|
| 79 |
+
# Helpers
|
| 80 |
+
# ======================================================================
|
| 81 |
+
|
| 82 |
+
def obs_to_dict(obs: Observation) -> dict[str, Any]:
|
| 83 |
+
"""Convert an Observation to a JSON-friendly dict."""
|
| 84 |
+
return {
|
| 85 |
+
"device": obs.device,
|
| 86 |
+
"button_size": obs.layout.button_size,
|
| 87 |
+
"form_length": obs.layout.form_length,
|
| 88 |
+
"steps": obs.layout.steps,
|
| 89 |
+
"progress": round(obs.progress, 4),
|
| 90 |
+
"last_action": obs.last_action,
|
| 91 |
+
}
|
| 92 |
+
|
| 93 |
+
# ======================================================================
|
| 94 |
+
# Endpoints
|
| 95 |
+
# ======================================================================
|
| 96 |
+
|
| 97 |
+
@app.get("/")
|
| 98 |
+
async def serve_frontend():
|
| 99 |
+
"""Serve the main HTML page."""
|
| 100 |
+
return FileResponse(os.path.join(FRONTEND_DIR, "index.html"))
|
| 101 |
+
|
| 102 |
+
|
| 103 |
+
@app.post("/reset")
|
| 104 |
+
async def reset_env():
|
| 105 |
+
"""Reset the environment and return the initial observation."""
|
| 106 |
+
global current_obs, episode_done
|
| 107 |
+
current_obs = env.reset()
|
| 108 |
+
episode_done = False
|
| 109 |
+
return {"observation": obs_to_dict(current_obs), "done": False}
|
| 110 |
+
|
| 111 |
+
|
| 112 |
+
@app.post("/step")
|
| 113 |
+
async def step_env(req: StepRequest):
|
| 114 |
+
"""Apply one action and return the transition."""
|
| 115 |
+
global current_obs, episode_done
|
| 116 |
+
|
| 117 |
+
if episode_done:
|
| 118 |
+
raise HTTPException(status_code=400, detail="Episode is done. Call /reset first.")
|
| 119 |
+
|
| 120 |
+
try:
|
| 121 |
+
action = Action(type=req.action, value=req.value)
|
| 122 |
+
except Exception as e:
|
| 123 |
+
raise HTTPException(status_code=422, detail=f"Invalid action: {e}")
|
| 124 |
+
|
| 125 |
+
obs, reward, done, info = env.step(action)
|
| 126 |
+
current_obs = obs
|
| 127 |
+
episode_done = done
|
| 128 |
+
|
| 129 |
+
return {
|
| 130 |
+
"observation": obs_to_dict(obs),
|
| 131 |
+
"reward": round(reward, 4),
|
| 132 |
+
"done": done,
|
| 133 |
+
"info": {
|
| 134 |
+
"outcome": info["outcome"],
|
| 135 |
+
"step_count": info["step_count"],
|
| 136 |
+
"progress": round(info["progress"], 4),
|
| 137 |
+
"user_type": info["user_type"],
|
| 138 |
+
},
|
| 139 |
+
}
|
| 140 |
+
|
| 141 |
+
|
| 142 |
+
@app.post("/run_episode")
|
| 143 |
+
async def run_episode(req: EpisodeRequest):
|
| 144 |
+
"""Run a full episode with the selected agent and return all steps."""
|
| 145 |
+
global current_obs, episode_done
|
| 146 |
+
|
| 147 |
+
agent_name = req.agent.lower()
|
| 148 |
+
if agent_name not in AGENTS:
|
| 149 |
+
raise HTTPException(
|
| 150 |
+
status_code=400,
|
| 151 |
+
detail=f"Unknown agent '{req.agent}'. Available: {list(AGENTS.keys())}",
|
| 152 |
+
)
|
| 153 |
+
|
| 154 |
+
agent = AGENTS[agent_name]()
|
| 155 |
+
run_env = UIEnv(seed=42)
|
| 156 |
+
obs = run_env.reset()
|
| 157 |
+
agent.reset()
|
| 158 |
+
|
| 159 |
+
steps = []
|
| 160 |
+
done = False
|
| 161 |
+
|
| 162 |
+
while not done:
|
| 163 |
+
action = agent.act(obs)
|
| 164 |
+
obs, reward, done, info = run_env.step(action)
|
| 165 |
+
agent.update(info)
|
| 166 |
+
|
| 167 |
+
steps.append({
|
| 168 |
+
"observation": obs_to_dict(obs),
|
| 169 |
+
"action": action.type,
|
| 170 |
+
"action_value": action.value,
|
| 171 |
+
"reward": round(reward, 4),
|
| 172 |
+
"done": done,
|
| 173 |
+
"info": {
|
| 174 |
+
"outcome": info["outcome"],
|
| 175 |
+
"step_count": info["step_count"],
|
| 176 |
+
"progress": round(info["progress"], 4),
|
| 177 |
+
"user_type": info["user_type"],
|
| 178 |
+
},
|
| 179 |
+
})
|
| 180 |
+
|
| 181 |
+
# Also update the global state to match final state
|
| 182 |
+
current_obs = obs
|
| 183 |
+
episode_done = done
|
| 184 |
+
|
| 185 |
+
return {
|
| 186 |
+
"agent": req.agent,
|
| 187 |
+
"total_steps": len(steps),
|
| 188 |
+
"final_outcome": info["outcome"],
|
| 189 |
+
"total_reward": round(sum(s["reward"] for s in steps), 4),
|
| 190 |
+
"steps": steps,
|
| 191 |
+
}
|
| 192 |
+
|
| 193 |
+
|
| 194 |
+
@app.get("/leaderboard")
|
| 195 |
+
async def get_leaderboard():
|
| 196 |
+
"""Run a benchmark and return the leaderboard."""
|
| 197 |
+
agents = [RandomAgent(seed=99), HeuristicAgent(seed=99)]
|
| 198 |
+
|
| 199 |
+
runner = BenchmarkRunner(
|
| 200 |
+
agents=agents,
|
| 201 |
+
episodes=50,
|
| 202 |
+
env_seed=42,
|
| 203 |
+
verbose=False,
|
| 204 |
+
)
|
| 205 |
+
results = runner.run()
|
| 206 |
+
|
| 207 |
+
leaderboard = []
|
| 208 |
+
for rank, m in enumerate(results, start=1):
|
| 209 |
+
leaderboard.append({
|
| 210 |
+
"rank": rank,
|
| 211 |
+
"agent": m.agent_name,
|
| 212 |
+
"score": round(m.score, 4),
|
| 213 |
+
"completion_rate": round(m.completion_rate, 4),
|
| 214 |
+
"drop_rate": round(m.drop_rate, 4),
|
| 215 |
+
"avg_reward": round(m.avg_reward, 4),
|
| 216 |
+
"avg_steps": round(m.avg_steps, 2),
|
| 217 |
+
})
|
| 218 |
+
|
| 219 |
+
return {"leaderboard": leaderboard}
|
| 220 |
+
|
| 221 |
+
|
| 222 |
+
@app.get("/agents")
|
| 223 |
+
async def list_agents():
|
| 224 |
+
"""Return available agent names."""
|
| 225 |
+
return {"agents": list(AGENTS.keys())}
|
baseline.py
ADDED
|
@@ -0,0 +1,197 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import os
|
| 2 |
+
import random
|
| 3 |
+
import time
|
| 4 |
+
from typing import Tuple
|
| 5 |
+
from openai import OpenAI
|
| 6 |
+
from env import UIEnv, Action, Observation
|
| 7 |
+
|
| 8 |
+
VALID_ACTIONS = [
|
| 9 |
+
"increase_button", "decrease_form", "increase_steps",
|
| 10 |
+
"decrease_steps", "reorder_sections", "set_button_size", "noop",
|
| 11 |
+
]
|
| 12 |
+
|
| 13 |
+
MAX_STEPS = 20
|
| 14 |
+
DEBUG = True
|
| 15 |
+
|
| 16 |
+
random.seed(42)
|
| 17 |
+
|
| 18 |
+
|
| 19 |
+
def load_env(task: str = "easy") -> UIEnv:
|
| 20 |
+
return UIEnv(seed=42, task=task)
|
| 21 |
+
|
| 22 |
+
|
| 23 |
+
def heuristic_policy(obs: Observation) -> Action:
|
| 24 |
+
layout = obs.layout
|
| 25 |
+
|
| 26 |
+
# Calculate which dimension creates the most drop risk
|
| 27 |
+
step_risk = max(0, layout.steps - 3) * 0.06
|
| 28 |
+
form_risk = max(0, layout.form_length - 5) * 0.04
|
| 29 |
+
|
| 30 |
+
# Fix highest risk first
|
| 31 |
+
if step_risk > 0 or form_risk > 0:
|
| 32 |
+
if form_risk >= step_risk and layout.form_length > 4:
|
| 33 |
+
return Action(type="decrease_form")
|
| 34 |
+
if layout.steps > 2:
|
| 35 |
+
return Action(type="decrease_steps")
|
| 36 |
+
if layout.form_length > 4:
|
| 37 |
+
return Action(type="decrease_form")
|
| 38 |
+
|
| 39 |
+
# Fix button size instantly (targets hidden preference bonus at > 1.2)
|
| 40 |
+
if layout.button_size < 0.9 or layout.button_size > 1.3:
|
| 41 |
+
return Action(type="set_button_size", value=1.25)
|
| 42 |
+
|
| 43 |
+
# Fine-tune: bring steps and form to optimal completion thresholds
|
| 44 |
+
if layout.steps > 2:
|
| 45 |
+
return Action(type="decrease_steps")
|
| 46 |
+
if layout.form_length > 4:
|
| 47 |
+
return Action(type="decrease_form")
|
| 48 |
+
|
| 49 |
+
return Action(type="noop")
|
| 50 |
+
|
| 51 |
+
|
| 52 |
+
def llm_policy(client: OpenAI, obs: Observation) -> Action:
|
| 53 |
+
state_desc = (
|
| 54 |
+
f"Device: {obs.device}\n"
|
| 55 |
+
f"Button Size: {obs.layout.button_size:.2f}\n"
|
| 56 |
+
f"Form Length: {obs.layout.form_length}\n"
|
| 57 |
+
f"Steps: {obs.layout.steps}\n"
|
| 58 |
+
f"Progress: {obs.progress:.2f}\n"
|
| 59 |
+
f"Last Action: {obs.last_action or 'None'}"
|
| 60 |
+
)
|
| 61 |
+
|
| 62 |
+
prompt = (
|
| 63 |
+
"You are optimizing a UI checkout flow to maximize user completion.\n"
|
| 64 |
+
"Fewer steps and shorter forms reduce friction. Button size between 0.9-1.3 is ideal.\n\n"
|
| 65 |
+
f"State:\n{state_desc}\n\n"
|
| 66 |
+
"Respond with ONLY one word from this list:\n"
|
| 67 |
+
"increase_button, decrease_form, increase_steps, decrease_steps, reorder_sections, set_button_size, noop"
|
| 68 |
+
)
|
| 69 |
+
|
| 70 |
+
max_retries = 2
|
| 71 |
+
for attempt in range(max_retries + 1):
|
| 72 |
+
try:
|
| 73 |
+
response = client.chat.completions.create(
|
| 74 |
+
model="katanemo/Arch-Router-1.5B",
|
| 75 |
+
messages=[
|
| 76 |
+
{"role": "system", "content": "You are a UI optimization agent."},
|
| 77 |
+
{"role": "user", "content": prompt},
|
| 78 |
+
],
|
| 79 |
+
temperature=0.001,
|
| 80 |
+
max_tokens=20,
|
| 81 |
+
)
|
| 82 |
+
|
| 83 |
+
content = response.choices[0].message.content
|
| 84 |
+
print("RAW RESPONSE:", content)
|
| 85 |
+
|
| 86 |
+
action_str = content.strip().lower()
|
| 87 |
+
|
| 88 |
+
for action in VALID_ACTIONS:
|
| 89 |
+
if action in action_str:
|
| 90 |
+
action_str = action
|
| 91 |
+
break
|
| 92 |
+
|
| 93 |
+
if action_str not in VALID_ACTIONS:
|
| 94 |
+
return Action(type="noop")
|
| 95 |
+
|
| 96 |
+
if action_str == "set_button_size":
|
| 97 |
+
return Action(type=action_str, value=1.1)
|
| 98 |
+
|
| 99 |
+
return Action(type=action_str)
|
| 100 |
+
|
| 101 |
+
except Exception as e:
|
| 102 |
+
if "429" in str(e):
|
| 103 |
+
if DEBUG: print(" [Rate Limit] Waiting 30s...")
|
| 104 |
+
time.sleep(30)
|
| 105 |
+
else:
|
| 106 |
+
if DEBUG: print(f" [API Error] {e}")
|
| 107 |
+
|
| 108 |
+
if attempt == max_retries:
|
| 109 |
+
return Action(type="noop")
|
| 110 |
+
time.sleep(2 ** attempt)
|
| 111 |
+
|
| 112 |
+
return Action(type="noop")
|
| 113 |
+
|
| 114 |
+
|
| 115 |
+
def agent_policy(client: OpenAI, obs: Observation) -> Action:
|
| 116 |
+
heuristic_action = heuristic_policy(obs)
|
| 117 |
+
if heuristic_action.type != "noop":
|
| 118 |
+
return heuristic_action
|
| 119 |
+
else:
|
| 120 |
+
return llm_policy(client, obs)
|
| 121 |
+
|
| 122 |
+
|
| 123 |
+
def run_episode(env: UIEnv, client: OpenAI) -> Tuple[float, bool]:
|
| 124 |
+
obs = env.reset()
|
| 125 |
+
total_reward = 0.0
|
| 126 |
+
done = False
|
| 127 |
+
completed = False
|
| 128 |
+
steps = 0
|
| 129 |
+
|
| 130 |
+
while not done and steps < MAX_STEPS:
|
| 131 |
+
action = agent_policy(client, obs)
|
| 132 |
+
obs, reward, done, info = env.step(action)
|
| 133 |
+
total_reward += reward
|
| 134 |
+
steps += 1
|
| 135 |
+
|
| 136 |
+
if info.get("outcome") == "complete":
|
| 137 |
+
completed = True
|
| 138 |
+
|
| 139 |
+
time.sleep(5)
|
| 140 |
+
|
| 141 |
+
if DEBUG:
|
| 142 |
+
print(f" step={steps} action={action.type} reward={reward:+.3f} outcome={info.get('outcome')}")
|
| 143 |
+
|
| 144 |
+
return total_reward, completed
|
| 145 |
+
|
| 146 |
+
|
| 147 |
+
def evaluate_task(task: str, client: OpenAI, n_episodes: int = 1) -> Tuple[float, float, float]:
|
| 148 |
+
total_rewards = 0.0
|
| 149 |
+
completions = 0
|
| 150 |
+
|
| 151 |
+
for ep in range(n_episodes):
|
| 152 |
+
env = load_env(task)
|
| 153 |
+
|
| 154 |
+
reward, completed = run_episode(env, client)
|
| 155 |
+
total_rewards += reward
|
| 156 |
+
if completed:
|
| 157 |
+
completions += 1
|
| 158 |
+
|
| 159 |
+
if DEBUG:
|
| 160 |
+
print(f" [{task}] ep={ep+1}/{n_episodes} reward={reward:+.3f} completed={completed}")
|
| 161 |
+
|
| 162 |
+
avg_reward = total_rewards / n_episodes
|
| 163 |
+
completion_rate = completions / n_episodes
|
| 164 |
+
score = 0.7 * completion_rate + 0.3 * avg_reward
|
| 165 |
+
|
| 166 |
+
return avg_reward, completion_rate, score
|
| 167 |
+
|
| 168 |
+
|
| 169 |
+
def main():
|
| 170 |
+
hf_token = os.getenv("HF_TOKEN")
|
| 171 |
+
if not hf_token:
|
| 172 |
+
print("Error: HF_TOKEN environment variable not set.")
|
| 173 |
+
return
|
| 174 |
+
|
| 175 |
+
client = OpenAI(
|
| 176 |
+
base_url="https://router.huggingface.co/v1",
|
| 177 |
+
api_key=os.getenv("HF_TOKEN")
|
| 178 |
+
)
|
| 179 |
+
tasks = ["easy", "medium", "hard"]
|
| 180 |
+
|
| 181 |
+
print("=" * 50)
|
| 182 |
+
print(" UIEnv Baseline Evaluation (Hugging Face Router)")
|
| 183 |
+
print("=" * 50)
|
| 184 |
+
|
| 185 |
+
for task in tasks:
|
| 186 |
+
print(f"\n> Evaluating task: {task}...")
|
| 187 |
+
avg_reward, completion_rate, score = evaluate_task(task, client)
|
| 188 |
+
print(f"\nTask: {task}")
|
| 189 |
+
print(f" Avg Reward: {avg_reward:.4f}")
|
| 190 |
+
print(f" Completion Rate: {completion_rate:.4f}")
|
| 191 |
+
print(f" Score: {score:.4f}")
|
| 192 |
+
|
| 193 |
+
print("\n" + "=" * 50)
|
| 194 |
+
|
| 195 |
+
|
| 196 |
+
if __name__ == "__main__":
|
| 197 |
+
main()
|
benchmark.py
ADDED
|
@@ -0,0 +1,353 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
benchmark.py
|
| 3 |
+
-------------
|
| 4 |
+
Robust benchmarking and leaderboard system for UIEnv.
|
| 5 |
+
|
| 6 |
+
Evaluates multiple agents on identical environment conditions, computes
|
| 7 |
+
standardised metrics, and produces a ranked leaderboard.
|
| 8 |
+
|
| 9 |
+
Fairness guarantee
|
| 10 |
+
------------------
|
| 11 |
+
Each agent is evaluated on a *fresh* UIEnv instance created with the same
|
| 12 |
+
seed, so every agent faces the exact same sequence of user types, devices,
|
| 13 |
+
and random-drop rolls. Agent-internal RNG is independent.
|
| 14 |
+
|
| 15 |
+
Usage
|
| 16 |
+
-----
|
| 17 |
+
python benchmark.py # default: 50 episodes
|
| 18 |
+
python benchmark.py --episodes 200 # custom episode count
|
| 19 |
+
"""
|
| 20 |
+
|
| 21 |
+
from __future__ import annotations
|
| 22 |
+
|
| 23 |
+
import argparse
|
| 24 |
+
import json
|
| 25 |
+
import time
|
| 26 |
+
from dataclasses import dataclass, field, asdict
|
| 27 |
+
from typing import Protocol, runtime_checkable
|
| 28 |
+
|
| 29 |
+
from env import UIEnv, Action, Observation
|
| 30 |
+
|
| 31 |
+
|
| 32 |
+
# ======================================================================
|
| 33 |
+
# Agent Protocol -- any agent plugged into the benchmark must satisfy this
|
| 34 |
+
# ======================================================================
|
| 35 |
+
|
| 36 |
+
@runtime_checkable
|
| 37 |
+
class Agent(Protocol):
|
| 38 |
+
"""Minimal interface every agent must expose."""
|
| 39 |
+
|
| 40 |
+
NAME: str
|
| 41 |
+
|
| 42 |
+
def reset(self) -> None: ...
|
| 43 |
+
def act(self, obs: Observation) -> Action: ...
|
| 44 |
+
def update(self, info: dict) -> None: ...
|
| 45 |
+
|
| 46 |
+
|
| 47 |
+
# ======================================================================
|
| 48 |
+
# Per-episode result record
|
| 49 |
+
# ======================================================================
|
| 50 |
+
|
| 51 |
+
@dataclass
|
| 52 |
+
class EpisodeResult:
|
| 53 |
+
"""Immutable record of a single episode's outcome."""
|
| 54 |
+
episode: int
|
| 55 |
+
outcome: str # "complete" | "drop" | "distrust" | "continue"
|
| 56 |
+
total_reward: float
|
| 57 |
+
steps: int
|
| 58 |
+
final_progress: float
|
| 59 |
+
|
| 60 |
+
|
| 61 |
+
# ======================================================================
|
| 62 |
+
# Per-agent aggregate metrics
|
| 63 |
+
# ======================================================================
|
| 64 |
+
|
| 65 |
+
@dataclass
|
| 66 |
+
class AgentMetrics:
|
| 67 |
+
"""Aggregate metrics for one agent across all episodes."""
|
| 68 |
+
agent_name: str
|
| 69 |
+
score: float # 0.7 * completion_rate + 0.3 * avg_reward
|
| 70 |
+
completion_rate: float
|
| 71 |
+
drop_rate: float
|
| 72 |
+
avg_reward: float
|
| 73 |
+
avg_steps: float
|
| 74 |
+
total_episodes: int
|
| 75 |
+
episodes: list[EpisodeResult] = field(default_factory=list, repr=False)
|
| 76 |
+
|
| 77 |
+
|
| 78 |
+
# ======================================================================
|
| 79 |
+
# BenchmarkRunner
|
| 80 |
+
# ======================================================================
|
| 81 |
+
|
| 82 |
+
class BenchmarkRunner:
|
| 83 |
+
"""
|
| 84 |
+
Evaluates a list of agents on UIEnv and produces a ranked leaderboard.
|
| 85 |
+
|
| 86 |
+
Parameters
|
| 87 |
+
----------
|
| 88 |
+
agents : list
|
| 89 |
+
Agent instances satisfying the Agent protocol.
|
| 90 |
+
episodes : int
|
| 91 |
+
Number of episodes per agent (default 50).
|
| 92 |
+
env_seed : int
|
| 93 |
+
Seed for UIEnv -- same for every agent to ensure fairness.
|
| 94 |
+
verbose : bool
|
| 95 |
+
If True, print per-episode progress during evaluation.
|
| 96 |
+
"""
|
| 97 |
+
|
| 98 |
+
def __init__(
|
| 99 |
+
self,
|
| 100 |
+
agents: list,
|
| 101 |
+
episodes: int = 50,
|
| 102 |
+
env_seed: int = 42,
|
| 103 |
+
verbose: bool = False,
|
| 104 |
+
) -> None:
|
| 105 |
+
self._agents = agents
|
| 106 |
+
self._episodes = episodes
|
| 107 |
+
self._env_seed = env_seed
|
| 108 |
+
self._verbose = verbose
|
| 109 |
+
|
| 110 |
+
# Validate agent interface at init time
|
| 111 |
+
for agent in agents:
|
| 112 |
+
if not isinstance(agent, Agent):
|
| 113 |
+
raise TypeError(
|
| 114 |
+
f"{agent!r} does not satisfy the Agent protocol "
|
| 115 |
+
f"(needs NAME, reset, act, update)"
|
| 116 |
+
)
|
| 117 |
+
|
| 118 |
+
# ------------------------------------------------------------------ #
|
| 119 |
+
# Core evaluation loop #
|
| 120 |
+
# ------------------------------------------------------------------ #
|
| 121 |
+
|
| 122 |
+
def _evaluate_agent(self, agent) -> AgentMetrics:
|
| 123 |
+
"""
|
| 124 |
+
Run one agent for N episodes and collect metrics.
|
| 125 |
+
|
| 126 |
+
A fresh UIEnv is created with the canonical seed so every agent
|
| 127 |
+
faces the same stochastic sequence and an even mix of tasks.
|
| 128 |
+
"""
|
| 129 |
+
total_reward: float = 0.0
|
| 130 |
+
completions: int = 0
|
| 131 |
+
drops: int = 0
|
| 132 |
+
total_steps: int = 0
|
| 133 |
+
episode_results: list[EpisodeResult] = []
|
| 134 |
+
|
| 135 |
+
tasks = ["easy", "medium", "hard"]
|
| 136 |
+
|
| 137 |
+
for ep in range(self._episodes):
|
| 138 |
+
# Rotate through all task difficulties evenly
|
| 139 |
+
current_task = tasks[ep % len(tasks)]
|
| 140 |
+
env = UIEnv(seed=self._env_seed + ep, task=current_task)
|
| 141 |
+
|
| 142 |
+
obs = env.reset()
|
| 143 |
+
agent.reset()
|
| 144 |
+
|
| 145 |
+
ep_reward: float = 0.0
|
| 146 |
+
done = False
|
| 147 |
+
|
| 148 |
+
while not done:
|
| 149 |
+
action = agent.act(obs)
|
| 150 |
+
obs, reward, done, info = env.step(action)
|
| 151 |
+
agent.update(info)
|
| 152 |
+
ep_reward += reward
|
| 153 |
+
|
| 154 |
+
outcome = info["outcome"]
|
| 155 |
+
steps = info["step_count"]
|
| 156 |
+
progress = info["progress"]
|
| 157 |
+
|
| 158 |
+
total_reward += ep_reward
|
| 159 |
+
total_steps += steps
|
| 160 |
+
|
| 161 |
+
if outcome == "complete":
|
| 162 |
+
completions += 1
|
| 163 |
+
elif outcome == "drop":
|
| 164 |
+
drops += 1
|
| 165 |
+
|
| 166 |
+
episode_results.append(
|
| 167 |
+
EpisodeResult(
|
| 168 |
+
episode=ep,
|
| 169 |
+
outcome=outcome,
|
| 170 |
+
total_reward=ep_reward,
|
| 171 |
+
steps=steps,
|
| 172 |
+
final_progress=progress,
|
| 173 |
+
)
|
| 174 |
+
)
|
| 175 |
+
|
| 176 |
+
if self._verbose:
|
| 177 |
+
print(
|
| 178 |
+
f" [{agent.NAME}] ep={ep:03d} "
|
| 179 |
+
f"outcome={outcome:<10s} "
|
| 180 |
+
f"reward={ep_reward:+.3f} "
|
| 181 |
+
f"steps={steps}"
|
| 182 |
+
)
|
| 183 |
+
|
| 184 |
+
n = self._episodes
|
| 185 |
+
completion_rate = completions / n
|
| 186 |
+
drop_rate = drops / n
|
| 187 |
+
avg_reward = total_reward / n
|
| 188 |
+
avg_steps = total_steps / n
|
| 189 |
+
score = 0.7 * completion_rate + 0.3 * avg_reward
|
| 190 |
+
|
| 191 |
+
return AgentMetrics(
|
| 192 |
+
agent_name=agent.NAME,
|
| 193 |
+
score=score,
|
| 194 |
+
completion_rate=completion_rate,
|
| 195 |
+
drop_rate=drop_rate,
|
| 196 |
+
avg_reward=avg_reward,
|
| 197 |
+
avg_steps=avg_steps,
|
| 198 |
+
total_episodes=n,
|
| 199 |
+
episodes=episode_results,
|
| 200 |
+
)
|
| 201 |
+
|
| 202 |
+
# ------------------------------------------------------------------ #
|
| 203 |
+
# Public API #
|
| 204 |
+
# ------------------------------------------------------------------ #
|
| 205 |
+
|
| 206 |
+
def run(self) -> list[AgentMetrics]:
|
| 207 |
+
"""
|
| 208 |
+
Evaluate all agents and return a leaderboard sorted by score (desc).
|
| 209 |
+
|
| 210 |
+
Returns
|
| 211 |
+
-------
|
| 212 |
+
list[AgentMetrics]
|
| 213 |
+
One entry per agent, sorted best-first.
|
| 214 |
+
"""
|
| 215 |
+
results: list[AgentMetrics] = []
|
| 216 |
+
|
| 217 |
+
for agent in self._agents:
|
| 218 |
+
if self._verbose:
|
| 219 |
+
print(f"\n> Evaluating {agent.NAME} ({self._episodes} episodes) ...")
|
| 220 |
+
|
| 221 |
+
t0 = time.perf_counter()
|
| 222 |
+
metrics = self._evaluate_agent(agent)
|
| 223 |
+
elapsed = time.perf_counter() - t0
|
| 224 |
+
|
| 225 |
+
if self._verbose:
|
| 226 |
+
print(f" Done in {elapsed:.2f}s")
|
| 227 |
+
|
| 228 |
+
results.append(metrics)
|
| 229 |
+
|
| 230 |
+
# Sort descending by score
|
| 231 |
+
results.sort(key=lambda m: m.score, reverse=True)
|
| 232 |
+
return results
|
| 233 |
+
|
| 234 |
+
# ------------------------------------------------------------------ #
|
| 235 |
+
# Display #
|
| 236 |
+
# ------------------------------------------------------------------ #
|
| 237 |
+
|
| 238 |
+
@staticmethod
|
| 239 |
+
def print_leaderboard(leaderboard: list[AgentMetrics]) -> None:
|
| 240 |
+
"""Print a professional leaderboard table to stdout."""
|
| 241 |
+
|
| 242 |
+
hdr = (
|
| 243 |
+
f" {'Rank':<6s}"
|
| 244 |
+
f"{'Agent':<20s}"
|
| 245 |
+
f"{'Score':>8s}"
|
| 246 |
+
f"{'Completion':>12s}"
|
| 247 |
+
f"{'Drop':>8s}"
|
| 248 |
+
f"{'AvgReward':>11s}"
|
| 249 |
+
f"{'AvgSteps':>10s}"
|
| 250 |
+
)
|
| 251 |
+
sep = "-" * len(hdr)
|
| 252 |
+
|
| 253 |
+
print()
|
| 254 |
+
print("=" * len(hdr))
|
| 255 |
+
print(" LEADERBOARD".center(len(hdr)))
|
| 256 |
+
print("=" * len(hdr))
|
| 257 |
+
print(hdr)
|
| 258 |
+
print(sep)
|
| 259 |
+
|
| 260 |
+
for rank, m in enumerate(leaderboard, start=1):
|
| 261 |
+
medal = {1: "(1st)", 2: "(2nd)", 3: "(3rd)"}.get(rank, "")
|
| 262 |
+
print(
|
| 263 |
+
f" {f'#{rank} {medal}':<6s}"
|
| 264 |
+
f"{m.agent_name:<20s}"
|
| 265 |
+
f"{m.score:>8.4f}"
|
| 266 |
+
f"{m.completion_rate * 100:>11.1f}%"
|
| 267 |
+
f"{m.drop_rate * 100:>7.1f}%"
|
| 268 |
+
f"{m.avg_reward:>11.4f}"
|
| 269 |
+
f"{m.avg_steps:>10.1f}"
|
| 270 |
+
)
|
| 271 |
+
|
| 272 |
+
print(sep)
|
| 273 |
+
print()
|
| 274 |
+
|
| 275 |
+
@staticmethod
|
| 276 |
+
def print_comparison(leaderboard: list[AgentMetrics]) -> None:
|
| 277 |
+
"""Print head-to-head delta between rank #1 and all others."""
|
| 278 |
+
if len(leaderboard) < 2:
|
| 279 |
+
return
|
| 280 |
+
|
| 281 |
+
best = leaderboard[0]
|
| 282 |
+
print(" HEAD-TO-HEAD vs " + best.agent_name)
|
| 283 |
+
print(" " + "-" * 50)
|
| 284 |
+
|
| 285 |
+
for other in leaderboard[1:]:
|
| 286 |
+
d_score = best.score - other.score
|
| 287 |
+
d_comp = (best.completion_rate - other.completion_rate) * 100
|
| 288 |
+
d_drop = (best.drop_rate - other.drop_rate) * 100
|
| 289 |
+
d_rew = best.avg_reward - other.avg_reward
|
| 290 |
+
|
| 291 |
+
print(
|
| 292 |
+
f" vs {other.agent_name:<16s} "
|
| 293 |
+
f"score: +{d_score:.4f} "
|
| 294 |
+
f"completion: {d_comp:+.1f}pp "
|
| 295 |
+
f"drop: {d_drop:+.1f}pp "
|
| 296 |
+
f"reward: {d_rew:+.4f}"
|
| 297 |
+
)
|
| 298 |
+
|
| 299 |
+
print()
|
| 300 |
+
|
| 301 |
+
@staticmethod
|
| 302 |
+
def export_json(leaderboard: list[AgentMetrics], path: str = "leaderboard.json") -> None:
|
| 303 |
+
"""Export the leaderboard to a JSON file (without per-episode logs)."""
|
| 304 |
+
data = []
|
| 305 |
+
for m in leaderboard:
|
| 306 |
+
d = asdict(m)
|
| 307 |
+
del d["episodes"] # keep export compact
|
| 308 |
+
data.append(d)
|
| 309 |
+
|
| 310 |
+
with open(path, "w", encoding="utf-8") as f:
|
| 311 |
+
json.dump(data, f, indent=2)
|
| 312 |
+
|
| 313 |
+
print(f" Leaderboard exported to {path}")
|
| 314 |
+
|
| 315 |
+
|
| 316 |
+
# ======================================================================
|
| 317 |
+
# Main -- run benchmark with all available agents
|
| 318 |
+
# ======================================================================
|
| 319 |
+
|
| 320 |
+
if __name__ == "__main__":
|
| 321 |
+
|
| 322 |
+
parser = argparse.ArgumentParser(description="UIEnv Agent Benchmark")
|
| 323 |
+
parser.add_argument("--episodes", type=int, default=50, help="Episodes per agent")
|
| 324 |
+
parser.add_argument("--seed", type=int, default=42, help="Environment seed")
|
| 325 |
+
parser.add_argument("--verbose", action="store_true", help="Show per-episode logs")
|
| 326 |
+
parser.add_argument("--export", action="store_true", help="Export leaderboard JSON")
|
| 327 |
+
args = parser.parse_args()
|
| 328 |
+
|
| 329 |
+
# -- Import agents --
|
| 330 |
+
from agents.random_agent import RandomAgent
|
| 331 |
+
from agents.heuristic_agent import HeuristicAgent
|
| 332 |
+
|
| 333 |
+
agents = [
|
| 334 |
+
RandomAgent(seed=99),
|
| 335 |
+
HeuristicAgent(seed=99),
|
| 336 |
+
]
|
| 337 |
+
|
| 338 |
+
# -- Run benchmark --
|
| 339 |
+
runner = BenchmarkRunner(
|
| 340 |
+
agents=agents,
|
| 341 |
+
episodes=args.episodes,
|
| 342 |
+
env_seed=args.seed,
|
| 343 |
+
verbose=args.verbose,
|
| 344 |
+
)
|
| 345 |
+
|
| 346 |
+
leaderboard = runner.run()
|
| 347 |
+
|
| 348 |
+
# -- Display results --
|
| 349 |
+
runner.print_leaderboard(leaderboard)
|
| 350 |
+
runner.print_comparison(leaderboard)
|
| 351 |
+
|
| 352 |
+
if args.export:
|
| 353 |
+
runner.export_json(leaderboard)
|
env.py
ADDED
|
@@ -0,0 +1,364 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
ui_env.py
|
| 3 |
+
---------
|
| 4 |
+
Environment Engine for an Adaptive UI Layout Optimization system.
|
| 5 |
+
"""
|
| 6 |
+
|
| 7 |
+
from __future__ import annotations
|
| 8 |
+
|
| 9 |
+
import random
|
| 10 |
+
from typing import Literal, Optional
|
| 11 |
+
|
| 12 |
+
from pydantic import BaseModel, Field, model_validator
|
| 13 |
+
|
| 14 |
+
|
| 15 |
+
# ---------------------------------------------------------------------------
|
| 16 |
+
# Constants
|
| 17 |
+
# ---------------------------------------------------------------------------
|
| 18 |
+
|
| 19 |
+
BUTTON_SIZE_MIN: float = 0.5
|
| 20 |
+
BUTTON_SIZE_MAX: float = 2.0
|
| 21 |
+
FORM_LENGTH_MIN: int = 1
|
| 22 |
+
FORM_LENGTH_MAX: int = 10
|
| 23 |
+
STEPS_MIN: int = 1
|
| 24 |
+
STEPS_MAX: int = 10
|
| 25 |
+
|
| 26 |
+
BUTTON_SIZE_DELTA: float = 0.1
|
| 27 |
+
FORM_LENGTH_DELTA: int = 1
|
| 28 |
+
STEPS_DELTA: int = 1
|
| 29 |
+
|
| 30 |
+
INVALID_ACTION_REWARD: float = -0.1
|
| 31 |
+
MAX_STEPS_PER_EPISODE: int = 20
|
| 32 |
+
|
| 33 |
+
BUTTON_SWEET_LOW: float = 0.9
|
| 34 |
+
BUTTON_SWEET_HIGH: float = 1.3
|
| 35 |
+
|
| 36 |
+
|
| 37 |
+
# ---------------------------------------------------------------------------
|
| 38 |
+
# Data Models
|
| 39 |
+
# ---------------------------------------------------------------------------
|
| 40 |
+
|
| 41 |
+
class Layout(BaseModel):
|
| 42 |
+
"""Represents the current UI layout configuration."""
|
| 43 |
+
|
| 44 |
+
button_size: float = Field(
|
| 45 |
+
default=1.0,
|
| 46 |
+
ge=BUTTON_SIZE_MIN,
|
| 47 |
+
le=BUTTON_SIZE_MAX,
|
| 48 |
+
description="Size multiplier for UI buttons (0.5 - 2.0).",
|
| 49 |
+
)
|
| 50 |
+
form_length: int = Field(
|
| 51 |
+
default=5,
|
| 52 |
+
ge=FORM_LENGTH_MIN,
|
| 53 |
+
le=FORM_LENGTH_MAX,
|
| 54 |
+
description="Number of fields in the form (1 - 10).",
|
| 55 |
+
)
|
| 56 |
+
steps: int = Field(
|
| 57 |
+
default=3,
|
| 58 |
+
ge=STEPS_MIN,
|
| 59 |
+
le=STEPS_MAX,
|
| 60 |
+
description="Number of wizard / checkout steps (1 - 10).",
|
| 61 |
+
)
|
| 62 |
+
|
| 63 |
+
|
| 64 |
+
class Observation(BaseModel):
|
| 65 |
+
"""Full observable state returned to the agent after every transition."""
|
| 66 |
+
|
| 67 |
+
device: Literal["mobile", "desktop"] = Field(
|
| 68 |
+
description="Device type the user is on.",
|
| 69 |
+
)
|
| 70 |
+
layout: Layout = Field(
|
| 71 |
+
description="Current layout configuration.",
|
| 72 |
+
)
|
| 73 |
+
progress: float = Field(
|
| 74 |
+
ge=0.0,
|
| 75 |
+
le=1.0,
|
| 76 |
+
description="User's task-completion progress in [0, 1].",
|
| 77 |
+
)
|
| 78 |
+
last_action: Optional[str] = Field(
|
| 79 |
+
default=None,
|
| 80 |
+
description="String name of the most recently applied action, or None.",
|
| 81 |
+
)
|
| 82 |
+
|
| 83 |
+
|
| 84 |
+
class Action(BaseModel):
|
| 85 |
+
"""An action the agent can submit to the environment."""
|
| 86 |
+
|
| 87 |
+
type: Literal[
|
| 88 |
+
"increase_button",
|
| 89 |
+
"decrease_form",
|
| 90 |
+
"increase_steps",
|
| 91 |
+
"decrease_steps",
|
| 92 |
+
"reorder_sections",
|
| 93 |
+
"set_button_size",
|
| 94 |
+
"noop",
|
| 95 |
+
] = Field(description="Discrete action type.")
|
| 96 |
+
value: Optional[float] = Field(
|
| 97 |
+
default=None,
|
| 98 |
+
description="Optional scalar payload (used by set_button_size).",
|
| 99 |
+
)
|
| 100 |
+
|
| 101 |
+
@model_validator(mode="after")
|
| 102 |
+
def _value_required_for_set_button_size(self) -> "Action":
|
| 103 |
+
"""Ensure `value` is provided when action type requires it."""
|
| 104 |
+
if self.type == "set_button_size" and self.value is None:
|
| 105 |
+
raise ValueError("'value' must be provided for action type 'set_button_size'.")
|
| 106 |
+
return self
|
| 107 |
+
|
| 108 |
+
|
| 109 |
+
# ---------------------------------------------------------------------------
|
| 110 |
+
# Environment Engine
|
| 111 |
+
# ---------------------------------------------------------------------------
|
| 112 |
+
|
| 113 |
+
class UIEnv:
|
| 114 |
+
"""Adaptive UI Layout Optimization - Environment Engine."""
|
| 115 |
+
|
| 116 |
+
def __init__(self, seed: int = 42, task: str = "easy") -> None:
|
| 117 |
+
self._seed: int = seed
|
| 118 |
+
self._task: str = task
|
| 119 |
+
self._rng: random.Random = random.Random(seed)
|
| 120 |
+
|
| 121 |
+
self._layout: Layout = Layout()
|
| 122 |
+
self._device: Literal["mobile", "desktop"] = "desktop"
|
| 123 |
+
self._progress: float = 0.0
|
| 124 |
+
self._last_action: Optional[str] = None
|
| 125 |
+
self._step_count: int = 0
|
| 126 |
+
|
| 127 |
+
self._prefers_short_forms: bool = False
|
| 128 |
+
self._prefers_large_buttons: bool = False
|
| 129 |
+
self._user_type: str = "new"
|
| 130 |
+
|
| 131 |
+
self._ready: bool = False
|
| 132 |
+
|
| 133 |
+
def reset(self) -> Observation:
|
| 134 |
+
if self._task == "easy":
|
| 135 |
+
steps = self._rng.randint(2, 3)
|
| 136 |
+
form_length = self._rng.randint(2, 4)
|
| 137 |
+
button_size = self._rng.uniform(0.9, 1.2)
|
| 138 |
+
elif self._task == "medium":
|
| 139 |
+
steps = self._rng.randint(3, 5)
|
| 140 |
+
form_length = self._rng.randint(4, 6)
|
| 141 |
+
button_size = self._rng.uniform(0.7, 1.5)
|
| 142 |
+
elif self._task == "hard":
|
| 143 |
+
steps = self._rng.randint(5, 8)
|
| 144 |
+
form_length = self._rng.randint(6, 10)
|
| 145 |
+
button_size = self._rng.uniform(0.5, 2.0)
|
| 146 |
+
else:
|
| 147 |
+
steps = self._rng.randint(3, 5)
|
| 148 |
+
form_length = self._rng.randint(4, 6)
|
| 149 |
+
button_size = 1.0
|
| 150 |
+
|
| 151 |
+
self._layout = Layout(
|
| 152 |
+
button_size=button_size,
|
| 153 |
+
form_length=form_length,
|
| 154 |
+
steps=steps,
|
| 155 |
+
)
|
| 156 |
+
self._clamp_layout()
|
| 157 |
+
|
| 158 |
+
self._device = self._rng.choice(("mobile", "desktop"))
|
| 159 |
+
self._progress = 0.0
|
| 160 |
+
self._last_action = None
|
| 161 |
+
self._step_count = 0
|
| 162 |
+
|
| 163 |
+
self._prefers_short_forms = self._rng.choice([True, False])
|
| 164 |
+
self._prefers_large_buttons = self._rng.choice([True, False])
|
| 165 |
+
self._user_type = self._rng.choice(["impatient", "careful", "new"])
|
| 166 |
+
|
| 167 |
+
self._ready = True
|
| 168 |
+
return self._get_observation()
|
| 169 |
+
|
| 170 |
+
def step(self, action: Action) -> tuple[Observation, float, bool, dict]:
|
| 171 |
+
if not self._ready:
|
| 172 |
+
raise RuntimeError("Call reset() before step().")
|
| 173 |
+
|
| 174 |
+
action_reward_offset: float = self._apply_action(action)
|
| 175 |
+
self._step_count += 1
|
| 176 |
+
|
| 177 |
+
outcome, user_reward = self._simulate_user()
|
| 178 |
+
done = False
|
| 179 |
+
|
| 180 |
+
if outcome == "drop":
|
| 181 |
+
done = True
|
| 182 |
+
elif outcome == "distrust":
|
| 183 |
+
# progress is stalled, episode continues
|
| 184 |
+
pass
|
| 185 |
+
else:
|
| 186 |
+
# user successfully proceeds through 1 of the required layout steps
|
| 187 |
+
self._progress += 1.0 / max(1, self._layout.steps)
|
| 188 |
+
if self._progress >= 0.999:
|
| 189 |
+
self._progress = 1.0
|
| 190 |
+
outcome = "complete"
|
| 191 |
+
done = True
|
| 192 |
+
|
| 193 |
+
# Base reward
|
| 194 |
+
reward = user_reward + action_reward_offset
|
| 195 |
+
if outcome == "complete":
|
| 196 |
+
reward += 2.0
|
| 197 |
+
elif outcome == "continue":
|
| 198 |
+
reward += 0.1 # small reward for steady progress
|
| 199 |
+
|
| 200 |
+
# Time penalty
|
| 201 |
+
reward -= 0.05
|
| 202 |
+
|
| 203 |
+
if self._task == "hard":
|
| 204 |
+
reward += self._rng.uniform(-0.2, 0.2)
|
| 205 |
+
|
| 206 |
+
if self._step_count >= MAX_STEPS_PER_EPISODE:
|
| 207 |
+
done = True
|
| 208 |
+
|
| 209 |
+
info: dict = {
|
| 210 |
+
"completed": (outcome == "complete"),
|
| 211 |
+
"outcome": outcome,
|
| 212 |
+
"progress": self._progress,
|
| 213 |
+
"step_count": self._step_count,
|
| 214 |
+
"user_type": self._user_type,
|
| 215 |
+
}
|
| 216 |
+
|
| 217 |
+
return self._get_observation(), reward, done, info
|
| 218 |
+
|
| 219 |
+
def state(self) -> Observation:
|
| 220 |
+
if not self._ready:
|
| 221 |
+
raise RuntimeError("Call reset() before state().")
|
| 222 |
+
return self._get_observation()
|
| 223 |
+
|
| 224 |
+
def _simulate_user(self) -> tuple[str, float]:
|
| 225 |
+
"""Simulates user behavior (drop, distrust, or continue) based on layout.
|
| 226 |
+
|
| 227 |
+
Calibrated so that:
|
| 228 |
+
- easy tasks β ~80-95 % survival per step
|
| 229 |
+
- medium tasks β ~70-85 % survival per step
|
| 230 |
+
- hard tasks β ~55-75 % survival per step (achievable but tough)
|
| 231 |
+
|
| 232 |
+
The user has a brief grace period (first 2 steps) where they won't
|
| 233 |
+
drop β simulating the patience of a user who just landed on the page.
|
| 234 |
+
"""
|
| 235 |
+
# Grace period: user won't drop during the first 3 steps
|
| 236 |
+
if self._step_count <= 3:
|
| 237 |
+
return "continue", 0.0
|
| 238 |
+
|
| 239 |
+
layout = self._layout
|
| 240 |
+
drop_chance = 0.0
|
| 241 |
+
distrust_chance = 0.0
|
| 242 |
+
|
| 243 |
+
# --- Friction from too many checkout steps ---
|
| 244 |
+
if layout.steps > 3:
|
| 245 |
+
drop_chance += 0.05 * (layout.steps - 3)
|
| 246 |
+
|
| 247 |
+
# --- Friction from long forms ---
|
| 248 |
+
if layout.form_length > 5:
|
| 249 |
+
drop_chance += 0.04 * (layout.form_length - 5)
|
| 250 |
+
|
| 251 |
+
# --- Hidden user preference: short-form lovers ---
|
| 252 |
+
if self._prefers_short_forms and layout.form_length > 4:
|
| 253 |
+
drop_chance += 0.05
|
| 254 |
+
|
| 255 |
+
# --- Too few steps feels sketchy β distrust ---
|
| 256 |
+
if layout.steps < 2:
|
| 257 |
+
distrust_chance += 0.20
|
| 258 |
+
|
| 259 |
+
# --- Button size outside sweet spot ---
|
| 260 |
+
if layout.button_size < 0.9 or layout.button_size > 1.3:
|
| 261 |
+
distrust_chance += 0.10
|
| 262 |
+
drop_chance += 0.02
|
| 263 |
+
|
| 264 |
+
# --- User persona modifiers ---
|
| 265 |
+
if self._user_type == "impatient":
|
| 266 |
+
drop_chance += 0.06
|
| 267 |
+
elif self._user_type == "careful":
|
| 268 |
+
distrust_chance += 0.08
|
| 269 |
+
|
| 270 |
+
# --- Task difficulty scaling ---
|
| 271 |
+
if self._task == "hard":
|
| 272 |
+
drop_chance += 0.04
|
| 273 |
+
elif self._task == "easy":
|
| 274 |
+
drop_chance -= 0.05
|
| 275 |
+
distrust_chance -= 0.05
|
| 276 |
+
|
| 277 |
+
drop_chance = max(0.0, min(1.0, drop_chance))
|
| 278 |
+
distrust_chance = max(0.0, min(1.0 - drop_chance, distrust_chance))
|
| 279 |
+
|
| 280 |
+
roll = self._rng.random()
|
| 281 |
+
|
| 282 |
+
if roll < drop_chance:
|
| 283 |
+
return "drop", -1.0
|
| 284 |
+
elif roll < drop_chance + distrust_chance:
|
| 285 |
+
return "distrust", -0.2
|
| 286 |
+
else:
|
| 287 |
+
return "continue", 0.0
|
| 288 |
+
|
| 289 |
+
def _apply_action(self, action: Action) -> float:
|
| 290 |
+
reward: float = 0.0
|
| 291 |
+
|
| 292 |
+
match action.type:
|
| 293 |
+
case "increase_button":
|
| 294 |
+
self._layout.button_size += BUTTON_SIZE_DELTA
|
| 295 |
+
case "decrease_form":
|
| 296 |
+
self._layout.form_length -= FORM_LENGTH_DELTA
|
| 297 |
+
case "increase_steps":
|
| 298 |
+
self._layout.steps += STEPS_DELTA
|
| 299 |
+
case "decrease_steps":
|
| 300 |
+
self._layout.steps -= STEPS_DELTA
|
| 301 |
+
case "set_button_size":
|
| 302 |
+
proposed: float = action.value
|
| 303 |
+
if not (BUTTON_SIZE_MIN <= proposed <= BUTTON_SIZE_MAX):
|
| 304 |
+
reward = INVALID_ACTION_REWARD
|
| 305 |
+
self._layout.button_size = proposed
|
| 306 |
+
case "reorder_sections":
|
| 307 |
+
pass
|
| 308 |
+
case "noop":
|
| 309 |
+
pass
|
| 310 |
+
|
| 311 |
+
self._clamp_layout()
|
| 312 |
+
self._last_action = action.type
|
| 313 |
+
return reward
|
| 314 |
+
|
| 315 |
+
def _clamp_layout(self) -> None:
|
| 316 |
+
self._layout.button_size = max(
|
| 317 |
+
BUTTON_SIZE_MIN, min(BUTTON_SIZE_MAX, self._layout.button_size)
|
| 318 |
+
)
|
| 319 |
+
self._layout.form_length = max(
|
| 320 |
+
FORM_LENGTH_MIN, min(FORM_LENGTH_MAX, self._layout.form_length)
|
| 321 |
+
)
|
| 322 |
+
self._layout.steps = max(
|
| 323 |
+
STEPS_MIN, min(STEPS_MAX, self._layout.steps)
|
| 324 |
+
)
|
| 325 |
+
|
| 326 |
+
def _get_observation(self) -> Observation:
|
| 327 |
+
return Observation(
|
| 328 |
+
device=self._device,
|
| 329 |
+
layout=self._layout.model_copy(),
|
| 330 |
+
progress=self._progress,
|
| 331 |
+
last_action=self._last_action,
|
| 332 |
+
)
|
| 333 |
+
|
| 334 |
+
def _compute_reward(self) -> float:
|
| 335 |
+
layout = self._layout
|
| 336 |
+
reward = 0.0
|
| 337 |
+
|
| 338 |
+
reward -= 0.1 * layout.steps
|
| 339 |
+
reward -= 0.05 * layout.form_length
|
| 340 |
+
|
| 341 |
+
if BUTTON_SWEET_LOW <= layout.button_size <= BUTTON_SWEET_HIGH:
|
| 342 |
+
reward += 0.2
|
| 343 |
+
|
| 344 |
+
if self._prefers_short_forms and layout.form_length <= 4:
|
| 345 |
+
reward += 0.1
|
| 346 |
+
if self._prefers_large_buttons and layout.button_size > 1.2:
|
| 347 |
+
reward += 0.1
|
| 348 |
+
|
| 349 |
+
return reward
|
| 350 |
+
|
| 351 |
+
if __name__ == "__main__":
|
| 352 |
+
import json
|
| 353 |
+
ALL_ACTION_TYPES = [
|
| 354 |
+
"increase_button", "decrease_form", "increase_steps",
|
| 355 |
+
"decrease_steps", "reorder_sections", "noop",
|
| 356 |
+
]
|
| 357 |
+
rng = random.Random(0)
|
| 358 |
+
env = UIEnv(seed=42, task="hard")
|
| 359 |
+
obs = env.reset()
|
| 360 |
+
done = False
|
| 361 |
+
while not done:
|
| 362 |
+
action_type = rng.choice(ALL_ACTION_TYPES)
|
| 363 |
+
action = Action(type=action_type, value=None)
|
| 364 |
+
obs, reward, done, info = env.step(action)
|
frontend/index.html
ADDED
|
@@ -0,0 +1,227 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
<!DOCTYPE html>
|
| 2 |
+
<html lang="en">
|
| 3 |
+
<head>
|
| 4 |
+
<meta charset="UTF-8">
|
| 5 |
+
<meta name="viewport" content="width=device-width, initial-scale=1.0">
|
| 6 |
+
<title>UIEnv Interactive Simulator</title>
|
| 7 |
+
<meta name="description" content="Interactive browser-based simulator for the Adaptive UI Layout Optimization Environment">
|
| 8 |
+
<script src="https://cdn.tailwindcss.com"></script>
|
| 9 |
+
<link rel="preconnect" href="https://fonts.googleapis.com">
|
| 10 |
+
<link href="https://fonts.googleapis.com/css2?family=Inter:wght@300;400;500;600;700;800&display=swap" rel="stylesheet">
|
| 11 |
+
<link rel="stylesheet" href="/static/styles.css">
|
| 12 |
+
<script>
|
| 13 |
+
tailwind.config = {
|
| 14 |
+
theme: {
|
| 15 |
+
extend: {
|
| 16 |
+
fontFamily: { sans: ['Inter', 'system-ui', 'sans-serif'] },
|
| 17 |
+
colors: {
|
| 18 |
+
dark: { 50: '#f0f0f5', 100: '#e0e1ea', 200: '#c2c3d5', 300: '#9d9fb8', 400: '#73759a', 500: '#515380', 600: '#3d3f68', 700: '#2d2f52', 800: '#1e2040', 900: '#141630', 950: '#0c0e1f' },
|
| 19 |
+
accent: { 400: '#818cf8', 500: '#6366f1', 600: '#4f46e5' },
|
| 20 |
+
success: '#34d399',
|
| 21 |
+
danger: '#f87171',
|
| 22 |
+
warn: '#fbbf24',
|
| 23 |
+
}
|
| 24 |
+
}
|
| 25 |
+
}
|
| 26 |
+
}
|
| 27 |
+
</script>
|
| 28 |
+
</head>
|
| 29 |
+
<body class="bg-dark-950 text-dark-100 font-sans min-h-screen">
|
| 30 |
+
|
| 31 |
+
<!-- Header -->
|
| 32 |
+
<header class="border-b border-dark-800/60 bg-dark-950/80 backdrop-blur-xl sticky top-0 z-50">
|
| 33 |
+
<div class="max-w-[1400px] mx-auto px-6 py-4 flex items-center justify-between">
|
| 34 |
+
<div class="flex items-center gap-3">
|
| 35 |
+
<div class="w-9 h-9 rounded-lg bg-gradient-to-br from-accent-500 to-purple-600 flex items-center justify-center text-white font-bold text-sm">UI</div>
|
| 36 |
+
<div>
|
| 37 |
+
<h1 class="text-lg font-bold text-white tracking-tight">UIEnv Simulator</h1>
|
| 38 |
+
<p class="text-xs text-dark-400">Adaptive UI Layout Optimization</p>
|
| 39 |
+
</div>
|
| 40 |
+
</div>
|
| 41 |
+
<div id="connection-status" class="flex items-center gap-2 text-xs text-dark-400">
|
| 42 |
+
<span class="w-2 h-2 rounded-full bg-dark-600 animate-pulse" id="status-dot"></span>
|
| 43 |
+
<span id="status-text">Connecting...</span>
|
| 44 |
+
</div>
|
| 45 |
+
</div>
|
| 46 |
+
</header>
|
| 47 |
+
|
| 48 |
+
<main class="max-w-[1400px] mx-auto px-6 py-6">
|
| 49 |
+
|
| 50 |
+
<!-- Top Row: Controls -->
|
| 51 |
+
<section class="grid grid-cols-1 md:grid-cols-3 gap-4 mb-6">
|
| 52 |
+
<!-- Agent Selector -->
|
| 53 |
+
<div class="bg-dark-900/50 border border-dark-800/40 rounded-xl p-4">
|
| 54 |
+
<label class="text-xs font-semibold text-dark-400 uppercase tracking-wider mb-2 block">Agent</label>
|
| 55 |
+
<select id="agent-select" class="w-full bg-dark-800 border border-dark-700 rounded-lg px-3 py-2.5 text-sm text-white focus:ring-2 focus:ring-accent-500 focus:border-transparent outline-none">
|
| 56 |
+
<option value="heuristic">Heuristic Agent</option>
|
| 57 |
+
<option value="random">Random Agent</option>
|
| 58 |
+
</select>
|
| 59 |
+
</div>
|
| 60 |
+
|
| 61 |
+
<!-- Action Buttons -->
|
| 62 |
+
<div class="bg-dark-900/50 border border-dark-800/40 rounded-xl p-4 flex flex-col gap-2">
|
| 63 |
+
<label class="text-xs font-semibold text-dark-400 uppercase tracking-wider mb-1">Controls</label>
|
| 64 |
+
<div class="flex gap-2">
|
| 65 |
+
<button id="btn-reset" onclick="resetEnv()" class="flex-1 bg-dark-700 hover:bg-dark-600 text-white text-sm font-medium rounded-lg px-3 py-2 transition-all duration-200 active:scale-95">Reset</button>
|
| 66 |
+
<button id="btn-step" onclick="stepAgent()" disabled class="flex-1 bg-accent-600 hover:bg-accent-500 disabled:bg-dark-700 disabled:text-dark-500 text-white text-sm font-medium rounded-lg px-3 py-2 transition-all duration-200 active:scale-95">Step</button>
|
| 67 |
+
<button id="btn-run" onclick="runEpisode()" class="flex-1 bg-gradient-to-r from-accent-500 to-purple-600 hover:from-accent-400 hover:to-purple-500 text-white text-sm font-medium rounded-lg px-3 py-2 transition-all duration-200 active:scale-95 shadow-lg shadow-accent-500/20">Run Episode</button>
|
| 68 |
+
</div>
|
| 69 |
+
</div>
|
| 70 |
+
|
| 71 |
+
<!-- Episode Status -->
|
| 72 |
+
<div class="bg-dark-900/50 border border-dark-800/40 rounded-xl p-4">
|
| 73 |
+
<label class="text-xs font-semibold text-dark-400 uppercase tracking-wider mb-2 block">Episode Status</label>
|
| 74 |
+
<div class="flex items-center gap-3">
|
| 75 |
+
<span id="episode-badge" class="px-3 py-1 rounded-full text-xs font-semibold bg-dark-700 text-dark-400">IDLE</span>
|
| 76 |
+
<span id="episode-outcome" class="text-sm text-dark-400">--</span>
|
| 77 |
+
</div>
|
| 78 |
+
</div>
|
| 79 |
+
</section>
|
| 80 |
+
|
| 81 |
+
<!-- Main Grid: Visualization + Metrics -->
|
| 82 |
+
<section class="grid grid-cols-1 lg:grid-cols-3 gap-6 mb-6">
|
| 83 |
+
|
| 84 |
+
<!-- LEFT: Layout Visualization (2 cols) -->
|
| 85 |
+
<div class="lg:col-span-2 bg-dark-900/50 border border-dark-800/40 rounded-xl p-6">
|
| 86 |
+
<div class="flex items-center justify-between mb-5">
|
| 87 |
+
<h2 class="text-sm font-bold text-white uppercase tracking-wider">Layout Preview</h2>
|
| 88 |
+
<span id="device-badge" class="px-2.5 py-1 rounded-md text-xs font-medium bg-dark-800 text-dark-300">Desktop</span>
|
| 89 |
+
</div>
|
| 90 |
+
|
| 91 |
+
<!-- Simulated UI -->
|
| 92 |
+
<div id="layout-preview" class="bg-dark-950 border border-dark-800 rounded-xl p-6 min-h-[320px] flex flex-col gap-5 transition-all duration-500">
|
| 93 |
+
|
| 94 |
+
<!-- Steps Indicator -->
|
| 95 |
+
<div>
|
| 96 |
+
<p class="text-xs text-dark-500 mb-2 font-medium">CHECKOUT STEPS</p>
|
| 97 |
+
<div id="steps-container" class="flex gap-2 items-center">
|
| 98 |
+
<!-- Rendered by JS -->
|
| 99 |
+
</div>
|
| 100 |
+
</div>
|
| 101 |
+
|
| 102 |
+
<!-- Form Fields -->
|
| 103 |
+
<div>
|
| 104 |
+
<p class="text-xs text-dark-500 mb-2 font-medium">FORM FIELDS</p>
|
| 105 |
+
<div id="form-container" class="grid grid-cols-2 gap-2">
|
| 106 |
+
<!-- Rendered by JS -->
|
| 107 |
+
</div>
|
| 108 |
+
</div>
|
| 109 |
+
|
| 110 |
+
<!-- CTA Button -->
|
| 111 |
+
<div class="mt-auto">
|
| 112 |
+
<p class="text-xs text-dark-500 mb-2 font-medium">CTA BUTTON</p>
|
| 113 |
+
<button id="cta-button" class="bg-gradient-to-r from-accent-500 to-purple-600 text-white font-semibold rounded-lg transition-all duration-500 shadow-lg shadow-accent-500/25">
|
| 114 |
+
Submit
|
| 115 |
+
</button>
|
| 116 |
+
</div>
|
| 117 |
+
</div>
|
| 118 |
+
</div>
|
| 119 |
+
|
| 120 |
+
<!-- RIGHT: Live Metrics -->
|
| 121 |
+
<div class="space-y-4">
|
| 122 |
+
<!-- Progress -->
|
| 123 |
+
<div class="bg-dark-900/50 border border-dark-800/40 rounded-xl p-4">
|
| 124 |
+
<label class="text-xs font-semibold text-dark-400 uppercase tracking-wider mb-3 block">Progress</label>
|
| 125 |
+
<div class="relative h-3 bg-dark-800 rounded-full overflow-hidden mb-2">
|
| 126 |
+
<div id="progress-bar" class="absolute left-0 top-0 h-full bg-gradient-to-r from-accent-500 to-success rounded-full transition-all duration-700 ease-out" style="width: 0%"></div>
|
| 127 |
+
</div>
|
| 128 |
+
<p class="text-right text-sm font-mono text-dark-300"><span id="progress-value">0.0</span>%</p>
|
| 129 |
+
</div>
|
| 130 |
+
|
| 131 |
+
<!-- Metrics Grid -->
|
| 132 |
+
<div class="grid grid-cols-2 gap-3">
|
| 133 |
+
<div class="bg-dark-900/50 border border-dark-800/40 rounded-xl p-4 text-center">
|
| 134 |
+
<p class="text-xs text-dark-500 mb-1">Reward</p>
|
| 135 |
+
<p id="metric-reward" class="text-xl font-bold font-mono text-white">--</p>
|
| 136 |
+
</div>
|
| 137 |
+
<div class="bg-dark-900/50 border border-dark-800/40 rounded-xl p-4 text-center">
|
| 138 |
+
<p class="text-xs text-dark-500 mb-1">Step</p>
|
| 139 |
+
<p id="metric-step" class="text-xl font-bold font-mono text-white">0</p>
|
| 140 |
+
</div>
|
| 141 |
+
<div class="bg-dark-900/50 border border-dark-800/40 rounded-xl p-4 text-center">
|
| 142 |
+
<p class="text-xs text-dark-500 mb-1">Total Reward</p>
|
| 143 |
+
<p id="metric-total-reward" class="text-xl font-bold font-mono text-accent-400">0.00</p>
|
| 144 |
+
</div>
|
| 145 |
+
<div class="bg-dark-900/50 border border-dark-800/40 rounded-xl p-4 text-center">
|
| 146 |
+
<p class="text-xs text-dark-500 mb-1">Outcome</p>
|
| 147 |
+
<p id="metric-outcome" class="text-lg font-bold text-dark-400">--</p>
|
| 148 |
+
</div>
|
| 149 |
+
</div>
|
| 150 |
+
|
| 151 |
+
<!-- Layout Values -->
|
| 152 |
+
<div class="bg-dark-900/50 border border-dark-800/40 rounded-xl p-4">
|
| 153 |
+
<label class="text-xs font-semibold text-dark-400 uppercase tracking-wider mb-3 block">Layout State</label>
|
| 154 |
+
<div class="space-y-2">
|
| 155 |
+
<div class="flex justify-between text-sm">
|
| 156 |
+
<span class="text-dark-500">Button Size</span>
|
| 157 |
+
<span id="val-button" class="font-mono text-white">1.0</span>
|
| 158 |
+
</div>
|
| 159 |
+
<div class="flex justify-between text-sm">
|
| 160 |
+
<span class="text-dark-500">Form Length</span>
|
| 161 |
+
<span id="val-form" class="font-mono text-white">5</span>
|
| 162 |
+
</div>
|
| 163 |
+
<div class="flex justify-between text-sm">
|
| 164 |
+
<span class="text-dark-500">Steps</span>
|
| 165 |
+
<span id="val-steps" class="font-mono text-white">3</span>
|
| 166 |
+
</div>
|
| 167 |
+
<div class="flex justify-between text-sm">
|
| 168 |
+
<span class="text-dark-500">Last Action</span>
|
| 169 |
+
<span id="val-action" class="font-mono text-accent-400 text-xs">--</span>
|
| 170 |
+
</div>
|
| 171 |
+
</div>
|
| 172 |
+
</div>
|
| 173 |
+
</div>
|
| 174 |
+
</section>
|
| 175 |
+
|
| 176 |
+
<!-- Action Log + Leaderboard -->
|
| 177 |
+
<section class="grid grid-cols-1 lg:grid-cols-2 gap-6">
|
| 178 |
+
|
| 179 |
+
<!-- Action Log -->
|
| 180 |
+
<div class="bg-dark-900/50 border border-dark-800/40 rounded-xl p-5">
|
| 181 |
+
<div class="flex items-center justify-between mb-4">
|
| 182 |
+
<h2 class="text-sm font-bold text-white uppercase tracking-wider">Action Log</h2>
|
| 183 |
+
<button onclick="clearLog()" class="text-xs text-dark-500 hover:text-dark-300 transition-colors">Clear</button>
|
| 184 |
+
</div>
|
| 185 |
+
<div id="action-log" class="h-[250px] overflow-y-auto space-y-1 font-mono text-xs scroll-smooth">
|
| 186 |
+
<p class="text-dark-600 italic">No actions yet. Press Reset to start.</p>
|
| 187 |
+
</div>
|
| 188 |
+
</div>
|
| 189 |
+
|
| 190 |
+
<!-- Leaderboard -->
|
| 191 |
+
<div class="bg-dark-900/50 border border-dark-800/40 rounded-xl p-5">
|
| 192 |
+
<div class="flex items-center justify-between mb-4">
|
| 193 |
+
<h2 class="text-sm font-bold text-white uppercase tracking-wider">Leaderboard</h2>
|
| 194 |
+
<button id="btn-leaderboard" onclick="fetchLeaderboard()" class="text-xs bg-dark-700 hover:bg-dark-600 text-dark-300 px-3 py-1.5 rounded-lg transition-colors">
|
| 195 |
+
Run Benchmark
|
| 196 |
+
</button>
|
| 197 |
+
</div>
|
| 198 |
+
<div id="leaderboard-container">
|
| 199 |
+
<table class="w-full text-sm">
|
| 200 |
+
<thead>
|
| 201 |
+
<tr class="text-dark-500 text-xs uppercase">
|
| 202 |
+
<th class="text-left py-2 pr-2">#</th>
|
| 203 |
+
<th class="text-left py-2">Agent</th>
|
| 204 |
+
<th class="text-right py-2">Score</th>
|
| 205 |
+
<th class="text-right py-2">Comp %</th>
|
| 206 |
+
<th class="text-right py-2">Drop %</th>
|
| 207 |
+
<th class="text-right py-2">Avg Rwd</th>
|
| 208 |
+
</tr>
|
| 209 |
+
</thead>
|
| 210 |
+
<tbody id="leaderboard-body">
|
| 211 |
+
<tr><td colspan="6" class="py-8 text-center text-dark-600 italic">Click "Run Benchmark" to evaluate agents</td></tr>
|
| 212 |
+
</tbody>
|
| 213 |
+
</table>
|
| 214 |
+
</div>
|
| 215 |
+
</div>
|
| 216 |
+
</section>
|
| 217 |
+
|
| 218 |
+
</main>
|
| 219 |
+
|
| 220 |
+
<!-- Footer -->
|
| 221 |
+
<footer class="border-t border-dark-800/40 mt-12 py-4">
|
| 222 |
+
<p class="text-center text-xs text-dark-600">UIEnv Adaptive Layout Optimization -- Interactive Simulator v1.0</p>
|
| 223 |
+
</footer>
|
| 224 |
+
|
| 225 |
+
<script src="/static/script.js"></script>
|
| 226 |
+
</body>
|
| 227 |
+
</html>
|
frontend/script.js
ADDED
|
@@ -0,0 +1,454 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
/**
|
| 2 |
+
* script.js
|
| 3 |
+
* ---------
|
| 4 |
+
* Frontend logic for the UIEnv Interactive Simulator.
|
| 5 |
+
*
|
| 6 |
+
* Handles:
|
| 7 |
+
* - API calls (reset, step, run_episode, leaderboard)
|
| 8 |
+
* - Layout visualization updates
|
| 9 |
+
* - Live metric rendering
|
| 10 |
+
* - Action log
|
| 11 |
+
* - Animated episode playback
|
| 12 |
+
*/
|
| 13 |
+
|
| 14 |
+
const API_BASE = ""; // Same origin
|
| 15 |
+
|
| 16 |
+
// ======================================================================
|
| 17 |
+
// State
|
| 18 |
+
// ======================================================================
|
| 19 |
+
|
| 20 |
+
let state = {
|
| 21 |
+
observation: null,
|
| 22 |
+
done: true,
|
| 23 |
+
totalReward: 0,
|
| 24 |
+
stepCount: 0,
|
| 25 |
+
isRunning: false,
|
| 26 |
+
};
|
| 27 |
+
|
| 28 |
+
// ======================================================================
|
| 29 |
+
// DOM Elements
|
| 30 |
+
// ======================================================================
|
| 31 |
+
|
| 32 |
+
const $ = (id) => document.getElementById(id);
|
| 33 |
+
|
| 34 |
+
const dom = {
|
| 35 |
+
agentSelect: $("agent-select"),
|
| 36 |
+
btnReset: $("btn-reset"),
|
| 37 |
+
btnStep: $("btn-step"),
|
| 38 |
+
btnRun: $("btn-run"),
|
| 39 |
+
episodeBadge: $("episode-badge"),
|
| 40 |
+
episodeOutcome: $("episode-outcome"),
|
| 41 |
+
deviceBadge: $("device-badge"),
|
| 42 |
+
stepsContainer: $("steps-container"),
|
| 43 |
+
formContainer: $("form-container"),
|
| 44 |
+
ctaButton: $("cta-button"),
|
| 45 |
+
progressBar: $("progress-bar"),
|
| 46 |
+
progressValue: $("progress-value"),
|
| 47 |
+
metricReward: $("metric-reward"),
|
| 48 |
+
metricStep: $("metric-step"),
|
| 49 |
+
metricTotal: $("metric-total-reward"),
|
| 50 |
+
metricOutcome: $("metric-outcome"),
|
| 51 |
+
valButton: $("val-button"),
|
| 52 |
+
valForm: $("val-form"),
|
| 53 |
+
valSteps: $("val-steps"),
|
| 54 |
+
valAction: $("val-action"),
|
| 55 |
+
actionLog: $("action-log"),
|
| 56 |
+
leaderboardBody:$("leaderboard-body"),
|
| 57 |
+
statusDot: $("status-dot"),
|
| 58 |
+
statusText: $("status-text"),
|
| 59 |
+
};
|
| 60 |
+
|
| 61 |
+
// ======================================================================
|
| 62 |
+
// API Helpers
|
| 63 |
+
// ======================================================================
|
| 64 |
+
|
| 65 |
+
async function api(endpoint, method = "GET", body = null) {
|
| 66 |
+
const opts = {
|
| 67 |
+
method,
|
| 68 |
+
headers: { "Content-Type": "application/json" },
|
| 69 |
+
};
|
| 70 |
+
if (body) opts.body = JSON.stringify(body);
|
| 71 |
+
|
| 72 |
+
const res = await fetch(API_BASE + endpoint, opts);
|
| 73 |
+
if (!res.ok) {
|
| 74 |
+
const err = await res.json().catch(() => ({ detail: res.statusText }));
|
| 75 |
+
throw new Error(err.detail || "API error");
|
| 76 |
+
}
|
| 77 |
+
return res.json();
|
| 78 |
+
}
|
| 79 |
+
|
| 80 |
+
// ======================================================================
|
| 81 |
+
// Layout Visualization
|
| 82 |
+
// ======================================================================
|
| 83 |
+
|
| 84 |
+
function renderSteps(count, progress) {
|
| 85 |
+
const container = dom.stepsContainer;
|
| 86 |
+
container.innerHTML = "";
|
| 87 |
+
|
| 88 |
+
for (let i = 1; i <= count; i++) {
|
| 89 |
+
// Step circle
|
| 90 |
+
const circle = document.createElement("div");
|
| 91 |
+
circle.className = "step-circle" + (i === 1 ? " active" : "");
|
| 92 |
+
circle.textContent = i;
|
| 93 |
+
|
| 94 |
+
// Activate based on progress
|
| 95 |
+
if (progress > 0 && i <= Math.ceil(progress * count)) {
|
| 96 |
+
circle.classList.add("active");
|
| 97 |
+
}
|
| 98 |
+
|
| 99 |
+
container.appendChild(circle);
|
| 100 |
+
|
| 101 |
+
// Connector (except after last)
|
| 102 |
+
if (i < count) {
|
| 103 |
+
const conn = document.createElement("div");
|
| 104 |
+
conn.className = "step-connector";
|
| 105 |
+
if (progress > 0 && i < Math.ceil(progress * count)) {
|
| 106 |
+
conn.classList.add("active");
|
| 107 |
+
}
|
| 108 |
+
container.appendChild(conn);
|
| 109 |
+
}
|
| 110 |
+
}
|
| 111 |
+
}
|
| 112 |
+
|
| 113 |
+
function renderFormFields(count) {
|
| 114 |
+
const container = dom.formContainer;
|
| 115 |
+
container.innerHTML = "";
|
| 116 |
+
|
| 117 |
+
const labels = [
|
| 118 |
+
"Full Name", "Email", "Phone", "Address", "City",
|
| 119 |
+
"Country", "Zip Code", "Company", "Card Number", "CVV",
|
| 120 |
+
];
|
| 121 |
+
|
| 122 |
+
for (let i = 0; i < count; i++) {
|
| 123 |
+
const el = document.createElement("div");
|
| 124 |
+
el.className = "sim-input log-entry-new";
|
| 125 |
+
el.textContent = labels[i] || `Field ${i + 1}`;
|
| 126 |
+
container.appendChild(el);
|
| 127 |
+
}
|
| 128 |
+
}
|
| 129 |
+
|
| 130 |
+
function renderButton(size) {
|
| 131 |
+
const btn = dom.ctaButton;
|
| 132 |
+
// Scale: size 1.0 = 100%, mapped proportionally
|
| 133 |
+
const pxWidth = Math.round(120 + (size - 0.5) * 80);
|
| 134 |
+
const pxHeight = Math.round(32 + (size - 0.5) * 16);
|
| 135 |
+
const fontSize = Math.round(12 + (size - 0.5) * 4);
|
| 136 |
+
|
| 137 |
+
btn.style.width = pxWidth + "px";
|
| 138 |
+
btn.style.height = pxHeight + "px";
|
| 139 |
+
btn.style.fontSize = fontSize + "px";
|
| 140 |
+
|
| 141 |
+
// Pulse animation
|
| 142 |
+
btn.classList.remove("cta-pulse");
|
| 143 |
+
void btn.offsetWidth; // force reflow
|
| 144 |
+
btn.classList.add("cta-pulse");
|
| 145 |
+
|
| 146 |
+
// Color hint: green if in sweet spot, orange if not
|
| 147 |
+
if (size >= 0.9 && size <= 1.3) {
|
| 148 |
+
btn.classList.remove("from-orange-500", "to-red-500");
|
| 149 |
+
btn.classList.add("from-accent-500", "to-purple-600");
|
| 150 |
+
} else {
|
| 151 |
+
btn.classList.remove("from-accent-500", "to-purple-600");
|
| 152 |
+
btn.classList.add("from-orange-500", "to-red-500");
|
| 153 |
+
}
|
| 154 |
+
}
|
| 155 |
+
|
| 156 |
+
// ======================================================================
|
| 157 |
+
// UI Update
|
| 158 |
+
// ======================================================================
|
| 159 |
+
|
| 160 |
+
function updateUI(obs, reward = null, info = null) {
|
| 161 |
+
if (!obs) return;
|
| 162 |
+
|
| 163 |
+
state.observation = obs;
|
| 164 |
+
|
| 165 |
+
// Device badge
|
| 166 |
+
dom.deviceBadge.textContent = obs.device === "mobile" ? "Mobile" : "Desktop";
|
| 167 |
+
|
| 168 |
+
// Layout values
|
| 169 |
+
dom.valButton.textContent = obs.button_size.toFixed(1);
|
| 170 |
+
dom.valForm.textContent = obs.form_length;
|
| 171 |
+
dom.valSteps.textContent = obs.steps;
|
| 172 |
+
dom.valAction.textContent = obs.last_action || "--";
|
| 173 |
+
|
| 174 |
+
// Progress
|
| 175 |
+
const pct = (obs.progress * 100).toFixed(1);
|
| 176 |
+
dom.progressBar.style.width = pct + "%";
|
| 177 |
+
dom.progressValue.textContent = pct;
|
| 178 |
+
|
| 179 |
+
// Render layout
|
| 180 |
+
renderSteps(obs.steps, obs.progress);
|
| 181 |
+
renderFormFields(obs.form_length);
|
| 182 |
+
renderButton(obs.button_size);
|
| 183 |
+
|
| 184 |
+
// Reward
|
| 185 |
+
if (reward !== null) {
|
| 186 |
+
dom.metricReward.textContent = (reward >= 0 ? "+" : "") + reward.toFixed(4);
|
| 187 |
+
dom.metricReward.className = "text-xl font-bold font-mono " +
|
| 188 |
+
(reward >= 0 ? "text-success" : "text-danger");
|
| 189 |
+
|
| 190 |
+
// Flash
|
| 191 |
+
dom.metricReward.parentElement.classList.remove("flash-green", "flash-red");
|
| 192 |
+
void dom.metricReward.parentElement.offsetWidth;
|
| 193 |
+
dom.metricReward.parentElement.classList.add(reward >= 0 ? "flash-green" : "flash-red");
|
| 194 |
+
}
|
| 195 |
+
|
| 196 |
+
// Step count
|
| 197 |
+
if (info) {
|
| 198 |
+
dom.metricStep.textContent = info.step_count || state.stepCount;
|
| 199 |
+
}
|
| 200 |
+
|
| 201 |
+
// Total reward
|
| 202 |
+
dom.metricTotal.textContent = state.totalReward.toFixed(2);
|
| 203 |
+
|
| 204 |
+
// Outcome
|
| 205 |
+
if (info && info.outcome) {
|
| 206 |
+
const oc = info.outcome;
|
| 207 |
+
dom.metricOutcome.textContent = oc.charAt(0).toUpperCase() + oc.slice(1);
|
| 208 |
+
dom.metricOutcome.className = "text-lg font-bold outcome-" + oc;
|
| 209 |
+
}
|
| 210 |
+
}
|
| 211 |
+
|
| 212 |
+
function setEpisodeStatus(status, outcome = "") {
|
| 213 |
+
const badge = dom.episodeBadge;
|
| 214 |
+
badge.textContent = status;
|
| 215 |
+
|
| 216 |
+
const colors = {
|
| 217 |
+
"IDLE": "bg-dark-700 text-dark-400",
|
| 218 |
+
"RUNNING": "bg-accent-600/20 text-accent-400",
|
| 219 |
+
"DONE": "bg-success/20 text-success",
|
| 220 |
+
"DROPPED": "bg-danger/20 text-danger",
|
| 221 |
+
};
|
| 222 |
+
badge.className = "px-3 py-1 rounded-full text-xs font-semibold " + (colors[status] || colors["IDLE"]);
|
| 223 |
+
dom.episodeOutcome.textContent = outcome;
|
| 224 |
+
}
|
| 225 |
+
|
| 226 |
+
function setControlsEnabled(enabled) {
|
| 227 |
+
dom.btnStep.disabled = !enabled;
|
| 228 |
+
}
|
| 229 |
+
|
| 230 |
+
// ======================================================================
|
| 231 |
+
// Action Log
|
| 232 |
+
// ======================================================================
|
| 233 |
+
|
| 234 |
+
let logInitialized = false;
|
| 235 |
+
|
| 236 |
+
function addLog(message, type = "system") {
|
| 237 |
+
if (!logInitialized) {
|
| 238 |
+
dom.actionLog.innerHTML = "";
|
| 239 |
+
logInitialized = true;
|
| 240 |
+
}
|
| 241 |
+
|
| 242 |
+
const entry = document.createElement("div");
|
| 243 |
+
entry.className = `log-entry log-entry-new log-${type}`;
|
| 244 |
+
entry.textContent = message;
|
| 245 |
+
dom.actionLog.appendChild(entry);
|
| 246 |
+
dom.actionLog.scrollTop = dom.actionLog.scrollHeight;
|
| 247 |
+
}
|
| 248 |
+
|
| 249 |
+
function clearLog() {
|
| 250 |
+
dom.actionLog.innerHTML = '<p class="text-dark-600 italic">Log cleared.</p>';
|
| 251 |
+
logInitialized = false;
|
| 252 |
+
}
|
| 253 |
+
|
| 254 |
+
// ======================================================================
|
| 255 |
+
// API Actions
|
| 256 |
+
// ======================================================================
|
| 257 |
+
|
| 258 |
+
async function resetEnv() {
|
| 259 |
+
try {
|
| 260 |
+
const data = await api("/reset", "POST");
|
| 261 |
+
state.done = false;
|
| 262 |
+
state.totalReward = 0;
|
| 263 |
+
state.stepCount = 0;
|
| 264 |
+
|
| 265 |
+
updateUI(data.observation);
|
| 266 |
+
setEpisodeStatus("RUNNING", "Episode started");
|
| 267 |
+
setControlsEnabled(true);
|
| 268 |
+
|
| 269 |
+
dom.metricReward.textContent = "--";
|
| 270 |
+
dom.metricReward.className = "text-xl font-bold font-mono text-white";
|
| 271 |
+
dom.metricTotal.textContent = "0.00";
|
| 272 |
+
dom.metricStep.textContent = "0";
|
| 273 |
+
dom.metricOutcome.textContent = "--";
|
| 274 |
+
dom.metricOutcome.className = "text-lg font-bold text-dark-400";
|
| 275 |
+
|
| 276 |
+
addLog("Environment reset. Episode started.", "system");
|
| 277 |
+
} catch (err) {
|
| 278 |
+
addLog("Error: " + err.message, "negative");
|
| 279 |
+
}
|
| 280 |
+
}
|
| 281 |
+
|
| 282 |
+
async function stepAgent() {
|
| 283 |
+
if (state.done || state.isRunning) return;
|
| 284 |
+
|
| 285 |
+
const agent = dom.agentSelect.value;
|
| 286 |
+
|
| 287 |
+
try {
|
| 288 |
+
// Run one step on the server via run_episode is not ideal for single steps.
|
| 289 |
+
// Instead, we use a dedicated approach: call run_episode and take one step.
|
| 290 |
+
// But we actually have the /step endpoint for manual actions.
|
| 291 |
+
// For agent-driven steps, we'll call /run_episode and animate.
|
| 292 |
+
// Actually, for a single step with the agent, let's run a mini approach:
|
| 293 |
+
// We'll call /step with the action chosen by the UI. But we want the agent to choose.
|
| 294 |
+
// The simplest: run_episode returns all steps, and we can animate one at a time.
|
| 295 |
+
// Let's do a single-step run via run_episode with a post-hoc approach.
|
| 296 |
+
|
| 297 |
+
// For now, use a simple heuristic: run the full episode and take the next step.
|
| 298 |
+
// Better: let's just re-run and animate step by step. We'll fake it.
|
| 299 |
+
|
| 300 |
+
// Actually, the cleanest is: we run the entire episode, cache it, and step through it.
|
| 301 |
+
if (!state._cachedSteps || state._cacheAgent !== agent) {
|
| 302 |
+
const data = await api("/run_episode", "POST", { agent });
|
| 303 |
+
state._cachedSteps = data.steps;
|
| 304 |
+
state._cacheAgent = agent;
|
| 305 |
+
state._cacheIdx = 0;
|
| 306 |
+
}
|
| 307 |
+
|
| 308 |
+
if (state._cacheIdx < state._cachedSteps.length) {
|
| 309 |
+
const s = state._cachedSteps[state._cacheIdx];
|
| 310 |
+
state.stepCount = s.info.step_count;
|
| 311 |
+
state.totalReward += s.reward;
|
| 312 |
+
state.done = s.done;
|
| 313 |
+
|
| 314 |
+
updateUI(s.observation, s.reward, s.info);
|
| 315 |
+
addLog(
|
| 316 |
+
`Step ${s.info.step_count}: ${s.action} -> reward=${s.reward >= 0 ? "+" : ""}${s.reward.toFixed(3)} outcome=${s.info.outcome}`,
|
| 317 |
+
s.reward >= 0 ? "reward" : "negative"
|
| 318 |
+
);
|
| 319 |
+
|
| 320 |
+
state._cacheIdx++;
|
| 321 |
+
|
| 322 |
+
if (s.done) {
|
| 323 |
+
const outcome = s.info.outcome;
|
| 324 |
+
setEpisodeStatus(outcome === "complete" ? "DONE" : "DROPPED", outcome);
|
| 325 |
+
setControlsEnabled(false);
|
| 326 |
+
addLog(`Episode ended: ${outcome}. Total reward: ${state.totalReward.toFixed(3)}`, "outcome");
|
| 327 |
+
state._cachedSteps = null;
|
| 328 |
+
}
|
| 329 |
+
}
|
| 330 |
+
} catch (err) {
|
| 331 |
+
addLog("Error: " + err.message, "negative");
|
| 332 |
+
}
|
| 333 |
+
}
|
| 334 |
+
|
| 335 |
+
async function runEpisode() {
|
| 336 |
+
if (state.isRunning) return;
|
| 337 |
+
|
| 338 |
+
const agent = dom.agentSelect.value;
|
| 339 |
+
state.isRunning = true;
|
| 340 |
+
state.totalReward = 0;
|
| 341 |
+
state.stepCount = 0;
|
| 342 |
+
state._cachedSteps = null;
|
| 343 |
+
|
| 344 |
+
dom.btnRun.classList.add("btn-running");
|
| 345 |
+
dom.btnRun.textContent = "Running...";
|
| 346 |
+
setControlsEnabled(false);
|
| 347 |
+
|
| 348 |
+
addLog(`--- Running full episode with ${agent} agent ---`, "system");
|
| 349 |
+
|
| 350 |
+
try {
|
| 351 |
+
const data = await api("/run_episode", "POST", { agent });
|
| 352 |
+
setEpisodeStatus("RUNNING", `${agent} agent`);
|
| 353 |
+
|
| 354 |
+
// Animate step by step
|
| 355 |
+
for (let i = 0; i < data.steps.length; i++) {
|
| 356 |
+
const s = data.steps[i];
|
| 357 |
+
state.stepCount = s.info.step_count;
|
| 358 |
+
state.totalReward += s.reward;
|
| 359 |
+
state.done = s.done;
|
| 360 |
+
|
| 361 |
+
updateUI(s.observation, s.reward, s.info);
|
| 362 |
+
|
| 363 |
+
const actionLabel = s.action + (s.action_value !== null ? `(${s.action_value})` : "");
|
| 364 |
+
addLog(
|
| 365 |
+
`Step ${s.info.step_count}: ${actionLabel} -> R=${s.reward >= 0 ? "+" : ""}${s.reward.toFixed(3)} [${s.info.outcome}]`,
|
| 366 |
+
s.reward >= 0 ? "reward" : "negative"
|
| 367 |
+
);
|
| 368 |
+
|
| 369 |
+
// Delay for animation
|
| 370 |
+
await sleep(350);
|
| 371 |
+
}
|
| 372 |
+
|
| 373 |
+
const outcome = data.final_outcome;
|
| 374 |
+
setEpisodeStatus(outcome === "complete" ? "DONE" : "DROPPED", outcome);
|
| 375 |
+
addLog(
|
| 376 |
+
`Episode complete: ${outcome} | Total reward: ${state.totalReward.toFixed(3)} | Steps: ${data.total_steps}`,
|
| 377 |
+
"outcome"
|
| 378 |
+
);
|
| 379 |
+
|
| 380 |
+
} catch (err) {
|
| 381 |
+
addLog("Error: " + err.message, "negative");
|
| 382 |
+
} finally {
|
| 383 |
+
state.isRunning = false;
|
| 384 |
+
dom.btnRun.classList.remove("btn-running");
|
| 385 |
+
dom.btnRun.textContent = "Run Episode";
|
| 386 |
+
setControlsEnabled(false);
|
| 387 |
+
}
|
| 388 |
+
}
|
| 389 |
+
|
| 390 |
+
async function fetchLeaderboard() {
|
| 391 |
+
const btn = $("btn-leaderboard");
|
| 392 |
+
btn.textContent = "Running...";
|
| 393 |
+
btn.classList.add("btn-running");
|
| 394 |
+
|
| 395 |
+
try {
|
| 396 |
+
const data = await api("/leaderboard");
|
| 397 |
+
const tbody = dom.leaderboardBody;
|
| 398 |
+
tbody.innerHTML = "";
|
| 399 |
+
|
| 400 |
+
for (const entry of data.leaderboard) {
|
| 401 |
+
const tr = document.createElement("tr");
|
| 402 |
+
tr.className = entry.rank === 1 ? "lb-row-1" : "";
|
| 403 |
+
tr.innerHTML = `
|
| 404 |
+
<td class="py-2 pr-2 font-mono text-dark-400">#${entry.rank}</td>
|
| 405 |
+
<td class="py-2 font-medium text-white">${entry.agent}</td>
|
| 406 |
+
<td class="py-2 text-right font-mono ${entry.rank === 1 ? 'text-accent-400' : 'text-dark-300'}">${entry.score.toFixed(4)}</td>
|
| 407 |
+
<td class="py-2 text-right font-mono text-success">${(entry.completion_rate * 100).toFixed(1)}%</td>
|
| 408 |
+
<td class="py-2 text-right font-mono text-danger">${(entry.drop_rate * 100).toFixed(1)}%</td>
|
| 409 |
+
<td class="py-2 text-right font-mono text-dark-300">${entry.avg_reward.toFixed(3)}</td>
|
| 410 |
+
`;
|
| 411 |
+
tbody.appendChild(tr);
|
| 412 |
+
}
|
| 413 |
+
|
| 414 |
+
addLog("Leaderboard updated (50 episodes/agent).", "system");
|
| 415 |
+
} catch (err) {
|
| 416 |
+
addLog("Leaderboard error: " + err.message, "negative");
|
| 417 |
+
} finally {
|
| 418 |
+
btn.textContent = "Run Benchmark";
|
| 419 |
+
btn.classList.remove("btn-running");
|
| 420 |
+
}
|
| 421 |
+
}
|
| 422 |
+
|
| 423 |
+
// ======================================================================
|
| 424 |
+
// Utilities
|
| 425 |
+
// ======================================================================
|
| 426 |
+
|
| 427 |
+
function sleep(ms) {
|
| 428 |
+
return new Promise((resolve) => setTimeout(resolve, ms));
|
| 429 |
+
}
|
| 430 |
+
|
| 431 |
+
// ======================================================================
|
| 432 |
+
// Initialization
|
| 433 |
+
// ======================================================================
|
| 434 |
+
|
| 435 |
+
async function init() {
|
| 436 |
+
try {
|
| 437 |
+
// Quick health check
|
| 438 |
+
await api("/agents");
|
| 439 |
+
dom.statusDot.className = "w-2 h-2 rounded-full bg-success";
|
| 440 |
+
dom.statusText.textContent = "Connected";
|
| 441 |
+
dom.statusDot.classList.remove("animate-pulse");
|
| 442 |
+
} catch {
|
| 443 |
+
dom.statusDot.className = "w-2 h-2 rounded-full bg-danger";
|
| 444 |
+
dom.statusText.textContent = "Disconnected";
|
| 445 |
+
}
|
| 446 |
+
|
| 447 |
+
// Set initial layout preview to defaults
|
| 448 |
+
renderSteps(3, 0);
|
| 449 |
+
renderFormFields(5);
|
| 450 |
+
renderButton(1.0);
|
| 451 |
+
}
|
| 452 |
+
|
| 453 |
+
// Run on load
|
| 454 |
+
document.addEventListener("DOMContentLoaded", init);
|
frontend/styles.css
ADDED
|
@@ -0,0 +1,128 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
/* styles.css -- Custom styles for UIEnv Simulator */
|
| 2 |
+
|
| 3 |
+
/* Scrollbar styling */
|
| 4 |
+
::-webkit-scrollbar {
|
| 5 |
+
width: 6px;
|
| 6 |
+
}
|
| 7 |
+
::-webkit-scrollbar-track {
|
| 8 |
+
background: transparent;
|
| 9 |
+
}
|
| 10 |
+
::-webkit-scrollbar-thumb {
|
| 11 |
+
background: #2d2f52;
|
| 12 |
+
border-radius: 3px;
|
| 13 |
+
}
|
| 14 |
+
::-webkit-scrollbar-thumb:hover {
|
| 15 |
+
background: #3d3f68;
|
| 16 |
+
}
|
| 17 |
+
|
| 18 |
+
/* Action log entries */
|
| 19 |
+
.log-entry {
|
| 20 |
+
padding: 4px 8px;
|
| 21 |
+
border-radius: 6px;
|
| 22 |
+
transition: background-color 0.2s;
|
| 23 |
+
}
|
| 24 |
+
.log-entry:hover {
|
| 25 |
+
background-color: rgba(99, 102, 241, 0.05);
|
| 26 |
+
}
|
| 27 |
+
.log-entry.log-action { color: #818cf8; }
|
| 28 |
+
.log-entry.log-reward { color: #34d399; }
|
| 29 |
+
.log-entry.log-negative { color: #f87171; }
|
| 30 |
+
.log-entry.log-system { color: #9d9fb8; }
|
| 31 |
+
.log-entry.log-outcome { color: #fbbf24; }
|
| 32 |
+
|
| 33 |
+
/* Fade-in animation for new log entries */
|
| 34 |
+
@keyframes fadeSlideIn {
|
| 35 |
+
from { opacity: 0; transform: translateY(-4px); }
|
| 36 |
+
to { opacity: 1; transform: translateY(0); }
|
| 37 |
+
}
|
| 38 |
+
.log-entry-new {
|
| 39 |
+
animation: fadeSlideIn 0.25s ease-out;
|
| 40 |
+
}
|
| 41 |
+
|
| 42 |
+
/* Step circles */
|
| 43 |
+
.step-circle {
|
| 44 |
+
width: 36px;
|
| 45 |
+
height: 36px;
|
| 46 |
+
border-radius: 50%;
|
| 47 |
+
display: flex;
|
| 48 |
+
align-items: center;
|
| 49 |
+
justify-content: center;
|
| 50 |
+
font-size: 12px;
|
| 51 |
+
font-weight: 600;
|
| 52 |
+
transition: all 0.4s ease;
|
| 53 |
+
border: 2px solid #2d2f52;
|
| 54 |
+
color: #73759a;
|
| 55 |
+
background: #1e2040;
|
| 56 |
+
}
|
| 57 |
+
.step-circle.active {
|
| 58 |
+
border-color: #6366f1;
|
| 59 |
+
color: #ffffff;
|
| 60 |
+
background: linear-gradient(135deg, #6366f1, #7c3aed);
|
| 61 |
+
box-shadow: 0 0 12px rgba(99, 102, 241, 0.4);
|
| 62 |
+
}
|
| 63 |
+
.step-connector {
|
| 64 |
+
flex: 1;
|
| 65 |
+
height: 2px;
|
| 66 |
+
background: #2d2f52;
|
| 67 |
+
max-width: 40px;
|
| 68 |
+
transition: background 0.4s;
|
| 69 |
+
}
|
| 70 |
+
.step-connector.active {
|
| 71 |
+
background: #6366f1;
|
| 72 |
+
}
|
| 73 |
+
|
| 74 |
+
/* Form field placeholder */
|
| 75 |
+
.sim-input {
|
| 76 |
+
background: #1e2040;
|
| 77 |
+
border: 1px solid #2d2f52;
|
| 78 |
+
border-radius: 8px;
|
| 79 |
+
padding: 8px 12px;
|
| 80 |
+
font-size: 12px;
|
| 81 |
+
color: #73759a;
|
| 82 |
+
transition: all 0.3s ease;
|
| 83 |
+
}
|
| 84 |
+
.sim-input.highlight {
|
| 85 |
+
border-color: #6366f1;
|
| 86 |
+
box-shadow: 0 0 0 2px rgba(99, 102, 241, 0.15);
|
| 87 |
+
}
|
| 88 |
+
|
| 89 |
+
/* CTA button pulse on change */
|
| 90 |
+
@keyframes ctaPulse {
|
| 91 |
+
0%, 100% { box-shadow: 0 4px 20px rgba(99, 102, 241, 0.25); }
|
| 92 |
+
50% { box-shadow: 0 4px 30px rgba(99, 102, 241, 0.5); }
|
| 93 |
+
}
|
| 94 |
+
.cta-pulse {
|
| 95 |
+
animation: ctaPulse 0.6s ease;
|
| 96 |
+
}
|
| 97 |
+
|
| 98 |
+
/* Outcome badge colors */
|
| 99 |
+
.outcome-complete { color: #34d399; }
|
| 100 |
+
.outcome-drop { color: #f87171; }
|
| 101 |
+
.outcome-distrust { color: #fbbf24; }
|
| 102 |
+
.outcome-continue { color: #818cf8; }
|
| 103 |
+
|
| 104 |
+
/* Leaderboard row highlight */
|
| 105 |
+
.lb-row-1 { background: rgba(99, 102, 241, 0.08); }
|
| 106 |
+
.lb-row-1 td:first-child { color: #818cf8; font-weight: 700; }
|
| 107 |
+
|
| 108 |
+
/* Running animation on buttons */
|
| 109 |
+
@keyframes btnPulse {
|
| 110 |
+
0%, 100% { opacity: 1; }
|
| 111 |
+
50% { opacity: 0.6; }
|
| 112 |
+
}
|
| 113 |
+
.btn-running {
|
| 114 |
+
animation: btnPulse 0.8s ease infinite;
|
| 115 |
+
pointer-events: none;
|
| 116 |
+
}
|
| 117 |
+
|
| 118 |
+
/* Flash effect for metric updates */
|
| 119 |
+
@keyframes flashGreen {
|
| 120 |
+
from { background-color: rgba(52, 211, 153, 0.15); }
|
| 121 |
+
to { background-color: transparent; }
|
| 122 |
+
}
|
| 123 |
+
@keyframes flashRed {
|
| 124 |
+
from { background-color: rgba(248, 113, 113, 0.15); }
|
| 125 |
+
to { background-color: transparent; }
|
| 126 |
+
}
|
| 127 |
+
.flash-green { animation: flashGreen 0.5s ease; }
|
| 128 |
+
.flash-red { animation: flashRed 0.5s ease; }
|
heuristic_agent.py
ADDED
|
@@ -0,0 +1,463 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
heuristic_agent.py
|
| 3 |
+
------------------
|
| 4 |
+
A high-performance heuristic agent for the UIEnv environment.
|
| 5 |
+
|
| 6 |
+
Architecture
|
| 7 |
+
============
|
| 8 |
+
The agent uses a **multi-stage decision pipeline** that evaluates conditions
|
| 9 |
+
in priority order. The first stage to produce an action wins.
|
| 10 |
+
|
| 11 |
+
Stage 1 β Risk Mitigation (prevent imminent drop)
|
| 12 |
+
Stage 2 β Feedback Adaptation (react to distrust / drop signals)
|
| 13 |
+
Stage 3 β Layout Optimization (converge toward ideal layout)
|
| 14 |
+
Stage 4 β Exploration (controlled randomness in safe states)
|
| 15 |
+
Stage 5 β Fallback (safe default when layout is near-optimal)
|
| 16 |
+
|
| 17 |
+
Internal state (outcome history, action history, noop streak) is used to
|
| 18 |
+
make context-aware decisions and avoid oscillation.
|
| 19 |
+
|
| 20 |
+
Includes a full evaluation harness that benchmarks the heuristic agent
|
| 21 |
+
against a random baseline.
|
| 22 |
+
"""
|
| 23 |
+
|
| 24 |
+
from __future__ import annotations
|
| 25 |
+
|
| 26 |
+
import random
|
| 27 |
+
from collections import deque
|
| 28 |
+
from typing import Optional
|
| 29 |
+
|
| 30 |
+
from env import UIEnv, Action, Observation
|
| 31 |
+
|
| 32 |
+
# ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 33 |
+
# Optimal layout targets (derived from reward shaping in env.py)
|
| 34 |
+
# ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 35 |
+
|
| 36 |
+
BUTTON_SWEET_LOW: float = 0.9
|
| 37 |
+
BUTTON_SWEET_HIGH: float = 1.3
|
| 38 |
+
BUTTON_SWEET_MID: float = 1.1 # centre of the sweet spot for jumps
|
| 39 |
+
|
| 40 |
+
TARGET_STEPS: int = 2 # at or below β shaping bonus
|
| 41 |
+
TARGET_FORM_LENGTH: int = 4 # at or below β progress bonus
|
| 42 |
+
SAFE_FORM_FLOOR: int = 3 # do NOT reduce below this (careful-user trap)
|
| 43 |
+
|
| 44 |
+
DROP_STEPS_THRESHOLD: int = 3 # steps above this β impatient drop
|
| 45 |
+
DROP_FORM_THRESHOLD: int = 5 # form_length above this β impatient drop
|
| 46 |
+
|
| 47 |
+
EXPLORE_PROBABILITY: float = 0.07 # 7 % exploration rate
|
| 48 |
+
NOOP_SAFE_LIMIT: int = 1 # max consecutive noops before forcing action
|
| 49 |
+
|
| 50 |
+
# Inverse action pairs β used for oscillation detection
|
| 51 |
+
_INVERSE_ACTIONS: dict[str, str] = {
|
| 52 |
+
"increase_button": "set_button_size", # conceptual inverse
|
| 53 |
+
"increase_steps": "decrease_steps",
|
| 54 |
+
"decrease_steps": "increase_steps",
|
| 55 |
+
}
|
| 56 |
+
|
| 57 |
+
|
| 58 |
+
# ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 59 |
+
# Heuristic Agent
|
| 60 |
+
# ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 61 |
+
|
| 62 |
+
class HeuristicAgent:
|
| 63 |
+
"""
|
| 64 |
+
Structured, multi-stage heuristic agent for UIEnv.
|
| 65 |
+
|
| 66 |
+
The agent maintains internal state that is updated every step via
|
| 67 |
+
`update(info)`, and selects actions via `act(obs)` using a
|
| 68 |
+
priority-ordered decision pipeline.
|
| 69 |
+
"""
|
| 70 |
+
|
| 71 |
+
def __init__(self, seed: int = 99) -> None:
|
| 72 |
+
self._rng = random.Random(seed)
|
| 73 |
+
|
| 74 |
+
# ββ internal tracking ββ
|
| 75 |
+
self.last_outcome: Optional[str] = None
|
| 76 |
+
self.noop_streak: int = 0
|
| 77 |
+
self.action_history: deque[str] = deque(maxlen=5)
|
| 78 |
+
self.distrust_count: int = 0
|
| 79 |
+
self.drop_count: int = 0
|
| 80 |
+
self.step_number: int = 0
|
| 81 |
+
|
| 82 |
+
# ββββββββββββββββββββββββ public API ββββββββββββββββββββββββββ
|
| 83 |
+
|
| 84 |
+
def reset(self) -> None:
|
| 85 |
+
"""Clear per-episode state at the start of a new episode."""
|
| 86 |
+
self.last_outcome = None
|
| 87 |
+
self.noop_streak = 0
|
| 88 |
+
self.action_history.clear()
|
| 89 |
+
self.distrust_count = 0
|
| 90 |
+
self.drop_count = 0
|
| 91 |
+
self.step_number = 0
|
| 92 |
+
|
| 93 |
+
def act(self, obs: Observation) -> Action:
|
| 94 |
+
"""
|
| 95 |
+
Select the next action by running the decision pipeline.
|
| 96 |
+
|
| 97 |
+
Stages are evaluated in priority order; the first stage to return
|
| 98 |
+
a non-None action wins. This guarantees that safety-critical
|
| 99 |
+
adjustments always take precedence over optimisation moves.
|
| 100 |
+
"""
|
| 101 |
+
self.step_number += 1
|
| 102 |
+
|
| 103 |
+
action = (
|
| 104 |
+
self._risk_mitigation(obs)
|
| 105 |
+
or self._adaptation(obs)
|
| 106 |
+
or self._optimize_layout(obs)
|
| 107 |
+
or self._explore(obs)
|
| 108 |
+
or self._fallback(obs)
|
| 109 |
+
)
|
| 110 |
+
|
| 111 |
+
# Record for oscillation detection
|
| 112 |
+
self.action_history.append(action.type)
|
| 113 |
+
|
| 114 |
+
# Track noop streak
|
| 115 |
+
if action.type == "noop":
|
| 116 |
+
self.noop_streak += 1
|
| 117 |
+
else:
|
| 118 |
+
self.noop_streak = 0
|
| 119 |
+
|
| 120 |
+
return action
|
| 121 |
+
|
| 122 |
+
def update(self, info: dict) -> None:
|
| 123 |
+
"""Ingest environment info dict to update internal beliefs."""
|
| 124 |
+
outcome = info.get("outcome", "continue")
|
| 125 |
+
self.last_outcome = outcome
|
| 126 |
+
if outcome == "distrust":
|
| 127 |
+
self.distrust_count += 1
|
| 128 |
+
elif outcome == "drop":
|
| 129 |
+
self.drop_count += 1
|
| 130 |
+
|
| 131 |
+
# ββββββββββββββββββββββββ helpers βββββββββββββββββββββββββββββ
|
| 132 |
+
|
| 133 |
+
def _would_oscillate(self, candidate: str) -> bool:
|
| 134 |
+
"""
|
| 135 |
+
Return True if `candidate` would undo the most recent action,
|
| 136 |
+
creating a pointless back-and-forth oscillation.
|
| 137 |
+
"""
|
| 138 |
+
if not self.action_history:
|
| 139 |
+
return False
|
| 140 |
+
last = self.action_history[-1]
|
| 141 |
+
inv = _INVERSE_ACTIONS.get(candidate)
|
| 142 |
+
return last == inv or _INVERSE_ACTIONS.get(last) == candidate
|
| 143 |
+
|
| 144 |
+
@staticmethod
|
| 145 |
+
def _make(action_type: str, value: float | None = None) -> Action:
|
| 146 |
+
"""Shorthand to construct an Action."""
|
| 147 |
+
return Action(type=action_type, value=value)
|
| 148 |
+
|
| 149 |
+
# ββββββββββββ Stage 1: Risk Mitigation ββββββββββββββββββββββββ
|
| 150 |
+
|
| 151 |
+
def _risk_mitigation(self, obs: Observation) -> Optional[Action]:
|
| 152 |
+
"""
|
| 153 |
+
Immediately neutralise conditions that lead to user drop.
|
| 154 |
+
|
| 155 |
+
Priority:
|
| 156 |
+
1. steps > 3 β decrease_steps (impatient-drop rule)
|
| 157 |
+
2. form_length > 5 β decrease_form (impatient-drop rule)
|
| 158 |
+
|
| 159 |
+
Steps are prioritised because the impatient drop threshold for
|
| 160 |
+
steps (> 3) is stricter and more common than form (> 5).
|
| 161 |
+
"""
|
| 162 |
+
layout = obs.layout
|
| 163 |
+
|
| 164 |
+
if layout.steps > DROP_STEPS_THRESHOLD:
|
| 165 |
+
return self._make("decrease_steps")
|
| 166 |
+
|
| 167 |
+
if layout.form_length > DROP_FORM_THRESHOLD:
|
| 168 |
+
return self._make("decrease_form")
|
| 169 |
+
|
| 170 |
+
return None
|
| 171 |
+
|
| 172 |
+
# ββββββββββββ Stage 2: Feedback Adaptation ββββββββββββββββββββ
|
| 173 |
+
|
| 174 |
+
def _adaptation(self, obs: Observation) -> Optional[Action]:
|
| 175 |
+
"""
|
| 176 |
+
React to the most recent user outcome signal.
|
| 177 |
+
|
| 178 |
+
- 'distrust' means the layout is *too minimal* for this user type:
|
| 179 |
+
β’ new users distrust when steps < 2 β increase_steps
|
| 180 |
+
β’ careful users distrust when form_length < 3 β stop reducing
|
| 181 |
+
(since there is no increase_form action, we can only prevent
|
| 182 |
+
future reduction β but if steps are low, raising them is safe)
|
| 183 |
+
- 'drop' means the layout was *too heavy* β aggressively reduce
|
| 184 |
+
"""
|
| 185 |
+
if self.last_outcome == "distrust":
|
| 186 |
+
layout = obs.layout
|
| 187 |
+
|
| 188 |
+
# New-user distrust: steps too low
|
| 189 |
+
if layout.steps < 2 and not self._would_oscillate("increase_steps"):
|
| 190 |
+
return self._make("increase_steps")
|
| 191 |
+
|
| 192 |
+
# Careful-user distrust is likely about form being too short.
|
| 193 |
+
# We can't increase form, but we can ensure steps stay reasonable
|
| 194 |
+
# (having decent steps helps overall progress which offsets the
|
| 195 |
+
# distrust effect on the next simulation round).
|
| 196 |
+
if layout.steps < 2:
|
| 197 |
+
return self._make("increase_steps")
|
| 198 |
+
|
| 199 |
+
# If distrust persists but layout looks safe, do nothing drastic
|
| 200 |
+
# β let the optimiser handle it.
|
| 201 |
+
return None
|
| 202 |
+
|
| 203 |
+
if self.last_outcome == "drop":
|
| 204 |
+
layout = obs.layout
|
| 205 |
+
|
| 206 |
+
# Emergency: cut the most expensive dimension first
|
| 207 |
+
if layout.steps > 2 and not self._would_oscillate("decrease_steps"):
|
| 208 |
+
return self._make("decrease_steps")
|
| 209 |
+
|
| 210 |
+
if layout.form_length > SAFE_FORM_FLOOR:
|
| 211 |
+
return self._make("decrease_form")
|
| 212 |
+
|
| 213 |
+
return None
|
| 214 |
+
|
| 215 |
+
return None
|
| 216 |
+
|
| 217 |
+
# ββββββββββββ Stage 3: Layout Optimization ββββββββββββββββββββ
|
| 218 |
+
|
| 219 |
+
def _optimize_layout(self, obs: Observation) -> Optional[Action]:
|
| 220 |
+
"""
|
| 221 |
+
Gradually move the layout toward the ideal configuration:
|
| 222 |
+
button_size β [0.9, 1.3]
|
| 223 |
+
steps β€ 2
|
| 224 |
+
form_length β€ 4 (but β₯ 3 for safety)
|
| 225 |
+
|
| 226 |
+
Optimisation order (by reward impact):
|
| 227 |
+
1. steps β biggest reward shaping bonus (+0.1) AND progress bonus
|
| 228 |
+
2. form β progress bonus when β€ 4
|
| 229 |
+
3. button β shaping bonus (+0.1) when in sweet spot
|
| 230 |
+
|
| 231 |
+
Each call makes at most ONE change to avoid compounding effects
|
| 232 |
+
in a single step.
|
| 233 |
+
"""
|
| 234 |
+
layout = obs.layout
|
| 235 |
+
|
| 236 |
+
# ββ Steps: aim for TARGET_STEPS (2) ββ
|
| 237 |
+
if layout.steps > TARGET_STEPS and not self._would_oscillate("decrease_steps"):
|
| 238 |
+
# Don't reduce below 2 if we've seen distrust (new-user guard)
|
| 239 |
+
if not (self.distrust_count > 0 and layout.steps <= 2):
|
| 240 |
+
return self._make("decrease_steps")
|
| 241 |
+
|
| 242 |
+
# ββ Form: aim for TARGET_FORM_LENGTH (4) but never below SAFE_FORM_FLOOR (3) ββ
|
| 243 |
+
if layout.form_length > TARGET_FORM_LENGTH and layout.form_length > SAFE_FORM_FLOOR:
|
| 244 |
+
return self._make("decrease_form")
|
| 245 |
+
|
| 246 |
+
# ββ Button size: steer into sweet spot ββ
|
| 247 |
+
bs = layout.button_size
|
| 248 |
+
if bs < BUTTON_SWEET_LOW:
|
| 249 |
+
if not self._would_oscillate("increase_button"):
|
| 250 |
+
return self._make("increase_button")
|
| 251 |
+
|
| 252 |
+
if bs > BUTTON_SWEET_HIGH:
|
| 253 |
+
# Use set_button_size to jump directly into the sweet zone
|
| 254 |
+
# rather than slowly decrementing (no decrease_button action exists)
|
| 255 |
+
return self._make("set_button_size", BUTTON_SWEET_MID)
|
| 256 |
+
|
| 257 |
+
return None
|
| 258 |
+
|
| 259 |
+
# ββββββββββββ Stage 4: Exploration ββββββββββββββββββββββββββββ
|
| 260 |
+
|
| 261 |
+
def _explore(self, obs: Observation) -> Optional[Action]:
|
| 262 |
+
"""
|
| 263 |
+
Small controlled randomness to discover micro-improvements.
|
| 264 |
+
|
| 265 |
+
Only fires when:
|
| 266 |
+
- RNG says so (7 % chance)
|
| 267 |
+
- Last outcome was NOT negative (don't explore under stress)
|
| 268 |
+
- Layout is already reasonably safe
|
| 269 |
+
|
| 270 |
+
Exploration action: try a random button_size within the sweet spot.
|
| 271 |
+
This is the safest dimension to explore because it has no drop or
|
| 272 |
+
distrust rules tied to it.
|
| 273 |
+
"""
|
| 274 |
+
if self.last_outcome in ("drop", "distrust"):
|
| 275 |
+
return None
|
| 276 |
+
|
| 277 |
+
if self._rng.random() < EXPLORE_PROBABILITY:
|
| 278 |
+
target = self._rng.uniform(BUTTON_SWEET_LOW, BUTTON_SWEET_HIGH)
|
| 279 |
+
target = round(target, 2)
|
| 280 |
+
return self._make("set_button_size", target)
|
| 281 |
+
|
| 282 |
+
return None
|
| 283 |
+
|
| 284 |
+
# ββββββββββββ Stage 5: Fallback βββββββββββββββββββββββββββββββ
|
| 285 |
+
|
| 286 |
+
def _fallback(self, obs: Observation) -> Action:
|
| 287 |
+
"""
|
| 288 |
+
Default action when the layout is already near-optimal.
|
| 289 |
+
|
| 290 |
+
- If noop streak is still safe β noop (preserves a good layout)
|
| 291 |
+
- Otherwise β a tiny, safe micro-adjustment to break the streak
|
| 292 |
+
while keeping the layout in the sweet spot.
|
| 293 |
+
"""
|
| 294 |
+
if self.noop_streak < NOOP_SAFE_LIMIT:
|
| 295 |
+
return self._make("noop")
|
| 296 |
+
|
| 297 |
+
# Break the noop streak with a harmless move
|
| 298 |
+
bs = obs.layout.button_size
|
| 299 |
+
if bs <= BUTTON_SWEET_MID:
|
| 300 |
+
target = min(BUTTON_SWEET_HIGH, bs + 0.05)
|
| 301 |
+
else:
|
| 302 |
+
target = max(BUTTON_SWEET_LOW, bs - 0.05)
|
| 303 |
+
|
| 304 |
+
return self._make("set_button_size", round(target, 2))
|
| 305 |
+
|
| 306 |
+
|
| 307 |
+
# ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 308 |
+
# Random Agent (Baseline)
|
| 309 |
+
# ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 310 |
+
|
| 311 |
+
class RandomAgent:
|
| 312 |
+
"""Uniformly random discrete-action agent for baseline comparison."""
|
| 313 |
+
|
| 314 |
+
_ACTIONS = [
|
| 315 |
+
"increase_button",
|
| 316 |
+
"decrease_form",
|
| 317 |
+
"increase_steps",
|
| 318 |
+
"decrease_steps",
|
| 319 |
+
"reorder_sections",
|
| 320 |
+
"noop",
|
| 321 |
+
]
|
| 322 |
+
|
| 323 |
+
def __init__(self, seed: int = 99) -> None:
|
| 324 |
+
self._rng = random.Random(seed)
|
| 325 |
+
|
| 326 |
+
def reset(self) -> None:
|
| 327 |
+
pass
|
| 328 |
+
|
| 329 |
+
def act(self, obs: Observation) -> Action:
|
| 330 |
+
return Action(type=self._rng.choice(self._ACTIONS), value=None)
|
| 331 |
+
|
| 332 |
+
def update(self, info: dict) -> None:
|
| 333 |
+
pass
|
| 334 |
+
|
| 335 |
+
|
| 336 |
+
# ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 337 |
+
# Evaluation Harness
|
| 338 |
+
# ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 339 |
+
|
| 340 |
+
def run_evaluation(
|
| 341 |
+
agent,
|
| 342 |
+
n_episodes: int = 200,
|
| 343 |
+
env_seed: int = 42,
|
| 344 |
+
verbose: bool = False,
|
| 345 |
+
) -> dict:
|
| 346 |
+
"""
|
| 347 |
+
Run *n_episodes* in UIEnv with the given agent and collect metrics.
|
| 348 |
+
|
| 349 |
+
Returns
|
| 350 |
+
-------
|
| 351 |
+
dict with keys:
|
| 352 |
+
avg_reward, completion_rate, drop_rate, avg_steps
|
| 353 |
+
"""
|
| 354 |
+
env = UIEnv(seed=env_seed)
|
| 355 |
+
|
| 356 |
+
total_reward: float = 0.0
|
| 357 |
+
completions: int = 0
|
| 358 |
+
drops: int = 0
|
| 359 |
+
total_steps: int = 0
|
| 360 |
+
|
| 361 |
+
for ep in range(n_episodes):
|
| 362 |
+
obs = env.reset()
|
| 363 |
+
agent.reset()
|
| 364 |
+
ep_reward: float = 0.0
|
| 365 |
+
done = False
|
| 366 |
+
|
| 367 |
+
while not done:
|
| 368 |
+
action = agent.act(obs)
|
| 369 |
+
obs, reward, done, info = env.step(action)
|
| 370 |
+
agent.update(info)
|
| 371 |
+
ep_reward += reward
|
| 372 |
+
|
| 373 |
+
total_reward += ep_reward
|
| 374 |
+
total_steps += info["step_count"]
|
| 375 |
+
|
| 376 |
+
if info["outcome"] == "complete":
|
| 377 |
+
completions += 1
|
| 378 |
+
elif info["outcome"] == "drop":
|
| 379 |
+
drops += 1
|
| 380 |
+
|
| 381 |
+
if verbose and ep < 10:
|
| 382 |
+
print(
|
| 383 |
+
f" ep={ep:03d} outcome={info['outcome']:<10s} "
|
| 384 |
+
f"reward={ep_reward:+.3f} steps={info['step_count']}"
|
| 385 |
+
)
|
| 386 |
+
|
| 387 |
+
return {
|
| 388 |
+
"avg_reward": total_reward / n_episodes,
|
| 389 |
+
"completion_rate": completions / n_episodes,
|
| 390 |
+
"drop_rate": drops / n_episodes,
|
| 391 |
+
"avg_steps": total_steps / n_episodes,
|
| 392 |
+
}
|
| 393 |
+
|
| 394 |
+
|
| 395 |
+
def _fmt_pct(v: float) -> str:
|
| 396 |
+
return f"{v * 100:.1f}%"
|
| 397 |
+
|
| 398 |
+
|
| 399 |
+
# ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 400 |
+
# Main β run benchmark
|
| 401 |
+
# ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 402 |
+
|
| 403 |
+
if __name__ == "__main__":
|
| 404 |
+
|
| 405 |
+
N_EPISODES = 200
|
| 406 |
+
|
| 407 |
+
print("=" * 64)
|
| 408 |
+
print(" UIEnv Heuristic Agent -- Benchmark Suite")
|
| 409 |
+
print("=" * 64)
|
| 410 |
+
|
| 411 |
+
# -- Heuristic Agent --
|
| 412 |
+
print("\n> Running Heuristic Agent ...")
|
| 413 |
+
h_agent = HeuristicAgent(seed=99)
|
| 414 |
+
h_metrics = run_evaluation(h_agent, n_episodes=N_EPISODES, verbose=True)
|
| 415 |
+
|
| 416 |
+
# -- Random Baseline --
|
| 417 |
+
print("\n> Running Random Agent ...")
|
| 418 |
+
r_agent = RandomAgent(seed=99)
|
| 419 |
+
r_metrics = run_evaluation(r_agent, n_episodes=N_EPISODES, verbose=True)
|
| 420 |
+
|
| 421 |
+
# -- Comparison Table --
|
| 422 |
+
print("\n" + "-" * 64)
|
| 423 |
+
print(f" {'Metric':<22s} {'Heuristic':>12s} {'Random':>12s} {'Delta':>12s}")
|
| 424 |
+
print("-" * 64)
|
| 425 |
+
|
| 426 |
+
for key, label in [
|
| 427 |
+
("avg_reward", "Avg Reward"),
|
| 428 |
+
("completion_rate", "Completion Rate"),
|
| 429 |
+
("drop_rate", "Drop Rate"),
|
| 430 |
+
("avg_steps", "Avg Steps"),
|
| 431 |
+
]:
|
| 432 |
+
h_val = h_metrics[key]
|
| 433 |
+
r_val = r_metrics[key]
|
| 434 |
+
delta = h_val - r_val
|
| 435 |
+
|
| 436 |
+
if "rate" in key:
|
| 437 |
+
h_str = _fmt_pct(h_val)
|
| 438 |
+
r_str = _fmt_pct(r_val)
|
| 439 |
+
d_str = f"{delta * 100:+.1f}pp"
|
| 440 |
+
elif "step" in key:
|
| 441 |
+
h_str = f"{h_val:.1f}"
|
| 442 |
+
r_str = f"{r_val:.1f}"
|
| 443 |
+
d_str = f"{delta:+.1f}"
|
| 444 |
+
else:
|
| 445 |
+
h_str = f"{h_val:+.4f}"
|
| 446 |
+
r_str = f"{r_val:+.4f}"
|
| 447 |
+
d_str = f"{delta:+.4f}"
|
| 448 |
+
|
| 449 |
+
print(f" {label:<22s} {h_str:>12s} {r_str:>12s} {d_str:>12s}")
|
| 450 |
+
|
| 451 |
+
print("-" * 64)
|
| 452 |
+
|
| 453 |
+
# -- Verdict --
|
| 454 |
+
lift = h_metrics["avg_reward"] - r_metrics["avg_reward"]
|
| 455 |
+
if lift > 0.2:
|
| 456 |
+
verdict = "[PASS] STRONG improvement over random baseline"
|
| 457 |
+
elif lift > 0.05:
|
| 458 |
+
verdict = "[WARN] Moderate improvement -- consider tuning"
|
| 459 |
+
else:
|
| 460 |
+
verdict = "[FAIL] Marginal -- agent needs rework"
|
| 461 |
+
|
| 462 |
+
print(f"\n Verdict: {verdict}")
|
| 463 |
+
print(f" Reward lift: {lift:+.4f}\n")
|
leaderboard.json
ADDED
|
@@ -0,0 +1,20 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
[
|
| 2 |
+
{
|
| 3 |
+
"agent_name": "RandomAgent",
|
| 4 |
+
"score": 1.3095999999999997,
|
| 5 |
+
"completion_rate": 1.0,
|
| 6 |
+
"drop_rate": 0.0,
|
| 7 |
+
"avg_reward": 2.031999999999999,
|
| 8 |
+
"avg_steps": 2.64,
|
| 9 |
+
"total_episodes": 50
|
| 10 |
+
},
|
| 11 |
+
{
|
| 12 |
+
"agent_name": "HeuristicAgent",
|
| 13 |
+
"score": 1.2999999999999998,
|
| 14 |
+
"completion_rate": 1.0,
|
| 15 |
+
"drop_rate": 0.0,
|
| 16 |
+
"avg_reward": 2.0,
|
| 17 |
+
"avg_steps": 2.0,
|
| 18 |
+
"total_episodes": 50
|
| 19 |
+
}
|
| 20 |
+
]
|
openenv.yaml
ADDED
|
@@ -0,0 +1,31 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
name: ui_layout_optimizer
|
| 2 |
+
version: 1.0.0
|
| 3 |
+
description: "Adaptive UI Layout Optimization Environment for training agents to maximize user completion and satisfaction in digital checkout flows."
|
| 4 |
+
|
| 5 |
+
action_space:
|
| 6 |
+
increase_button: "Increases the UI button size by 0.1 increments."
|
| 7 |
+
decrease_form: "Reduces the number of form fields to decrease user friction."
|
| 8 |
+
increase_steps: "Adds a step to the wizard flow to separate complex tasks."
|
| 9 |
+
decrease_steps: "Removes a step from the flow to reduce user fatigue."
|
| 10 |
+
reorder_sections: "Optimizes the logical order of UI components."
|
| 11 |
+
set_button_size: "Directly sets the button size multiplier (Continuous: 0.5 - 2.0)."
|
| 12 |
+
noop: "No operation. Keeps the current layout state."
|
| 13 |
+
|
| 14 |
+
observation_space:
|
| 15 |
+
device: "User device type: mobile or desktop."
|
| 16 |
+
layout:
|
| 17 |
+
button_size: "Current button size multiplier (0.5 to 2.0)."
|
| 18 |
+
form_length: "Number of fields in the current form (1 to 10)."
|
| 19 |
+
steps: "Number of steps in the current checkout flow (1 to 10)."
|
| 20 |
+
progress: "Current completion progress percentage (0.0 to 1.0)."
|
| 21 |
+
|
| 22 |
+
tasks:
|
| 23 |
+
easy:
|
| 24 |
+
description: "Discrete actions only. Known user type with high patience levels."
|
| 25 |
+
difficulty: 0.2
|
| 26 |
+
medium:
|
| 27 |
+
description: "Mixed user personas. Stochastic transitions and moderate friction thresholds."
|
| 28 |
+
difficulty: 0.5
|
| 29 |
+
hard:
|
| 30 |
+
description: "Hidden user types. Continuous actions allowed. High noise and conflicting objectives."
|
| 31 |
+
difficulty: 0.9
|
prd_adaptive_ui_layout_optimization_environment_final_enhanced.md
ADDED
|
@@ -0,0 +1,305 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Product Requirements Document (PRD)
|
| 2 |
+
|
| 3 |
+
## Product Name
|
| 4 |
+
Adaptive UI Layout Optimization Environment (OpenEnv)
|
| 5 |
+
|
| 6 |
+
---
|
| 7 |
+
|
| 8 |
+
## 1. Problem Statement
|
| 9 |
+
Static A/B testing cannot adapt UI layouts per user in real time, leading to suboptimal conversions and user experience. We need a standardized, reproducible environment where AI agents learn to adapt UI layouts dynamically based on user behavior.
|
| 10 |
+
|
| 11 |
+
---
|
| 12 |
+
|
| 13 |
+
## 2. Objective
|
| 14 |
+
Build an OpenEnv-compliant environment that simulates user interaction with UI layouts and enables agents to optimize for:
|
| 15 |
+
- Completion rate
|
| 16 |
+
- User satisfaction
|
| 17 |
+
|
| 18 |
+
---
|
| 19 |
+
|
| 20 |
+
## 3. Success Metrics
|
| 21 |
+
- Deterministic grader score (0.0β1.0)
|
| 22 |
+
- Reproducible baseline results (Β±1% variance)
|
| 23 |
+
- Increasing reward trend across steps
|
| 24 |
+
- OpenEnv validation passes
|
| 25 |
+
|
| 26 |
+
---
|
| 27 |
+
|
| 28 |
+
## 4. Tech Stack (Required)
|
| 29 |
+
|
| 30 |
+
### Core Language
|
| 31 |
+
- Python 3.10+
|
| 32 |
+
|
| 33 |
+
### Backend & Environment
|
| 34 |
+
- Pydantic (typed models)
|
| 35 |
+
- FastAPI (optional)
|
| 36 |
+
|
| 37 |
+
### AI / Agent
|
| 38 |
+
- OpenAI API (baseline agent)
|
| 39 |
+
|
| 40 |
+
### Simulation & Utilities
|
| 41 |
+
- NumPy
|
| 42 |
+
- random (seeded)
|
| 43 |
+
|
| 44 |
+
### Visualization
|
| 45 |
+
- Streamlit / simple HTML renderer (for layout visualization)
|
| 46 |
+
|
| 47 |
+
### Deployment
|
| 48 |
+
- Docker
|
| 49 |
+
- Hugging Face Spaces
|
| 50 |
+
|
| 51 |
+
### Config
|
| 52 |
+
- YAML (openenv.yaml)
|
| 53 |
+
|
| 54 |
+
---
|
| 55 |
+
|
| 56 |
+
## 5. System Design
|
| 57 |
+
|
| 58 |
+
### 5.1 Observation Schema
|
| 59 |
+
```python
|
| 60 |
+
class Layout(BaseModel):
|
| 61 |
+
button_size: float # 0.5β2.0 (continuous in hard task)
|
| 62 |
+
form_length: int # 1β10
|
| 63 |
+
steps: int # 1β5
|
| 64 |
+
|
| 65 |
+
class Observation(BaseModel):
|
| 66 |
+
device: Literal['mobile','desktop']
|
| 67 |
+
layout: Layout
|
| 68 |
+
progress: float
|
| 69 |
+
last_action: str | None
|
| 70 |
+
```
|
| 71 |
+
|
| 72 |
+
---
|
| 73 |
+
|
| 74 |
+
### 5.2 Action Schema
|
| 75 |
+
```python
|
| 76 |
+
class Action(BaseModel):
|
| 77 |
+
type: Literal[
|
| 78 |
+
'increase_button',
|
| 79 |
+
'decrease_form',
|
| 80 |
+
'increase_steps',
|
| 81 |
+
'decrease_steps',
|
| 82 |
+
'reorder_sections',
|
| 83 |
+
'set_button_size', # continuous action (hard task)
|
| 84 |
+
'noop'
|
| 85 |
+
]
|
| 86 |
+
value: float | None
|
| 87 |
+
```
|
| 88 |
+
|
| 89 |
+
---
|
| 90 |
+
|
| 91 |
+
### 5.3 Hidden State
|
| 92 |
+
- user_type β {impatient, careful, new}
|
| 93 |
+
- tolerance threshold
|
| 94 |
+
- trust threshold
|
| 95 |
+
|
| 96 |
+
---
|
| 97 |
+
|
| 98 |
+
## 6. User Simulation
|
| 99 |
+
|
| 100 |
+
### Deterministic Rules
|
| 101 |
+
| User Type | Condition | Outcome |
|
| 102 |
+
|----------|----------|--------|
|
| 103 |
+
| impatient | steps > 3 | drop |
|
| 104 |
+
| impatient | form_length > 5 | drop |
|
| 105 |
+
| careful | form_length < 3 | distrust |
|
| 106 |
+
| new_user | steps < 2 | distrust |
|
| 107 |
+
|
| 108 |
+
### Probabilistic Layer
|
| 109 |
+
```python
|
| 110 |
+
if outcome == "continue":
|
| 111 |
+
if random(seed).random() < 0.1:
|
| 112 |
+
return "drop"
|
| 113 |
+
```
|
| 114 |
+
|
| 115 |
+
---
|
| 116 |
+
|
| 117 |
+
## 7. Reward Function
|
| 118 |
+
|
| 119 |
+
Let:
|
| 120 |
+
- C = completion
|
| 121 |
+
- P = progress
|
| 122 |
+
- D = drop
|
| 123 |
+
|
| 124 |
+
```
|
| 125 |
+
R = 0.5*C + 0.3*P - 0.4*D
|
| 126 |
+
```
|
| 127 |
+
|
| 128 |
+
Shaping:
|
| 129 |
+
- optimal button_size range (0.9β1.3) β +0.1
|
| 130 |
+
- steps β€ 2 β +0.1
|
| 131 |
+
- form_length > 6 β -0.2
|
| 132 |
+
- repeated noop β -0.3
|
| 133 |
+
|
| 134 |
+
---
|
| 135 |
+
|
| 136 |
+
## 8. Episode Lifecycle
|
| 137 |
+
- max_steps = 10 (default)
|
| 138 |
+
- extended mode: 20+ steps (scalability test)
|
| 139 |
+
|
| 140 |
+
Termination:
|
| 141 |
+
- complete
|
| 142 |
+
- drop
|
| 143 |
+
- max steps reached
|
| 144 |
+
|
| 145 |
+
---
|
| 146 |
+
|
| 147 |
+
## 9. Tasks
|
| 148 |
+
|
| 149 |
+
### Easy
|
| 150 |
+
- discrete actions only
|
| 151 |
+
- known user type
|
| 152 |
+
|
| 153 |
+
### Medium
|
| 154 |
+
- mixed users
|
| 155 |
+
- stochastic transitions
|
| 156 |
+
|
| 157 |
+
### Hard
|
| 158 |
+
- hidden user type
|
| 159 |
+
- continuous action (button_size tuning)
|
| 160 |
+
- conflicting objectives
|
| 161 |
+
- noisy feedback
|
| 162 |
+
|
| 163 |
+
---
|
| 164 |
+
|
| 165 |
+
## 10. Grader
|
| 166 |
+
|
| 167 |
+
Run N=50 episodes
|
| 168 |
+
|
| 169 |
+
Metrics:
|
| 170 |
+
- completion_rate
|
| 171 |
+
- avg_reward
|
| 172 |
+
|
| 173 |
+
```
|
| 174 |
+
Score = 0.7 * completion_rate + 0.3 * avg_reward
|
| 175 |
+
```
|
| 176 |
+
|
| 177 |
+
---
|
| 178 |
+
|
| 179 |
+
## 11. Benchmarking & Leaderboard
|
| 180 |
+
|
| 181 |
+
Include:
|
| 182 |
+
- Random policy baseline
|
| 183 |
+
- Heuristic rule-based baseline
|
| 184 |
+
- LLM-based baseline
|
| 185 |
+
|
| 186 |
+
Metrics:
|
| 187 |
+
- score
|
| 188 |
+
- avg_reward
|
| 189 |
+
- episodes-to-convergence
|
| 190 |
+
|
| 191 |
+
Leaderboard displayed in README / UI
|
| 192 |
+
|
| 193 |
+
---
|
| 194 |
+
|
| 195 |
+
## 12. Visualization (WOW Factor)
|
| 196 |
+
|
| 197 |
+
- Render layout using Streamlit or HTML
|
| 198 |
+
- Show:
|
| 199 |
+
- button size visually
|
| 200 |
+
- number of form fields
|
| 201 |
+
- step flow
|
| 202 |
+
|
| 203 |
+
- Integrate into HF Space UI
|
| 204 |
+
|
| 205 |
+
---
|
| 206 |
+
|
| 207 |
+
## 13. Environment API
|
| 208 |
+
|
| 209 |
+
```python
|
| 210 |
+
def reset() -> Observation
|
| 211 |
+
|
| 212 |
+
def step(action: Action) -> tuple[Observation, float, bool, dict]
|
| 213 |
+
|
| 214 |
+
def state() -> Observation
|
| 215 |
+
```
|
| 216 |
+
|
| 217 |
+
---
|
| 218 |
+
|
| 219 |
+
## 14. openenv.yaml
|
| 220 |
+
|
| 221 |
+
```yaml
|
| 222 |
+
name: ui_optimizer_env
|
| 223 |
+
version: 1.0
|
| 224 |
+
|
| 225 |
+
actions:
|
| 226 |
+
- increase_button
|
| 227 |
+
- decrease_form
|
| 228 |
+
- increase_steps
|
| 229 |
+
- decrease_steps
|
| 230 |
+
- reorder_sections
|
| 231 |
+
- set_button_size
|
| 232 |
+
- noop
|
| 233 |
+
|
| 234 |
+
observations:
|
| 235 |
+
device: string
|
| 236 |
+
layout: object
|
| 237 |
+
progress: float
|
| 238 |
+
|
| 239 |
+
tasks:
|
| 240 |
+
- easy
|
| 241 |
+
- medium
|
| 242 |
+
- hard
|
| 243 |
+
```
|
| 244 |
+
|
| 245 |
+
---
|
| 246 |
+
|
| 247 |
+
## 15. Baseline Agent
|
| 248 |
+
|
| 249 |
+
- deterministic
|
| 250 |
+
- temperature = 0
|
| 251 |
+
- fixed seeds
|
| 252 |
+
|
| 253 |
+
---
|
| 254 |
+
|
| 255 |
+
## 16. Scalability Tests
|
| 256 |
+
|
| 257 |
+
- extended episode length (20+ steps)
|
| 258 |
+
- batch simulation (multiple users)
|
| 259 |
+
- stress test reward stability
|
| 260 |
+
|
| 261 |
+
---
|
| 262 |
+
|
| 263 |
+
## 17. Non-Functional Requirements
|
| 264 |
+
- Dockerized
|
| 265 |
+
- HF Space deployable
|
| 266 |
+
- openenv validate passes
|
| 267 |
+
- reproducible outputs
|
| 268 |
+
|
| 269 |
+
---
|
| 270 |
+
|
| 271 |
+
## 18. Edge Cases
|
| 272 |
+
- infinite loops β penalty
|
| 273 |
+
- invalid actions β ignore + penalty
|
| 274 |
+
- conflicting actions β last action wins
|
| 275 |
+
|
| 276 |
+
---
|
| 277 |
+
|
| 278 |
+
## 19. Risks & Mitigation
|
| 279 |
+
|
| 280 |
+
| Risk | Mitigation |
|
| 281 |
+
|-----|-----------|
|
| 282 |
+
| weak simulation | hybrid rules + randomness |
|
| 283 |
+
| instability | fixed seeds |
|
| 284 |
+
| trivial agent success | stronger hard task |
|
| 285 |
+
|
| 286 |
+
---
|
| 287 |
+
|
| 288 |
+
## 20. Deliverables
|
| 289 |
+
- environment code
|
| 290 |
+
- tasks + grader
|
| 291 |
+
- baselines
|
| 292 |
+
- leaderboard
|
| 293 |
+
- visualization UI
|
| 294 |
+
- Dockerfile
|
| 295 |
+
- HF deployment
|
| 296 |
+
- README
|
| 297 |
+
|
| 298 |
+
---
|
| 299 |
+
|
| 300 |
+
## FINAL STATUS
|
| 301 |
+
|
| 302 |
+
β Fully optimized for hackathon scoring
|
| 303 |
+
β High novelty + strong evaluation
|
| 304 |
+
β Ready for implementation
|
| 305 |
+
|
requirements.txt
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
openai
|
| 2 |
+
pydantic
|
| 3 |
+
numpy
|