Spaces:
Sleeping
Sleeping
File size: 7,842 Bytes
966fa43 2312199 966fa43 2312199 966fa43 2312199 966fa43 2312199 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 | ---
title: LexEnvs — Harbor RL Environment
emoji: 🃏
colorFrom: blue
colorTo: purple
sdk: docker
app_port: 8000
pinned: false
license: apache-2.0
tags:
- reinforcement-learning
- evaluation
- credit-cards
- grpo
- rl-environment
---
# LexEnvs — Harbor RL Environment
A stateless FastAPI evaluation server for credit card optimization tasks. Agents receive task prompts (system instructions, a knowledge base, and user constraints), produce structured JSON recommendations, and get scored on a multi-dimensional rubric.
## Quick Start
```bash
# Install dependencies
uv sync
# Run the server
uv run uvicorn lexenvs.harbor_app:app --reload
# Run tests
uv run pytest -v
# Lint & format
uv run ruff check .
uv run ruff format .
```
### Docker
```bash
docker compose up -d # Start server on port 8000
docker compose down # Stop
```
## API Endpoints
| Method | Path | Description |
|--------|------|-------------|
| `GET` | `/health/live` | Liveness probe |
| `GET` | `/api/tasks` | List all tasks (id, domain, difficulty) |
| `GET` | `/api/tasks/{task_id}` | Get task prompt (system, context, user) |
| `POST` | `/api/tasks/{task_id}/evaluate` | Score an agent's answer; returns reward in [0, 1] |
## Modal Deployment
The full RL environment runs on [Modal](https://modal.com) — the harbor server as a web endpoint and GRPO training on A100 GPUs.
### 1. Setup
```bash
# Install modal (included in project deps)
uv sync
# Authenticate with Modal
uv run modal setup
```
### 2. Deploy the Harbor Server
```bash
uv run modal deploy scripts/modal/harbor.py
```
This deploys the FastAPI evaluation server as a persistent web endpoint. The deploy output prints the URL:
```
https://YOUR_USER--lexenvs-harbor-harbor-endpoint.modal.run
```
Verify it's working:
```bash
curl https://YOUR_USER--lexenvs-harbor-harbor-endpoint.modal.run/health/live
# {"status":"ok"}
curl https://YOUR_USER--lexenvs-harbor-harbor-endpoint.modal.run/api/tasks
# {"tasks": [{"task_id": "easy_01", ...}, ...]}
```
### 3. Run GRPO Training
```bash
# Default: Qwen2.5-32B-Instruct on A100-80GB, 4 samples/task, 3 iterations
uv run modal run scripts/modal/grpo_train.py \
--harbor-url https://YOUR_USER--lexenvs-harbor-harbor-endpoint.modal.run
# Customize model and parameters
uv run modal run scripts/modal/grpo_train.py \
--harbor-url https://YOUR_USER--lexenvs-harbor-harbor-endpoint.modal.run \
--model Qwen/Qwen2.5-14B-Instruct \
--num-samples 8 \
--iterations 5
# Run only specific tasks
uv run modal run scripts/modal/grpo_train.py \
--harbor-url https://YOUR_USER--lexenvs-harbor-harbor-endpoint.modal.run \
--tasks easy_01,easy_02,medium_01
```
The first run downloads model weights (~60GB for 32B) to a persistent Modal volume. Subsequent runs start generating immediately.
Results are saved locally to `results/grpo-modal/`.
### Training Parameters
| Flag | Default | Description |
|------|---------|-------------|
| `--harbor-url` | (required) | Harbor server URL |
| `--model` | `Qwen/Qwen2.5-32B-Instruct` | HuggingFace model ID |
| `--num-samples` | `4` | Completions per task per iteration |
| `--iterations` | `3` | Number of GRPO iterations |
| `--temp-start` | `0.8` | Starting temperature (exploration) |
| `--temp-end` | `0.4` | Ending temperature (exploitation) |
| `--tasks` | all | Comma-separated task IDs to run |
| `--no-few-shot` | `false` | Disable few-shot example injection |
| `--max-model-len` | `32768` | vLLM context window size |
## Local Training (Ollama)
For quick iteration without cloud GPUs:
```bash
# Start the harbor server locally
uv run uvicorn lexenvs.harbor_app:app --port 8000 &
# Pull a model
ollama pull qwen2.5:7b
# Run GRPO training
uv run python scripts/local/grpo_train.py --model qwen2.5:7b
# Customize
uv run python scripts/local/grpo_train.py \
--model qwen2.5:7b \
-N 4 \
--iterations 3 \
--tasks easy_01 easy_02
```
## Baseline Evaluation
Run Claude models against all tasks to establish baselines:
```bash
ANTHROPIC_API_KEY=sk-... uv run python scripts/local/run_baseline_eval.py \
--model claude-haiku-4-5-20251001
```
### Baseline Results
| Model | Avg Reward | Method |
|-------|-----------|--------|
| Qwen 7B | 0.010 | Single pass |
| Haiku 4.5 | 0.091 | Single pass |
| Sonnet 4 | 0.176 | Single pass |
| Opus 4 | 0.182 | Single pass |
| Qwen 32B | 0.090 | Single pass (mean) |
| **Qwen 32B** | **0.246** | **GRPO best-of-4** |
## How It Works
### GRPO (Group Relative Policy Optimization)
For each task, the training loop:
1. **Generates N completions** — vLLM produces N candidate answers in a single batched call
2. **Scores each completion** — the harbor server evaluates against the rubric
3. **Computes group-relative advantages** — `advantage = (reward - group_mean) / group_std`
4. **Selects the best** — top completions are retained and optionally used as few-shot examples for subsequent iterations
Temperature decays linearly across iterations (exploration to exploitation).
Currently this is best-of-N selection without gradient updates. The architecture supports upgrading to TRL's `GRPOTrainer` for actual weight updates with LoRA.
### Scoring Rubric
Each task is scored on four weighted dimensions:
- **EV Accuracy** (40%, automated) — how close the agent's expected value calculation is to the reference
- **Constraint Compliance** (30%, automated) — Jaccard similarity of recommended cards + housing option match
- **Reasoning Quality** (20%, human) — quality of analysis and tradeoff discussion
- **Constraint Prioritization** (10%, human) — handling of ambiguous or conflicting constraints
When no reference solution exists, scoring falls back to JSON structure validation.
## Project Structure
```
src/lexenvs/
├── harbor_app.py # FastAPI entry point
├── config/ # Pydantic settings (LEX_ENVS_ env prefix)
├── routers/tasks.py # API endpoints
├── schemas/ # Request/response models
├── services/
│ ├── task_loader_service.py # Loads tasks from data/tasks/*.json
│ ├── rubric_evaluator_service.py # Multi-dimensional scoring
│ └── kb_filter_service.py # Per-task KB filtering
└── utils.py # JSON extraction utilities
data/
├── tasks/ # 15 task definitions (easy/medium/hard x 5)
└── knowledge_base.md # Shared KB (~56K chars: issuers, cards, rules)
scripts/
├── local/
│ ├── smoke_test.py # Server verification + LLM evaluation
│ ├── run_baseline_eval.py # Baseline model evaluation (Anthropic API)
│ ├── grpo_train.py # GRPO training (local/Ollama)
│ └── grpo_common.py # Shared GRPO types and helpers
└── modal/
├── harbor.py # Deploy harbor server
└── grpo_train.py # GRPO training on A100
results/ # Evaluation results (gitignored)
```
## Development
```bash
uv run pytest -v # Unit tests
RUN_API_INTEGRATION_TESTS=1 uv run pytest tests/integration/ -v # Integration tests
uv run ruff check . # Lint
uv run ruff format . # Format
uv run mypy src/lexenvs # Type check
```
## Contributing
See [CONTRIBUTING.md](CONTRIBUTING.md) for development setup, coding standards, and how to add new tasks.
## Acknowledgments
Credit card point valuations and transfer partner data in `data/knowledge_base.md` reference publicly available information from [The Points Guy](https://thepointsguy.com/) and issuer websites.
## License
This project is licensed under the Apache License 2.0 — see [LICENSE](LICENSE) for details.
|