Spaces:
Sleeping
title: LexEnvs — Harbor RL Environment
emoji: 🃏
colorFrom: blue
colorTo: purple
sdk: docker
app_port: 8000
pinned: false
license: apache-2.0
tags:
- reinforcement-learning
- evaluation
- credit-cards
- grpo
- rl-environment
LexEnvs — Harbor RL Environment
A stateless FastAPI evaluation server for credit card optimization tasks. Agents receive task prompts (system instructions, a knowledge base, and user constraints), produce structured JSON recommendations, and get scored on a multi-dimensional rubric.
Quick Start
# Install dependencies
uv sync
# Run the server
uv run uvicorn lexenvs.harbor_app:app --reload
# Run tests
uv run pytest -v
# Lint & format
uv run ruff check .
uv run ruff format .
Docker
docker compose up -d # Start server on port 8000
docker compose down # Stop
API Endpoints
| Method | Path | Description |
|---|---|---|
GET |
/health/live |
Liveness probe |
GET |
/api/tasks |
List all tasks (id, domain, difficulty) |
GET |
/api/tasks/{task_id} |
Get task prompt (system, context, user) |
POST |
/api/tasks/{task_id}/evaluate |
Score an agent's answer; returns reward in [0, 1] |
Modal Deployment
The full RL environment runs on Modal — the harbor server as a web endpoint and GRPO training on A100 GPUs.
1. Setup
# Install modal (included in project deps)
uv sync
# Authenticate with Modal
uv run modal setup
2. Deploy the Harbor Server
uv run modal deploy scripts/modal/harbor.py
This deploys the FastAPI evaluation server as a persistent web endpoint. The deploy output prints the URL:
https://YOUR_USER--lexenvs-harbor-harbor-endpoint.modal.run
Verify it's working:
curl https://YOUR_USER--lexenvs-harbor-harbor-endpoint.modal.run/health/live
# {"status":"ok"}
curl https://YOUR_USER--lexenvs-harbor-harbor-endpoint.modal.run/api/tasks
# {"tasks": [{"task_id": "easy_01", ...}, ...]}
3. Run GRPO Training
# Default: Qwen2.5-32B-Instruct on A100-80GB, 4 samples/task, 3 iterations
uv run modal run scripts/modal/grpo_train.py \
--harbor-url https://YOUR_USER--lexenvs-harbor-harbor-endpoint.modal.run
# Customize model and parameters
uv run modal run scripts/modal/grpo_train.py \
--harbor-url https://YOUR_USER--lexenvs-harbor-harbor-endpoint.modal.run \
--model Qwen/Qwen2.5-14B-Instruct \
--num-samples 8 \
--iterations 5
# Run only specific tasks
uv run modal run scripts/modal/grpo_train.py \
--harbor-url https://YOUR_USER--lexenvs-harbor-harbor-endpoint.modal.run \
--tasks easy_01,easy_02,medium_01
The first run downloads model weights (~60GB for 32B) to a persistent Modal volume. Subsequent runs start generating immediately.
Results are saved locally to results/grpo-modal/.
Training Parameters
| Flag | Default | Description |
|---|---|---|
--harbor-url |
(required) | Harbor server URL |
--model |
Qwen/Qwen2.5-32B-Instruct |
HuggingFace model ID |
--num-samples |
4 |
Completions per task per iteration |
--iterations |
3 |
Number of GRPO iterations |
--temp-start |
0.8 |
Starting temperature (exploration) |
--temp-end |
0.4 |
Ending temperature (exploitation) |
--tasks |
all | Comma-separated task IDs to run |
--no-few-shot |
false |
Disable few-shot example injection |
--max-model-len |
32768 |
vLLM context window size |
Local Training (Ollama)
For quick iteration without cloud GPUs:
# Start the harbor server locally
uv run uvicorn lexenvs.harbor_app:app --port 8000 &
# Pull a model
ollama pull qwen2.5:7b
# Run GRPO training
uv run python scripts/local/grpo_train.py --model qwen2.5:7b
# Customize
uv run python scripts/local/grpo_train.py \
--model qwen2.5:7b \
-N 4 \
--iterations 3 \
--tasks easy_01 easy_02
Baseline Evaluation
Run Claude models against all tasks to establish baselines:
ANTHROPIC_API_KEY=sk-... uv run python scripts/local/run_baseline_eval.py \
--model claude-haiku-4-5-20251001
Baseline Results
| Model | Avg Reward | Method |
|---|---|---|
| Qwen 7B | 0.010 | Single pass |
| Haiku 4.5 | 0.091 | Single pass |
| Sonnet 4 | 0.176 | Single pass |
| Opus 4 | 0.182 | Single pass |
| Qwen 32B | 0.090 | Single pass (mean) |
| Qwen 32B | 0.246 | GRPO best-of-4 |
How It Works
GRPO (Group Relative Policy Optimization)
For each task, the training loop:
- Generates N completions — vLLM produces N candidate answers in a single batched call
- Scores each completion — the harbor server evaluates against the rubric
- Computes group-relative advantages —
advantage = (reward - group_mean) / group_std - Selects the best — top completions are retained and optionally used as few-shot examples for subsequent iterations
Temperature decays linearly across iterations (exploration to exploitation).
Currently this is best-of-N selection without gradient updates. The architecture supports upgrading to TRL's GRPOTrainer for actual weight updates with LoRA.
Scoring Rubric
Each task is scored on four weighted dimensions:
- EV Accuracy (40%, automated) — how close the agent's expected value calculation is to the reference
- Constraint Compliance (30%, automated) — Jaccard similarity of recommended cards + housing option match
- Reasoning Quality (20%, human) — quality of analysis and tradeoff discussion
- Constraint Prioritization (10%, human) — handling of ambiguous or conflicting constraints
When no reference solution exists, scoring falls back to JSON structure validation.
Project Structure
src/lexenvs/
├── harbor_app.py # FastAPI entry point
├── config/ # Pydantic settings (LEX_ENVS_ env prefix)
├── routers/tasks.py # API endpoints
├── schemas/ # Request/response models
├── services/
│ ├── task_loader_service.py # Loads tasks from data/tasks/*.json
│ ├── rubric_evaluator_service.py # Multi-dimensional scoring
│ └── kb_filter_service.py # Per-task KB filtering
└── utils.py # JSON extraction utilities
data/
├── tasks/ # 15 task definitions (easy/medium/hard x 5)
└── knowledge_base.md # Shared KB (~56K chars: issuers, cards, rules)
scripts/
├── local/
│ ├── smoke_test.py # Server verification + LLM evaluation
│ ├── run_baseline_eval.py # Baseline model evaluation (Anthropic API)
│ ├── grpo_train.py # GRPO training (local/Ollama)
│ └── grpo_common.py # Shared GRPO types and helpers
└── modal/
├── harbor.py # Deploy harbor server
└── grpo_train.py # GRPO training on A100
results/ # Evaluation results (gitignored)
Development
uv run pytest -v # Unit tests
RUN_API_INTEGRATION_TESTS=1 uv run pytest tests/integration/ -v # Integration tests
uv run ruff check . # Lint
uv run ruff format . # Format
uv run mypy src/lexenvs # Type check
Contributing
See CONTRIBUTING.md for development setup, coding standards, and how to add new tasks.
Acknowledgments
Credit card point valuations and transfer partner data in data/knowledge_base.md reference publicly available information from The Points Guy and issuer websites.
License
This project is licensed under the Apache License 2.0 — see LICENSE for details.