lexenvs-harbor / README.md
endishai's picture
Upload folder using huggingface_hub
2312199 verified
---
title: LexEnvs Harbor RL Environment
emoji: 🃏
colorFrom: blue
colorTo: purple
sdk: docker
app_port: 8000
pinned: false
license: apache-2.0
tags:
- reinforcement-learning
- evaluation
- credit-cards
- grpo
- rl-environment
---
# LexEnvs — Harbor RL Environment
A stateless FastAPI evaluation server for credit card optimization tasks. Agents receive task prompts (system instructions, a knowledge base, and user constraints), produce structured JSON recommendations, and get scored on a multi-dimensional rubric.
## Quick Start
```bash
# Install dependencies
uv sync
# Run the server
uv run uvicorn lexenvs.harbor_app:app --reload
# Run tests
uv run pytest -v
# Lint & format
uv run ruff check .
uv run ruff format .
```
### Docker
```bash
docker compose up -d # Start server on port 8000
docker compose down # Stop
```
## API Endpoints
| Method | Path | Description |
|--------|------|-------------|
| `GET` | `/health/live` | Liveness probe |
| `GET` | `/api/tasks` | List all tasks (id, domain, difficulty) |
| `GET` | `/api/tasks/{task_id}` | Get task prompt (system, context, user) |
| `POST` | `/api/tasks/{task_id}/evaluate` | Score an agent's answer; returns reward in [0, 1] |
## Modal Deployment
The full RL environment runs on [Modal](https://modal.com) — the harbor server as a web endpoint and GRPO training on A100 GPUs.
### 1. Setup
```bash
# Install modal (included in project deps)
uv sync
# Authenticate with Modal
uv run modal setup
```
### 2. Deploy the Harbor Server
```bash
uv run modal deploy scripts/modal/harbor.py
```
This deploys the FastAPI evaluation server as a persistent web endpoint. The deploy output prints the URL:
```
https://YOUR_USER--lexenvs-harbor-harbor-endpoint.modal.run
```
Verify it's working:
```bash
curl https://YOUR_USER--lexenvs-harbor-harbor-endpoint.modal.run/health/live
# {"status":"ok"}
curl https://YOUR_USER--lexenvs-harbor-harbor-endpoint.modal.run/api/tasks
# {"tasks": [{"task_id": "easy_01", ...}, ...]}
```
### 3. Run GRPO Training
```bash
# Default: Qwen2.5-32B-Instruct on A100-80GB, 4 samples/task, 3 iterations
uv run modal run scripts/modal/grpo_train.py \
--harbor-url https://YOUR_USER--lexenvs-harbor-harbor-endpoint.modal.run
# Customize model and parameters
uv run modal run scripts/modal/grpo_train.py \
--harbor-url https://YOUR_USER--lexenvs-harbor-harbor-endpoint.modal.run \
--model Qwen/Qwen2.5-14B-Instruct \
--num-samples 8 \
--iterations 5
# Run only specific tasks
uv run modal run scripts/modal/grpo_train.py \
--harbor-url https://YOUR_USER--lexenvs-harbor-harbor-endpoint.modal.run \
--tasks easy_01,easy_02,medium_01
```
The first run downloads model weights (~60GB for 32B) to a persistent Modal volume. Subsequent runs start generating immediately.
Results are saved locally to `results/grpo-modal/`.
### Training Parameters
| Flag | Default | Description |
|------|---------|-------------|
| `--harbor-url` | (required) | Harbor server URL |
| `--model` | `Qwen/Qwen2.5-32B-Instruct` | HuggingFace model ID |
| `--num-samples` | `4` | Completions per task per iteration |
| `--iterations` | `3` | Number of GRPO iterations |
| `--temp-start` | `0.8` | Starting temperature (exploration) |
| `--temp-end` | `0.4` | Ending temperature (exploitation) |
| `--tasks` | all | Comma-separated task IDs to run |
| `--no-few-shot` | `false` | Disable few-shot example injection |
| `--max-model-len` | `32768` | vLLM context window size |
## Local Training (Ollama)
For quick iteration without cloud GPUs:
```bash
# Start the harbor server locally
uv run uvicorn lexenvs.harbor_app:app --port 8000 &
# Pull a model
ollama pull qwen2.5:7b
# Run GRPO training
uv run python scripts/local/grpo_train.py --model qwen2.5:7b
# Customize
uv run python scripts/local/grpo_train.py \
--model qwen2.5:7b \
-N 4 \
--iterations 3 \
--tasks easy_01 easy_02
```
## Baseline Evaluation
Run Claude models against all tasks to establish baselines:
```bash
ANTHROPIC_API_KEY=sk-... uv run python scripts/local/run_baseline_eval.py \
--model claude-haiku-4-5-20251001
```
### Baseline Results
| Model | Avg Reward | Method |
|-------|-----------|--------|
| Qwen 7B | 0.010 | Single pass |
| Haiku 4.5 | 0.091 | Single pass |
| Sonnet 4 | 0.176 | Single pass |
| Opus 4 | 0.182 | Single pass |
| Qwen 32B | 0.090 | Single pass (mean) |
| **Qwen 32B** | **0.246** | **GRPO best-of-4** |
## How It Works
### GRPO (Group Relative Policy Optimization)
For each task, the training loop:
1. **Generates N completions** — vLLM produces N candidate answers in a single batched call
2. **Scores each completion** — the harbor server evaluates against the rubric
3. **Computes group-relative advantages**`advantage = (reward - group_mean) / group_std`
4. **Selects the best** — top completions are retained and optionally used as few-shot examples for subsequent iterations
Temperature decays linearly across iterations (exploration to exploitation).
Currently this is best-of-N selection without gradient updates. The architecture supports upgrading to TRL's `GRPOTrainer` for actual weight updates with LoRA.
### Scoring Rubric
Each task is scored on four weighted dimensions:
- **EV Accuracy** (40%, automated) — how close the agent's expected value calculation is to the reference
- **Constraint Compliance** (30%, automated) — Jaccard similarity of recommended cards + housing option match
- **Reasoning Quality** (20%, human) — quality of analysis and tradeoff discussion
- **Constraint Prioritization** (10%, human) — handling of ambiguous or conflicting constraints
When no reference solution exists, scoring falls back to JSON structure validation.
## Project Structure
```
src/lexenvs/
├── harbor_app.py # FastAPI entry point
├── config/ # Pydantic settings (LEX_ENVS_ env prefix)
├── routers/tasks.py # API endpoints
├── schemas/ # Request/response models
├── services/
│ ├── task_loader_service.py # Loads tasks from data/tasks/*.json
│ ├── rubric_evaluator_service.py # Multi-dimensional scoring
│ └── kb_filter_service.py # Per-task KB filtering
└── utils.py # JSON extraction utilities
data/
├── tasks/ # 15 task definitions (easy/medium/hard x 5)
└── knowledge_base.md # Shared KB (~56K chars: issuers, cards, rules)
scripts/
├── local/
│ ├── smoke_test.py # Server verification + LLM evaluation
│ ├── run_baseline_eval.py # Baseline model evaluation (Anthropic API)
│ ├── grpo_train.py # GRPO training (local/Ollama)
│ └── grpo_common.py # Shared GRPO types and helpers
└── modal/
├── harbor.py # Deploy harbor server
└── grpo_train.py # GRPO training on A100
results/ # Evaluation results (gitignored)
```
## Development
```bash
uv run pytest -v # Unit tests
RUN_API_INTEGRATION_TESTS=1 uv run pytest tests/integration/ -v # Integration tests
uv run ruff check . # Lint
uv run ruff format . # Format
uv run mypy src/lexenvs # Type check
```
## Contributing
See [CONTRIBUTING.md](CONTRIBUTING.md) for development setup, coding standards, and how to add new tasks.
## Acknowledgments
Credit card point valuations and transfer partner data in `data/knowledge_base.md` reference publicly available information from [The Points Guy](https://thepointsguy.com/) and issuer websites.
## License
This project is licensed under the Apache License 2.0 — see [LICENSE](LICENSE) for details.