--- title: LexEnvs — Harbor RL Environment emoji: 🃏 colorFrom: blue colorTo: purple sdk: docker app_port: 8000 pinned: false license: apache-2.0 tags: - reinforcement-learning - evaluation - credit-cards - grpo - rl-environment --- # LexEnvs — Harbor RL Environment A stateless FastAPI evaluation server for credit card optimization tasks. Agents receive task prompts (system instructions, a knowledge base, and user constraints), produce structured JSON recommendations, and get scored on a multi-dimensional rubric. ## Quick Start ```bash # Install dependencies uv sync # Run the server uv run uvicorn lexenvs.harbor_app:app --reload # Run tests uv run pytest -v # Lint & format uv run ruff check . uv run ruff format . ``` ### Docker ```bash docker compose up -d # Start server on port 8000 docker compose down # Stop ``` ## API Endpoints | Method | Path | Description | |--------|------|-------------| | `GET` | `/health/live` | Liveness probe | | `GET` | `/api/tasks` | List all tasks (id, domain, difficulty) | | `GET` | `/api/tasks/{task_id}` | Get task prompt (system, context, user) | | `POST` | `/api/tasks/{task_id}/evaluate` | Score an agent's answer; returns reward in [0, 1] | ## Modal Deployment The full RL environment runs on [Modal](https://modal.com) — the harbor server as a web endpoint and GRPO training on A100 GPUs. ### 1. Setup ```bash # Install modal (included in project deps) uv sync # Authenticate with Modal uv run modal setup ``` ### 2. Deploy the Harbor Server ```bash uv run modal deploy scripts/modal/harbor.py ``` This deploys the FastAPI evaluation server as a persistent web endpoint. The deploy output prints the URL: ``` https://YOUR_USER--lexenvs-harbor-harbor-endpoint.modal.run ``` Verify it's working: ```bash curl https://YOUR_USER--lexenvs-harbor-harbor-endpoint.modal.run/health/live # {"status":"ok"} curl https://YOUR_USER--lexenvs-harbor-harbor-endpoint.modal.run/api/tasks # {"tasks": [{"task_id": "easy_01", ...}, ...]} ``` ### 3. Run GRPO Training ```bash # Default: Qwen2.5-32B-Instruct on A100-80GB, 4 samples/task, 3 iterations uv run modal run scripts/modal/grpo_train.py \ --harbor-url https://YOUR_USER--lexenvs-harbor-harbor-endpoint.modal.run # Customize model and parameters uv run modal run scripts/modal/grpo_train.py \ --harbor-url https://YOUR_USER--lexenvs-harbor-harbor-endpoint.modal.run \ --model Qwen/Qwen2.5-14B-Instruct \ --num-samples 8 \ --iterations 5 # Run only specific tasks uv run modal run scripts/modal/grpo_train.py \ --harbor-url https://YOUR_USER--lexenvs-harbor-harbor-endpoint.modal.run \ --tasks easy_01,easy_02,medium_01 ``` The first run downloads model weights (~60GB for 32B) to a persistent Modal volume. Subsequent runs start generating immediately. Results are saved locally to `results/grpo-modal/`. ### Training Parameters | Flag | Default | Description | |------|---------|-------------| | `--harbor-url` | (required) | Harbor server URL | | `--model` | `Qwen/Qwen2.5-32B-Instruct` | HuggingFace model ID | | `--num-samples` | `4` | Completions per task per iteration | | `--iterations` | `3` | Number of GRPO iterations | | `--temp-start` | `0.8` | Starting temperature (exploration) | | `--temp-end` | `0.4` | Ending temperature (exploitation) | | `--tasks` | all | Comma-separated task IDs to run | | `--no-few-shot` | `false` | Disable few-shot example injection | | `--max-model-len` | `32768` | vLLM context window size | ## Local Training (Ollama) For quick iteration without cloud GPUs: ```bash # Start the harbor server locally uv run uvicorn lexenvs.harbor_app:app --port 8000 & # Pull a model ollama pull qwen2.5:7b # Run GRPO training uv run python scripts/local/grpo_train.py --model qwen2.5:7b # Customize uv run python scripts/local/grpo_train.py \ --model qwen2.5:7b \ -N 4 \ --iterations 3 \ --tasks easy_01 easy_02 ``` ## Baseline Evaluation Run Claude models against all tasks to establish baselines: ```bash ANTHROPIC_API_KEY=sk-... uv run python scripts/local/run_baseline_eval.py \ --model claude-haiku-4-5-20251001 ``` ### Baseline Results | Model | Avg Reward | Method | |-------|-----------|--------| | Qwen 7B | 0.010 | Single pass | | Haiku 4.5 | 0.091 | Single pass | | Sonnet 4 | 0.176 | Single pass | | Opus 4 | 0.182 | Single pass | | Qwen 32B | 0.090 | Single pass (mean) | | **Qwen 32B** | **0.246** | **GRPO best-of-4** | ## How It Works ### GRPO (Group Relative Policy Optimization) For each task, the training loop: 1. **Generates N completions** — vLLM produces N candidate answers in a single batched call 2. **Scores each completion** — the harbor server evaluates against the rubric 3. **Computes group-relative advantages** — `advantage = (reward - group_mean) / group_std` 4. **Selects the best** — top completions are retained and optionally used as few-shot examples for subsequent iterations Temperature decays linearly across iterations (exploration to exploitation). Currently this is best-of-N selection without gradient updates. The architecture supports upgrading to TRL's `GRPOTrainer` for actual weight updates with LoRA. ### Scoring Rubric Each task is scored on four weighted dimensions: - **EV Accuracy** (40%, automated) — how close the agent's expected value calculation is to the reference - **Constraint Compliance** (30%, automated) — Jaccard similarity of recommended cards + housing option match - **Reasoning Quality** (20%, human) — quality of analysis and tradeoff discussion - **Constraint Prioritization** (10%, human) — handling of ambiguous or conflicting constraints When no reference solution exists, scoring falls back to JSON structure validation. ## Project Structure ``` src/lexenvs/ ├── harbor_app.py # FastAPI entry point ├── config/ # Pydantic settings (LEX_ENVS_ env prefix) ├── routers/tasks.py # API endpoints ├── schemas/ # Request/response models ├── services/ │ ├── task_loader_service.py # Loads tasks from data/tasks/*.json │ ├── rubric_evaluator_service.py # Multi-dimensional scoring │ └── kb_filter_service.py # Per-task KB filtering └── utils.py # JSON extraction utilities data/ ├── tasks/ # 15 task definitions (easy/medium/hard x 5) └── knowledge_base.md # Shared KB (~56K chars: issuers, cards, rules) scripts/ ├── local/ │ ├── smoke_test.py # Server verification + LLM evaluation │ ├── run_baseline_eval.py # Baseline model evaluation (Anthropic API) │ ├── grpo_train.py # GRPO training (local/Ollama) │ └── grpo_common.py # Shared GRPO types and helpers └── modal/ ├── harbor.py # Deploy harbor server └── grpo_train.py # GRPO training on A100 results/ # Evaluation results (gitignored) ``` ## Development ```bash uv run pytest -v # Unit tests RUN_API_INTEGRATION_TESTS=1 uv run pytest tests/integration/ -v # Integration tests uv run ruff check . # Lint uv run ruff format . # Format uv run mypy src/lexenvs # Type check ``` ## Contributing See [CONTRIBUTING.md](CONTRIBUTING.md) for development setup, coding standards, and how to add new tasks. ## Acknowledgments Credit card point valuations and transfer partner data in `data/knowledge_base.md` reference publicly available information from [The Points Guy](https://thepointsguy.com/) and issuer websites. ## License This project is licensed under the Apache License 2.0 — see [LICENSE](LICENSE) for details.