Spaces:
Sleeping
Sleeping
| title: LexEnvs — Harbor RL Environment | |
| emoji: 🃏 | |
| colorFrom: blue | |
| colorTo: purple | |
| sdk: docker | |
| app_port: 8000 | |
| pinned: false | |
| license: apache-2.0 | |
| tags: | |
| - reinforcement-learning | |
| - evaluation | |
| - credit-cards | |
| - grpo | |
| - rl-environment | |
| # LexEnvs — Harbor RL Environment | |
| A stateless FastAPI evaluation server for credit card optimization tasks. Agents receive task prompts (system instructions, a knowledge base, and user constraints), produce structured JSON recommendations, and get scored on a multi-dimensional rubric. | |
| ## Quick Start | |
| ```bash | |
| # Install dependencies | |
| uv sync | |
| # Run the server | |
| uv run uvicorn lexenvs.harbor_app:app --reload | |
| # Run tests | |
| uv run pytest -v | |
| # Lint & format | |
| uv run ruff check . | |
| uv run ruff format . | |
| ``` | |
| ### Docker | |
| ```bash | |
| docker compose up -d # Start server on port 8000 | |
| docker compose down # Stop | |
| ``` | |
| ## API Endpoints | |
| | Method | Path | Description | | |
| |--------|------|-------------| | |
| | `GET` | `/health/live` | Liveness probe | | |
| | `GET` | `/api/tasks` | List all tasks (id, domain, difficulty) | | |
| | `GET` | `/api/tasks/{task_id}` | Get task prompt (system, context, user) | | |
| | `POST` | `/api/tasks/{task_id}/evaluate` | Score an agent's answer; returns reward in [0, 1] | | |
| ## Modal Deployment | |
| The full RL environment runs on [Modal](https://modal.com) — the harbor server as a web endpoint and GRPO training on A100 GPUs. | |
| ### 1. Setup | |
| ```bash | |
| # Install modal (included in project deps) | |
| uv sync | |
| # Authenticate with Modal | |
| uv run modal setup | |
| ``` | |
| ### 2. Deploy the Harbor Server | |
| ```bash | |
| uv run modal deploy scripts/modal/harbor.py | |
| ``` | |
| This deploys the FastAPI evaluation server as a persistent web endpoint. The deploy output prints the URL: | |
| ``` | |
| https://YOUR_USER--lexenvs-harbor-harbor-endpoint.modal.run | |
| ``` | |
| Verify it's working: | |
| ```bash | |
| curl https://YOUR_USER--lexenvs-harbor-harbor-endpoint.modal.run/health/live | |
| # {"status":"ok"} | |
| curl https://YOUR_USER--lexenvs-harbor-harbor-endpoint.modal.run/api/tasks | |
| # {"tasks": [{"task_id": "easy_01", ...}, ...]} | |
| ``` | |
| ### 3. Run GRPO Training | |
| ```bash | |
| # Default: Qwen2.5-32B-Instruct on A100-80GB, 4 samples/task, 3 iterations | |
| uv run modal run scripts/modal/grpo_train.py \ | |
| --harbor-url https://YOUR_USER--lexenvs-harbor-harbor-endpoint.modal.run | |
| # Customize model and parameters | |
| uv run modal run scripts/modal/grpo_train.py \ | |
| --harbor-url https://YOUR_USER--lexenvs-harbor-harbor-endpoint.modal.run \ | |
| --model Qwen/Qwen2.5-14B-Instruct \ | |
| --num-samples 8 \ | |
| --iterations 5 | |
| # Run only specific tasks | |
| uv run modal run scripts/modal/grpo_train.py \ | |
| --harbor-url https://YOUR_USER--lexenvs-harbor-harbor-endpoint.modal.run \ | |
| --tasks easy_01,easy_02,medium_01 | |
| ``` | |
| The first run downloads model weights (~60GB for 32B) to a persistent Modal volume. Subsequent runs start generating immediately. | |
| Results are saved locally to `results/grpo-modal/`. | |
| ### Training Parameters | |
| | Flag | Default | Description | | |
| |------|---------|-------------| | |
| | `--harbor-url` | (required) | Harbor server URL | | |
| | `--model` | `Qwen/Qwen2.5-32B-Instruct` | HuggingFace model ID | | |
| | `--num-samples` | `4` | Completions per task per iteration | | |
| | `--iterations` | `3` | Number of GRPO iterations | | |
| | `--temp-start` | `0.8` | Starting temperature (exploration) | | |
| | `--temp-end` | `0.4` | Ending temperature (exploitation) | | |
| | `--tasks` | all | Comma-separated task IDs to run | | |
| | `--no-few-shot` | `false` | Disable few-shot example injection | | |
| | `--max-model-len` | `32768` | vLLM context window size | | |
| ## Local Training (Ollama) | |
| For quick iteration without cloud GPUs: | |
| ```bash | |
| # Start the harbor server locally | |
| uv run uvicorn lexenvs.harbor_app:app --port 8000 & | |
| # Pull a model | |
| ollama pull qwen2.5:7b | |
| # Run GRPO training | |
| uv run python scripts/local/grpo_train.py --model qwen2.5:7b | |
| # Customize | |
| uv run python scripts/local/grpo_train.py \ | |
| --model qwen2.5:7b \ | |
| -N 4 \ | |
| --iterations 3 \ | |
| --tasks easy_01 easy_02 | |
| ``` | |
| ## Baseline Evaluation | |
| Run Claude models against all tasks to establish baselines: | |
| ```bash | |
| ANTHROPIC_API_KEY=sk-... uv run python scripts/local/run_baseline_eval.py \ | |
| --model claude-haiku-4-5-20251001 | |
| ``` | |
| ### Baseline Results | |
| | Model | Avg Reward | Method | | |
| |-------|-----------|--------| | |
| | Qwen 7B | 0.010 | Single pass | | |
| | Haiku 4.5 | 0.091 | Single pass | | |
| | Sonnet 4 | 0.176 | Single pass | | |
| | Opus 4 | 0.182 | Single pass | | |
| | Qwen 32B | 0.090 | Single pass (mean) | | |
| | **Qwen 32B** | **0.246** | **GRPO best-of-4** | | |
| ## How It Works | |
| ### GRPO (Group Relative Policy Optimization) | |
| For each task, the training loop: | |
| 1. **Generates N completions** — vLLM produces N candidate answers in a single batched call | |
| 2. **Scores each completion** — the harbor server evaluates against the rubric | |
| 3. **Computes group-relative advantages** — `advantage = (reward - group_mean) / group_std` | |
| 4. **Selects the best** — top completions are retained and optionally used as few-shot examples for subsequent iterations | |
| Temperature decays linearly across iterations (exploration to exploitation). | |
| Currently this is best-of-N selection without gradient updates. The architecture supports upgrading to TRL's `GRPOTrainer` for actual weight updates with LoRA. | |
| ### Scoring Rubric | |
| Each task is scored on four weighted dimensions: | |
| - **EV Accuracy** (40%, automated) — how close the agent's expected value calculation is to the reference | |
| - **Constraint Compliance** (30%, automated) — Jaccard similarity of recommended cards + housing option match | |
| - **Reasoning Quality** (20%, human) — quality of analysis and tradeoff discussion | |
| - **Constraint Prioritization** (10%, human) — handling of ambiguous or conflicting constraints | |
| When no reference solution exists, scoring falls back to JSON structure validation. | |
| ## Project Structure | |
| ``` | |
| src/lexenvs/ | |
| ├── harbor_app.py # FastAPI entry point | |
| ├── config/ # Pydantic settings (LEX_ENVS_ env prefix) | |
| ├── routers/tasks.py # API endpoints | |
| ├── schemas/ # Request/response models | |
| ├── services/ | |
| │ ├── task_loader_service.py # Loads tasks from data/tasks/*.json | |
| │ ├── rubric_evaluator_service.py # Multi-dimensional scoring | |
| │ └── kb_filter_service.py # Per-task KB filtering | |
| └── utils.py # JSON extraction utilities | |
| data/ | |
| ├── tasks/ # 15 task definitions (easy/medium/hard x 5) | |
| └── knowledge_base.md # Shared KB (~56K chars: issuers, cards, rules) | |
| scripts/ | |
| ├── local/ | |
| │ ├── smoke_test.py # Server verification + LLM evaluation | |
| │ ├── run_baseline_eval.py # Baseline model evaluation (Anthropic API) | |
| │ ├── grpo_train.py # GRPO training (local/Ollama) | |
| │ └── grpo_common.py # Shared GRPO types and helpers | |
| └── modal/ | |
| ├── harbor.py # Deploy harbor server | |
| └── grpo_train.py # GRPO training on A100 | |
| results/ # Evaluation results (gitignored) | |
| ``` | |
| ## Development | |
| ```bash | |
| uv run pytest -v # Unit tests | |
| RUN_API_INTEGRATION_TESTS=1 uv run pytest tests/integration/ -v # Integration tests | |
| uv run ruff check . # Lint | |
| uv run ruff format . # Format | |
| uv run mypy src/lexenvs # Type check | |
| ``` | |
| ## Contributing | |
| See [CONTRIBUTING.md](CONTRIBUTING.md) for development setup, coding standards, and how to add new tasks. | |
| ## Acknowledgments | |
| Credit card point valuations and transfer partner data in `data/knowledge_base.md` reference publicly available information from [The Points Guy](https://thepointsguy.com/) and issuer websites. | |
| ## License | |
| This project is licensed under the Apache License 2.0 — see [LICENSE](LICENSE) for details. | |