lexenvs-harbor / README.md
endishai's picture
Upload folder using huggingface_hub
2312199 verified
metadata
title: LexEnvs  Harbor RL Environment
emoji: 🃏
colorFrom: blue
colorTo: purple
sdk: docker
app_port: 8000
pinned: false
license: apache-2.0
tags:
  - reinforcement-learning
  - evaluation
  - credit-cards
  - grpo
  - rl-environment

LexEnvs — Harbor RL Environment

A stateless FastAPI evaluation server for credit card optimization tasks. Agents receive task prompts (system instructions, a knowledge base, and user constraints), produce structured JSON recommendations, and get scored on a multi-dimensional rubric.

Quick Start

# Install dependencies
uv sync

# Run the server
uv run uvicorn lexenvs.harbor_app:app --reload

# Run tests
uv run pytest -v

# Lint & format
uv run ruff check .
uv run ruff format .

Docker

docker compose up -d    # Start server on port 8000
docker compose down     # Stop

API Endpoints

Method Path Description
GET /health/live Liveness probe
GET /api/tasks List all tasks (id, domain, difficulty)
GET /api/tasks/{task_id} Get task prompt (system, context, user)
POST /api/tasks/{task_id}/evaluate Score an agent's answer; returns reward in [0, 1]

Modal Deployment

The full RL environment runs on Modal — the harbor server as a web endpoint and GRPO training on A100 GPUs.

1. Setup

# Install modal (included in project deps)
uv sync

# Authenticate with Modal
uv run modal setup

2. Deploy the Harbor Server

uv run modal deploy scripts/modal/harbor.py

This deploys the FastAPI evaluation server as a persistent web endpoint. The deploy output prints the URL:

https://YOUR_USER--lexenvs-harbor-harbor-endpoint.modal.run

Verify it's working:

curl https://YOUR_USER--lexenvs-harbor-harbor-endpoint.modal.run/health/live
# {"status":"ok"}

curl https://YOUR_USER--lexenvs-harbor-harbor-endpoint.modal.run/api/tasks
# {"tasks": [{"task_id": "easy_01", ...}, ...]}

3. Run GRPO Training

# Default: Qwen2.5-32B-Instruct on A100-80GB, 4 samples/task, 3 iterations
uv run modal run scripts/modal/grpo_train.py \
    --harbor-url https://YOUR_USER--lexenvs-harbor-harbor-endpoint.modal.run

# Customize model and parameters
uv run modal run scripts/modal/grpo_train.py \
    --harbor-url https://YOUR_USER--lexenvs-harbor-harbor-endpoint.modal.run \
    --model Qwen/Qwen2.5-14B-Instruct \
    --num-samples 8 \
    --iterations 5

# Run only specific tasks
uv run modal run scripts/modal/grpo_train.py \
    --harbor-url https://YOUR_USER--lexenvs-harbor-harbor-endpoint.modal.run \
    --tasks easy_01,easy_02,medium_01

The first run downloads model weights (~60GB for 32B) to a persistent Modal volume. Subsequent runs start generating immediately.

Results are saved locally to results/grpo-modal/.

Training Parameters

Flag Default Description
--harbor-url (required) Harbor server URL
--model Qwen/Qwen2.5-32B-Instruct HuggingFace model ID
--num-samples 4 Completions per task per iteration
--iterations 3 Number of GRPO iterations
--temp-start 0.8 Starting temperature (exploration)
--temp-end 0.4 Ending temperature (exploitation)
--tasks all Comma-separated task IDs to run
--no-few-shot false Disable few-shot example injection
--max-model-len 32768 vLLM context window size

Local Training (Ollama)

For quick iteration without cloud GPUs:

# Start the harbor server locally
uv run uvicorn lexenvs.harbor_app:app --port 8000 &

# Pull a model
ollama pull qwen2.5:7b

# Run GRPO training
uv run python scripts/local/grpo_train.py --model qwen2.5:7b

# Customize
uv run python scripts/local/grpo_train.py \
    --model qwen2.5:7b \
    -N 4 \
    --iterations 3 \
    --tasks easy_01 easy_02

Baseline Evaluation

Run Claude models against all tasks to establish baselines:

ANTHROPIC_API_KEY=sk-... uv run python scripts/local/run_baseline_eval.py \
    --model claude-haiku-4-5-20251001

Baseline Results

Model Avg Reward Method
Qwen 7B 0.010 Single pass
Haiku 4.5 0.091 Single pass
Sonnet 4 0.176 Single pass
Opus 4 0.182 Single pass
Qwen 32B 0.090 Single pass (mean)
Qwen 32B 0.246 GRPO best-of-4

How It Works

GRPO (Group Relative Policy Optimization)

For each task, the training loop:

  1. Generates N completions — vLLM produces N candidate answers in a single batched call
  2. Scores each completion — the harbor server evaluates against the rubric
  3. Computes group-relative advantagesadvantage = (reward - group_mean) / group_std
  4. Selects the best — top completions are retained and optionally used as few-shot examples for subsequent iterations

Temperature decays linearly across iterations (exploration to exploitation).

Currently this is best-of-N selection without gradient updates. The architecture supports upgrading to TRL's GRPOTrainer for actual weight updates with LoRA.

Scoring Rubric

Each task is scored on four weighted dimensions:

  • EV Accuracy (40%, automated) — how close the agent's expected value calculation is to the reference
  • Constraint Compliance (30%, automated) — Jaccard similarity of recommended cards + housing option match
  • Reasoning Quality (20%, human) — quality of analysis and tradeoff discussion
  • Constraint Prioritization (10%, human) — handling of ambiguous or conflicting constraints

When no reference solution exists, scoring falls back to JSON structure validation.

Project Structure

src/lexenvs/
├── harbor_app.py          # FastAPI entry point
├── config/                # Pydantic settings (LEX_ENVS_ env prefix)
├── routers/tasks.py       # API endpoints
├── schemas/               # Request/response models
├── services/
│   ├── task_loader_service.py      # Loads tasks from data/tasks/*.json
│   ├── rubric_evaluator_service.py # Multi-dimensional scoring
│   └── kb_filter_service.py        # Per-task KB filtering
└── utils.py               # JSON extraction utilities

data/
├── tasks/                 # 15 task definitions (easy/medium/hard x 5)
└── knowledge_base.md      # Shared KB (~56K chars: issuers, cards, rules)

scripts/
├── local/
│   ├── smoke_test.py          # Server verification + LLM evaluation
│   ├── run_baseline_eval.py   # Baseline model evaluation (Anthropic API)
│   ├── grpo_train.py          # GRPO training (local/Ollama)
│   └── grpo_common.py         # Shared GRPO types and helpers
└── modal/
    ├── harbor.py              # Deploy harbor server
    └── grpo_train.py          # GRPO training on A100

results/                   # Evaluation results (gitignored)

Development

uv run pytest -v                                  # Unit tests
RUN_API_INTEGRATION_TESTS=1 uv run pytest tests/integration/ -v  # Integration tests
uv run ruff check .                               # Lint
uv run ruff format .                              # Format
uv run mypy src/lexenvs                           # Type check

Contributing

See CONTRIBUTING.md for development setup, coding standards, and how to add new tasks.

Acknowledgments

Credit card point valuations and transfer partner data in data/knowledge_base.md reference publicly available information from The Points Guy and issuer websites.

License

This project is licensed under the Apache License 2.0 — see LICENSE for details.