Spaces:

endishai
/

lexenvs-harbor

Sleeping

App Files Files Community

lexenvs-harbor / README.md

endishai

Upload folder using huggingface_hub

2312199 verified 24 days ago

preview code

raw

history blame contribute delete

7.84 kB

metadata

title: LexEnvs — Harbor RL Environment
emoji: 🃏
colorFrom: blue
colorTo: purple
sdk: docker
app_port: 8000
pinned: false
license: apache-2.0
tags:
  - reinforcement-learning
  - evaluation
  - credit-cards
  - grpo
  - rl-environment

LexEnvs — Harbor RL Environment

A stateless FastAPI evaluation server for credit card optimization tasks. Agents receive task prompts (system instructions, a knowledge base, and user constraints), produce structured JSON recommendations, and get scored on a multi-dimensional rubric.

Quick Start

# Install dependencies
uv sync

# Run the server
uv run uvicorn lexenvs.harbor_app:app --reload

# Run tests
uv run pytest -v

# Lint & format
uv run ruff check .
uv run ruff format .

Docker

docker compose up -d    # Start server on port 8000
docker compose down     # Stop

API Endpoints

Method	Path	Description
`GET`	`/health/live`	Liveness probe
`GET`	`/api/tasks`	List all tasks (id, domain, difficulty)
`GET`	`/api/tasks/{task_id}`	Get task prompt (system, context, user)
`POST`	`/api/tasks/{task_id}/evaluate`	Score an agent's answer; returns reward in [0, 1]

Modal Deployment

The full RL environment runs on Modal — the harbor server as a web endpoint and GRPO training on A100 GPUs.

1. Setup

# Install modal (included in project deps)
uv sync

# Authenticate with Modal
uv run modal setup

2. Deploy the Harbor Server

uv run modal deploy scripts/modal/harbor.py

This deploys the FastAPI evaluation server as a persistent web endpoint. The deploy output prints the URL:

https://YOUR_USER--lexenvs-harbor-harbor-endpoint.modal.run

Verify it's working:

curl https://YOUR_USER--lexenvs-harbor-harbor-endpoint.modal.run/health/live
# {"status":"ok"}

curl https://YOUR_USER--lexenvs-harbor-harbor-endpoint.modal.run/api/tasks
# {"tasks": [{"task_id": "easy_01", ...}, ...]}

3. Run GRPO Training

# Default: Qwen2.5-32B-Instruct on A100-80GB, 4 samples/task, 3 iterations
uv run modal run scripts/modal/grpo_train.py \
    --harbor-url https://YOUR_USER--lexenvs-harbor-harbor-endpoint.modal.run

# Customize model and parameters
uv run modal run scripts/modal/grpo_train.py \
    --harbor-url https://YOUR_USER--lexenvs-harbor-harbor-endpoint.modal.run \
    --model Qwen/Qwen2.5-14B-Instruct \
    --num-samples 8 \
    --iterations 5

# Run only specific tasks
uv run modal run scripts/modal/grpo_train.py \
    --harbor-url https://YOUR_USER--lexenvs-harbor-harbor-endpoint.modal.run \
    --tasks easy_01,easy_02,medium_01

The first run downloads model weights (~60GB for 32B) to a persistent Modal volume. Subsequent runs start generating immediately.

Results are saved locally to results/grpo-modal/.

Training Parameters

Flag	Default	Description
`--harbor-url`	(required)	Harbor server URL
`--model`	`Qwen/Qwen2.5-32B-Instruct`	HuggingFace model ID
`--num-samples`	`4`	Completions per task per iteration
`--iterations`	`3`	Number of GRPO iterations
`--temp-start`	`0.8`	Starting temperature (exploration)
`--temp-end`	`0.4`	Ending temperature (exploitation)
`--tasks`	all	Comma-separated task IDs to run
`--no-few-shot`	`false`	Disable few-shot example injection
`--max-model-len`	`32768`	vLLM context window size

Local Training (Ollama)

For quick iteration without cloud GPUs:

# Start the harbor server locally
uv run uvicorn lexenvs.harbor_app:app --port 8000 &

# Pull a model
ollama pull qwen2.5:7b

# Run GRPO training
uv run python scripts/local/grpo_train.py --model qwen2.5:7b

# Customize
uv run python scripts/local/grpo_train.py \
    --model qwen2.5:7b \
    -N 4 \
    --iterations 3 \
    --tasks easy_01 easy_02

Baseline Evaluation

Run Claude models against all tasks to establish baselines:

ANTHROPIC_API_KEY=sk-... uv run python scripts/local/run_baseline_eval.py \
    --model claude-haiku-4-5-20251001

Baseline Results

Model	Avg Reward	Method
Qwen 7B	0.010	Single pass
Haiku 4.5	0.091	Single pass
Sonnet 4	0.176	Single pass
Opus 4	0.182	Single pass
Qwen 32B	0.090	Single pass (mean)
Qwen 32B	0.246	GRPO best-of-4

How It Works

GRPO (Group Relative Policy Optimization)

For each task, the training loop:

Generates N completions — vLLM produces N candidate answers in a single batched call
Scores each completion — the harbor server evaluates against the rubric
Computes group-relative advantages — advantage = (reward - group_mean) / group_std
Selects the best — top completions are retained and optionally used as few-shot examples for subsequent iterations

Temperature decays linearly across iterations (exploration to exploitation).

Currently this is best-of-N selection without gradient updates. The architecture supports upgrading to TRL's GRPOTrainer for actual weight updates with LoRA.

Scoring Rubric

Each task is scored on four weighted dimensions:

EV Accuracy (40%, automated) — how close the agent's expected value calculation is to the reference
Constraint Compliance (30%, automated) — Jaccard similarity of recommended cards + housing option match
Reasoning Quality (20%, human) — quality of analysis and tradeoff discussion
Constraint Prioritization (10%, human) — handling of ambiguous or conflicting constraints

When no reference solution exists, scoring falls back to JSON structure validation.

Project Structure

src/lexenvs/
├── harbor_app.py          # FastAPI entry point
├── config/                # Pydantic settings (LEX_ENVS_ env prefix)
├── routers/tasks.py       # API endpoints
├── schemas/               # Request/response models
├── services/
│   ├── task_loader_service.py      # Loads tasks from data/tasks/*.json
│   ├── rubric_evaluator_service.py # Multi-dimensional scoring
│   └── kb_filter_service.py        # Per-task KB filtering
└── utils.py               # JSON extraction utilities

data/
├── tasks/                 # 15 task definitions (easy/medium/hard x 5)
└── knowledge_base.md      # Shared KB (~56K chars: issuers, cards, rules)

scripts/
├── local/
│   ├── smoke_test.py          # Server verification + LLM evaluation
│   ├── run_baseline_eval.py   # Baseline model evaluation (Anthropic API)
│   ├── grpo_train.py          # GRPO training (local/Ollama)
│   └── grpo_common.py         # Shared GRPO types and helpers
└── modal/
    ├── harbor.py              # Deploy harbor server
    └── grpo_train.py          # GRPO training on A100

results/                   # Evaluation results (gitignored)

Development

uv run pytest -v                                  # Unit tests
RUN_API_INTEGRATION_TESTS=1 uv run pytest tests/integration/ -v  # Integration tests
uv run ruff check .                               # Lint
uv run ruff format .                              # Format
uv run mypy src/lexenvs                           # Type check

Contributing

See CONTRIBUTING.md for development setup, coding standards, and how to add new tasks.

Acknowledgments

Credit card point valuations and transfer partner data in data/knowledge_base.md reference publicly available information from The Points Guy and issuer websites.

License

This project is licensed under the Apache License 2.0 — see LICENSE for details.