Spaces:

endishai
/

lexenvs-harbor

Sleeping

File size: 7,842 Bytes

---
title: LexEnvs — Harbor RL Environment
emoji: 🃏
colorFrom: blue
colorTo: purple
sdk: docker
app_port: 8000
pinned: false
license: apache-2.0
tags:
  - reinforcement-learning
  - evaluation
  - credit-cards
  - grpo
  - rl-environment
---

# LexEnvs — Harbor RL Environment

A stateless FastAPI evaluation server for credit card optimization tasks. Agents receive task prompts (system instructions, a knowledge base, and user constraints), produce structured JSON recommendations, and get scored on a multi-dimensional rubric.

## Quick Start

```bash
# Install dependencies
uv sync

# Run the server
uv run uvicorn lexenvs.harbor_app:app --reload

# Run tests
uv run pytest -v

# Lint & format
uv run ruff check .
uv run ruff format .
```

### Docker

```bash
docker compose up -d    # Start server on port 8000
docker compose down     # Stop
```

## API Endpoints

| Method | Path | Description |
|--------|------|-------------|
| `GET` | `/health/live` | Liveness probe |
| `GET` | `/api/tasks` | List all tasks (id, domain, difficulty) |
| `GET` | `/api/tasks/{task_id}` | Get task prompt (system, context, user) |
| `POST` | `/api/tasks/{task_id}/evaluate` | Score an agent's answer; returns reward in [0, 1] |

## Modal Deployment

The full RL environment runs on [Modal](https://modal.com) — the harbor server as a web endpoint and GRPO training on A100 GPUs.

### 1. Setup

```bash
# Install modal (included in project deps)
uv sync

# Authenticate with Modal
uv run modal setup
```

### 2. Deploy the Harbor Server

```bash
uv run modal deploy scripts/modal/harbor.py
```

This deploys the FastAPI evaluation server as a persistent web endpoint. The deploy output prints the URL:

```
https://YOUR_USER--lexenvs-harbor-harbor-endpoint.modal.run
```

Verify it's working:

```bash
curl https://YOUR_USER--lexenvs-harbor-harbor-endpoint.modal.run/health/live
# {"status":"ok"}

curl https://YOUR_USER--lexenvs-harbor-harbor-endpoint.modal.run/api/tasks
# {"tasks": [{"task_id": "easy_01", ...}, ...]}
```

### 3. Run GRPO Training

```bash
# Default: Qwen2.5-32B-Instruct on A100-80GB, 4 samples/task, 3 iterations
uv run modal run scripts/modal/grpo_train.py \
    --harbor-url https://YOUR_USER--lexenvs-harbor-harbor-endpoint.modal.run

# Customize model and parameters
uv run modal run scripts/modal/grpo_train.py \
    --harbor-url https://YOUR_USER--lexenvs-harbor-harbor-endpoint.modal.run \
    --model Qwen/Qwen2.5-14B-Instruct \
    --num-samples 8 \
    --iterations 5

# Run only specific tasks
uv run modal run scripts/modal/grpo_train.py \
    --harbor-url https://YOUR_USER--lexenvs-harbor-harbor-endpoint.modal.run \
    --tasks easy_01,easy_02,medium_01
```

The first run downloads model weights (~60GB for 32B) to a persistent Modal volume. Subsequent runs start generating immediately.

Results are saved locally to `results/grpo-modal/`.

### Training Parameters

| Flag | Default | Description |
|------|---------|-------------|
| `--harbor-url` | (required) | Harbor server URL |
| `--model` | `Qwen/Qwen2.5-32B-Instruct` | HuggingFace model ID |
| `--num-samples` | `4` | Completions per task per iteration |
| `--iterations` | `3` | Number of GRPO iterations |
| `--temp-start` | `0.8` | Starting temperature (exploration) |
| `--temp-end` | `0.4` | Ending temperature (exploitation) |
| `--tasks` | all | Comma-separated task IDs to run |
| `--no-few-shot` | `false` | Disable few-shot example injection |
| `--max-model-len` | `32768` | vLLM context window size |

## Local Training (Ollama)

For quick iteration without cloud GPUs:

```bash
# Start the harbor server locally
uv run uvicorn lexenvs.harbor_app:app --port 8000 &

# Pull a model
ollama pull qwen2.5:7b

# Run GRPO training
uv run python scripts/local/grpo_train.py --model qwen2.5:7b

# Customize
uv run python scripts/local/grpo_train.py \
    --model qwen2.5:7b \
    -N 4 \
    --iterations 3 \
    --tasks easy_01 easy_02
```

## Baseline Evaluation

Run Claude models against all tasks to establish baselines:

```bash
ANTHROPIC_API_KEY=sk-... uv run python scripts/local/run_baseline_eval.py \
    --model claude-haiku-4-5-20251001
```

### Baseline Results

| Model | Avg Reward | Method |
|-------|-----------|--------|
| Qwen 7B | 0.010 | Single pass |
| Haiku 4.5 | 0.091 | Single pass |
| Sonnet 4 | 0.176 | Single pass |
| Opus 4 | 0.182 | Single pass |
| Qwen 32B | 0.090 | Single pass (mean) |
| **Qwen 32B** | **0.246** | **GRPO best-of-4** |

## How It Works

### GRPO (Group Relative Policy Optimization)

For each task, the training loop:

1. **Generates N completions** — vLLM produces N candidate answers in a single batched call
2. **Scores each completion** — the harbor server evaluates against the rubric
3. **Computes group-relative advantages** — `advantage = (reward - group_mean) / group_std`
4. **Selects the best** — top completions are retained and optionally used as few-shot examples for subsequent iterations

Temperature decays linearly across iterations (exploration to exploitation).

Currently this is best-of-N selection without gradient updates. The architecture supports upgrading to TRL's `GRPOTrainer` for actual weight updates with LoRA.

### Scoring Rubric

Each task is scored on four weighted dimensions:

- **EV Accuracy** (40%, automated) — how close the agent's expected value calculation is to the reference
- **Constraint Compliance** (30%, automated) — Jaccard similarity of recommended cards + housing option match
- **Reasoning Quality** (20%, human) — quality of analysis and tradeoff discussion
- **Constraint Prioritization** (10%, human) — handling of ambiguous or conflicting constraints

When no reference solution exists, scoring falls back to JSON structure validation.

## Project Structure

```
src/lexenvs/
├── harbor_app.py          # FastAPI entry point
├── config/                # Pydantic settings (LEX_ENVS_ env prefix)
├── routers/tasks.py       # API endpoints
├── schemas/               # Request/response models
├── services/
│   ├── task_loader_service.py      # Loads tasks from data/tasks/*.json
│   ├── rubric_evaluator_service.py # Multi-dimensional scoring
│   └── kb_filter_service.py        # Per-task KB filtering
└── utils.py               # JSON extraction utilities

data/
├── tasks/                 # 15 task definitions (easy/medium/hard x 5)
└── knowledge_base.md      # Shared KB (~56K chars: issuers, cards, rules)

scripts/
├── local/
│   ├── smoke_test.py          # Server verification + LLM evaluation
│   ├── run_baseline_eval.py   # Baseline model evaluation (Anthropic API)
│   ├── grpo_train.py          # GRPO training (local/Ollama)
│   └── grpo_common.py         # Shared GRPO types and helpers
└── modal/
    ├── harbor.py              # Deploy harbor server
    └── grpo_train.py          # GRPO training on A100

results/                   # Evaluation results (gitignored)
```

## Development

```bash
uv run pytest -v                                  # Unit tests
RUN_API_INTEGRATION_TESTS=1 uv run pytest tests/integration/ -v  # Integration tests
uv run ruff check .                               # Lint
uv run ruff format .                              # Format
uv run mypy src/lexenvs                           # Type check
```

## Contributing

See [CONTRIBUTING.md](CONTRIBUTING.md) for development setup, coding standards, and how to add new tasks.

## Acknowledgments

Credit card point valuations and transfer partner data in `data/knowledge_base.md` reference publicly available information from [The Points Guy](https://thepointsguy.com/) and issuer websites.

## License

This project is licensed under the Apache License 2.0 — see [LICENSE](LICENSE) for details.