File size: 7,842 Bytes
966fa43
2312199
 
 
966fa43
 
2312199
966fa43
2312199
 
 
 
 
 
 
966fa43
 
2312199
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
---
title: LexEnvs  Harbor RL Environment
emoji: 🃏
colorFrom: blue
colorTo: purple
sdk: docker
app_port: 8000
pinned: false
license: apache-2.0
tags:
  - reinforcement-learning
  - evaluation
  - credit-cards
  - grpo
  - rl-environment
---

# LexEnvs — Harbor RL Environment

A stateless FastAPI evaluation server for credit card optimization tasks. Agents receive task prompts (system instructions, a knowledge base, and user constraints), produce structured JSON recommendations, and get scored on a multi-dimensional rubric.

## Quick Start

```bash
# Install dependencies
uv sync

# Run the server
uv run uvicorn lexenvs.harbor_app:app --reload

# Run tests
uv run pytest -v

# Lint & format
uv run ruff check .
uv run ruff format .
```

### Docker

```bash
docker compose up -d    # Start server on port 8000
docker compose down     # Stop
```

## API Endpoints

| Method | Path | Description |
|--------|------|-------------|
| `GET` | `/health/live` | Liveness probe |
| `GET` | `/api/tasks` | List all tasks (id, domain, difficulty) |
| `GET` | `/api/tasks/{task_id}` | Get task prompt (system, context, user) |
| `POST` | `/api/tasks/{task_id}/evaluate` | Score an agent's answer; returns reward in [0, 1] |

## Modal Deployment

The full RL environment runs on [Modal](https://modal.com) — the harbor server as a web endpoint and GRPO training on A100 GPUs.

### 1. Setup

```bash
# Install modal (included in project deps)
uv sync

# Authenticate with Modal
uv run modal setup
```

### 2. Deploy the Harbor Server

```bash
uv run modal deploy scripts/modal/harbor.py
```

This deploys the FastAPI evaluation server as a persistent web endpoint. The deploy output prints the URL:

```
https://YOUR_USER--lexenvs-harbor-harbor-endpoint.modal.run
```

Verify it's working:

```bash
curl https://YOUR_USER--lexenvs-harbor-harbor-endpoint.modal.run/health/live
# {"status":"ok"}

curl https://YOUR_USER--lexenvs-harbor-harbor-endpoint.modal.run/api/tasks
# {"tasks": [{"task_id": "easy_01", ...}, ...]}
```

### 3. Run GRPO Training

```bash
# Default: Qwen2.5-32B-Instruct on A100-80GB, 4 samples/task, 3 iterations
uv run modal run scripts/modal/grpo_train.py \
    --harbor-url https://YOUR_USER--lexenvs-harbor-harbor-endpoint.modal.run

# Customize model and parameters
uv run modal run scripts/modal/grpo_train.py \
    --harbor-url https://YOUR_USER--lexenvs-harbor-harbor-endpoint.modal.run \
    --model Qwen/Qwen2.5-14B-Instruct \
    --num-samples 8 \
    --iterations 5

# Run only specific tasks
uv run modal run scripts/modal/grpo_train.py \
    --harbor-url https://YOUR_USER--lexenvs-harbor-harbor-endpoint.modal.run \
    --tasks easy_01,easy_02,medium_01
```

The first run downloads model weights (~60GB for 32B) to a persistent Modal volume. Subsequent runs start generating immediately.

Results are saved locally to `results/grpo-modal/`.

### Training Parameters

| Flag | Default | Description |
|------|---------|-------------|
| `--harbor-url` | (required) | Harbor server URL |
| `--model` | `Qwen/Qwen2.5-32B-Instruct` | HuggingFace model ID |
| `--num-samples` | `4` | Completions per task per iteration |
| `--iterations` | `3` | Number of GRPO iterations |
| `--temp-start` | `0.8` | Starting temperature (exploration) |
| `--temp-end` | `0.4` | Ending temperature (exploitation) |
| `--tasks` | all | Comma-separated task IDs to run |
| `--no-few-shot` | `false` | Disable few-shot example injection |
| `--max-model-len` | `32768` | vLLM context window size |

## Local Training (Ollama)

For quick iteration without cloud GPUs:

```bash
# Start the harbor server locally
uv run uvicorn lexenvs.harbor_app:app --port 8000 &

# Pull a model
ollama pull qwen2.5:7b

# Run GRPO training
uv run python scripts/local/grpo_train.py --model qwen2.5:7b

# Customize
uv run python scripts/local/grpo_train.py \
    --model qwen2.5:7b \
    -N 4 \
    --iterations 3 \
    --tasks easy_01 easy_02
```

## Baseline Evaluation

Run Claude models against all tasks to establish baselines:

```bash
ANTHROPIC_API_KEY=sk-... uv run python scripts/local/run_baseline_eval.py \
    --model claude-haiku-4-5-20251001
```

### Baseline Results

| Model | Avg Reward | Method |
|-------|-----------|--------|
| Qwen 7B | 0.010 | Single pass |
| Haiku 4.5 | 0.091 | Single pass |
| Sonnet 4 | 0.176 | Single pass |
| Opus 4 | 0.182 | Single pass |
| Qwen 32B | 0.090 | Single pass (mean) |
| **Qwen 32B** | **0.246** | **GRPO best-of-4** |

## How It Works

### GRPO (Group Relative Policy Optimization)

For each task, the training loop:

1. **Generates N completions** — vLLM produces N candidate answers in a single batched call
2. **Scores each completion** — the harbor server evaluates against the rubric
3. **Computes group-relative advantages**`advantage = (reward - group_mean) / group_std`
4. **Selects the best** — top completions are retained and optionally used as few-shot examples for subsequent iterations

Temperature decays linearly across iterations (exploration to exploitation).

Currently this is best-of-N selection without gradient updates. The architecture supports upgrading to TRL's `GRPOTrainer` for actual weight updates with LoRA.

### Scoring Rubric

Each task is scored on four weighted dimensions:

- **EV Accuracy** (40%, automated) — how close the agent's expected value calculation is to the reference
- **Constraint Compliance** (30%, automated) — Jaccard similarity of recommended cards + housing option match
- **Reasoning Quality** (20%, human) — quality of analysis and tradeoff discussion
- **Constraint Prioritization** (10%, human) — handling of ambiguous or conflicting constraints

When no reference solution exists, scoring falls back to JSON structure validation.

## Project Structure

```
src/lexenvs/
├── harbor_app.py          # FastAPI entry point
├── config/                # Pydantic settings (LEX_ENVS_ env prefix)
├── routers/tasks.py       # API endpoints
├── schemas/               # Request/response models
├── services/
│   ├── task_loader_service.py      # Loads tasks from data/tasks/*.json
│   ├── rubric_evaluator_service.py # Multi-dimensional scoring
│   └── kb_filter_service.py        # Per-task KB filtering
└── utils.py               # JSON extraction utilities

data/
├── tasks/                 # 15 task definitions (easy/medium/hard x 5)
└── knowledge_base.md      # Shared KB (~56K chars: issuers, cards, rules)

scripts/
├── local/
│   ├── smoke_test.py          # Server verification + LLM evaluation
│   ├── run_baseline_eval.py   # Baseline model evaluation (Anthropic API)
│   ├── grpo_train.py          # GRPO training (local/Ollama)
│   └── grpo_common.py         # Shared GRPO types and helpers
└── modal/
    ├── harbor.py              # Deploy harbor server
    └── grpo_train.py          # GRPO training on A100

results/                   # Evaluation results (gitignored)
```

## Development

```bash
uv run pytest -v                                  # Unit tests
RUN_API_INTEGRATION_TESTS=1 uv run pytest tests/integration/ -v  # Integration tests
uv run ruff check .                               # Lint
uv run ruff format .                              # Format
uv run mypy src/lexenvs                           # Type check
```

## Contributing

See [CONTRIBUTING.md](CONTRIBUTING.md) for development setup, coding standards, and how to add new tasks.

## Acknowledgments

Credit card point valuations and transfer partner data in `data/knowledge_base.md` reference publicly available information from [The Points Guy](https://thepointsguy.com/) and issuer websites.

## License

This project is licensed under the Apache License 2.0 — see [LICENSE](LICENSE) for details.