Spaces:

endishai
/

lexenvs-harbor

Sleeping

App Files Files Community

lexenvs-harbor / README.md

endishai

Upload folder using huggingface_hub

2312199 verified 24 days ago

preview code

raw

history blame contribute delete

7.84 kB

	---
	title: LexEnvs — Harbor RL Environment
	emoji: 🃏
	colorFrom: blue
	colorTo: purple
	sdk: docker
	app_port: 8000
	pinned: false
	license: apache-2.0
	tags:
	- reinforcement-learning
	- evaluation
	- credit-cards
	- grpo
	- rl-environment
	---

	# LexEnvs — Harbor RL Environment

	A stateless FastAPI evaluation server for credit card optimization tasks. Agents receive task prompts (system instructions, a knowledge base, and user constraints), produce structured JSON recommendations, and get scored on a multi-dimensional rubric.

	## Quick Start

	```bash
	# Install dependencies
	uv sync

	# Run the server
	uv run uvicorn lexenvs.harbor_app:app --reload

	# Run tests
	uv run pytest -v

	# Lint & format
	uv run ruff check .
	uv run ruff format .
	```

	### Docker

	```bash
	docker compose up -d # Start server on port 8000
	docker compose down # Stop
	```

	## API Endpoints

	\| Method \| Path \| Description \|
	\|--------\|------\|-------------\|
	\| `GET` \| `/health/live` \| Liveness probe \|
	\| `GET` \| `/api/tasks` \| List all tasks (id, domain, difficulty) \|
	\| `GET` \| `/api/tasks/{task_id}` \| Get task prompt (system, context, user) \|
	\| `POST` \| `/api/tasks/{task_id}/evaluate` \| Score an agent's answer; returns reward in [0, 1] \|

	## Modal Deployment

	The full RL environment runs on [Modal](https://modal.com) — the harbor server as a web endpoint and GRPO training on A100 GPUs.

	### 1. Setup

	```bash
	# Install modal (included in project deps)
	uv sync

	# Authenticate with Modal
	uv run modal setup
	```

	### 2. Deploy the Harbor Server

	```bash
	uv run modal deploy scripts/modal/harbor.py
	```

	This deploys the FastAPI evaluation server as a persistent web endpoint. The deploy output prints the URL:

	```
	https://YOUR_USER--lexenvs-harbor-harbor-endpoint.modal.run
	```

	Verify it's working:

	```bash
	curl https://YOUR_USER--lexenvs-harbor-harbor-endpoint.modal.run/health/live
	# {"status":"ok"}

	curl https://YOUR_USER--lexenvs-harbor-harbor-endpoint.modal.run/api/tasks
	# {"tasks": [{"task_id": "easy_01", ...}, ...]}
	```

	### 3. Run GRPO Training

	```bash
	# Default: Qwen2.5-32B-Instruct on A100-80GB, 4 samples/task, 3 iterations
	uv run modal run scripts/modal/grpo_train.py \
	--harbor-url https://YOUR_USER--lexenvs-harbor-harbor-endpoint.modal.run

	# Customize model and parameters
	uv run modal run scripts/modal/grpo_train.py \
	--harbor-url https://YOUR_USER--lexenvs-harbor-harbor-endpoint.modal.run \
	--model Qwen/Qwen2.5-14B-Instruct \
	--num-samples 8 \
	--iterations 5

	# Run only specific tasks
	uv run modal run scripts/modal/grpo_train.py \
	--harbor-url https://YOUR_USER--lexenvs-harbor-harbor-endpoint.modal.run \
	--tasks easy_01,easy_02,medium_01
	```

	The first run downloads model weights (~60GB for 32B) to a persistent Modal volume. Subsequent runs start generating immediately.

	Results are saved locally to `results/grpo-modal/`.

	### Training Parameters

	\| Flag \| Default \| Description \|
	\|------\|---------\|-------------\|
	\| `--harbor-url` \| (required) \| Harbor server URL \|
	\| `--model` \| `Qwen/Qwen2.5-32B-Instruct` \| HuggingFace model ID \|
	\| `--num-samples` \| `4` \| Completions per task per iteration \|
	\| `--iterations` \| `3` \| Number of GRPO iterations \|
	\| `--temp-start` \| `0.8` \| Starting temperature (exploration) \|
	\| `--temp-end` \| `0.4` \| Ending temperature (exploitation) \|
	\| `--tasks` \| all \| Comma-separated task IDs to run \|
	\| `--no-few-shot` \| `false` \| Disable few-shot example injection \|
	\| `--max-model-len` \| `32768` \| vLLM context window size \|

	## Local Training (Ollama)

	For quick iteration without cloud GPUs:

	```bash
	# Start the harbor server locally
	uv run uvicorn lexenvs.harbor_app:app --port 8000 &

	# Pull a model
	ollama pull qwen2.5:7b

	# Run GRPO training
	uv run python scripts/local/grpo_train.py --model qwen2.5:7b

	# Customize
	uv run python scripts/local/grpo_train.py \
	--model qwen2.5:7b \
	-N 4 \
	--iterations 3 \
	--tasks easy_01 easy_02
	```

	## Baseline Evaluation

	Run Claude models against all tasks to establish baselines:

	```bash
	ANTHROPIC_API_KEY=sk-... uv run python scripts/local/run_baseline_eval.py \
	--model claude-haiku-4-5-20251001
	```

	### Baseline Results

	\| Model \| Avg Reward \| Method \|
	\|-------\|-----------\|--------\|
	\| Qwen 7B \| 0.010 \| Single pass \|
	\| Haiku 4.5 \| 0.091 \| Single pass \|
	\| Sonnet 4 \| 0.176 \| Single pass \|
	\| Opus 4 \| 0.182 \| Single pass \|
	\| Qwen 32B \| 0.090 \| Single pass (mean) \|
	\| Qwen 32B \| 0.246 \| GRPO best-of-4 \|

	## How It Works

	### GRPO (Group Relative Policy Optimization)

	For each task, the training loop:

	1. Generates N completions — vLLM produces N candidate answers in a single batched call
	2. Scores each completion — the harbor server evaluates against the rubric
	3. Computes group-relative advantages — `advantage = (reward - group_mean) / group_std`
	4. Selects the best — top completions are retained and optionally used as few-shot examples for subsequent iterations

	Temperature decays linearly across iterations (exploration to exploitation).

	Currently this is best-of-N selection without gradient updates. The architecture supports upgrading to TRL's `GRPOTrainer` for actual weight updates with LoRA.

	### Scoring Rubric

	Each task is scored on four weighted dimensions:

	- EV Accuracy (40%, automated) — how close the agent's expected value calculation is to the reference
	- Constraint Compliance (30%, automated) — Jaccard similarity of recommended cards + housing option match
	- Reasoning Quality (20%, human) — quality of analysis and tradeoff discussion
	- Constraint Prioritization (10%, human) — handling of ambiguous or conflicting constraints

	When no reference solution exists, scoring falls back to JSON structure validation.

	## Project Structure

	```
	src/lexenvs/
	├── harbor_app.py # FastAPI entry point
	├── config/ # Pydantic settings (LEX_ENVS_ env prefix)
	├── routers/tasks.py # API endpoints
	├── schemas/ # Request/response models
	├── services/
	│ ├── task_loader_service.py # Loads tasks from data/tasks/*.json
	│ ├── rubric_evaluator_service.py # Multi-dimensional scoring
	│ └── kb_filter_service.py # Per-task KB filtering
	└── utils.py # JSON extraction utilities

	data/
	├── tasks/ # 15 task definitions (easy/medium/hard x 5)
	└── knowledge_base.md # Shared KB (~56K chars: issuers, cards, rules)

	scripts/
	├── local/
	│ ├── smoke_test.py # Server verification + LLM evaluation
	│ ├── run_baseline_eval.py # Baseline model evaluation (Anthropic API)
	│ ├── grpo_train.py # GRPO training (local/Ollama)
	│ └── grpo_common.py # Shared GRPO types and helpers
	└── modal/
	├── harbor.py # Deploy harbor server
	└── grpo_train.py # GRPO training on A100

	results/ # Evaluation results (gitignored)
	```

	## Development

	```bash
	uv run pytest -v # Unit tests
	RUN_API_INTEGRATION_TESTS=1 uv run pytest tests/integration/ -v # Integration tests
	uv run ruff check . # Lint
	uv run ruff format . # Format
	uv run mypy src/lexenvs # Type check
	```

	## Contributing

	See [CONTRIBUTING.md](CONTRIBUTING.md) for development setup, coding standards, and how to add new tasks.

	## Acknowledgments

	Credit card point valuations and transfer partner data in `data/knowledge_base.md` reference publicly available information from [The Points Guy](https://thepointsguy.com/) and issuer websites.

	## License

	This project is licensed under the Apache License 2.0 — see [LICENSE](LICENSE) for details.