molforge / HF_RL_JOBS_NOTES.md
Adhitya122's picture
Prepare MolForge OpenEnv Docker Space submission
bf9e424 verified
# Hugging Face RL Jobs Notes
This file tracks the remote RL training attempts for the MolForge OpenEnv GRPO run.
## Jobs Tried
| Job | Hardware | Result | Notes |
| --- | --- | --- | --- |
| `69ed7260d70108f37acdf4b8` | `a100-large` | Canceled | Stayed in `SCHEDULING`, so we canceled it before it used GPU time. |
| `69ed73d3d70108f37acdf4e1` | `l40sx1` | Failed | Started but exited during Python import before model load or training. |
| `69ed74f6d70108f37acdf504` | `l40sx1` | **Failed** | `--with mergekit` caused unsolvable pydantic conflict with `openenv-core`. |
| `69ed7be5d2c8bd8662bcef00` | `l40sx1` | Canceled | Incorrect CLI usage (missing image name). |
| `69ed9440d70108f37acdf83b` | `l40sx1` | Failed | `uv run` couldn't find the script path `issue/script.py`. |
| `69ed94add2c8bd8662bcf215` | `l40sx1` | Submitted | Fixed script path to just filename and used explicit `python` call. |
## Failure History
### Job 2 (`69ed73d3`) — `ModuleNotFoundError: No module named 'mergekit'`
TRL internally imports `mergekit` for GRPO model-merging callbacks even though we don't use merging. The fix was to add `--with mergekit`.
### Job 3 (`69ed74f6`) — **pydantic version conflict** (CURRENT)
Adding `--with mergekit` broke the resolver:
- `mergekit` (all versions) requires `pydantic < 2.11`
- `openenv-core==0.2.3``fastmcp>=3.0.0``pydantic >= 2.11.7`
**No version of pydantic satisfies both.** uv correctly refuses to resolve.
## Fix
**Do NOT pass `--with mergekit`** in the HF Jobs command. Instead, the script now installs mergekit at runtime with `--no-deps` before importing TRL:
```python
try:
import mergekit
except ImportError:
subprocess.check_call([sys.executable, "-m", "pip", "install", "mergekit", "--no-deps", "-q"])
```
This makes `mergekit` importable (satisfying TRL) without pulling in its conflicting pydantic constraint.
## Checkpoint and Artifact Persistence
The OpenEnv GRPO script saves the final trained adapter and tokenizer to:
```text
<run_dir>/adapters/
```
It also writes logs, metrics, plots, before/after evaluator JSON, and a zip archive under the run directory. When `HF_OUTPUT_REPO=Adhitya122/molforge-rl-runs` is set, the full run folder is uploaded to:
```text
hf://datasets/Adhitya122/molforge-rl-runs/<run_name>
```
## Safer Next Runs
Recommended next HF Jobs command (NO `--with mergekit`):
```bash
--env RL_MAX_STEPS=20
--env RL_DATASET_SIZE=30
--env MAX_COMPLETION_LENGTH=1024
```
Use this as a smoke run first. Once it reaches at least one trainer log line and uploads artifacts, scale back to:
```bash
--env RL_MAX_STEPS=80
--env RL_DATASET_SIZE=120
--env MAX_COMPLETION_LENGTH=2048
```
Good hardware choices:
| Hardware | Use |
| --- | --- |
| `l40sx1` | Best next smoke test: 48 GB VRAM, cheaper than A100. |
| `a100-large` | Good full run if scheduling is available. |
| `h200` | Highest headroom, more expensive, useful if A100 scheduling stalls. |
| `a10g-large` | Cheap fallback, but may need shorter completion length and fewer steps. |
## Monitoring Commands
```bash
hf jobs inspect <job_id>
hf jobs logs <job_id> --tail 200
```
Use logs without `inspect` when searching for the real traceback, because `inspect` prints the full base64-encoded submitted script and makes the useful error harder to see.