Spaces:

Adhitya122
/

molforge

Running

App Files Files Community

molforge / HF_RL_JOBS_NOTES.md

Adhitya122

Prepare MolForge OpenEnv Docker Space submission

bf9e424 verified 12 days ago

preview code

raw

history blame contribute delete

3.31 kB

	# Hugging Face RL Jobs Notes

	This file tracks the remote RL training attempts for the MolForge OpenEnv GRPO run.

	## Jobs Tried

	\| Job \| Hardware \| Result \| Notes \|
	\| --- \| --- \| --- \| --- \|
	\| `69ed7260d70108f37acdf4b8` \| `a100-large` \| Canceled \| Stayed in `SCHEDULING`, so we canceled it before it used GPU time. \|
	\| `69ed73d3d70108f37acdf4e1` \| `l40sx1` \| Failed \| Started but exited during Python import before model load or training. \|
	\| `69ed74f6d70108f37acdf504` \| `l40sx1` \| Failed \| `--with mergekit` caused unsolvable pydantic conflict with `openenv-core`. \|
	\| `69ed7be5d2c8bd8662bcef00` \| `l40sx1` \| Canceled \| Incorrect CLI usage (missing image name). \|
	\| `69ed9440d70108f37acdf83b` \| `l40sx1` \| Failed \| `uv run` couldn't find the script path `issue/script.py`. \|
	\| `69ed94add2c8bd8662bcf215` \| `l40sx1` \| Submitted \| Fixed script path to just filename and used explicit `python` call. \|

	## Failure History

	### Job 2 (`69ed73d3`) — `ModuleNotFoundError: No module named 'mergekit'`

	TRL internally imports `mergekit` for GRPO model-merging callbacks even though we don't use merging. The fix was to add `--with mergekit`.

	### Job 3 (`69ed74f6`) — pydantic version conflict (CURRENT)

	Adding `--with mergekit` broke the resolver:

	- `mergekit` (all versions) requires `pydantic < 2.11`
	- `openenv-core==0.2.3` → `fastmcp>=3.0.0` → `pydantic >= 2.11.7`

	No version of pydantic satisfies both. uv correctly refuses to resolve.

	## Fix

	Do NOT pass `--with mergekit` in the HF Jobs command. Instead, the script now installs mergekit at runtime with `--no-deps` before importing TRL:

	```python
	try:
	import mergekit
	except ImportError:
	subprocess.check_call([sys.executable, "-m", "pip", "install", "mergekit", "--no-deps", "-q"])
	```

	This makes `mergekit` importable (satisfying TRL) without pulling in its conflicting pydantic constraint.

	## Checkpoint and Artifact Persistence

	The OpenEnv GRPO script saves the final trained adapter and tokenizer to:

	```text
	<run_dir>/adapters/
	```

	It also writes logs, metrics, plots, before/after evaluator JSON, and a zip archive under the run directory. When `HF_OUTPUT_REPO=Adhitya122/molforge-rl-runs` is set, the full run folder is uploaded to:

	```text
	hf://datasets/Adhitya122/molforge-rl-runs/<run_name>
	```

	## Safer Next Runs

	Recommended next HF Jobs command (NO `--with mergekit`):

	```bash
	--env RL_MAX_STEPS=20
	--env RL_DATASET_SIZE=30
	--env MAX_COMPLETION_LENGTH=1024
	```

	Use this as a smoke run first. Once it reaches at least one trainer log line and uploads artifacts, scale back to:

	```bash
	--env RL_MAX_STEPS=80
	--env RL_DATASET_SIZE=120
	--env MAX_COMPLETION_LENGTH=2048
	```

	Good hardware choices:

	\| Hardware \| Use \|
	\| --- \| --- \|
	\| `l40sx1` \| Best next smoke test: 48 GB VRAM, cheaper than A100. \|
	\| `a100-large` \| Good full run if scheduling is available. \|
	\| `h200` \| Highest headroom, more expensive, useful if A100 scheduling stalls. \|
	\| `a10g-large` \| Cheap fallback, but may need shorter completion length and fewer steps. \|

	## Monitoring Commands

	```bash
	hf jobs inspect <job_id>
	hf jobs logs <job_id> --tail 200
	```

	Use logs without `inspect` when searching for the real traceback, because `inspect` prints the full base64-encoded submitted script and makes the useful error harder to see.