Spaces:

Adhitya122
/

molforge

Running

App Files Files Community

molforge / HF_RL_JOBS_NOTES.md

Adhitya122

Prepare MolForge OpenEnv Docker Space submission

bf9e424 verified 12 days ago

preview code

raw

history blame contribute delete

3.31 kB

Hugging Face RL Jobs Notes

This file tracks the remote RL training attempts for the MolForge OpenEnv GRPO run.

Jobs Tried

Job	Hardware	Result	Notes
`69ed7260d70108f37acdf4b8`	`a100-large`	Canceled	Stayed in `SCHEDULING`, so we canceled it before it used GPU time.
`69ed73d3d70108f37acdf4e1`	`l40sx1`	Failed	Started but exited during Python import before model load or training.
`69ed74f6d70108f37acdf504`	`l40sx1`	Failed	`--with mergekit` caused unsolvable pydantic conflict with `openenv-core`.
`69ed7be5d2c8bd8662bcef00`	`l40sx1`	Canceled	Incorrect CLI usage (missing image name).
`69ed9440d70108f37acdf83b`	`l40sx1`	Failed	`uv run` couldn't find the script path `issue/script.py`.
`69ed94add2c8bd8662bcf215`	`l40sx1`	Submitted	Fixed script path to just filename and used explicit `python` call.

Failure History

Job 2 (`69ed73d3`) — `ModuleNotFoundError: No module named 'mergekit'`

TRL internally imports mergekit for GRPO model-merging callbacks even though we don't use merging. The fix was to add --with mergekit.

Job 3 (`69ed74f6`) — pydantic version conflict (CURRENT)

Adding --with mergekit broke the resolver:

mergekit (all versions) requires pydantic < 2.11
openenv-core==0.2.3 → fastmcp>=3.0.0 → pydantic >= 2.11.7

No version of pydantic satisfies both. uv correctly refuses to resolve.

Fix

Do NOT pass --with mergekit in the HF Jobs command. Instead, the script now installs mergekit at runtime with --no-deps before importing TRL:

try:
    import mergekit
except ImportError:
    subprocess.check_call([sys.executable, "-m", "pip", "install", "mergekit", "--no-deps", "-q"])

This makes mergekit importable (satisfying TRL) without pulling in its conflicting pydantic constraint.

Checkpoint and Artifact Persistence

The OpenEnv GRPO script saves the final trained adapter and tokenizer to:

<run_dir>/adapters/

It also writes logs, metrics, plots, before/after evaluator JSON, and a zip archive under the run directory. When HF_OUTPUT_REPO=Adhitya122/molforge-rl-runs is set, the full run folder is uploaded to:

hf://datasets/Adhitya122/molforge-rl-runs/<run_name>

Safer Next Runs

Recommended next HF Jobs command (NO --with mergekit):

--env RL_MAX_STEPS=20
--env RL_DATASET_SIZE=30
--env MAX_COMPLETION_LENGTH=1024

Use this as a smoke run first. Once it reaches at least one trainer log line and uploads artifacts, scale back to:

--env RL_MAX_STEPS=80
--env RL_DATASET_SIZE=120
--env MAX_COMPLETION_LENGTH=2048

Good hardware choices:

Hardware	Use
`l40sx1`	Best next smoke test: 48 GB VRAM, cheaper than A100.
`a100-large`	Good full run if scheduling is available.
`h200`	Highest headroom, more expensive, useful if A100 scheduling stalls.
`a10g-large`	Cheap fallback, but may need shorter completion length and fewer steps.

Monitoring Commands

hf jobs inspect <job_id>
hf jobs logs <job_id> --tail 200

Use logs without inspect when searching for the real traceback, because inspect prints the full base64-encoded submitted script and makes the useful error harder to see.