molforge / HF_RL_JOBS_NOTES.md
Adhitya122's picture
Prepare MolForge OpenEnv Docker Space submission
bf9e424 verified

Hugging Face RL Jobs Notes

This file tracks the remote RL training attempts for the MolForge OpenEnv GRPO run.

Jobs Tried

Job Hardware Result Notes
69ed7260d70108f37acdf4b8 a100-large Canceled Stayed in SCHEDULING, so we canceled it before it used GPU time.
69ed73d3d70108f37acdf4e1 l40sx1 Failed Started but exited during Python import before model load or training.
69ed74f6d70108f37acdf504 l40sx1 Failed --with mergekit caused unsolvable pydantic conflict with openenv-core.
69ed7be5d2c8bd8662bcef00 l40sx1 Canceled Incorrect CLI usage (missing image name).
69ed9440d70108f37acdf83b l40sx1 Failed uv run couldn't find the script path issue/script.py.
69ed94add2c8bd8662bcf215 l40sx1 Submitted Fixed script path to just filename and used explicit python call.

Failure History

Job 2 (69ed73d3) — ModuleNotFoundError: No module named 'mergekit'

TRL internally imports mergekit for GRPO model-merging callbacks even though we don't use merging. The fix was to add --with mergekit.

Job 3 (69ed74f6) — pydantic version conflict (CURRENT)

Adding --with mergekit broke the resolver:

  • mergekit (all versions) requires pydantic < 2.11
  • openenv-core==0.2.3fastmcp>=3.0.0pydantic >= 2.11.7

No version of pydantic satisfies both. uv correctly refuses to resolve.

Fix

Do NOT pass --with mergekit in the HF Jobs command. Instead, the script now installs mergekit at runtime with --no-deps before importing TRL:

try:
    import mergekit
except ImportError:
    subprocess.check_call([sys.executable, "-m", "pip", "install", "mergekit", "--no-deps", "-q"])

This makes mergekit importable (satisfying TRL) without pulling in its conflicting pydantic constraint.

Checkpoint and Artifact Persistence

The OpenEnv GRPO script saves the final trained adapter and tokenizer to:

<run_dir>/adapters/

It also writes logs, metrics, plots, before/after evaluator JSON, and a zip archive under the run directory. When HF_OUTPUT_REPO=Adhitya122/molforge-rl-runs is set, the full run folder is uploaded to:

hf://datasets/Adhitya122/molforge-rl-runs/<run_name>

Safer Next Runs

Recommended next HF Jobs command (NO --with mergekit):

--env RL_MAX_STEPS=20
--env RL_DATASET_SIZE=30
--env MAX_COMPLETION_LENGTH=1024

Use this as a smoke run first. Once it reaches at least one trainer log line and uploads artifacts, scale back to:

--env RL_MAX_STEPS=80
--env RL_DATASET_SIZE=120
--env MAX_COMPLETION_LENGTH=2048

Good hardware choices:

Hardware Use
l40sx1 Best next smoke test: 48 GB VRAM, cheaper than A100.
a100-large Good full run if scheduling is available.
h200 Highest headroom, more expensive, useful if A100 scheduling stalls.
a10g-large Cheap fallback, but may need shorter completion length and fewer steps.

Monitoring Commands

hf jobs inspect <job_id>
hf jobs logs <job_id> --tail 200

Use logs without inspect when searching for the real traceback, because inspect prints the full base64-encoded submitted script and makes the useful error harder to see.