YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

Neuralese / hackable GRPO baseline

This repository is a hackable GRPO-style training baseline for math reasoning with a fixed chain-of-thought format: reasoning inside <redacted_thinking>...</redacted_thinking> and a final \boxed{...} answer. The design keeps objectives, rewards, and data providers in small pluggable modules so you can experiment without rewriting a monolithic trainer.

Repository layout

Path Role
src/hackable/ Core library: config types, registries, objectives, rewards, data providers, model loading.
src/train_grpo.py Training entrypoint expected by scripts/*.sh (may be absent in a partial checkout; scripts default to --config configs/grpo_llama32_3b_bf16.yaml).
docs/repository-map.md Maintainer-focused map of entrypoints, modules, and storage conventions.
AGENTS.md Agent-facing workflow and folder contract for future repo changes.
configs/ YAML experiment configs (referenced by scripts; may need to be added locally).
scripts/ Bash launchers for single/multi-GPU training, sweeps, and evaluation.
cache/ Runtime storage root: datasets/, models/, hf/, artifacts/{runs,sweeps,eval}/, and logs/wandb/.
requirements.txt Python dependencies (Torch, TRL, Transformers, Accelerate, etc.).
note.txt Unrelated environment/pip noise from a past install; not project documentation.

Storage conventions

  • storage.cache_dir is the canonical runtime root. Relative runtime paths like artifacts/runs/grpo-llama32-3b resolve under that directory.
  • Use cache/datasets for dataset cache, cache/models for model/tokenizer cache, and cache/logs/wandb for W&B logs.
  • Training runs live in cache/artifacts/runs, sweep outputs in cache/artifacts/sweeps, eval outputs in cache/artifacts/eval.
  • Permanent checkpoints now live under each run at checkpoints/permanent/.

src/hackable/ modules

  • config.py β€” Loads YAML into typed dataclasses (ExperimentConfig, ModelConfig, TrainerConfig, GenerationConfig, ObjectiveConfig, RewardsConfig, …). Normalizes optimizer/scheduler aliases and numeric fields.
  • registry.py β€” @register_data_provider, @register_reward, @register_objective plus build_* factories. Reward kwargs from YAML are partially applied via build_reward so each reward can receive static options (e.g. tokenizer name for length penalty).
  • interfaces.py β€” TrainingSample and protocol shapes for providers/rewards/objectives.
  • objectives.py β€” TokenGRPOObjective (main token-level GRPO recipe) and LatentNeuraleseObjective (stub: format reward only, extra_reward no-op for future latent scoring).
  • reward_plugins.py β€” Registered rewards: strict format, GSM8K/MATH-style correctness, length penalty, optional token-utilisation shaping. See Length penalty rewards below.
  • data_plugins.py β€” GSM8K, Hendrycks MATH by level, interleaved curricula (gsm8k_math_curriculum, etc.). Shared prompt prefix matches the strict completion format expected by rewards.
  • backends.py β€” Loads causal LMs with Liger Llama patches when applicable, FlashAttention2 when importable, else SDPA. Exposes generation_kwargs from config.
  • utils.py β€” import_from_path for objective.class_path style "module:Class" loading.

Importing hackable registers default plugins (hackable/__init__.py imports data_plugins, objectives, reward_plugins).

src/ evaluation utilities

  • eval_sweep_models.py β€” Distributed evaluation of every run_* directory under a sweep root: loads checkpoints, runs GSM8K test generations, scores correctness, records CoT word length stats, writes CSV/JSON summaries (used after lambda or reward-variant sweeps).
  • eval_permanent_checkpoints.py β€” Walks checkpoints/permanent folders, evaluates each checkpoint, can emit simple SVG learning curves.
  • eval_math_level1_thinking_zeroshot.py β€” Zero-shot / thinking-format eval on MATH-style data with JSONL output (for downstream rewards or analysis).

scripts/ (high level)

Script Purpose
run_grpo.sh, run_grpo_2gpu.sh, run_grpo_4gpu.sh, run_grpo_8gpu.sh Launch training with Accelerate.
resume_grpo_8gpu.sh Resume from latest or explicit checkpoint.
sweep_length_penalty_lambda.sh Trains multiple runs with different length_penalty_lambda (weighted length-penalty mode).
run_reward_variants_and_eval.sh Trains three interaction/gating variants, then runs eval_sweep_models.py.
run_twostage_correctness1.sh, run_twostage_correctness5.sh Two-stage schedules; YAML is expected to set correctness_weight and optionally stage-2 length-penalty fields (see below).
run_lambda_0p1_existing_gate_token_util.sh Example run with low Ξ» and token_utilisation_reward enabled.
eval_sweep_models_offline.sh Offline eval driver for a length-penalty Ξ» sweep directory.
eval_length_penalty_ablation_offline.sh Launches src/eval_length_penalty_ablation.py (script must exist alongside training code).
eval_twostage_permanent_checkpoints.sh Eval for two-stage permanent checkpoint trees.
eval_gsm8k_*.sh, eval_math_level*_*.sh Dataset-specific eval launchers.
hf_upload_repo.py, hf_download_repo.py Push/pull Hugging Face dataset repo snapshots.

Length penalty rewards

Training prefers shorter thinking traces within each GRPO group (same prompt, multiple sampled completions). The signal is implemented as a reward (length_penalty_reward) and combined with correctness and format either additively or via a weighted multiplicative term controlled by TokenGRPOObjective.

What gets measured

  1. Strict format only β€” _think_length_tokens in reward_plugins.py parses the completion with the same regex as format_tag_reward: a single block <redacted_thinking>...</redacted_thinking> followed by \boxed{...}. If the completion does not match, thinking length is treated as 0 (so length reward does not reward malformed outputs on length grounds).
  2. Token count β€” If tokenizer_name is passed (via rewards.kwargs.length_penalty_reward in YAML and build_reward), the thinking substring is encoded with that tokenizer (add_special_tokens=False) and the length is the number of token IDs. If no tokenizer is configured, length falls back to whitespace-split words inside the thinking block.
  3. Per-group normalization β€” For each group (G) of completions (see grouping below), let (L_i) be the thinking length of completion (i), and (\bar{L} = \frac{1}{|G|}\sum_j L_j). If (\bar{L} \le 0), every score in the group is 0. Otherwise:

[ R^{\mathrm{length}}_i = \max\left(0,\ 1 - \frac{L_i}{\bar{L}}\right). ]

So shorter-than-average thinking in the group scores closer to 1, longer-than-average scores closer to 0, and everyone at the average gets 0. This is a relative length preference, not an absolute token budget.

How groups are formed

length_penalty_reward assigns the same group normalization to completions that belong to the same GRPO comparison group:

  • If group_size is set (again, typically under rewards.kwargs.length_penalty_reward), the flat batch is chunked in order: [0:group_size), [group_size:2*group_size), …
  • Else groups are inferred by contiguous runs of identical prompt text in the parallel lists passed to the reward.

How TokenGRPOObjective combines length with other rewards

Configured under objective.name: token_grpo with objective.kwargs (see src/hackable/objectives.py).

Registered reward names (when enabled):

  • format_tag_reward β€” 1.0 if strict thinking + non-empty boxed answer, else 0.0.
  • gsm8k_correctness_reward β€” Parses \boxed{...} vs reference (GSM8K #### answers or MATH boxed solutions), numeric normalization and tolerant float compare.
  • length_penalty_reward β€” if enable_length_penalty: true.
  • token_utilisation_reward β€” optional; shapes training vs a frozen zero-shot correctness JSONL (see docstring in reward_plugins.py).

reward_mode

  1. additive β€” Total score is the sum of all enabled reward outputs for that sample. If strict_format_gate: true, any sample with format_tag_reward ≀ 0.5 is replaced by non_strict_penalty (default -1.0) instead of the sum.

  2. weighted_length_penalty β€” Correctness and length interact multiplicatively; format is added (and optionally multiplied into the interaction):

    • Let (r_c, r_f, r_\ell) be correctness, format, and length scores in ([0,1]) (length may be 0 if disabled or malformed).
    • Base interaction: (r_c \times r_\ell).
    • If length_penalty_interaction is correctness_length_format, the interaction is (r_c \times r_\ell \times r_f).
    • Total (before optional token-util term):

    [ \texttt{correctness_weight} \cdot r_c + \texttt{length_penalty_lambda} \cdot \text{interaction} + r_f + r_{\mathrm{util}}. ]

    If strict_format_gate is true and (r_f \le 0.5), the total is non_strict_penalty and the formula above is skipped.

Important knobs

KWarg Meaning
enable_length_penalty When false, length reward is not registered; in weighted mode (r_\ell) is treated as 0.
length_penalty_lambda Scales the (r_c \cdot r_\ell) (or (r_c \cdot r_\ell \cdot r_f)) term in weighted mode. Sweeps often try e.g. 0.25, 0.5, 0.75, 1.0.
correctness_weight Scales the standalone correctness term in weighted mode.
length_penalty_interaction correctness_length vs correctness_length_format (whether format enters the product).
strict_format_gate / non_strict_penalty Hard gate on format before crediting other terms.
stage2_length_penalty_lambda, stage2_start_epoch Stored on TokenGRPOObjective for two-stage schedules; changing Ξ» at epoch boundaries must be implemented in the trainer (combine_rewards itself only reads the current length_penalty_lambda on the instance).

YAML wiring example (illustrative; paths depend on your repo):

objective:
  name: token_grpo
  kwargs:
    reward_mode: weighted_length_penalty
    enable_length_penalty: true
    correctness_weight: 1.0
    length_penalty_lambda: 0.5
    length_penalty_interaction: correctness_length
    strict_format_gate: true
    non_strict_penalty: -1.0

rewards:
  kwargs:
    length_penalty_reward:
      tokenizer_name: meta-llama/Llama-3.2-3B-Instruct   # example
      group_size: 4                                   # often match generation.num_generations
      cache_dir: cache/models

Hackable extension path

  • Implement a new objective class with reward_names() and combine_rewards-equivalent behavior (your trainer must call into it the same way as the baseline), or add new @register_reward functions.
  • Point config at a custom class: objective.class_path: "my_package.my_module:MyObjective".
  • LatentNeuraleseObjective is a stub: keep token rewards while you add latent signals via extra_reward or a future trainer hook.

Quickstart

  1. pip install -r requirements.txt (you may need a CUDA-matched Torch and optional FlashAttention build).
  2. Set HF_TOKEN / WANDB_API_KEY if needed (names configurable under auth in YAML).
  3. Single GPU: bash scripts/run_grpo.sh (requires src/train_grpo.py and configs/grpo_llama32_3b_bf16.yaml present).
  4. Many scripts default to offline Hub/datasets/W&B; override WANDB_MODE, HF_HUB_OFFLINE, etc. if you need network access.
  5. Runtime outputs default to cache/artifacts/... through storage.cache_dir; agent docs live in AGENTS.md and docs/repository-map.md.

Attention backend

backends.py prefers FlashAttention2 when flash_attn imports cleanly; otherwise SDPA. Llama models optionally use Liger kernels when installed.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support