YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
Neuralese / hackable GRPO baseline
This repository is a hackable GRPO-style training baseline for math reasoning with a fixed chain-of-thought format: reasoning inside <redacted_thinking>...</redacted_thinking> and a final \boxed{...} answer. The design keeps objectives, rewards, and data providers in small pluggable modules so you can experiment without rewriting a monolithic trainer.
Repository layout
| Path | Role |
|---|---|
src/hackable/ |
Core library: config types, registries, objectives, rewards, data providers, model loading. |
src/train_grpo.py |
Training entrypoint expected by scripts/*.sh (may be absent in a partial checkout; scripts default to --config configs/grpo_llama32_3b_bf16.yaml). |
docs/repository-map.md |
Maintainer-focused map of entrypoints, modules, and storage conventions. |
AGENTS.md |
Agent-facing workflow and folder contract for future repo changes. |
configs/ |
YAML experiment configs (referenced by scripts; may need to be added locally). |
scripts/ |
Bash launchers for single/multi-GPU training, sweeps, and evaluation. |
cache/ |
Runtime storage root: datasets/, models/, hf/, artifacts/{runs,sweeps,eval}/, and logs/wandb/. |
requirements.txt |
Python dependencies (Torch, TRL, Transformers, Accelerate, etc.). |
note.txt |
Unrelated environment/pip noise from a past install; not project documentation. |
Storage conventions
storage.cache_diris the canonical runtime root. Relative runtime paths likeartifacts/runs/grpo-llama32-3bresolve under that directory.- Use
cache/datasetsfor dataset cache,cache/modelsfor model/tokenizer cache, andcache/logs/wandbfor W&B logs. - Training runs live in
cache/artifacts/runs, sweep outputs incache/artifacts/sweeps, eval outputs incache/artifacts/eval. - Permanent checkpoints now live under each run at
checkpoints/permanent/.
src/hackable/ modules
config.pyβ Loads YAML into typed dataclasses (ExperimentConfig,ModelConfig,TrainerConfig,GenerationConfig,ObjectiveConfig,RewardsConfig, β¦). Normalizes optimizer/scheduler aliases and numeric fields.registry.pyβ@register_data_provider,@register_reward,@register_objectiveplusbuild_*factories. Reward kwargs from YAML are partially applied viabuild_rewardso each reward can receive static options (e.g. tokenizer name for length penalty).interfaces.pyβTrainingSampleand protocol shapes for providers/rewards/objectives.objectives.pyβTokenGRPOObjective(main token-level GRPO recipe) andLatentNeuraleseObjective(stub: format reward only,extra_rewardno-op for future latent scoring).reward_plugins.pyβ Registered rewards: strict format, GSM8K/MATH-style correctness, length penalty, optional token-utilisation shaping. See Length penalty rewards below.data_plugins.pyβ GSM8K, Hendrycks MATH by level, interleaved curricula (gsm8k_math_curriculum, etc.). Shared prompt prefix matches the strict completion format expected by rewards.backends.pyβ Loads causal LMs with Liger Llama patches when applicable, FlashAttention2 when importable, else SDPA. Exposesgeneration_kwargsfrom config.utils.pyβimport_from_pathforobjective.class_pathstyle"module:Class"loading.
Importing hackable registers default plugins (hackable/__init__.py imports data_plugins, objectives, reward_plugins).
src/ evaluation utilities
eval_sweep_models.pyβ Distributed evaluation of everyrun_*directory under a sweep root: loads checkpoints, runs GSM8K test generations, scores correctness, records CoT word length stats, writes CSV/JSON summaries (used after lambda or reward-variant sweeps).eval_permanent_checkpoints.pyβ Walkscheckpoints/permanentfolders, evaluates each checkpoint, can emit simple SVG learning curves.eval_math_level1_thinking_zeroshot.pyβ Zero-shot / thinking-format eval on MATH-style data with JSONL output (for downstream rewards or analysis).
scripts/ (high level)
| Script | Purpose |
|---|---|
run_grpo.sh, run_grpo_2gpu.sh, run_grpo_4gpu.sh, run_grpo_8gpu.sh |
Launch training with Accelerate. |
resume_grpo_8gpu.sh |
Resume from latest or explicit checkpoint. |
sweep_length_penalty_lambda.sh |
Trains multiple runs with different length_penalty_lambda (weighted length-penalty mode). |
run_reward_variants_and_eval.sh |
Trains three interaction/gating variants, then runs eval_sweep_models.py. |
run_twostage_correctness1.sh, run_twostage_correctness5.sh |
Two-stage schedules; YAML is expected to set correctness_weight and optionally stage-2 length-penalty fields (see below). |
run_lambda_0p1_existing_gate_token_util.sh |
Example run with low Ξ» and token_utilisation_reward enabled. |
eval_sweep_models_offline.sh |
Offline eval driver for a length-penalty Ξ» sweep directory. |
eval_length_penalty_ablation_offline.sh |
Launches src/eval_length_penalty_ablation.py (script must exist alongside training code). |
eval_twostage_permanent_checkpoints.sh |
Eval for two-stage permanent checkpoint trees. |
eval_gsm8k_*.sh, eval_math_level*_*.sh |
Dataset-specific eval launchers. |
hf_upload_repo.py, hf_download_repo.py |
Push/pull Hugging Face dataset repo snapshots. |
Length penalty rewards
Training prefers shorter thinking traces within each GRPO group (same prompt, multiple sampled completions). The signal is implemented as a reward (length_penalty_reward) and combined with correctness and format either additively or via a weighted multiplicative term controlled by TokenGRPOObjective.
What gets measured
- Strict format only β
_think_length_tokensinreward_plugins.pyparses the completion with the same regex asformat_tag_reward: a single block<redacted_thinking>...</redacted_thinking>followed by\boxed{...}. If the completion does not match, thinking length is treated as 0 (so length reward does not reward malformed outputs on length grounds). - Token count β If
tokenizer_nameis passed (viarewards.kwargs.length_penalty_rewardin YAML andbuild_reward), the thinking substring is encoded with that tokenizer (add_special_tokens=False) and the length is the number of token IDs. If no tokenizer is configured, length falls back to whitespace-split words inside the thinking block. - Per-group normalization β For each group (G) of completions (see grouping below), let (L_i) be the thinking length of completion (i), and (\bar{L} = \frac{1}{|G|}\sum_j L_j). If (\bar{L} \le 0), every score in the group is 0. Otherwise:
[ R^{\mathrm{length}}_i = \max\left(0,\ 1 - \frac{L_i}{\bar{L}}\right). ]
So shorter-than-average thinking in the group scores closer to 1, longer-than-average scores closer to 0, and everyone at the average gets 0. This is a relative length preference, not an absolute token budget.
How groups are formed
length_penalty_reward assigns the same group normalization to completions that belong to the same GRPO comparison group:
- If
group_sizeis set (again, typically underrewards.kwargs.length_penalty_reward), the flat batch is chunked in order:[0:group_size),[group_size:2*group_size), β¦ - Else groups are inferred by contiguous runs of identical
prompttext in the parallel lists passed to the reward.
How TokenGRPOObjective combines length with other rewards
Configured under objective.name: token_grpo with objective.kwargs (see src/hackable/objectives.py).
Registered reward names (when enabled):
format_tag_rewardβ 1.0 if strict thinking + non-empty boxed answer, else 0.0.gsm8k_correctness_rewardβ Parses\boxed{...}vs reference (GSM8K####answers or MATH boxed solutions), numeric normalization and tolerant float compare.length_penalty_rewardβ ifenable_length_penalty: true.token_utilisation_rewardβ optional; shapes training vs a frozen zero-shot correctness JSONL (see docstring inreward_plugins.py).
reward_mode
additiveβ Total score is the sum of all enabled reward outputs for that sample. Ifstrict_format_gate: true, any sample withformat_tag_reward β€ 0.5is replaced bynon_strict_penalty(default -1.0) instead of the sum.weighted_length_penaltyβ Correctness and length interact multiplicatively; format is added (and optionally multiplied into the interaction):- Let (r_c, r_f, r_\ell) be correctness, format, and length scores in ([0,1]) (length may be 0 if disabled or malformed).
- Base interaction: (r_c \times r_\ell).
- If
length_penalty_interactioniscorrectness_length_format, the interaction is (r_c \times r_\ell \times r_f). - Total (before optional token-util term):
[ \texttt{correctness_weight} \cdot r_c + \texttt{length_penalty_lambda} \cdot \text{interaction} + r_f + r_{\mathrm{util}}. ]
If
strict_format_gateis true and (r_f \le 0.5), the total isnon_strict_penaltyand the formula above is skipped.
Important knobs
| KWarg | Meaning |
|---|---|
enable_length_penalty |
When false, length reward is not registered; in weighted mode (r_\ell) is treated as 0. |
length_penalty_lambda |
Scales the (r_c \cdot r_\ell) (or (r_c \cdot r_\ell \cdot r_f)) term in weighted mode. Sweeps often try e.g. 0.25, 0.5, 0.75, 1.0. |
correctness_weight |
Scales the standalone correctness term in weighted mode. |
length_penalty_interaction |
correctness_length vs correctness_length_format (whether format enters the product). |
strict_format_gate / non_strict_penalty |
Hard gate on format before crediting other terms. |
stage2_length_penalty_lambda, stage2_start_epoch |
Stored on TokenGRPOObjective for two-stage schedules; changing Ξ» at epoch boundaries must be implemented in the trainer (combine_rewards itself only reads the current length_penalty_lambda on the instance). |
YAML wiring example (illustrative; paths depend on your repo):
objective:
name: token_grpo
kwargs:
reward_mode: weighted_length_penalty
enable_length_penalty: true
correctness_weight: 1.0
length_penalty_lambda: 0.5
length_penalty_interaction: correctness_length
strict_format_gate: true
non_strict_penalty: -1.0
rewards:
kwargs:
length_penalty_reward:
tokenizer_name: meta-llama/Llama-3.2-3B-Instruct # example
group_size: 4 # often match generation.num_generations
cache_dir: cache/models
Hackable extension path
- Implement a new objective class with
reward_names()andcombine_rewards-equivalent behavior (your trainer must call into it the same way as the baseline), or add new@register_rewardfunctions. - Point config at a custom class:
objective.class_path: "my_package.my_module:MyObjective". LatentNeuraleseObjectiveis a stub: keep token rewards while you add latent signals viaextra_rewardor a future trainer hook.
Quickstart
pip install -r requirements.txt(you may need a CUDA-matched Torch and optional FlashAttention build).- Set
HF_TOKEN/WANDB_API_KEYif needed (names configurable underauthin YAML). - Single GPU:
bash scripts/run_grpo.sh(requiressrc/train_grpo.pyandconfigs/grpo_llama32_3b_bf16.yamlpresent). - Many scripts default to offline Hub/datasets/W&B; override
WANDB_MODE,HF_HUB_OFFLINE, etc. if you need network access. - Runtime outputs default to
cache/artifacts/...throughstorage.cache_dir; agent docs live inAGENTS.mdanddocs/repository-map.md.
Attention backend
backends.py prefers FlashAttention2 when flash_attn imports cleanly; otherwise SDPA. Llama models optionally use Liger kernels when installed.