Neuralese / hackable GRPO baseline

This repository is a hackable GRPO-style training baseline for math reasoning with a fixed chain-of-thought format: reasoning inside <redacted_thinking>...</redacted_thinking> and a final \boxed{...} answer. The design keeps objectives, rewards, and data providers in small pluggable modules so you can experiment without rewriting a monolithic trainer.

Repository layout

Path	Role
`src/hackable/`	Core library: config types, registries, objectives, rewards, data providers, model loading.
`src/train_grpo.py`	Training entrypoint expected by `scripts/*.sh` (may be absent in a partial checkout; scripts default to `--config configs/grpo_llama32_3b_bf16.yaml`).
`docs/repository-map.md`	Maintainer-focused map of entrypoints, modules, and storage conventions.
`AGENTS.md`	Agent-facing workflow and folder contract for future repo changes.
`configs/`	YAML experiment configs (referenced by scripts; may need to be added locally).
`scripts/`	Bash launchers for single/multi-GPU training, sweeps, and evaluation.
`cache/`	Runtime storage root: `datasets/`, `models/`, `hf/`, `artifacts/{runs,sweeps,eval}/`, and `logs/wandb/`.
`requirements.txt`	Python dependencies (Torch, TRL, Transformers, Accelerate, etc.).
`note.txt`	Unrelated environment/pip noise from a past install; not project documentation.

Storage conventions

storage.cache_dir is the canonical runtime root. Relative runtime paths like artifacts/runs/grpo-llama32-3b resolve under that directory.
Use cache/datasets for dataset cache, cache/models for model/tokenizer cache, and cache/logs/wandb for W&B logs.
Training runs live in cache/artifacts/runs, sweep outputs in cache/artifacts/sweeps, eval outputs in cache/artifacts/eval.
Permanent checkpoints now live under each run at checkpoints/permanent/.

`src/hackable/` modules

config.py — Loads YAML into typed dataclasses (ExperimentConfig, ModelConfig, TrainerConfig, GenerationConfig, ObjectiveConfig, RewardsConfig, …). Normalizes optimizer/scheduler aliases and numeric fields.
registry.py — @register_data_provider, @register_reward, @register_objective plus build_* factories. Reward kwargs from YAML are partially applied via build_reward so each reward can receive static options (e.g. tokenizer name for length penalty).
interfaces.py — TrainingSample and protocol shapes for providers/rewards/objectives.
objectives.py — TokenGRPOObjective (main token-level GRPO recipe) and LatentNeuraleseObjective (stub: format reward only, extra_reward no-op for future latent scoring).
reward_plugins.py — Registered rewards: strict format, GSM8K/MATH-style correctness, length penalty, optional token-utilisation shaping. See Length penalty rewards below.
data_plugins.py — GSM8K, Hendrycks MATH by level, interleaved curricula (gsm8k_math_curriculum, etc.). Shared prompt prefix matches the strict completion format expected by rewards.
backends.py — Loads causal LMs with Liger Llama patches when applicable, FlashAttention2 when importable, else SDPA. Exposes generation_kwargs from config.
utils.py — import_from_path for objective.class_path style "module:Class" loading.

Importing hackable registers default plugins (hackable/__init__.py imports data_plugins, objectives, reward_plugins).

`src/` evaluation utilities

eval_sweep_models.py — Distributed evaluation of every run_* directory under a sweep root: loads checkpoints, runs GSM8K test generations, scores correctness, records CoT word length stats, writes CSV/JSON summaries (used after lambda or reward-variant sweeps).
eval_permanent_checkpoints.py — Walks checkpoints/permanent folders, evaluates each checkpoint, can emit simple SVG learning curves.
eval_math_level1_thinking_zeroshot.py — Zero-shot / thinking-format eval on MATH-style data with JSONL output (for downstream rewards or analysis).

`scripts/` (high level)

Script	Purpose
`run_grpo.sh`, `run_grpo_2gpu.sh`, `run_grpo_4gpu.sh`, `run_grpo_8gpu.sh`	Launch training with Accelerate.
`resume_grpo_8gpu.sh`	Resume from latest or explicit checkpoint.
`sweep_length_penalty_lambda.sh`	Trains multiple runs with different `length_penalty_lambda` (weighted length-penalty mode).
`run_reward_variants_and_eval.sh`	Trains three interaction/gating variants, then runs `eval_sweep_models.py`.
`run_twostage_correctness1.sh`, `run_twostage_correctness5.sh`	Two-stage schedules; YAML is expected to set `correctness_weight` and optionally stage-2 length-penalty fields (see below).
`run_lambda_0p1_existing_gate_token_util.sh`	Example run with low λ and `token_utilisation_reward` enabled.
`eval_sweep_models_offline.sh`	Offline eval driver for a length-penalty λ sweep directory.
`eval_length_penalty_ablation_offline.sh`	Launches `src/eval_length_penalty_ablation.py` (script must exist alongside training code).
`eval_twostage_permanent_checkpoints.sh`	Eval for two-stage permanent checkpoint trees.
`eval_gsm8k_.sh`, `eval_math_level_*.sh`	Dataset-specific eval launchers.
`hf_upload_repo.py`, `hf_download_repo.py`	Push/pull Hugging Face dataset repo snapshots.

Length penalty rewards

Training prefers shorter thinking traces within each GRPO group (same prompt, multiple sampled completions). The signal is implemented as a reward (length_penalty_reward) and combined with correctness and format either additively or via a weighted multiplicative term controlled by TokenGRPOObjective.

What gets measured

Strict format only — _think_length_tokens in reward_plugins.py parses the completion with the same regex as format_tag_reward: a single block <redacted_thinking>...</redacted_thinking> followed by \boxed{...}. If the completion does not match, thinking length is treated as 0 (so length reward does not reward malformed outputs on length grounds).
Token count — If tokenizer_name is passed (via rewards.kwargs.length_penalty_reward in YAML and build_reward), the thinking substring is encoded with that tokenizer (add_special_tokens=False) and the length is the number of token IDs. If no tokenizer is configured, length falls back to whitespace-split words inside the thinking block.
Per-group normalization — For each group (G) of completions (see grouping below), let (L_i) be the thinking length of completion (i), and (\bar{L} = \frac{1}{|G|}\sum_j L_j). If (\bar{L} \le 0), every score in the group is 0. Otherwise:

[ R^{\mathrm{length}}_i = \max\left(0,\ 1 - \frac{L_i}{\bar{L}}\right). ]

So shorter-than-average thinking in the group scores closer to 1, longer-than-average scores closer to 0, and everyone at the average gets 0. This is a relative length preference, not an absolute token budget.

How groups are formed

length_penalty_reward assigns the same group normalization to completions that belong to the same GRPO comparison group:

If group_size is set (again, typically under rewards.kwargs.length_penalty_reward), the flat batch is chunked in order: [0:group_size), [group_size:2*group_size), …
Else groups are inferred by contiguous runs of identical prompt text in the parallel lists passed to the reward.

How `TokenGRPOObjective` combines length with other rewards

Configured under objective.name: token_grpo with objective.kwargs (see src/hackable/objectives.py).

Registered reward names (when enabled):

format_tag_reward — 1.0 if strict thinking + non-empty boxed answer, else 0.0.
gsm8k_correctness_reward — Parses \boxed{...} vs reference (GSM8K #### answers or MATH boxed solutions), numeric normalization and tolerant float compare.
length_penalty_reward — if enable_length_penalty: true.
token_utilisation_reward — optional; shapes training vs a frozen zero-shot correctness JSONL (see docstring in reward_plugins.py).

reward_mode

additive — Total score is the sum of all enabled reward outputs for that sample. If strict_format_gate: true, any sample with format_tag_reward ≤ 0.5 is replaced by non_strict_penalty (default -1.0) instead of the sum.
weighted_length_penalty — Correctness and length interact multiplicatively; format is added (and optionally multiplied into the interaction):
- Let (r_c, r_f, r_\ell) be correctness, format, and length scores in ([0,1]) (length may be 0 if disabled or malformed).
- Base interaction: (r_c \times r_\ell).
- If length_penalty_interaction is correctness_length_format, the interaction is (r_c \times r_\ell \times r_f).
- Total (before optional token-util term):
[ \texttt{correctness_weight} \cdot r_c + \texttt{length_penalty_lambda} \cdot \text{interaction} + r_f + r_{\mathrm{util}}. ]

If strict_format_gate is true and (r_f \le 0.5), the total is non_strict_penalty and the formula above is skipped.

Important knobs

KWarg	Meaning
`enable_length_penalty`	When false, length reward is not registered; in weighted mode (r_\ell) is treated as 0.
`length_penalty_lambda`	Scales the (r_c \cdot r_\ell) (or (r_c \cdot r_\ell \cdot r_f)) term in weighted mode. Sweeps often try e.g. 0.25, 0.5, 0.75, 1.0.
`correctness_weight`	Scales the standalone correctness term in weighted mode.
`length_penalty_interaction`	`correctness_length` vs `correctness_length_format` (whether format enters the product).
`strict_format_gate` / `non_strict_penalty`	Hard gate on format before crediting other terms.
`stage2_length_penalty_lambda`, `stage2_start_epoch`	Stored on `TokenGRPOObjective` for two-stage schedules; changing λ at epoch boundaries must be implemented in the trainer (`combine_rewards` itself only reads the current `length_penalty_lambda` on the instance).

YAML wiring example (illustrative; paths depend on your repo):

objective:
  name: token_grpo
  kwargs:
    reward_mode: weighted_length_penalty
    enable_length_penalty: true
    correctness_weight: 1.0
    length_penalty_lambda: 0.5
    length_penalty_interaction: correctness_length
    strict_format_gate: true
    non_strict_penalty: -1.0

rewards:
  kwargs:
    length_penalty_reward:
      tokenizer_name: meta-llama/Llama-3.2-3B-Instruct   # example
      group_size: 4                                   # often match generation.num_generations
      cache_dir: cache/models

Hackable extension path

Implement a new objective class with reward_names() and combine_rewards-equivalent behavior (your trainer must call into it the same way as the baseline), or add new @register_reward functions.
Point config at a custom class: objective.class_path: "my_package.my_module:MyObjective".
LatentNeuraleseObjective is a stub: keep token rewards while you add latent signals via extra_reward or a future trainer hook.

Quickstart

pip install -r requirements.txt (you may need a CUDA-matched Torch and optional FlashAttention build).
Set HF_TOKEN / WANDB_API_KEY if needed (names configurable under auth in YAML).
Single GPU: bash scripts/run_grpo.sh (requires src/train_grpo.py and configs/grpo_llama32_3b_bf16.yaml present).
Many scripts default to offline Hub/datasets/W&B; override WANDB_MODE, HF_HUB_OFFLINE, etc. if you need network access.
Runtime outputs default to cache/artifacts/... through storage.cache_dir; agent docs live in AGENTS.md and docs/repository-map.md.

Attention backend

backends.py prefers FlashAttention2 when flash_attn imports cleanly; otherwise SDPA. Llama models optionally use Liger kernels when installed.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support