Spaces:
Paused
Adversarial Self-Play Training (Kimi-Style + TRL)
This repository now includes a code scaffold for alternating adversarial self-play with Hugging Face TRL.
Goal
Train two policies in alternating rounds:
- Generator policy: proposes hard OSINT tasks (question + answer + supporting edges).
- Answerer policy: solves tasks proposed by the generator.
The loop is intended to move from static evaluation toward on-policy co-evolution.
Kimi-style Objective Mapping
The implementation maps the requested Kimi-style ingredients onto TRL GRPO as follows:
- Grouped rollouts:
num_generationsin each GRPO phase. - Relative reward baseline: GRPO group-relative advantages.
- Clipped policy updates:
epsilonclipping in GRPO objective. - KL/reference regularization:
betain GRPOConfig. - Token-level online RL behavior: GRPO online generation with reward functions.
- Toggle schedule: explicit alternating generator and answerer rounds.
Topology and Scheduling Options
model_topology: "dual": train separate generator and answerer models.model_topology: "shared": train one shared model for both roles.- Use
shared_model_name_or_pathto set the common base checkpoint.
- Use
phase_schedule: "generator_answerer": default two-phase loop per round.phase_schedule: "answerer_generator_answerer": solver-first curriculum:- Train answerer on current adversarial pool.
- Freeze that answerer snapshot while training generator against it.
- Train answerer again on newly generated adversarial tasks.
This directly supports the "train solver, freeze, attack, retrain solver" sequence.
Canonical Graph Mode
canonical_graph_mode: "generate"(default): generator can propose canonical graph updates inswarm_v2.canonical_graph_mode: "fixed": canonical graph candidates are held fixed per prompt, so training focuses on question/answer behavior over stable graph structure.
Tuning Modes
tuning_mode: "full": full-model GRPO fine-tuning.tuning_mode: "lora": PEFT LoRA adapters for GRPO updates.- Configure via
lorablock:r,alpha,dropout,target_modules,bias,task_type.
- Configure via
Reward Design
Generator (adversarial swarm)
GeneratorRewardFunction combines weighted components:
- Validity: checks parsable task fields and bounded support-edge size.
- Hardness: rewards questions the frozen answerer currently gets wrong.
- Diversity: penalizes near-duplicate questions via token-overlap similarity.
- Consistency: rewards edge/answer/question grounding against canonical graph context.
Weights are configurable in generator_reward_weights.
Answerer (existing reward integration)
AnswererRewardFunction wraps existing environment reward logic:
- Reuses
compute_answer_rewardfromsrc/osint_env/env/reward.py. - Builds transient
TaskInstanceobjects from training rows. - Preserves difficulty-aware reward behavior (
easy/medium/hard).
Entry Points
- CLI command:
osint-env train-self-play - Main runner:
src/osint_env/training/self_play.py - Config loader:
src/osint_env/training/config.py - Reward functions:
src/osint_env/training/rewards.py - Example config:
config/self_play_training_example.json
Dry Run Mode
The example config sets dry_run: true by default.
In dry run mode, the pipeline still:
- Materializes generator/answerer datasets per round.
- Materializes optional
answerer_pre_datasetwhen using solver-first schedule. - Produces generated-task artifacts (fallback generator path).
- Writes a full run summary.
But it skips expensive GRPO updates.
Compute Mode
When compute is available:
- Install train dependencies:
python -m pip install -e ".[train]" - Disable dry run (
--dry-runoff and/or"dry_run": falsein config). - Run
osint-env train-self-play.
Outputs are written under artifacts/self_play unless overridden.