OSINT1

Paused

OSINT1 / docs /adversarial_self_play.md

siddeshwar-kagatikar

Deploy clean snapshot to Hugging Face Space.

db4fa53 13 days ago

3.89 kB

Adversarial Self-Play Training (Kimi-Style + TRL)

This repository now includes a code scaffold for alternating adversarial self-play with Hugging Face TRL.

Train two policies in alternating rounds:

Generator policy: proposes hard OSINT tasks (question + answer + supporting edges).
Answerer policy: solves tasks proposed by the generator.

The loop is intended to move from static evaluation toward on-policy co-evolution.

The implementation maps the requested Kimi-style ingredients onto TRL GRPO as follows:

model_topology: "dual": train separate generator and answerer models.
model_topology: "shared": train one shared model for both roles.
- Use shared_model_name_or_path to set the common base checkpoint.
phase_schedule: "generator_answerer": default two-phase loop per round.
phase_schedule: "answerer_generator_answerer": solver-first curriculum:
1. Train answerer on current adversarial pool.
2. Freeze that answerer snapshot while training generator against it.
3. Train answerer again on newly generated adversarial tasks.

This directly supports the "train solver, freeze, attack, retrain solver" sequence.

canonical_graph_mode: "generate" (default): generator can propose canonical graph updates in swarm_v2.
canonical_graph_mode: "fixed": canonical graph candidates are held fixed per prompt, so training focuses on question/answer behavior over stable graph structure.

tuning_mode: "full": full-model GRPO fine-tuning.
tuning_mode: "lora": PEFT LoRA adapters for GRPO updates.
- Configure via lora block: r, alpha, dropout, target_modules, bias, task_type.

GeneratorRewardFunction combines weighted components:

Validity: checks parsable task fields and bounded support-edge size.
Hardness: rewards questions the frozen answerer currently gets wrong.
Diversity: penalizes near-duplicate questions via token-overlap similarity.
Consistency: rewards edge/answer/question grounding against canonical graph context.

Weights are configurable in generator_reward_weights.

AnswererRewardFunction wraps existing environment reward logic:

The example config sets dry_run: true by default.

In dry run mode, the pipeline still:

But it skips expensive GRPO updates.

When compute is available:

Outputs are written under artifacts/self_play unless overridden.