File size: 6,800 Bytes
d814291
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
fe1f842
 
 
 
 
 
 
 
d814291
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
fe1f842
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e44cdee
 
d814291
 
 
 
 
 
fe1f842
d814291
 
fe1f842
 
 
 
 
 
 
 
 
 
 
e44cdee
fe1f842
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
# Adversarial Self-Play Training (Kimi-Style + TRL)

This repository now includes a code scaffold for alternating adversarial self-play with Hugging Face TRL.

## Goal

Train two policies in alternating rounds:

- Generator policy: proposes hard OSINT tasks (question + answer + supporting edges).
- Answerer policy: solves tasks proposed by the generator.

The loop is intended to move from static evaluation toward on-policy co-evolution.

## Kimi-style Objective Mapping

The implementation maps the requested Kimi-style ingredients onto TRL GRPO as follows:

- Grouped rollouts: `num_generations` in each GRPO phase.
- Relative reward baseline: GRPO group-relative advantages.
- Clipped policy updates: `epsilon` clipping in GRPO objective.
- KL/reference regularization: `beta` in GRPOConfig.
- Token-level online RL behavior: GRPO online generation with reward functions.
- Toggle schedule: explicit alternating generator and answerer rounds.

## Topology and Scheduling Options

- `model_topology: "dual"`: train separate generator and answerer models.
- `model_topology: "shared"`: train one shared model for both roles.
	- Use `shared_model_name_or_path` to set the common base checkpoint.
- `phase_schedule: "generator_answerer"`: default two-phase loop per round.
- `phase_schedule: "answerer_generator_answerer"`: solver-first curriculum:
	1. Train answerer on current adversarial pool.
	2. Freeze that answerer snapshot while training generator against it.
	3. Train answerer again on newly generated adversarial tasks.

This directly supports the "train solver, freeze, attack, retrain solver" sequence.

## Canonical Graph Mode

- `canonical_graph_mode: "generate"` (default): generator can propose canonical graph updates in `swarm_v2`.
- `canonical_graph_mode: "fixed"`: canonical graph candidates are held fixed per prompt, so training focuses on question/answer behavior over stable graph structure.

## Tuning Modes

- `tuning_mode: "full"`: full-model GRPO fine-tuning.
- `tuning_mode: "lora"`: PEFT LoRA adapters for GRPO updates.
	- Configure via `lora` block: `r`, `alpha`, `dropout`, `target_modules`, `bias`, `task_type`.

## Reward Design

### Generator (adversarial swarm)

`GeneratorRewardFunction` combines weighted components:

- Validity: checks parsable task fields and bounded support-edge size.
- Hardness: rewards questions the frozen answerer currently gets wrong.
- Diversity: penalizes near-duplicate questions via token-overlap similarity.
- Consistency: rewards edge/answer/question grounding against canonical graph context.

Weights are configurable in `generator_reward_weights`.

For `swarm_v2`, the reward now prioritizes:

- Valid, replayable task structure first.
- Hardness against the frozen answerer second.
- Diversity and compact multi-agent/shared-context usage after validity.

This avoids the degenerate regime where almost every sample is invalid and the whole batch stays negative.

### Answerer (existing reward integration)

`AnswererRewardFunction` wraps existing environment reward logic:

- Reuses `compute_answer_reward` from `src/osint_env/env/reward.py`.
- Builds transient `TaskInstance` objects from training rows.
- Preserves difficulty-aware reward behavior (`easy` / `medium` / `hard`).

## Entry Points

- CLI command: `osint-env train-self-play`
- Main runner: `src/osint_env/training/self_play.py`
- Config loader: `src/osint_env/training/config.py`
- Reward functions: `src/osint_env/training/rewards.py`
- Example config: `config/self_play_training_example.json`

## Dry Run Mode

The example config sets `dry_run: true` by default.

In dry run mode, the pipeline still:

- Materializes generator/answerer datasets per round.
- Materializes optional `answerer_pre_dataset` when using solver-first schedule.
- Produces generated-task artifacts (fallback generator path).
- Writes a full run summary.

But it skips expensive GRPO updates.

## Post-Training Evaluation

After a non-dry-run training job completes, the runner now writes a post-training evaluation artifact that:

- Uses the finetuned generator to create fresh evaluation questions.
- Evaluates both the finetuned answerer and the original/base answerer on those generated questions.
- Reports a `delta_vs_original` summary so you can see whether fine-tuning actually improved task success, reward, and graph F1.
- Saves the summary and episode rows under `post_training_evaluation.json`.

You can control this flow with:

- `generated_task_max_new_tokens`: decoding budget for generator-side task sampling/eval.
- `post_training_eval_questions`: how many fresh tasks to evaluate after training.
- `post_training_eval_answer_max_new_tokens`: answerer decoding budget for the final eval pass.

## Checkpoints And Final Models

Self-play outputs are written under `output_dir` (default `artifacts/self_play`) unless overridden.

Per round and phase you will now find:

- `round_XXX/<phase>/checkpoint-*`: intermediate trainer checkpoints saved every `save_steps`.
- `round_XXX/<phase>/final_model`: final saved model for that phase, with tokenizer files.
- `self_play_summary.json`: top-level run summary.
- `post_training_evaluation.json`: generated-question evaluation written after training.

If `HF_TOKEN` is available, the trainer can also mirror phase folders and summary artifacts to a Hugging Face repo. By default it derives a repo on the same account as the Space using `SPACE_ID`/`HF_SPACE_ID` and a `-checkpoints` suffix. You can override that with `OSINT_HF_CHECKPOINT_REPO_ID`.

## Compute Mode

When compute is available:

1. Install train dependencies: `python -m pip install -e ".[train]"`
2. Disable dry run (`--dry-run` off and/or `"dry_run": false` in config).
3. Run `osint-env train-self-play`, or launch a dedicated Hugging Face Job with `osint-env-launch-hf-job` if you want the Space to stay on CPU while training runs on separate GPU compute.

Outputs are written under `artifacts/self_play` unless overridden.

## Standalone Server Script

For an SSH server or other standalone machine, you can use `scripts/train_self_play_standalone.sh`.

Example:

```bash
VENV_PATH="$HOME/arl" \
INSTALL_TRAIN_DEPS=1 \
TRAIN_ENV_CONFIG_PATH="config/shared_config.json" \
TRAIN_SELF_PLAY_CONFIG_PATH="config/self_play_training_hf_l40s_full.json" \
TRAIN_SELF_PLAY_OUTPUT_DIR="artifacts/self_play_server" \
bash scripts/train_self_play_standalone.sh
```

Useful environment variables:

- `BOOTSTRAP_VENV=1`: create the virtualenv automatically if it does not exist yet.
- `TRAIN_SELF_PLAY_ROUNDS=2`: override the round count without editing JSON.
- `RUN_SELF_PLAY_DRY_RUN=1`: skip GRPO updates and only materialize artifacts.
- `TRAIN_SETUP_COMMAND='python -m pip install flash-attn --no-build-isolation'`: run any host-specific setup before training.