File size: 13,888 Bytes
72cba6b
 
 
 
0738dc1
 
72cba6b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9e3e5ff
72cba6b
9e3e5ff
 
 
 
 
72cba6b
 
 
9e3e5ff
 
 
 
72cba6b
 
 
 
 
 
 
9e3e5ff
 
 
72cba6b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
# Trace Net: Training Agents for Noisy, Adversarial OSINT

🤖 **Checkpoint:** [Siddeshwar1625/osint-checkpoints-final](https://huggingface.co/Siddeshwar1625/osint-checkpoints-final)

📊 **Training run (W&B):** [osint-self-play-train](https://wandb.ai/siddeshwar2004-international-institute-of-information-te/osint-self-play-train)

Most agent benchmarks are still too clean.

They assume the world is cooperative, the evidence is tidy, and the shortest path to the answer is also the most obvious one. Real OSINT is the opposite. People hide. Identities splinter across aliases. Threads derail. Posts mislead on purpose. Useful evidence is mixed with decoys, soft contradictions, and deliberate attempts to waste an investigator's time.

That is the motivation behind **Trace Net**.

Trace Net is a synthetic OSINT benchmark environment for tool-using agents that need to search, cross-link, and reason over noisy multi-platform evidence before producing an answer. Instead of rewarding pure prompt cleverness, the environment pushes agents to behave like investigators: retrieve signals, build a working graph, resolve entities, and justify the final node they select.

This repository is not just a dataset and not just a demo app. It is a full benchmark stack:

- a synthetic OSINT environment
- a tunable noise generator
- single-agent and swarm-style execution
- graph-aware reward shaping
- adversarial self-play training
- evaluation, leaderboard, and dashboard export
- a FastAPI/OpenEnv-compatible serving layer for Docker and Hugging Face Spaces

## Why we did not use MetaQA

Earlier iterations explored a MetaQA-style backend, but we deliberately moved away from it for the benchmark we wanted to build.

MetaQA is useful for classic multi-hop reasoning, but it is too clean and too structurally easy for a serious OSINT setting. Once a task becomes mostly "follow a relation chain in a cooperative knowledge base," it stops stress-testing the failure modes we actually care about:

- identity ambiguity
- noisy retrieval
- alias collision
- distractor evidence
- partial observations
- agents being baited into the wrong trail

Trace Net focuses on those harder conditions instead. The goal is not just to see whether a model can traverse hops. The goal is to test whether an agent can survive in an adversarial evidence landscape.

## A noisy world by design

The synthetic dataset is intentionally hostile.

The noise in this repository does not just mean random corruption. It refers to **users actively deflecting agent performance**. Some records act like red herrings. Some identities branch into aliases. Some traces create plausible but misleading routes through the graph. The environment can be tuned so retrieval becomes harder, evidence becomes less direct, and the agent is forced to discriminate signal from manipulation.

This tunability is exposed through environment parameters such as:

- `alias_density`
- `noise_level`
- `red_herring_rate`

Those controls matter because they let the benchmark scale from manageable to punishing without changing the fundamental task structure. You are not switching domains to make the task harder. You are increasing adversarial pressure inside the same OSINT world.

## How the environment works

At the core is a hidden canonical graph of:

- users
- aliases
- organizations
- locations
- posts
- threads
- events

Agents never see this graph directly. They interact through a compact action space:

- `CALL_TOOL`
- `ADD_EDGE`
- `ANSWER`

Every step returns structured observations containing recent tool outputs, the current working-memory graph snapshot, recent action history, and the active task payload. That means agents are evaluated on how they investigate, not just on what final token they emit.

The tool layer exposes search and lookup primitives over synthetic microblog posts, forum threads, profiles, and memory:

- `search_posts`
- `get_post`
- `get_user_posts`
- `get_mentions`
- `search_threads`
- `get_thread`
- `get_user_activity`
- `get_profile`
- `search_people`
- `get_connections`
- `search_memory`
- `search_shared_context`

This turns every episode into a miniature OSINT workflow: gather clues, connect evidence, resist decoys, and only then commit to an answer.

## Multi-agent interaction is a core feature, not a gimmick

One of the strongest ideas in Trace Net is that it treats multi-agent reasoning as an explicit systems problem.

The repository includes both a single-agent runner and a swarm-style orchestrator. In swarm mode, lightweight specialist roles such as **explorer**, **linker**, and **reasoner** coordinate over the same episode. Each role contributes a different kind of progress:

- explorers widen the search frontier
- linkers turn evidence into candidate relations
- reasoners consolidate partial findings into answerable structure

This matters because OSINT naturally decomposes into parallel subproblems. One path follows a person. Another resolves an alias. Another checks whether an event trace is real or planted. A single monolithic agent can do all of that serially, but the benchmark becomes much more interesting when we ask whether a system can split the work, use breadth efficiently, and still converge on the right graph.

Trace Net bakes that into the reward story as well. The swarm runner records spawn count, finished subtasks, critical steps, breadth, and depth. In other words, coordination itself becomes measurable.

## Adversarial self-play is where the benchmark gets dangerous

Trace Net does not stop at evaluation. It includes a scaffold for **adversarial self-play training** built around Hugging Face TRL and the **GRPO** algorithm.

The loop alternates between two roles:

1. a **generator** policy that proposes difficult OSINT tasks
2. an **answerer** policy that tries to solve them

That setup is powerful because it creates pressure from both sides. The generator is rewarded for producing tasks that are valid, grounded, diverse, and hard for the current answerer. The answerer is rewarded using the same environment-native graph-and-answer objectives used during benchmark evaluation.

This is not just hype. The training loop has concrete mathematical logic behind it:

- grouped rollouts for relative comparison
- mean-centered reward baselines through GRPO
- KL-controlled policy updates
- explicit hardness terms against a frozen answerer
- replay validation for generated tasks
- shared-context pressure and swarm diversity terms in `swarm_v2`
- solver-side PARL-style orchestration shaping inspired by **Kimi K2.5**

That means the benchmark can evolve from a static evaluation set into a co-evolving curriculum of adversarial traces. The generator learns how to expose weaknesses. The answerer learns how to survive them.

For OSINT-style agents, that is exactly the kind of training pressure we want.

## Reward design with mathematical grounding

The most important reward story in this repository is the one used during **adversarial self-play training**.

Training uses the **GRPO algorithm** through Hugging Face TRL. That means optimization is driven by grouped rollouts, relative reward comparison inside each group, clipped updates, and KL-regularized policy improvement rather than plain supervised fine-tuning.

In the self-play setting, the generator and solver have different reward functions.

For the **generator**, the training reward is a weighted objective over four core terms:

$$
R_{\text{gen}} =
w_v\, R_{\text{validity}} +
w_h\, R_{\text{hardness}} +
w_d\, R_{\text{diversity}} +
w_c\, R_{\text{consistency}}
$$

where:

- $R_{\text{validity}}$ checks that the proposed task is well-formed and bounded
- $R_{\text{hardness}}$ is higher when the frozen solver fails the generated task
- $R_{\text{diversity}}$ penalizes near-duplicate generations
- $R_{\text{consistency}}$ rewards graph-grounded, replayable tasks

In `swarm_v2`, this goes one step further: invalid or non-replayable generations are hard-gated by validation, and the reward then emphasizes replayability, hardness, swarm diversity, and shared-context pressure. This is what keeps the generator from gaming training by emitting flashy but unusable tasks.

For the **solver**, the training reward reuses the environment-native answer reward, but in the self-play pipeline it is explicitly framed as a solver-side objective for adversarial traces. The solver reward is also influenced by the **Kimi K2.5** paper through the project’s PARL-style shaping for multi-agent orchestration. In practice, that means solver training is not only about getting the final answer right, but also about coordinating useful work across the swarm.

The PARL-style orchestration term follows the project’s Kimi-inspired formulation:

$$
r_{\text{PARL}} = r_{\text{perf}} + \lambda_1\, r_{\text{parallel}} + \lambda_2\, r_{\text{finish}} + r_{\text{latency}}
$$

Therefore the final rewards having the components:

- output format validity and exact correctness
- knowledge-carrier and knowledge-indexing utility
- connectivity and supporting-edge F1 against task support edges,
- efficiency and compactness penalties,
- relation/entity informativeness and repetition control (difficulty-dependent).

This gives the solver-side swarm reward a strong systems flavor: the policy is encouraged to solve the task, but also to do so with effective parallel decomposition instead of brittle serial wandering.

Because training runs under **GRPO**, these rewards are used inside a relative-advantage setting:

- grouped rollouts provide comparison sets
- rewards are contrasted within the group
- KL terms stabilize policy updates
- generator hardness is measured against a frozen solver
- solver improvement is evaluated under the same adversarial pressure that the generator creates

That is the key design choice: the reward is not just scoring answers after the fact. It is shaping a co-evolutionary game between task proposer and task solver.

## Serving, evaluation, and reproducibility

The engineering story is just as compelling as the benchmark design.

The repository provides:

- a `src/` package layout
- CLI commands for `demo`, `eval`, `benchmark`, `leaderboard`, `benchmark-sweep`, `viz`, and `train-self-play`
- artifact outputs for evaluation and baselines
- dashboard export
- a FastAPI server with OpenEnv-style HTTP endpoints
- Docker and Hugging Face Space readiness

This makes Trace Net easy to use in several modes:

- local development
- repeatable benchmarking
- hosted interactive demos
- self-play training runs
- remote evaluation via HTTP

It also means the project is already structured for iteration instead of being locked into a one-off benchmark release.

## Results

The repository already includes reward visualizations and tracking artifacts that make the training story much more concrete.

**Answer reward shaping**

![Answer reward design](https://github.com/RitishShrirao/OSINT_env/blob/main/assets/answer_reward.png?raw=1)

This view highlights that final scoring is not a single accuracy scalar. It combines correctness with graph utility, evidence quality, and efficiency so agents are rewarded for building useful investigative structure.

**Generator reward shaping**

![Generator reward design](https://github.com/RitishShrirao/OSINT_env/blob/main/assets/generator_reward.png?raw=1)

The generator side is where adversarial pressure becomes explicit: validity, hardness, diversity, and consistency work together so the task proposer cannot win by generating nonsense, only by generating hard but replayable traces.

**KL tracking during self-play**

![KL tracking](https://github.com/RitishShrirao/OSINT_env/blob/main/assets/kl.png?raw=1)

KL tracking matters because adversarial training is only useful when updates remain stable. Monitoring KL helps ensure the policies are learning under pressure rather than collapsing into degenerate behavior.

**Checkpoint comparison**
These comparisons have been done after making the queries with the trained generator model
- Finetuned checkpoint: `task_success_rate=0.875`, `avg_reward=0.8996`
- Base model `Qwen/Qwen2.5-0.5B-Instruct`: `task_success_rate=0.0`, `avg_reward=0.5196`
- Delta: `+0.875 success`, `+0.3800 avg reward`

These numbers make the improvement legible at a glance. The finetuned agent moves from zero task success to a strong success rate under the benchmark’s adversarial setting, while also increasing average reward substantially.

## Why Trace Net is exciting

Trace Net is exciting because it pushes agent evaluation closer to the real difficulty profile of OSINT:

- evidence is incomplete
- some actors are deceptive by design
- retrieval can be baited
- graph construction matters
- parallel investigation is valuable
- harder tasks should emerge adversarially, not just be hand-written

A lot of benchmarks ask whether a model can answer. Trace Net asks whether a system can **investigate under pressure**.

That shift is the whole point.

## Quick start

Install locally:

```bash
python -m pip install -e .
```

Run a demo episode:

```bash
osint-env demo --agent-mode swarm --llm-provider mock
```

Run a short benchmark:

```bash
osint-env benchmark --episodes 5 --agent-mode swarm --llm-provider mock --name quick_check
```

Run the release validation gate:

```bash
python scripts/validate_release.py
```

## Final thoughts

Trace Net combines synthetic world-building, adversarial noise, multi-agent coordination, mathematically shaped rewards, and self-play training into one benchmark stack. The result is a far more realistic stress test for OSINT-style agents than clean multi-hop QA can provide.

If the future of agent evaluation is not just "can it answer?" but "can it coordinate, investigate, resist deception, and improve under adversarial pressure?", then Trace Net is pointed in exactly the right direction.