siddeshwar-kagatikar commited on
Commit
72cba6b
·
1 Parent(s): 0a16175

Add Trace Net blog post with final checkpoint link

Browse files
Files changed (1) hide show
  1. blog.md +281 -0
blog.md ADDED
@@ -0,0 +1,281 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Trace Net: Training Agents for Noisy, Adversarial OSINT
2
+
3
+ 🤖 **Checkpoint:** [Siddeshwar1625/osint-checkpoints-final](https://huggingface.co/Siddeshwar1625/osint-checkpoints-final)
4
+
5
+ Most agent benchmarks are still too clean.
6
+
7
+ They assume the world is cooperative, the evidence is tidy, and the shortest path to the answer is also the most obvious one. Real OSINT is the opposite. People hide. Identities splinter across aliases. Threads derail. Posts mislead on purpose. Useful evidence is mixed with decoys, soft contradictions, and deliberate attempts to waste an investigator's time.
8
+
9
+ That is the motivation behind **Trace Net**.
10
+
11
+ Trace Net is a synthetic OSINT benchmark environment for tool-using agents that need to search, cross-link, and reason over noisy multi-platform evidence before producing an answer. Instead of rewarding pure prompt cleverness, the environment pushes agents to behave like investigators: retrieve signals, build a working graph, resolve entities, and justify the final node they select.
12
+
13
+ This repository is not just a dataset and not just a demo app. It is a full benchmark stack:
14
+
15
+ - a synthetic OSINT environment
16
+ - a tunable noise generator
17
+ - single-agent and swarm-style execution
18
+ - graph-aware reward shaping
19
+ - adversarial self-play training
20
+ - evaluation, leaderboard, and dashboard export
21
+ - a FastAPI/OpenEnv-compatible serving layer for Docker and Hugging Face Spaces
22
+
23
+ ## Why we did not use MetaQA
24
+
25
+ Earlier iterations explored a MetaQA-style backend, but we deliberately moved away from it for the benchmark we wanted to build.
26
+
27
+ MetaQA is useful for classic multi-hop reasoning, but it is too clean and too structurally easy for a serious OSINT setting. Once a task becomes mostly "follow a relation chain in a cooperative knowledge base," it stops stress-testing the failure modes we actually care about:
28
+
29
+ - identity ambiguity
30
+ - noisy retrieval
31
+ - alias collision
32
+ - distractor evidence
33
+ - partial observations
34
+ - agents being baited into the wrong trail
35
+
36
+ Trace Net focuses on those harder conditions instead. The goal is not just to see whether a model can traverse hops. The goal is to test whether an agent can survive in an adversarial evidence landscape.
37
+
38
+ ## A noisy world by design
39
+
40
+ The synthetic dataset is intentionally hostile.
41
+
42
+ The noise in this repository does not just mean random corruption. It refers to **users actively deflecting agent performance**. Some records act like red herrings. Some identities branch into aliases. Some traces create plausible but misleading routes through the graph. The environment can be tuned so retrieval becomes harder, evidence becomes less direct, and the agent is forced to discriminate signal from manipulation.
43
+
44
+ This tunability is exposed through environment parameters such as:
45
+
46
+ - `alias_density`
47
+ - `noise_level`
48
+ - `red_herring_rate`
49
+
50
+ Those controls matter because they let the benchmark scale from manageable to punishing without changing the fundamental task structure. You are not switching domains to make the task harder. You are increasing adversarial pressure inside the same OSINT world.
51
+
52
+ ## How the environment works
53
+
54
+ At the core is a hidden canonical graph of:
55
+
56
+ - users
57
+ - aliases
58
+ - organizations
59
+ - locations
60
+ - posts
61
+ - threads
62
+ - events
63
+
64
+ Agents never see this graph directly. They interact through a compact action space:
65
+
66
+ - `CALL_TOOL`
67
+ - `ADD_EDGE`
68
+ - `ANSWER`
69
+
70
+ Every step returns structured observations containing recent tool outputs, the current working-memory graph snapshot, recent action history, and the active task payload. That means agents are evaluated on how they investigate, not just on what final token they emit.
71
+
72
+ The tool layer exposes search and lookup primitives over synthetic microblog posts, forum threads, profiles, and memory:
73
+
74
+ - `search_posts`
75
+ - `get_post`
76
+ - `get_user_posts`
77
+ - `get_mentions`
78
+ - `search_threads`
79
+ - `get_thread`
80
+ - `get_user_activity`
81
+ - `get_profile`
82
+ - `search_people`
83
+ - `get_connections`
84
+ - `search_memory`
85
+ - `search_shared_context`
86
+
87
+ This turns every episode into a miniature OSINT workflow: gather clues, connect evidence, resist decoys, and only then commit to an answer.
88
+
89
+ ## Multi-agent interaction is a core feature, not a gimmick
90
+
91
+ One of the strongest ideas in Trace Net is that it treats multi-agent reasoning as an explicit systems problem.
92
+
93
+ The repository includes both a single-agent runner and a swarm-style orchestrator. In swarm mode, lightweight specialist roles such as **explorer**, **linker**, and **reasoner** coordinate over the same episode. Each role contributes a different kind of progress:
94
+
95
+ - explorers widen the search frontier
96
+ - linkers turn evidence into candidate relations
97
+ - reasoners consolidate partial findings into answerable structure
98
+
99
+ This matters because OSINT naturally decomposes into parallel subproblems. One path follows a person. Another resolves an alias. Another checks whether an event trace is real or planted. A single monolithic agent can do all of that serially, but the benchmark becomes much more interesting when we ask whether a system can split the work, use breadth efficiently, and still converge on the right graph.
100
+
101
+ Trace Net bakes that into the reward story as well. The swarm runner records spawn count, finished subtasks, critical steps, breadth, and depth. In other words, coordination itself becomes measurable.
102
+
103
+ ## Adversarial self-play is where the benchmark gets dangerous
104
+
105
+ Trace Net does not stop at evaluation. It includes a scaffold for **adversarial self-play training** built around Hugging Face TRL and the **GRPO** algorithm.
106
+
107
+ The loop alternates between two roles:
108
+
109
+ 1. a **generator** policy that proposes difficult OSINT tasks
110
+ 2. an **answerer** policy that tries to solve them
111
+
112
+ That setup is powerful because it creates pressure from both sides. The generator is rewarded for producing tasks that are valid, grounded, diverse, and hard for the current answerer. The answerer is rewarded using the same environment-native graph-and-answer objectives used during benchmark evaluation.
113
+
114
+ This is not just hype. The training loop has concrete mathematical logic behind it:
115
+
116
+ - grouped rollouts for relative comparison
117
+ - mean-centered reward baselines through GRPO
118
+ - KL-controlled policy updates
119
+ - explicit hardness terms against a frozen answerer
120
+ - replay validation for generated tasks
121
+ - shared-context pressure and swarm diversity terms in `swarm_v2`
122
+ - solver-side PARL-style orchestration shaping inspired by **Kimi K2.5**
123
+
124
+ That means the benchmark can evolve from a static evaluation set into a co-evolving curriculum of adversarial traces. The generator learns how to expose weaknesses. The answerer learns how to survive them.
125
+
126
+ For OSINT-style agents, that is exactly the kind of training pressure we want.
127
+
128
+ ## Reward design with mathematical grounding
129
+
130
+ The most important reward story in this repository is the one used during **adversarial self-play training**.
131
+
132
+ Training uses the **GRPO algorithm** through Hugging Face TRL. That means optimization is driven by grouped rollouts, relative reward comparison inside each group, clipped updates, and KL-regularized policy improvement rather than plain supervised fine-tuning.
133
+
134
+ In the self-play setting, the generator and solver have different reward functions.
135
+
136
+ For the **generator**, the training reward is a weighted objective over four core terms:
137
+
138
+ \[
139
+ R_{\text{gen}} =
140
+ w_v R_{\text{validity}} +
141
+ w_h R_{\text{hardness}} +
142
+ w_d R_{\text{diversity}} +
143
+ w_c R_{\text{consistency}}
144
+ \]
145
+
146
+ where:
147
+
148
+ - \(R_{\text{validity}}\) checks that the proposed task is well-formed and bounded
149
+ - \(R_{\text{hardness}}\) is higher when the frozen solver fails the generated task
150
+ - \(R_{\text{diversity}}\) penalizes near-duplicate generations
151
+ - \(R_{\text{consistency}}\) rewards graph-grounded, replayable tasks
152
+
153
+ In `swarm_v2`, this goes one step further: invalid or non-replayable generations are hard-gated by validation, and the reward then emphasizes replayability, hardness, swarm diversity, and shared-context pressure. This is what keeps the generator from gaming training by emitting flashy but unusable tasks.
154
+
155
+ For the **solver**, the training reward reuses the environment-native answer reward, but in the self-play pipeline it is explicitly framed as a solver-side objective for adversarial traces. The solver reward is also influenced by the **Kimi K2.5** paper through the project’s PARL-style shaping for multi-agent orchestration. In practice, that means solver training is not only about getting the final answer right, but also about coordinating useful work across the swarm.
156
+
157
+ The PARL-style orchestration term follows the project’s Kimi-inspired formulation:
158
+
159
+ \[
160
+ r_{\text{PARL}} = r_{\text{perf}} + \lambda_1 r_{\text{parallel}} + \lambda_2 r_{\text{finish}} + r_{\text{latency}}
161
+ \]
162
+
163
+ Therefore the final rewards having the components:
164
+
165
+ - output format validity and exact correctness
166
+ - knowledge-carrier and knowledge-indexing utility
167
+ - connectivity and supporting-edge F1 against task support edges,
168
+ - efficiency and compactness penalties,
169
+ - relation/entity informativeness and repetition control (difficulty-dependent).
170
+
171
+ This gives the solver-side swarm reward a strong systems flavor: the policy is encouraged to solve the task, but also to do so with effective parallel decomposition instead of brittle serial wandering.
172
+
173
+ Because training runs under **GRPO**, these rewards are used inside a relative-advantage setting:
174
+
175
+ - grouped rollouts provide comparison sets
176
+ - rewards are contrasted within the group
177
+ - KL terms stabilize policy updates
178
+ - generator hardness is measured against a frozen solver
179
+ - solver improvement is evaluated under the same adversarial pressure that the generator creates
180
+
181
+ That is the key design choice: the reward is not just scoring answers after the fact. It is shaping a co-evolutionary game between task proposer and task solver.
182
+
183
+ ## Serving, evaluation, and reproducibility
184
+
185
+ The engineering story is just as compelling as the benchmark design.
186
+
187
+ The repository provides:
188
+
189
+ - a `src/` package layout
190
+ - CLI commands for `demo`, `eval`, `benchmark`, `leaderboard`, `benchmark-sweep`, `viz`, and `train-self-play`
191
+ - artifact outputs for evaluation and baselines
192
+ - dashboard export
193
+ - a FastAPI server with OpenEnv-style HTTP endpoints
194
+ - Docker and Hugging Face Space readiness
195
+
196
+ This makes Trace Net easy to use in several modes:
197
+
198
+ - local development
199
+ - repeatable benchmarking
200
+ - hosted interactive demos
201
+ - self-play training runs
202
+ - remote evaluation via HTTP
203
+
204
+ It also means the project is already structured for iteration instead of being locked into a one-off benchmark release.
205
+
206
+ ## Results
207
+
208
+ The repository already includes reward visualizations and tracking artifacts that make the training story much more concrete.
209
+
210
+ **Answer reward shaping**
211
+
212
+ ![Answer reward design](https://github.com/RitishShrirao/OSINT_env/blob/main/assets/answer_reward.png?raw=1)
213
+
214
+ This view highlights that final scoring is not a single accuracy scalar. It combines correctness with graph utility, evidence quality, and efficiency so agents are rewarded for building useful investigative structure.
215
+
216
+ **Generator reward shaping**
217
+
218
+ ![Generator reward design](https://github.com/RitishShrirao/OSINT_env/blob/main/assets/generator_reward.png?raw=1)
219
+
220
+ The generator side is where adversarial pressure becomes explicit: validity, hardness, diversity, and consistency work together so the task proposer cannot win by generating nonsense, only by generating hard but replayable traces.
221
+
222
+ **KL tracking during self-play**
223
+
224
+ ![KL tracking](https://github.com/RitishShrirao/OSINT_env/blob/main/assets/kl.png?raw=1)
225
+
226
+ KL tracking matters because adversarial training is only useful when updates remain stable. Monitoring KL helps ensure the policies are learning under pressure rather than collapsing into degenerate behavior.
227
+
228
+ **Checkpoint comparison**
229
+ These comparisons have been done after making the queries with the trained generator model
230
+ - Finetuned checkpoint: `task_success_rate=0.875`, `avg_reward=0.8996`
231
+ - Base model `Qwen/Qwen2.5-0.5B-Instruct`: `task_success_rate=0.0`, `avg_reward=0.5196`
232
+ - Delta: `+0.875 success`, `+0.3800 avg reward`
233
+
234
+ These numbers make the improvement legible at a glance. The finetuned agent moves from zero task success to a strong success rate under the benchmark’s adversarial setting, while also increasing average reward substantially.
235
+
236
+ ## Why Trace Net is exciting
237
+
238
+ Trace Net is exciting because it pushes agent evaluation closer to the real difficulty profile of OSINT:
239
+
240
+ - evidence is incomplete
241
+ - some actors are deceptive by design
242
+ - retrieval can be baited
243
+ - graph construction matters
244
+ - parallel investigation is valuable
245
+ - harder tasks should emerge adversarially, not just be hand-written
246
+
247
+ A lot of benchmarks ask whether a model can answer. Trace Net asks whether a system can **investigate under pressure**.
248
+
249
+ That shift is the whole point.
250
+
251
+ ## Quick start
252
+
253
+ Install locally:
254
+
255
+ ```bash
256
+ python -m pip install -e .
257
+ ```
258
+
259
+ Run a demo episode:
260
+
261
+ ```bash
262
+ osint-env demo --agent-mode swarm --llm-provider mock
263
+ ```
264
+
265
+ Run a short benchmark:
266
+
267
+ ```bash
268
+ osint-env benchmark --episodes 5 --agent-mode swarm --llm-provider mock --name quick_check
269
+ ```
270
+
271
+ Run the release validation gate:
272
+
273
+ ```bash
274
+ python scripts/validate_release.py
275
+ ```
276
+
277
+ ## Final thoughts
278
+
279
+ Trace Net combines synthetic world-building, adversarial noise, multi-agent coordination, mathematically shaped rewards, and self-play training into one benchmark stack. The result is a far more realistic stress test for OSINT-style agents than clean multi-hop QA can provide.
280
+
281
+ If the future of agent evaluation is not just "can it answer?" but "can it coordinate, investigate, resist deception, and improve under adversarial pressure?", then Trace Net is pointed in exactly the right direction.