Spaces:
Paused
Paused
File size: 13,888 Bytes
72cba6b 0738dc1 72cba6b 9e3e5ff 72cba6b 9e3e5ff 72cba6b 9e3e5ff 72cba6b 9e3e5ff 72cba6b | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 | # Trace Net: Training Agents for Noisy, Adversarial OSINT
🤖 **Checkpoint:** [Siddeshwar1625/osint-checkpoints-final](https://huggingface.co/Siddeshwar1625/osint-checkpoints-final)
📊 **Training run (W&B):** [osint-self-play-train](https://wandb.ai/siddeshwar2004-international-institute-of-information-te/osint-self-play-train)
Most agent benchmarks are still too clean.
They assume the world is cooperative, the evidence is tidy, and the shortest path to the answer is also the most obvious one. Real OSINT is the opposite. People hide. Identities splinter across aliases. Threads derail. Posts mislead on purpose. Useful evidence is mixed with decoys, soft contradictions, and deliberate attempts to waste an investigator's time.
That is the motivation behind **Trace Net**.
Trace Net is a synthetic OSINT benchmark environment for tool-using agents that need to search, cross-link, and reason over noisy multi-platform evidence before producing an answer. Instead of rewarding pure prompt cleverness, the environment pushes agents to behave like investigators: retrieve signals, build a working graph, resolve entities, and justify the final node they select.
This repository is not just a dataset and not just a demo app. It is a full benchmark stack:
- a synthetic OSINT environment
- a tunable noise generator
- single-agent and swarm-style execution
- graph-aware reward shaping
- adversarial self-play training
- evaluation, leaderboard, and dashboard export
- a FastAPI/OpenEnv-compatible serving layer for Docker and Hugging Face Spaces
## Why we did not use MetaQA
Earlier iterations explored a MetaQA-style backend, but we deliberately moved away from it for the benchmark we wanted to build.
MetaQA is useful for classic multi-hop reasoning, but it is too clean and too structurally easy for a serious OSINT setting. Once a task becomes mostly "follow a relation chain in a cooperative knowledge base," it stops stress-testing the failure modes we actually care about:
- identity ambiguity
- noisy retrieval
- alias collision
- distractor evidence
- partial observations
- agents being baited into the wrong trail
Trace Net focuses on those harder conditions instead. The goal is not just to see whether a model can traverse hops. The goal is to test whether an agent can survive in an adversarial evidence landscape.
## A noisy world by design
The synthetic dataset is intentionally hostile.
The noise in this repository does not just mean random corruption. It refers to **users actively deflecting agent performance**. Some records act like red herrings. Some identities branch into aliases. Some traces create plausible but misleading routes through the graph. The environment can be tuned so retrieval becomes harder, evidence becomes less direct, and the agent is forced to discriminate signal from manipulation.
This tunability is exposed through environment parameters such as:
- `alias_density`
- `noise_level`
- `red_herring_rate`
Those controls matter because they let the benchmark scale from manageable to punishing without changing the fundamental task structure. You are not switching domains to make the task harder. You are increasing adversarial pressure inside the same OSINT world.
## How the environment works
At the core is a hidden canonical graph of:
- users
- aliases
- organizations
- locations
- posts
- threads
- events
Agents never see this graph directly. They interact through a compact action space:
- `CALL_TOOL`
- `ADD_EDGE`
- `ANSWER`
Every step returns structured observations containing recent tool outputs, the current working-memory graph snapshot, recent action history, and the active task payload. That means agents are evaluated on how they investigate, not just on what final token they emit.
The tool layer exposes search and lookup primitives over synthetic microblog posts, forum threads, profiles, and memory:
- `search_posts`
- `get_post`
- `get_user_posts`
- `get_mentions`
- `search_threads`
- `get_thread`
- `get_user_activity`
- `get_profile`
- `search_people`
- `get_connections`
- `search_memory`
- `search_shared_context`
This turns every episode into a miniature OSINT workflow: gather clues, connect evidence, resist decoys, and only then commit to an answer.
## Multi-agent interaction is a core feature, not a gimmick
One of the strongest ideas in Trace Net is that it treats multi-agent reasoning as an explicit systems problem.
The repository includes both a single-agent runner and a swarm-style orchestrator. In swarm mode, lightweight specialist roles such as **explorer**, **linker**, and **reasoner** coordinate over the same episode. Each role contributes a different kind of progress:
- explorers widen the search frontier
- linkers turn evidence into candidate relations
- reasoners consolidate partial findings into answerable structure
This matters because OSINT naturally decomposes into parallel subproblems. One path follows a person. Another resolves an alias. Another checks whether an event trace is real or planted. A single monolithic agent can do all of that serially, but the benchmark becomes much more interesting when we ask whether a system can split the work, use breadth efficiently, and still converge on the right graph.
Trace Net bakes that into the reward story as well. The swarm runner records spawn count, finished subtasks, critical steps, breadth, and depth. In other words, coordination itself becomes measurable.
## Adversarial self-play is where the benchmark gets dangerous
Trace Net does not stop at evaluation. It includes a scaffold for **adversarial self-play training** built around Hugging Face TRL and the **GRPO** algorithm.
The loop alternates between two roles:
1. a **generator** policy that proposes difficult OSINT tasks
2. an **answerer** policy that tries to solve them
That setup is powerful because it creates pressure from both sides. The generator is rewarded for producing tasks that are valid, grounded, diverse, and hard for the current answerer. The answerer is rewarded using the same environment-native graph-and-answer objectives used during benchmark evaluation.
This is not just hype. The training loop has concrete mathematical logic behind it:
- grouped rollouts for relative comparison
- mean-centered reward baselines through GRPO
- KL-controlled policy updates
- explicit hardness terms against a frozen answerer
- replay validation for generated tasks
- shared-context pressure and swarm diversity terms in `swarm_v2`
- solver-side PARL-style orchestration shaping inspired by **Kimi K2.5**
That means the benchmark can evolve from a static evaluation set into a co-evolving curriculum of adversarial traces. The generator learns how to expose weaknesses. The answerer learns how to survive them.
For OSINT-style agents, that is exactly the kind of training pressure we want.
## Reward design with mathematical grounding
The most important reward story in this repository is the one used during **adversarial self-play training**.
Training uses the **GRPO algorithm** through Hugging Face TRL. That means optimization is driven by grouped rollouts, relative reward comparison inside each group, clipped updates, and KL-regularized policy improvement rather than plain supervised fine-tuning.
In the self-play setting, the generator and solver have different reward functions.
For the **generator**, the training reward is a weighted objective over four core terms:
$$
R_{\text{gen}} =
w_v\, R_{\text{validity}} +
w_h\, R_{\text{hardness}} +
w_d\, R_{\text{diversity}} +
w_c\, R_{\text{consistency}}
$$
where:
- $R_{\text{validity}}$ checks that the proposed task is well-formed and bounded
- $R_{\text{hardness}}$ is higher when the frozen solver fails the generated task
- $R_{\text{diversity}}$ penalizes near-duplicate generations
- $R_{\text{consistency}}$ rewards graph-grounded, replayable tasks
In `swarm_v2`, this goes one step further: invalid or non-replayable generations are hard-gated by validation, and the reward then emphasizes replayability, hardness, swarm diversity, and shared-context pressure. This is what keeps the generator from gaming training by emitting flashy but unusable tasks.
For the **solver**, the training reward reuses the environment-native answer reward, but in the self-play pipeline it is explicitly framed as a solver-side objective for adversarial traces. The solver reward is also influenced by the **Kimi K2.5** paper through the project’s PARL-style shaping for multi-agent orchestration. In practice, that means solver training is not only about getting the final answer right, but also about coordinating useful work across the swarm.
The PARL-style orchestration term follows the project’s Kimi-inspired formulation:
$$
r_{\text{PARL}} = r_{\text{perf}} + \lambda_1\, r_{\text{parallel}} + \lambda_2\, r_{\text{finish}} + r_{\text{latency}}
$$
Therefore the final rewards having the components:
- output format validity and exact correctness
- knowledge-carrier and knowledge-indexing utility
- connectivity and supporting-edge F1 against task support edges,
- efficiency and compactness penalties,
- relation/entity informativeness and repetition control (difficulty-dependent).
This gives the solver-side swarm reward a strong systems flavor: the policy is encouraged to solve the task, but also to do so with effective parallel decomposition instead of brittle serial wandering.
Because training runs under **GRPO**, these rewards are used inside a relative-advantage setting:
- grouped rollouts provide comparison sets
- rewards are contrasted within the group
- KL terms stabilize policy updates
- generator hardness is measured against a frozen solver
- solver improvement is evaluated under the same adversarial pressure that the generator creates
That is the key design choice: the reward is not just scoring answers after the fact. It is shaping a co-evolutionary game between task proposer and task solver.
## Serving, evaluation, and reproducibility
The engineering story is just as compelling as the benchmark design.
The repository provides:
- a `src/` package layout
- CLI commands for `demo`, `eval`, `benchmark`, `leaderboard`, `benchmark-sweep`, `viz`, and `train-self-play`
- artifact outputs for evaluation and baselines
- dashboard export
- a FastAPI server with OpenEnv-style HTTP endpoints
- Docker and Hugging Face Space readiness
This makes Trace Net easy to use in several modes:
- local development
- repeatable benchmarking
- hosted interactive demos
- self-play training runs
- remote evaluation via HTTP
It also means the project is already structured for iteration instead of being locked into a one-off benchmark release.
## Results
The repository already includes reward visualizations and tracking artifacts that make the training story much more concrete.
**Answer reward shaping**

This view highlights that final scoring is not a single accuracy scalar. It combines correctness with graph utility, evidence quality, and efficiency so agents are rewarded for building useful investigative structure.
**Generator reward shaping**

The generator side is where adversarial pressure becomes explicit: validity, hardness, diversity, and consistency work together so the task proposer cannot win by generating nonsense, only by generating hard but replayable traces.
**KL tracking during self-play**

KL tracking matters because adversarial training is only useful when updates remain stable. Monitoring KL helps ensure the policies are learning under pressure rather than collapsing into degenerate behavior.
**Checkpoint comparison**
These comparisons have been done after making the queries with the trained generator model
- Finetuned checkpoint: `task_success_rate=0.875`, `avg_reward=0.8996`
- Base model `Qwen/Qwen2.5-0.5B-Instruct`: `task_success_rate=0.0`, `avg_reward=0.5196`
- Delta: `+0.875 success`, `+0.3800 avg reward`
These numbers make the improvement legible at a glance. The finetuned agent moves from zero task success to a strong success rate under the benchmark’s adversarial setting, while also increasing average reward substantially.
## Why Trace Net is exciting
Trace Net is exciting because it pushes agent evaluation closer to the real difficulty profile of OSINT:
- evidence is incomplete
- some actors are deceptive by design
- retrieval can be baited
- graph construction matters
- parallel investigation is valuable
- harder tasks should emerge adversarially, not just be hand-written
A lot of benchmarks ask whether a model can answer. Trace Net asks whether a system can **investigate under pressure**.
That shift is the whole point.
## Quick start
Install locally:
```bash
python -m pip install -e .
```
Run a demo episode:
```bash
osint-env demo --agent-mode swarm --llm-provider mock
```
Run a short benchmark:
```bash
osint-env benchmark --episodes 5 --agent-mode swarm --llm-provider mock --name quick_check
```
Run the release validation gate:
```bash
python scripts/validate_release.py
```
## Final thoughts
Trace Net combines synthetic world-building, adversarial noise, multi-agent coordination, mathematically shaped rewards, and self-play training into one benchmark stack. The result is a far more realistic stress test for OSINT-style agents than clean multi-hop QA can provide.
If the future of agent evaluation is not just "can it answer?" but "can it coordinate, investigate, resist deception, and improve under adversarial pressure?", then Trace Net is pointed in exactly the right direction.
|