Spaces:
Paused
Paused
File size: 21,127 Bytes
d814291 0a16175 d814291 fe1f842 e44cdee fe1f842 d814291 e44cdee d814291 e44cdee d814291 e44cdee d814291 fe1f842 d814291 e44cdee d814291 fe1f842 e44cdee fe1f842 e44cdee fe1f842 e44cdee fe1f842 d814291 fe1f842 e44cdee fe1f842 e44cdee d814291 fe1f842 e44cdee fe1f842 d814291 fe1f842 d814291 274f638 d814291 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 | ---
title: OSINT OpenEnv
emoji: π΅οΈ
colorFrom: blue
colorTo: yellow
sdk: docker
app_port: 7860
pinned: false
license: apache-2.0
tags:
- openenv
- osint
- benchmark
- docker
- fastapi
short_description: Docker OSINT benchmark with fixed OpenEnv tasks.
---
# OSINT OpenEnv
π **Slide deck:** [TraceNet.pdf (Google Drive)](https://drive.google.com/file/d/14GyDrtIei54C4RHokOI-mLf12B_OOzPX/view?usp=sharing)
OSINT OpenEnv is a synthetic benchmark environment for tool-using agents that must recover identities, trace events, and link entities across noisy multi-platform records. The project is designed to feel like a compact OSINT workflow rather than a raw QA dataset: agents query mock profiles, posts, forum threads, and semantic memory, build a working graph, and then submit an answer.
The motivation is to provide a reproducible OpenEnv-compatible environment for evaluating graph-building and tool-using reasoning without depending on live web data, unstable APIs, or private corpora. That makes it useful for local development, regression testing, and hosted demos such as a Docker-based Hugging Face Space.
## Environment Summary
The environment generates or loads a hidden canonical graph of users, aliases, organizations, locations, posts, threads, and events. It then exposes partial platform views and a task list drawn from that graph.
The default hosted Space uses the fixed-level benchmark in `datasets/fixed_levels/seed_fixed_levels.json`, which now contains 30 stable tasks over a larger shared seeded graph.
The repository now supports two dataset backends:
- `canonical` (existing fixed-level synthetic graph pipeline)
- `metaqa` (MetaQA KB + QA files for `1-hop`, `2-hop`, and `3-hop`)
Use `config/shared_config.json` or CLI flags (`--dataset-mode`, `--metaqa-root`, `--metaqa-hops`, `--metaqa-splits`) to choose which backend to run.
## Action Space
The environment exposes three actions:
- `CALL_TOOL`: query platform views or semantic memory such as `search_posts`, `get_profile`, `search_threads`, `get_connections`, or `search_memory`.
- `ADD_EDGE`: add a candidate relation to the working memory graph.
- `ANSWER`: submit the final answer as an exact node id string.
## Observation Space
Each step returns a JSON observation with four parts:
- `tool_outputs`: the most recent tool results.
- `graph_snapshot`: the current working-memory graph edges.
- `action_history`: recent actions and rewards.
- `task`: the active task id, task type, and question.
## Task Types And Difficulty
The benchmark mixes direct lookups with multi-hop traces:
- Easy: single-hop identity resolution, organization lookup, event lookup, or location lookup.
- Mid: two-hop alias-to-user-to-organization or thread-to-event-to-user traces.
- High: cross-platform multi-hop traces combining aliases, authored content, event references, organization links, and direct connections.
In MetaQA mode, hop buckets are mapped into the same reward difficulty tiers:
- `1-hop` -> `easy`
- `2-hop` -> `medium`
- `3-hop` -> `hard`
Common task families include:
- `identity_resolution`
- `network_discovery`
- `event_tracing`
- `cross_platform_linking`
- `deanonymization`
- `convoluted_trace`
Expected difficulty increases with the number of relations the agent must chain together and whether the evidence is split across posts, threads, aliases, and profile edges.
## Repository Layout
```text
src/osint_env/
agents/ single-agent and swarm runners
baselines/ reusable OpenAI baseline runner
config/ shared config and seed loading
data/ graph/view/task generation
domain/ dataclasses and environment models
env/ environment, reward logic, OpenEnv compatibility shim
eval/ evaluation metrics and leaderboard helpers
llm/ mock, Ollama, and OpenAI client wrappers
memory/ working graph and semantic memory
platforms/ tool APIs over synthetic platform views
viz/ dashboard export
scripts/
build_fixed_levels_dataset.py
run_openai_baseline.py
datasets/fixed_levels/
seed_fixed_levels.json
shared_config_fixed_levels.json
qwen_swarm_benchmark_fixed_levels.json
server.py FastAPI app for local use and Docker/HF Spaces
Dockerfile Container entrypoint for Hugging Face Docker Spaces
```
## Setup
Python 3.10+ is required.
Local install:
```bash
python -m pip install -e .
```
Install optional adversarial self-play training stack (TRL + Transformers):
```bash
python -m pip install -e ".[train]"
```
Run tests:
```bash
python -m pytest -q
```
Run the automated release gate:
```bash
python scripts/validate_release.py
```
## Usage
Run one demo episode:
```bash
osint-env demo --agent-mode swarm --llm-provider mock
```
Run against MetaQA using the provided sample config:
```bash
osint-env demo --config config/shared_config_metaqa.json --dataset-mode metaqa --llm-provider mock
```
Run MetaQA with only selected hop buckets:
```bash
osint-env eval --config config/shared_config_metaqa.json --dataset-mode metaqa --metaqa-hops 1-hop,2-hop --episodes 5 --llm-provider mock
```
Run a quick evaluation:
```bash
osint-env eval --episodes 5 --agent-mode swarm --llm-provider mock
```
Export a dashboard:
```bash
osint-env benchmark --episodes 5 --agent-mode swarm --llm-provider mock --name quick_check
```
Run Kimi-style adversarial self-play scaffold (dry-run by default in the example config):
```bash
osint-env train-self-play --config config/shared_config.json --train-config config/self_play_training_example.json --dry-run
```
When you have compute and the train dependencies installed, remove `--dry-run` (or set `"dry_run": false` in the training config) to execute TRL GRPO updates for alternating generator and answerer phases.
For a standalone Linux server or SSH box, there is also a wrapper script that activates a venv, optionally installs train deps, and runs the same command:
```bash
VENV_PATH="$HOME/arl" \
INSTALL_TRAIN_DEPS=1 \
TRAIN_ENV_CONFIG_PATH="config/shared_config.json" \
TRAIN_SELF_PLAY_CONFIG_PATH="config/self_play_training_hf_l40s_full.json" \
TRAIN_SELF_PLAY_OUTPUT_DIR="artifacts/self_play_server" \
bash scripts/train_self_play_standalone.sh
```
Useful overrides for the standalone script:
- `BOOTSTRAP_VENV=1` to create the virtualenv if it does not exist
- `TRAIN_SELF_PLAY_ROUNDS=2` to override the number of rounds
- `RUN_SELF_PLAY_DRY_RUN=1` to materialize artifacts without GRPO updates
- `TRAIN_SETUP_COMMAND='python -m pip install flash-attn --no-build-isolation'` for host-specific extras
The training config also supports `"model_topology": "dual"|"shared"`, `"phase_schedule": "generator_answerer"|"answerer_generator_answerer"`, `"tuning_mode": "full"|"lora"`, and `"canonical_graph_mode": "generate"|"fixed"` so you can switch between two-model vs single-model self-play, full fine-tuning vs LoRA adapters, and whether canonical graph structure is generated each round or kept fixed while training question/answer behavior.
### Hugging Face Job L40S Run (Separate From The Space)
For a budgeted full fine-tuning run on `l40s` hardware, use:
```bash
osint-env train-self-play --config config/shared_config.json --train-config config/self_play_training_hf_l40s_full.json
```
This config:
- uses `Qwen/Qwen2.5-0.5B-Instruct`
- enables W&B reporting (`wandb_enabled: true`)
- uses `pipeline_mode: "swarm_v2"` with `canonical_graph_mode: "fixed"` to keep canonical graph candidates stable while training question/answer behavior
- keeps the VRAM-heavy settings aligned with the smoke config while extending runtime (`rounds=6`, `max_steps=120` per phase)
- uses full fine-tuning plus fused AdamW, bf16/tf32, larger generation batches, and extra dataloader workers to make better use of an L40S
To enable canonical graph generation during swarm_v2 training, switch `"canonical_graph_mode"` to `"generate"` in the training config.
If you want the Space to stay on CPU and train separately on paid GPU compute, launch a dedicated Hugging Face Job instead of training inside the Space:
```bash
osint-env-launch-hf-job \
--hf-token "$HF_TOKEN" \
--job-image "pytorch/pytorch:2.6.0-cuda12.4-cudnn9-devel" \
--repo-url "https://github.com/your-org/meta-knowledge-graph.git" \
--repo-ref "main" \
--flavor "l40s" \
--env-config "config/shared_config.json" \
--train-config "config/self_play_training_hf_l40s_full.json" \
--output-bucket "your-hf-bucket" \
--wait
```
The launcher talks to the Hugging Face Jobs API through `huggingface_hub`, so the Space can remain on CPU while the training job runs on separate L40S compute.
Optional Space startup wiring still exists if you want it:
1. Keep the Space on CPU if it is serving inference/UI only.
2. The startup script now defaults to running self-play on boot with the full L40S config.
3. Optional overrides:
- `RUN_SELF_PLAY_TRAINING=0` to disable startup-time training
- `TRAIN_SELF_PLAY_CONFIG_PATH` (default: `config/self_play_training_hf_l40s_full.json`)
- `TRAIN_ENV_CONFIG_PATH` (default: `config/shared_config.json`)
- `TRAIN_SELF_PLAY_OUTPUT_DIR` to override where artifacts land
- `RUN_SELF_PLAY_DRY_RUN=1` to test startup wiring without GRPO updates
- `RUN_SELF_PLAY_BACKGROUND=1` to keep the API up while startup-time training runs
- `OSINT_HF_CHECKPOINT_REPO_ID` to force uploads into a specific HF model repo
- `OSINT_HF_CHECKPOINT_REPO_TYPE` to switch repo type (`model` by default)
- `OSINT_HF_CHECKPOINT_REPO_PRIVATE=0` to create/update a public checkpoint repo
- `OSINT_TRAIN_STRICT_ASSERTS=1` to fail fast when reward variance, KL, loss, grad norms, or parameter updates stay zero
W&B run naming is controlled by `wandb_run_name_prefix` and will emit phase-specific runs like `...-r001-generator` and `...-r001-answerer`.
### Reward Functions In Self-Play (Generator + Answerer)
Self-play trains two policies with role-specific reward functions defined in `src/osint_env/training/rewards.py`.
Generator reward (`GeneratorRewardFunction`) and answerer reward (`AnswererRewardFunction`) are both returned to GRPO as scalar scores per completion, and both are clipped to a stable range before optimization.
#### Generator Reward (Task-Proposing Agent)
The generator is rewarded for producing valid, grounded, diverse, and hard tasks.
In `legacy` pipeline mode, the reward is a weighted sum:
- `validity`: checks non-empty `question`, non-empty `answer`, and bounded `supporting_edges`.
- `hardness`: uses a frozen answerer judge; reward is higher when the judge gets the generated question wrong.
- `diversity`: penalizes near-duplicate questions via token-level Jaccard similarity against prior generated questions.
- `consistency`: checks that support edges exist in the canonical graph and that the answer/question are graph-grounded.
Default weights (configurable through `generator_reward_weights` in training config):
- `validity`: `0.45`
- `hardness`: `0.20`
- `diversity`: `0.15`
- `consistency`: `0.20`
In `swarm_v2` pipeline mode, generation uses strict replay/validation first, then a structured reward:
- Hard-gated validation via `SwarmV2ReplayValidator` (invalid samples get a fixed negative reward path).
- Reward components include validity, derivability/replayability, hardness, swarm diversity, shared-context pressure targeting, and PARL-inspired orchestration bonuses (`parallel` + `finish`).
- Invalid or non-replayable candidates are penalized before the weighted positive terms are applied.
#### Answerer Reward (Question-Solving Agent)
The answerer reward wraps environment-native grading so train-time behavior matches benchmark-time incentives.
For each completion, `AnswererRewardFunction`:
- extracts the predicted answer from completion text/JSON,
- reconstructs a transient `TaskInstance` from row fields (`question`, `answer`, `supporting_edges_json`, `difficulty`),
- optionally extracts predicted supporting edges from JSON or text,
- calls `compute_answer_reward(...)` from `src/osint_env/env/reward.py`.
`compute_answer_reward` combines exact answer quality with graph-utility shaping:
- output format validity and exact correctness,
- knowledge-carrier and knowledge-indexing utility,
- connectivity and supporting-edge F1 against task support edges,
- efficiency and compactness penalties,
- relation/entity informativeness and repetition control (difficulty-dependent).
Difficulty controls (`easy`, `medium`, `hard`) are preserved during training exactly as in the environment scorer, so the answerer sees the same tiered reward profile used in evaluation.
In `swarm_v2`, the answerer reward also adds PARL-style orchestration credit (spawn/finish behavior) on top of base answer reward when orchestrator telemetry is present in the completion payload.
Detailed design notes are in `docs/adversarial_self_play.md`.
## OpenAI Baseline
The reproducible OpenAI baseline is implemented in `scripts/run_openai_baseline.py`. It runs on the fixed-level benchmark, uses a stable seeded graph/task set, writes a JSON artifact, appends a leaderboard record, and exports a dashboard.
Default behavior:
- dataset: fixed-level benchmark
- episodes: 30
- max steps per episode: 8
- temperature: 0.0
- output artifact: `artifacts/baselines/openai_fixed_levels_latest.json`
Run it with an API key:
```bash
export OPENAI_API_KEY="your_key_here"
python scripts/run_openai_baseline.py --model gpt-5-nano
```
The script is designed to stay bounded enough for a normal benchmark pass to finish comfortably under 20 minutes on a lightweight chat model, while still using the full fixed task set. For repeatability it fixes the benchmark graph/tasks and uses deterministic decoding settings. Because remote model backends can still change over time, the output artifact also records model metadata and system fingerprints when available.
## Inference Script
The submission-ready inference entrypoint is the root `inference.py` file. It talks to the deployed Hugging Face Space over HTTP, uses the OpenAI client for all model calls, and emits structured stdout logs in the `[START]`, `[STEP]`, and `[END]` format.
The script accepts `HF_TOKEN` as the primary auth variable and also supports `OPENAI_API_KEY` or `API_KEY` as local fallbacks.
After a successful run, `inference.py` also posts the evaluation summary back to the Space so the latest `/dashboard` view reflects that run.
Required environment variables:
- `API_BASE_URL`
- `MODEL_NAME`
- `HF_TOKEN`
Optional environment variables:
- `SPACE_URL` default: `https://siddeshwar1625-osint.hf.space`
- `TASK_INDICES` default: `0,10,20`
- `MAX_STEPS` default: `8`
Example local test command against a running local server:
```bash
API_BASE_URL=https://api.openai.com/v1 MODEL_NAME=gpt-5.4-mini OPENAI_API_KEY=your_key SPACE_URL=http://127.0.0.1:7860 python inference.py
```
Example test command against the deployed Space:
```bash
API_BASE_URL=https://api.openai.com/v1 MODEL_NAME=gpt-5.4-mini OPENAI_API_KEY=your_key SPACE_URL=https://siddeshwar1625-osint.hf.space python inference.py
```
## Hugging Face Space β Variables And Secrets
All settings below are read by `server.py`, `scripts/space_start.sh`, and `src/osint_env/training/self_play.py` at Space boot. Configure them under **Settings β Variables and secrets** on the Space. Secrets are stored encrypted; plain variables are visible in the UI.
### Secrets (sensitive β store as **Secrets**)
- `HF_TOKEN` β Hugging Face auth token used for downloading gated models and uploading checkpoints to your model repo (aliases also accepted: `HUGGINGFACE_HUB_TOKEN`, `HUGGING_FACE_HUB_TOKEN`).
- `WANDB_API_KEY` β Weights & Biases API key; enables remote training dashboards (loss, KL, reward curves) during self-play.
- `OPENAI_API_KEY` β OpenAI key used by `inference.py` and the OpenAI baseline; only required when the LLM provider is `openai`.
### Server / runtime variables
- `PORT` β Port uvicorn binds to (default `7860`; HF sets this automatically).
- `OSINT_ENV_CONFIG` β Path to the shared environment config JSON (default `datasets/fixed_levels/shared_config_fixed_levels.json`).
- `OSINT_ENV_SEED_FILE` β Path to the fixed-task seed file used by the live Space (default `datasets/fixed_levels/seed_fixed_levels.json`).
- `OSINT_SPACE_LLM_PROVIDER` β LLM provider for the live demo and `/api/environment` (default `mock`; set `openai` to use a real model).
- `OSINT_SPACE_LLM_MODEL` β Model name when provider is `openai` (default `gpt-4o-mini`).
### Self-play training (boot-time) variables
- `RUN_SELF_PLAY_TRAINING` β `1` to run self-play training on Space boot, `0` to skip (default `1`).
- `RUN_SELF_PLAY_DRY_RUN` β `1` materializes artifacts without GRPO updates; useful for verifying wiring (default `0`).
- `RUN_SPACE_API_SERVER` β `1` to start uvicorn alongside training, `0` for training-only mode (default `1`).
- `TRAIN_SELF_PLAY_CONFIG_PATH` β Training config JSON (default `config/self_play_training_hf_l40s_full.json`).
- `TRAIN_ENV_CONFIG_PATH` β Environment config consumed by the training run (default `config/shared_config.json`).
- `TRAIN_SELF_PLAY_OUTPUT_DIR` β Override for where training artifacts and checkpoints are written.
- `TRAIN_LOG_PATH` β File the supervised training process writes stdout/stderr to (default `/tmp/self_play_training.log`).
- `OSINT_TRAIN_STRICT_ASSERTS` β `1` makes training fail fast when reward variance, KL, loss, grad norms, or parameter updates stay zero (default off).
### Checkpoint upload (HF Hub) variables
- `OSINT_HF_CHECKPOINT_REPO_ID` β Force checkpoint uploads into a specific HF model repo (e.g. `Siddeshwar1625/osint-checkpoints-final`). When unset, a name is derived from `SPACE_ID`.
- `OSINT_HF_CHECKPOINT_REPO_TYPE` β Repo type for uploads (`model` by default; can be `dataset`).
- `OSINT_HF_CHECKPOINT_REPO_PRIVATE` β `0` to create/update a public checkpoint repo (default `1`/private).
- `OSINT_HF_CHECKPOINT_REPO_SUFFIX` β Suffix appended when the repo name is derived from `SPACE_ID` (default `-checkpoints`).
- `OSINT_HF_CHECKPOINT_PATH_PREFIX` β Subdirectory inside the repo where checkpoints are written.
- `OSINT_HF_UPLOAD_ON_SAVE` β `0` disables automatic upload of every new `checkpoint-*` directory (default `1`).
### Auto-populated by the Space (read-only)
- `SPACE_ID` / `HF_SPACE_ID` β Set by Hugging Face; consumed when deriving a default checkpoint repo name.
## Docker And Hugging Face Space
The repository is ready for a Docker-based Hugging Face Space:
- `README.md` includes `sdk: docker`
- `README.md` includes the `openenv` Space tag
- `Dockerfile` serves `server.py` on port `7860`
Local Docker smoke test:
```bash
docker build -t osint-openenv .
docker run --rm -p 7860:7860 osint-openenv
```
Then open `http://localhost:7860`.
The FastAPI app serves:
- `/`: overview page
- `/dashboard`: generated benchmark dashboard
- `/api/environment`: environment metadata
- `/health`: health check (validator-friendly alias)
- `/healthz`: health check (legacy alias)
- `/openenv.yaml`: OpenEnv HTTP spec stub
- `/openenv/tasks`: task enumeration
- `/reset` and `/openenv/reset`: episode reset endpoints
- `/step` and `/openenv/step`: episode step endpoints
- `/state` and `/openenv/state/{session_id}`: session state endpoints (`/state` returns the latest session)
## Automated Validation
The repository includes a pass/fail validation gate for the core delivery requirements:
- Hugging Face Space readiness
- OpenEnv spec compliance
- reproducible baseline behavior
- at least 3 fixed tasks with working graders
- Docker image build in CI
Local gate:
```bash
python scripts/validate_release.py
```
CI gate:
- `.github/workflows/validation.yml`
- runs `pytest`
- runs the validation script
- runs `docker build`
## Baseline Scores
The fixed-level benchmark was expanded from the earlier 15-question set to a 30-question set with a larger seeded graph, so older benchmark artifacts should be treated as legacy and regenerated on the new dataset before using them as reference scores.
After you supply an OpenAI API key, the current baseline scores for the expanded benchmark will be written to:
- `artifacts/baselines/openai_fixed_levels_latest.json`
- `artifacts/baselines/openai_fixed_levels_dashboard.html`
## Notes On `pyproject.toml`
The packaging file is structurally correct for a `src/` layout and editable installs. The main gaps were deployment/runtime related rather than build-breaking:
- `openenv` is now version-bounded explicitly.
- `fastapi` and `uvicorn` are included because the repo now ships a real web server.
- pytest is pointed at the `tests/` directory, and the test suite also adds `src/` to `sys.path` so source-layout imports work reliably during local runs.
## Development Notes
The project keeps a lightweight local compatibility shim for `openenv` so the source tree remains importable even before dependencies are installed. In a normal install or Docker build, the real `openenv` package from PyPI is still used.
|