File size: 9,111 Bytes
dbcbf00
 
 
281dcb4
 
dbcbf00
 
a58d42c
 
dbcbf00
 
 
 
 
 
281dcb4
dbcbf00
ce675d4
dbcbf00
ce675d4
dbcbf00
ce675d4
dbcbf00
ce675d4
dbcbf00
ce675d4
dbcbf00
ce675d4
30d6d2d
d6fbf54
dbcbf00
d6fbf54
dbcbf00
ce675d4
dbcbf00
 
 
ce675d4
dbcbf00
ce675d4
dbcbf00
ce675d4
dbcbf00
 
 
 
ce675d4
dbcbf00
ce675d4
dbcbf00
ce675d4
dbcbf00
 
 
ce675d4
dbcbf00
d6fbf54
dbcbf00
 
 
 
 
 
d6fbf54
dbcbf00
d6fbf54
dbcbf00
d6fbf54
dbcbf00
 
 
 
 
 
 
 
 
 
 
 
 
d6fbf54
dbcbf00
 
 
d6fbf54
dbcbf00
 
 
 
d6fbf54
dbcbf00
 
 
d6fbf54
dbcbf00
d6fbf54
dbcbf00
d6fbf54
dbcbf00
d6fbf54
dbcbf00
 
 
d6fbf54
dbcbf00
d6fbf54
dbcbf00
 
 
d6fbf54
3eeb606
 
 
 
 
 
dbcbf00
d6fbf54
dbcbf00
d6fbf54
dbcbf00
 
 
d6fbf54
dbcbf00
ce675d4
dbcbf00
 
 
ce675d4
dbcbf00
ce675d4
dbcbf00
 
 
ce675d4
dbcbf00
ce675d4
30d6d2d
ce675d4
dbcbf00
ce675d4
dbcbf00
2292d06
dbcbf00
 
 
ce675d4
dbcbf00
ce675d4
dbcbf00
 
2292d06
dbcbf00
ce675d4
dbcbf00
ce675d4
49b9b2f
 
 
 
1166e01
 
 
49b9b2f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9b3c545
49b9b2f
 
 
 
 
9b3c545
49b9b2f
 
dbcbf00
ce675d4
dbcbf00
ce675d4
dbcbf00
 
30d6d2d
ce675d4
dbcbf00
ce675d4
dbcbf00
 
 
 
ce675d4
dbcbf00
ce675d4
dbcbf00
ce675d4
dbcbf00
 
 
87f8562
 
49b9b2f
 
87f8562
 
 
ce675d4
3eeb606
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
dbcbf00
ce675d4
2292d06
ce675d4
2292d06
ce675d4
30d6d2d
 
ce675d4
dbcbf00
ce675d4
dbcbf00
ce675d4
dbcbf00
 
 
ce675d4
dbcbf00
ce675d4
dbcbf00
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
---
title: OSINT OpenEnv
emoji: 🕵️
colorFrom: blue
colorTo: yellow
sdk: docker
app_port: 7860
pinned: false
license: apache-2.0
tags:
  - openenv
  - osint
  - benchmark
  - docker
  - fastapi
short_description: Docker OSINT benchmark with fixed OpenEnv tasks.
---

# OSINT OpenEnv

OSINT OpenEnv is a synthetic benchmark environment for tool-using agents that must recover identities, trace events, and link entities across noisy multi-platform records. The project is designed to feel like a compact OSINT workflow rather than a raw QA dataset: agents query mock profiles, posts, forum threads, and semantic memory, build a working graph, and then submit an answer.

The motivation is to provide a reproducible OpenEnv-compatible environment for evaluating graph-building and tool-using reasoning without depending on live web data, unstable APIs, or private corpora. That makes it useful for local development, regression testing, and hosted demos such as a Docker-based Hugging Face Space.

## Environment Summary

The environment generates or loads a hidden canonical graph of users, aliases, organizations, locations, posts, threads, and events. It then exposes partial platform views and a task list drawn from that graph.

The default hosted Space uses the fixed-level benchmark in `datasets/fixed_levels/seed_fixed_levels.json`, which now contains 30 stable tasks over a larger shared seeded graph.

## Action Space

The environment exposes three actions:

- `CALL_TOOL`: query platform views or semantic memory such as `search_posts`, `get_profile`, `search_threads`, `get_connections`, or `search_memory`.
- `ADD_EDGE`: add a candidate relation to the working memory graph.
- `ANSWER`: submit the final answer as an exact node id string.

## Observation Space

Each step returns a JSON observation with four parts:

- `tool_outputs`: the most recent tool results.
- `graph_snapshot`: the current working-memory graph edges.
- `action_history`: recent actions and rewards.
- `task`: the active task id, task type, and question.

## Task Types And Difficulty

The benchmark mixes direct lookups with multi-hop traces:

- Easy: single-hop identity resolution, organization lookup, event lookup, or location lookup.
- Mid: two-hop alias-to-user-to-organization or thread-to-event-to-user traces.
- High: cross-platform multi-hop traces combining aliases, authored content, event references, organization links, and direct connections.

Common task families include:

- `identity_resolution`
- `network_discovery`
- `event_tracing`
- `cross_platform_linking`
- `deanonymization`
- `convoluted_trace`

Expected difficulty increases with the number of relations the agent must chain together and whether the evidence is split across posts, threads, aliases, and profile edges.

## Repository Layout

```text
src/osint_env/
  agents/        single-agent and swarm runners
  baselines/     reusable OpenAI baseline runner
  config/        shared config and seed loading
  data/          graph/view/task generation
  domain/        dataclasses and environment models
  env/           environment, reward logic, OpenEnv compatibility shim
  eval/          evaluation metrics and leaderboard helpers
  llm/           mock, Ollama, and OpenAI client wrappers
  memory/        working graph and semantic memory
  platforms/     tool APIs over synthetic platform views
  viz/           dashboard export

scripts/
  build_fixed_levels_dataset.py
  run_openai_baseline.py

datasets/fixed_levels/
  seed_fixed_levels.json
  shared_config_fixed_levels.json
  qwen_swarm_benchmark_fixed_levels.json

server.py        FastAPI app for local use and Docker/HF Spaces
Dockerfile       Container entrypoint for Hugging Face Docker Spaces
```

## Setup

Python 3.10+ is required.

Local install:

```bash
python -m pip install -e .
```

Run tests:

```bash
python -m pytest -q
```

Run the automated release gate:

```bash
python scripts/validate_release.py
```

## Usage

Run one demo episode:

```bash
osint-env demo --agent-mode swarm --llm-provider mock
```

Run a quick evaluation:

```bash
osint-env eval --episodes 5 --agent-mode swarm --llm-provider mock
```

Export a dashboard:

```bash
osint-env benchmark --episodes 5 --agent-mode swarm --llm-provider mock --name quick_check
```

## OpenAI Baseline

The reproducible OpenAI baseline is implemented in `scripts/run_openai_baseline.py`. It runs on the fixed-level benchmark, uses a stable seeded graph/task set, writes a JSON artifact, appends a leaderboard record, and exports a dashboard.

Default behavior:

- dataset: fixed-level benchmark
- episodes: 30
- max steps per episode: 8
- temperature: 0.0
- output artifact: `artifacts/baselines/openai_fixed_levels_latest.json`

Run it with an API key:

```bash
export OPENAI_API_KEY="your_key_here"
python scripts/run_openai_baseline.py --model gpt-5-nano
```

The script is designed to stay bounded enough for a normal benchmark pass to finish comfortably under 20 minutes on a lightweight chat model, while still using the full fixed task set. For repeatability it fixes the benchmark graph/tasks and uses deterministic decoding settings. Because remote model backends can still change over time, the output artifact also records model metadata and system fingerprints when available.

## Inference Script

The submission-ready inference entrypoint is the root `inference.py` file. It talks to the deployed Hugging Face Space over HTTP, uses the OpenAI client for all model calls, and emits structured stdout logs in the `[START]`, `[STEP]`, and `[END]` format.

The script accepts `HF_TOKEN` as the primary auth variable and also supports `OPENAI_API_KEY` or `API_KEY` as local fallbacks.
After a successful run, `inference.py` also posts the evaluation summary back to the Space so the latest `/dashboard` view reflects that run.

Required environment variables:

- `API_BASE_URL`
- `MODEL_NAME`
- `HF_TOKEN`

Optional environment variables:

- `SPACE_URL` default: `https://siddeshwar1625-osint.hf.space`
- `TASK_INDICES` default: `0,10,20`
- `MAX_STEPS` default: `8`

Example local test command against a running local server:

```bash
API_BASE_URL=https://api.openai.com/v1 MODEL_NAME=gpt-5.4-mini OPENAI_API_KEY=your_key SPACE_URL=http://127.0.0.1:7860 python inference.py
```

Example test command against the deployed Space:

```bash
API_BASE_URL=https://api.openai.com/v1 MODEL_NAME=gpt-5.4-mini OPENAI_API_KEY=your_key SPACE_URL=https://siddeshwar1625-osint.hf.space python inference.py
```

## Docker And Hugging Face Space

The repository is ready for a Docker-based Hugging Face Space:

- `README.md` includes `sdk: docker`
- `README.md` includes the `openenv` Space tag
- `Dockerfile` serves `server.py` on port `7860`

Local Docker smoke test:

```bash
docker build -t osint-openenv .
docker run --rm -p 7860:7860 osint-openenv
```

Then open `http://localhost:7860`.

The FastAPI app serves:

- `/`: overview page
- `/dashboard`: generated benchmark dashboard
- `/api/environment`: environment metadata
- `/health`: health check (validator-friendly alias)
- `/healthz`: health check (legacy alias)
- `/openenv.yaml`: OpenEnv HTTP spec stub
- `/openenv/tasks`: task enumeration
- `/reset` and `/openenv/reset`: episode reset endpoints
- `/step` and `/openenv/step`: episode step endpoints
- `/state` and `/openenv/state/{session_id}`: session state endpoints (`/state` returns the latest session)

## Automated Validation

The repository includes a pass/fail validation gate for the core delivery requirements:

- Hugging Face Space readiness
- OpenEnv spec compliance
- reproducible baseline behavior
- at least 3 fixed tasks with working graders
- Docker image build in CI

Local gate:

```bash
python scripts/validate_release.py
```

CI gate:

- `.github/workflows/validation.yml`
- runs `pytest`
- runs the validation script
- runs `docker build`

## Baseline Scores

The fixed-level benchmark was expanded from the earlier 15-question set to a 30-question set with a larger seeded graph, so older benchmark artifacts should be treated as legacy and regenerated on the new dataset before using them as reference scores.

After you supply an OpenAI API key, the current baseline scores for the expanded benchmark will be written to:

- `artifacts/baselines/openai_fixed_levels_latest.json`
- `artifacts/baselines/openai_fixed_levels_dashboard.html`

## Notes On `pyproject.toml`

The packaging file is structurally correct for a `src/` layout and editable installs. The main gaps were deployment/runtime related rather than build-breaking:

- `openenv` is now version-bounded explicitly.
- `fastapi` and `uvicorn` are included because the repo now ships a real web server.
- pytest is pointed at the `tests/` directory, and the test suite also adds `src/` to `sys.path` so source-layout imports work reliably during local runs.

## Development Notes

The project keeps a lightweight local compatibility shim for `openenv` so the source tree remains importable even before dependencies are installed. In a normal install or Docker build, the real `openenv` package from PyPI is still used.