Spaces:
Sleeping
title: OSINT OpenEnv
emoji: 🕵️
colorFrom: blue
colorTo: yellow
sdk: docker
app_port: 7860
pinned: false
license: apache-2.0
tags:
- openenv
- osint
- benchmark
- docker
- fastapi
short_description: Docker OSINT benchmark with fixed OpenEnv tasks.
OSINT OpenEnv
OSINT OpenEnv is a synthetic benchmark environment for tool-using agents that must recover identities, trace events, and link entities across noisy multi-platform records. The project is designed to feel like a compact OSINT workflow rather than a raw QA dataset: agents query mock profiles, posts, forum threads, and semantic memory, build a working graph, and then submit an answer.
The motivation is to provide a reproducible OpenEnv-compatible environment for evaluating graph-building and tool-using reasoning without depending on live web data, unstable APIs, or private corpora. That makes it useful for local development, regression testing, and hosted demos such as a Docker-based Hugging Face Space.
Environment Summary
The environment generates or loads a hidden canonical graph of users, aliases, organizations, locations, posts, threads, and events. It then exposes partial platform views and a task list drawn from that graph.
The default hosted Space uses the fixed-level benchmark in datasets/fixed_levels/seed_fixed_levels.json, which now contains 30 stable tasks over a larger shared seeded graph.
Action Space
The environment exposes three actions:
CALL_TOOL: query platform views or semantic memory such assearch_posts,get_profile,search_threads,get_connections, orsearch_memory.ADD_EDGE: add a candidate relation to the working memory graph.ANSWER: submit the final answer as an exact node id string.
Observation Space
Each step returns a JSON observation with four parts:
tool_outputs: the most recent tool results.graph_snapshot: the current working-memory graph edges.action_history: recent actions and rewards.task: the active task id, task type, and question.
Task Types And Difficulty
The benchmark mixes direct lookups with multi-hop traces:
- Easy: single-hop identity resolution, organization lookup, event lookup, or location lookup.
- Mid: two-hop alias-to-user-to-organization or thread-to-event-to-user traces.
- High: cross-platform multi-hop traces combining aliases, authored content, event references, organization links, and direct connections.
Common task families include:
identity_resolutionnetwork_discoveryevent_tracingcross_platform_linkingdeanonymizationconvoluted_trace
Expected difficulty increases with the number of relations the agent must chain together and whether the evidence is split across posts, threads, aliases, and profile edges.
Repository Layout
src/osint_env/
agents/ single-agent and swarm runners
baselines/ reusable OpenAI baseline runner
config/ shared config and seed loading
data/ graph/view/task generation
domain/ dataclasses and environment models
env/ environment, reward logic, OpenEnv compatibility shim
eval/ evaluation metrics and leaderboard helpers
llm/ mock, Ollama, and OpenAI client wrappers
memory/ working graph and semantic memory
platforms/ tool APIs over synthetic platform views
viz/ dashboard export
scripts/
build_fixed_levels_dataset.py
run_openai_baseline.py
datasets/fixed_levels/
seed_fixed_levels.json
shared_config_fixed_levels.json
qwen_swarm_benchmark_fixed_levels.json
server.py FastAPI app for local use and Docker/HF Spaces
Dockerfile Container entrypoint for Hugging Face Docker Spaces
Setup
Python 3.10+ is required.
Local install:
python -m pip install -e .
Run tests:
python -m pytest -q
Run the automated release gate:
python scripts/validate_release.py
Usage
Run one demo episode:
osint-env demo --agent-mode swarm --llm-provider mock
Run a quick evaluation:
osint-env eval --episodes 5 --agent-mode swarm --llm-provider mock
Export a dashboard:
osint-env benchmark --episodes 5 --agent-mode swarm --llm-provider mock --name quick_check
OpenAI Baseline
The reproducible OpenAI baseline is implemented in scripts/run_openai_baseline.py. It runs on the fixed-level benchmark, uses a stable seeded graph/task set, writes a JSON artifact, appends a leaderboard record, and exports a dashboard.
Default behavior:
- dataset: fixed-level benchmark
- episodes: 30
- max steps per episode: 8
- temperature: 0.0
- output artifact:
artifacts/baselines/openai_fixed_levels_latest.json
Run it with an API key:
export OPENAI_API_KEY="your_key_here"
python scripts/run_openai_baseline.py --model gpt-5-nano
The script is designed to stay bounded enough for a normal benchmark pass to finish comfortably under 20 minutes on a lightweight chat model, while still using the full fixed task set. For repeatability it fixes the benchmark graph/tasks and uses deterministic decoding settings. Because remote model backends can still change over time, the output artifact also records model metadata and system fingerprints when available.
Inference Script
The submission-ready inference entrypoint is the root inference.py file. It talks to the deployed Hugging Face Space over HTTP, uses the OpenAI client for all model calls, and emits structured stdout logs in the [START], [STEP], and [END] format.
The script accepts HF_TOKEN as the primary auth variable and also supports OPENAI_API_KEY or API_KEY as local fallbacks.
After a successful run, inference.py also posts the evaluation summary back to the Space so the latest /dashboard view reflects that run.
Required environment variables:
API_BASE_URLMODEL_NAMEHF_TOKEN
Optional environment variables:
SPACE_URLdefault:https://siddeshwar1625-osint.hf.spaceTASK_INDICESdefault:0,10,20MAX_STEPSdefault:8
Example local test command against a running local server:
API_BASE_URL=https://api.openai.com/v1 MODEL_NAME=gpt-5.4-mini OPENAI_API_KEY=your_key SPACE_URL=http://127.0.0.1:7860 python inference.py
Example test command against the deployed Space:
API_BASE_URL=https://api.openai.com/v1 MODEL_NAME=gpt-5.4-mini OPENAI_API_KEY=your_key SPACE_URL=https://siddeshwar1625-osint.hf.space python inference.py
Docker And Hugging Face Space
The repository is ready for a Docker-based Hugging Face Space:
README.mdincludessdk: dockerREADME.mdincludes theopenenvSpace tagDockerfileservesserver.pyon port7860
Local Docker smoke test:
docker build -t osint-openenv .
docker run --rm -p 7860:7860 osint-openenv
Then open http://localhost:7860.
The FastAPI app serves:
/: overview page/dashboard: generated benchmark dashboard/api/environment: environment metadata/health: health check (validator-friendly alias)/healthz: health check (legacy alias)/openenv.yaml: OpenEnv HTTP spec stub/openenv/tasks: task enumeration/resetand/openenv/reset: episode reset endpoints/stepand/openenv/step: episode step endpoints/stateand/openenv/state/{session_id}: session state endpoints (/statereturns the latest session)
Automated Validation
The repository includes a pass/fail validation gate for the core delivery requirements:
- Hugging Face Space readiness
- OpenEnv spec compliance
- reproducible baseline behavior
- at least 3 fixed tasks with working graders
- Docker image build in CI
Local gate:
python scripts/validate_release.py
CI gate:
.github/workflows/validation.yml- runs
pytest - runs the validation script
- runs
docker build
Baseline Scores
The fixed-level benchmark was expanded from the earlier 15-question set to a 30-question set with a larger seeded graph, so older benchmark artifacts should be treated as legacy and regenerated on the new dataset before using them as reference scores.
After you supply an OpenAI API key, the current baseline scores for the expanded benchmark will be written to:
artifacts/baselines/openai_fixed_levels_latest.jsonartifacts/baselines/openai_fixed_levels_dashboard.html
Notes On pyproject.toml
The packaging file is structurally correct for a src/ layout and editable installs. The main gaps were deployment/runtime related rather than build-breaking:
openenvis now version-bounded explicitly.fastapianduvicornare included because the repo now ships a real web server.- pytest is pointed at the
tests/directory, and the test suite also addssrc/tosys.pathso source-layout imports work reliably during local runs.
Development Notes
The project keeps a lightweight local compatibility shim for openenv so the source tree remains importable even before dependencies are installed. In a normal install or Docker build, the real openenv package from PyPI is still used.