OSINT

Sleeping

App Files Files Community

siddeshwar-kagatikar commited on Apr 2

Commit

dbcbf00

1 Parent(s): 9e6be29

tested using openai

Browse files

Files changed (14) hide show

.dockerignore +9 -0
Dockerfile +28 -0
README.md +138 -269
pyproject.toml +12 -2
scripts/run_openai_baseline.py +60 -0
server.py +221 -0
src/osint_env/baselines/__init__.py +4 -0
src/osint_env/baselines/openai_runner.py +480 -0
src/osint_env/cli.py +1 -1
src/osint_env/env/environment.py +1 -2
src/osint_env/env/openenv_compat.py +20 -0
tests/conftest.py +12 -0
tests/test_openai_baseline.py +19 -0
tests/test_server.py +22 -0

.dockerignore ADDED Viewed

	@@ -0,0 +1,9 @@

+.git
+.pytest_cache
+__pycache__
+*.pyc
+*.pyo
+*.pyd
+.Python
+artifacts
+tests

Dockerfile ADDED Viewed

	@@ -0,0 +1,28 @@

+FROM python:3.12-slim
+RUN useradd -m -u 1000 user
+USER user
+ENV HOME=/home/user \
+    PATH=/home/user/.local/bin:$PATH \
+    PYTHONDONTWRITEBYTECODE=1 \
+    PYTHONUNBUFFERED=1 \
+    PORT=7860
+WORKDIR $HOME/app
+COPY --chown=user pyproject.toml README.md $HOME/app/
+COPY --chown=user src $HOME/app/src
+COPY --chown=user config $HOME/app/config
+COPY --chown=user datasets $HOME/app/datasets
+COPY --chown=user docs $HOME/app/docs
+COPY --chown=user scripts $HOME/app/scripts
+COPY --chown=user server.py $HOME/app/server.py
+RUN pip install --no-cache-dir --upgrade pip && \
+    pip install --no-cache-dir -e .
+EXPOSE 7860
+CMD ["sh", "-c", "uvicorn server:app --host 0.0.0.0 --port ${PORT:-7860}"]

README.md CHANGED Viewed

@@ -1,331 +1,200 @@
-# OSINT RL Environment
-This repository implements a simulated OSINT-style reinforcement learning environment where agents build and query a knowledge graph over fragmented multi-platform synthetic data.
-The codebase now supports both single-agent and low-width multi-agent swarm execution, seeded task and graph bootstrapping, benchmark scoring, and interactive visualization.
-## 1. What The Project Does
-The environment models a realistic workflow for information discovery and linking:
-1. Generate a hidden canonical graph with users, aliases, organizations, locations, and links.
-2. Project noisy partial views into mock platforms (microblog, forum, profile).
-3. Ask identity-resolution, network-discovery, and event-tracing questions.
-4. Let agents call tools, add graph edges, and submit answers.
-5. Score episodes using a composite reward that combines correctness, retrieval utility, graph quality, and efficiency.
-The tool layer also supports semantic-memory retrieval over prior observations:
-- search_memory(query, k): vector-style retrieval over accumulated tool outputs.
-## 2. Current Capabilities
-- Single-agent baseline runner.
-- Multi-agent swarm runner with constrained breadth and width (configurable, low by default).
-- Seeded graph nodes and edges from user-provided JSON.
-- Seeded questions from user-provided JSON.
-- LLM-assisted generation hooks for remaining graph/task expansion with deterministic fallback.
-- Persistent benchmark leaderboard with composite utility score.
-- Interactive dashboard showing:
-  - canonical graph,
-  - episode graph diff (predicted vs truth),
-  - source database explorer,
-  - benchmark charts and leaderboard.
-## 3. Installation
-Environment setup from the project root:
-1. Activate your Python environment.
-2. Install package dependencies.
-Example:
-   source ~/arl/bin/activate
-   uv pip install -e .
-The project requires Python 3.10+.
-## 3.1 LLM Backends
-The environment supports three LLM providers:
-- mock: deterministic fallback for reproducible local tests.
-- ollama: local model inference (recommended for offline development).
-- openai: remote API provider using an API key.
-The provider is configured through config/shared_config.json (llm block) and can be overridden from CLI.
-### Local Ollama Setup (Qwen 3 2B)
-1. Install Ollama.
-2. Start Ollama service.
-3. Pull the model:
-  ollama pull qwen3:2b
-If your local Ollama registry does not expose `qwen3:2b`, use:
-  ollama pull qwen3:1.7b
-  ollama cp qwen3:1.7b qwen3:2b
-4. Run demo in swarm mode with local model:
-  osint-env demo --agent-mode swarm --llm-provider ollama --llm-model qwen3:2b
-### OpenAI Setup
-1. Export API key:
-  export OPENAI_API_KEY="your_key_here"
-2. Run with OpenAI backend:
-  osint-env eval --episodes 10 --llm-provider openai --llm-model gpt-4o-mini
-You can also provide the key via config/shared_config.json using llm.openai_api_key,
-or specify a custom environment variable name via llm.openai_api_key_env.
-## 4. Repository Layout
-   src/osint_env/
-    agents/        single-agent and swarm runners
-    config/        shared config loader
-    data/          canonical graph, views, and task generation
-    domain/        data models and configuration dataclasses
-    env/           OpenEnv environment and reward logic
-    eval/          metrics, runner, leaderboard
-    llm/           LLM client interface and local mock
-    memory/        in-memory KG and semantic memory
-    platforms/     platform tool APIs
-    viz/           dashboard export
-    cli.py         command-line entrypoint
-   config/
-    shared_config.json   shared runtime/environment/swarm/reward config
-    seed_example.json    example seeded graph and question file
-## 5. Shared Configuration
-All core knobs are centralized in config/shared_config.json.
-This file includes:
-- environment generation controls,
-- swarm limits,
-- spawn reward shaping hyperparameters,
-- seeding defaults,
-- llm backend defaults,
-- runtime output paths.
-Default swarm settings are intentionally conservative:
-- max_agents: 3
-- max_breadth: 2
-- max_width: 2
-- max_depth: 2
-These defaults keep orchestration cost and branching low while enabling swarm behavior.
-## 6. Seeding Questions And Partial Graphs
-You can manually seed:
-- graph nodes,
-- graph edges,
-- task questions (optionally with answers and supporting edges).
-Use a seed file with the same structure as config/seed_example.json and pass it using --seed-file.
-Workflow:
-1. Add your manual graph fragments and questions to a JSON file.
-2. Keep llm_generate_remaining_graph and llm_generate_remaining_tasks enabled to fill the rest automatically.
-3. Run demo/eval/benchmark with --seed-file.
-## 7. CLI Usage
-All commands accept:
-- --config for shared config path (default: config/shared_config.json)
-- --seed-file for seeded graph/task input JSON
-- --agent-mode with values: config, single, swarm
-- --llm-provider with values: config, mock, ollama, openai
-- --llm-model to override configured model
-- --ollama-base-url to override local Ollama endpoint
-- --openai-api-key or --openai-api-key-env for OpenAI authentication
-Main commands:
-1. Run one episode:
-     osint-env demo --agent-mode swarm
-2. Evaluate episodes:
-     osint-env eval --episodes 20 --agent-mode single
-3. Benchmark and export dashboard:
-     osint-env benchmark --episodes 20 --name baseline_swarm
-4. Multi-seed benchmark sweep:
-     osint-env benchmark-sweep --seeds 7,11,17,23,31 --name-prefix sweep_swarm
-5. Print leaderboard:
-     osint-env leaderboard --sort-by leaderboard_score --top 15
-6. Export explorer without full benchmark:
-     osint-env viz --with-demo --output artifacts/osint_explorer.html
-  7. Benchmark with local Qwen model:
-    osint-env benchmark --episodes 20 --agent-mode swarm --llm-provider ollama --llm-model qwen3:2b --name qwen3_swarm
-8. Fast local smoke benchmark:
-    osint-env benchmark --episodes 1 --agent-mode swarm --llm-provider ollama --llm-model qwen3:2b --seed-file config/seed_ollama_smoke.json --name ollama_qwen_smoke
-## 8. Multi-Agent Swarm Design
-Swarm orchestration is implemented in src/osint_env/agents/swarm_agent.py.
-Design choices:
-- Shared environment state (single episode state machine).
-- Planner rounds bounded by max_depth and planner_rounds.
-- Parallel workers bounded by min(max_agents, max_breadth, max_width).
-- Each worker performs limited tool calls, then attempts edge addition.
-- Final answer is submitted once planning rounds complete or episode ends.
-Reward compatibility:
-- Existing edge and answer reward components are unchanged.
-- Spawn utility is added as an auxiliary term using the PARL-style helper in src/osint_env/env/spawn_reward_hooks.py.
-- Spawn telemetry (count, critical steps, completion) is tracked in episode info and evaluation summaries.
-## 9. Reward Design (Integrated Notes)
-The reward function is a composite of graph-construction and answer-time utility terms. It combines ideas from DeepPath, EMNLP 2018 reward shaping, UniRel, and AutoGraph-R1.
-### 9.1 Edge Reward During Graph Construction
-For each ADD_EDGE action, the environment combines:
-1. Global accuracy signal (DeepPath-style positive/negative credit).
-2. Soft shaping term inspired by EMNLP 2018 reward shaping:
-  R = Rb + (1 - Rb) f(s, r, o)
-  where f is approximated in code with relation and type priors plus small domain priors.
-3. Efficiency bonus inversely proportional to step count.
-4. Diversity bonus using signature novelty against previous edges.
-5. Relation informativeness using normalized relation IDF.
-6. Entity informativeness using inverse hubness penalty.
-7. Connectivity gain bonus for bridge-style edges.
-### 9.2 Final Answer Reward
-For ANSWER, reward includes:
-1. format validity,
-2. correctness,
-3. knowledge-carrying utility (AutoGraph-style deducibility),
-4. knowledge-indexing utility (AutoGraph-style evidence coverage proxy over tool outputs),
-5. UniRel-style connectivity score over seed entities,
-6. graph F1 against supporting edges,
-7. compactness and repetition controls,
-8. efficiency and informativeness terms.
-### 9.3 Swarm Auxiliary Reward
-The swarm runner adds a PARL-style auxiliary term based on:
-- spawn parallelism,
-- finished subtask ratio,
-- critical-step latency proxy,
-- optional breadth and depth shaping.
-This auxiliary term is configurable in shared_config.json via spawn_reward.
-### 9.4 Benchmark Metrics
-Evaluation tracks:
-- task success,
-- graph F1,
-- deanonymization accuracy,
-- tool efficiency,
-- retrieval and structural utility signals,
-- spawn signals (for swarm runs),
-- composite leaderboard score.
-## 10. Interactive Dashboard
-Dashboard export includes:
-- canonical graph explorer,
-- episode graph comparison,
-- node and edge inspectors,
-- source database table with record detail pane,
-- reward and graph traces,
-- sortable leaderboard snapshot.
-Primary outputs:
-- artifacts/osint_dashboard.html
-- artifacts/osint_explorer.html
-- artifacts/sweep_dashboards/*.html
-## 11. Notes On LLM Generation
-Dataset generation supports an LLM-assisted expansion path for remaining tasks and graph edges.
-If no model is connected or structured output is unavailable, deterministic template fallback is used. This preserves reproducibility while keeping the interface compatible with stronger local or remote LLMs.
-## 12. Citation And Source Papers
-Reward components and swarm hooks are informed by the following papers:
-1. AutoGraph-R1: Enhancing Agentic RAG with Graph-R1 for Complex QA.
-  arXiv: https://arxiv.org/abs/2510.15339
-2. UniRel: Graph-based Relational Retrieval for LLM Reasoning.
-  arXiv: https://arxiv.org/abs/2512.17043
-3. DeepPath: A Reinforcement Learning Method for Knowledge Graph Reasoning.
-  EMNLP 2017: https://aclanthology.org/D17-1060/
-4. Multi-Hop Knowledge Graph Reasoning with Reward Shaping.
-  EMNLP 2018: https://aclanthology.org/D18-1362/
-5. Kimi K2.5 (PARL-style multi-agent shaping motivation).
-  arXiv: https://arxiv.org/abs/2602.02276
-Additional context:
-6. MINERVA: Reinforcement Learning for Query Answering over Knowledge Graphs.
-  arXiv: https://arxiv.org/abs/1711.05851
-## 13. Development And Testing
-Run tests from project root:
-   pytest -q
-Recommended validation after config changes:
-1. osint-env demo --agent-mode swarm
-2. osint-env eval --episodes 5
-3. osint-env benchmark --episodes 5 --name quick_check
-4. osint-env leaderboard --top 5
-## 14. Scope Boundaries
-- This repository supports a low-width swarm baseline and reward-compatible orchestration.
-- It does not include a full distributed training stack or asynchronous external worker runtime.
-- The architecture keeps those extensions possible without breaking current interfaces.

+---
+title: OSINT OpenEnv
+emoji: 🕵️
+colorFrom: teal
+colorTo: amber
+sdk: docker
+app_port: 7860
+tags:
+  - openenv
+  - osint
+  - benchmark
+  - docker
+  - fastapi
+short_description: Containerized OpenEnv-compatible OSINT benchmark with fixed-level tasks, dashboard export, and an OpenAI baseline runner.
+---
+# OSINT OpenEnv
+OSINT OpenEnv is a synthetic benchmark environment for tool-using agents that must recover identities, trace events, and link entities across noisy multi-platform records. The project is designed to feel like a compact OSINT workflow rather than a raw QA dataset: agents query mock profiles, posts, forum threads, and semantic memory, build a working graph, and then submit an answer.
+The motivation is to provide a reproducible OpenEnv-compatible environment for evaluating graph-building and tool-using reasoning without depending on live web data, unstable APIs, or private corpora. That makes it useful for local development, regression testing, and hosted demos such as a Docker-based Hugging Face Space.
+## Environment Summary
+The environment generates or loads a hidden canonical graph of users, aliases, organizations, locations, posts, threads, and events. It then exposes partial platform views and a task list drawn from that graph.
+The default hosted Space uses the fixed-level benchmark in [`datasets/fixed_levels/seed_fixed_levels.json`](/c:/Users/SIDDESHWAR/Desktop/meta/OSINT_env/datasets/fixed_levels/seed_fixed_levels.json), which contains 15 stable tasks over one shared seeded graph.
+## Action Space
+The environment exposes three actions:
+- `CALL_TOOL`: query platform views or semantic memory such as `search_posts`, `get_profile`, `search_threads`, `get_connections`, or `search_memory`.
+- `ADD_EDGE`: add a candidate relation to the working memory graph.
+- `ANSWER`: submit the final answer as an exact node id string.
+## Observation Space
+Each step returns a JSON observation with four parts:
+- `tool_outputs`: the most recent tool results.
+- `graph_snapshot`: the current working-memory graph edges.
+- `action_history`: recent actions and rewards.
+- `task`: the active task id, task type, and question.
+## Task Types And Difficulty
+The benchmark mixes direct lookups with multi-hop traces:
+- Easy: single-hop identity resolution, organization lookup, event lookup, or location lookup.
+- Mid: two-hop alias-to-user-to-organization or thread-to-event-to-user traces.
+- High: cross-platform multi-hop traces combining aliases, authored content, event references, organization links, and direct connections.
+Common task families include:
+- `identity_resolution`
+- `network_discovery`
+- `event_tracing`
+- `cross_platform_linking`
+- `deanonymization`
+- `convoluted_trace`
+Expected difficulty increases with the number of relations the agent must chain together and whether the evidence is split across posts, threads, aliases, and profile edges.
+## Repository Layout
+```text
+src/osint_env/
+  agents/        single-agent and swarm runners
+  baselines/     reusable OpenAI baseline runner
+  config/        shared config and seed loading
+  data/          graph/view/task generation
+  domain/        dataclasses and environment models
+  env/           environment, reward logic, OpenEnv compatibility shim
+  eval/          evaluation metrics and leaderboard helpers
+  llm/           mock, Ollama, and OpenAI client wrappers
+  memory/        working graph and semantic memory
+  platforms/     tool APIs over synthetic platform views
+  viz/           dashboard export
+scripts/
+  build_fixed_levels_dataset.py
+  run_openai_baseline.py
+datasets/fixed_levels/
+  seed_fixed_levels.json
+  shared_config_fixed_levels.json
+  qwen_swarm_benchmark_fixed_levels.json
+server.py        FastAPI app for local use and Docker/HF Spaces
+Dockerfile       Container entrypoint for Hugging Face Docker Spaces
+```
+## Setup
+Python 3.10+ is required.
+Local install:
+```bash
+python -m pip install -e .
+```
+Run tests:
+```bash
+python -m pytest -q
+```
+## Usage
+Run one demo episode:
+```bash
+osint-env demo --agent-mode swarm --llm-provider mock
+```
+Run a quick evaluation:
+```bash
+osint-env eval --episodes 5 --agent-mode swarm --llm-provider mock
+```
+Export a dashboard:
+```bash
+osint-env benchmark --episodes 5 --agent-mode swarm --llm-provider mock --name quick_check
+```
+## OpenAI Baseline
+The reproducible OpenAI baseline is implemented in [`scripts/run_openai_baseline.py`](/c:/Users/SIDDESHWAR/Desktop/meta/OSINT_env/scripts/run_openai_baseline.py). It runs on the fixed-level benchmark, uses a stable seeded graph/task set, writes a JSON artifact, appends a leaderboard record, and exports a dashboard.
+Default behavior:
+- dataset: fixed-level benchmark
+- episodes: 15
+- max steps per episode: 8
+- temperature: 0.0
+- output artifact: `artifacts/baselines/openai_fixed_levels_latest.json`
+Run it with an API key:
+```bash
+export OPENAI_API_KEY="your_key_here"
+python scripts/run_openai_baseline.py --model gpt-4o-mini
+```
+The script is designed to stay bounded enough for a normal benchmark pass to finish comfortably under 20 minutes on a lightweight chat model, while still using the full fixed task set. For repeatability it fixes the benchmark graph/tasks and uses deterministic decoding settings. Because remote model backends can still change over time, the output artifact also records model metadata and system fingerprints when available.
+## Docker And Hugging Face Space
+The repository is ready for a Docker-based Hugging Face Space:
+- `README.md` includes `sdk: docker`
+- `README.md` includes the `openenv` Space tag
+- `Dockerfile` serves [`server.py`](/c:/Users/SIDDESHWAR/Desktop/meta/OSINT_env/server.py) on port `7860`
+Local Docker smoke test:
+```bash
+docker build -t osint-openenv .
+docker run --rm -p 7860:7860 osint-openenv
+```
+Then open `http://localhost:7860`.
+The FastAPI app serves:
+- `/`: overview page
+- `/dashboard`: generated benchmark dashboard
+- `/api/environment`: environment metadata
+- `/healthz`: health check
+## Baseline Scores
+Bundled fixed-level baseline artifact:
+| baseline | provider | model | episodes | task success | avg graph f1 | leaderboard score |
+|---|---|---|---:|---:|---:|---:|
+| `fixed_levels_qwen_swarm` | Ollama | `qwen3:2b` | 15 | 1.000 | 0.849 | 0.854 |
+Source: [`datasets/fixed_levels/qwen_swarm_benchmark_fixed_levels.json`](/c:/Users/SIDDESHWAR/Desktop/meta/OSINT_env/datasets/fixed_levels/qwen_swarm_benchmark_fixed_levels.json)
+After you supply an OpenAI API key, the matching baseline scores will be written to:
+- [`artifacts/baselines/openai_fixed_levels_latest.json`](/c:/Users/SIDDESHWAR/Desktop/meta/OSINT_env/artifacts/baselines/openai_fixed_levels_latest.json)
+- [`artifacts/baselines/openai_fixed_levels_dashboard.html`](/c:/Users/SIDDESHWAR/Desktop/meta/OSINT_env/artifacts/baselines/openai_fixed_levels_dashboard.html)
+## Notes On `pyproject.toml`
+The packaging file is structurally correct for a `src/` layout and editable installs. The main gaps were deployment/runtime related rather than build-breaking:
+- `openenv` is now version-bounded explicitly.
+- `fastapi` and `uvicorn` are included because the repo now ships a real web server.
+- pytest is pointed at the `tests/` directory, and the test suite also adds `src/` to `sys.path` so source-layout imports work reliably during local runs.
+## Development Notes
+The project keeps a lightweight local compatibility shim for `openenv` so the source tree remains importable even before dependencies are installed. In a normal install or Docker build, the real `openenv` package from PyPI is still used.

pyproject.toml CHANGED Viewed

@@ -5,9 +5,16 @@ description = "OSINT-style multi-platform information ecosystem environment for
 readme = "README.md"
 requires-python = ">=3.10"
 dependencies = [
-	"openenv",
 	"openai>=1.40.0",
-	"requests>=2.31.0",
 ]
 [project.scripts]
@@ -22,3 +29,6 @@ package-dir = {"" = "src"}
 [tool.setuptools.packages.find]
 where = ["src"]

 readme = "README.md"
 requires-python = ">=3.10"
 dependencies = [
+	"openenv>=0.1.13",
 	"openai>=1.40.0",
+	"fastapi>=0.115.0",
+	"requests>=2.32.3",
+	"uvicorn>=0.30.0",
+]
+[project.optional-dependencies]
+dev = [
+	"pytest>=8.0.0",
 ]
 [project.scripts]
 [tool.setuptools.packages.find]
 where = ["src"]
+[tool.pytest.ini_options]
+testpaths = ["tests"]

scripts/run_openai_baseline.py ADDED Viewed

	@@ -0,0 +1,60 @@

+from __future__ import annotations
+import argparse
+import json
+import os
+from osint_env.baselines import OpenAIBaselineConfig, OpenAIBaselineRunner
+def build_parser() -> argparse.ArgumentParser:
+    parser = argparse.ArgumentParser(description="Run the reproducible OpenAI baseline on the fixed-level OSINT benchmark.")
+    parser.add_argument("--config", default="datasets/fixed_levels/shared_config_fixed_levels.json", help="Shared config JSON.")
+    parser.add_argument("--seed-file", default="datasets/fixed_levels/seed_fixed_levels.json", help="Fixed seed file JSON.")
+    parser.add_argument("--output", default="artifacts/baselines/openai_fixed_levels_latest.json", help="Baseline result JSON output path.")
+    parser.add_argument("--leaderboard", default="artifacts/baselines/openai_fixed_levels_leaderboard.json", help="Leaderboard JSON path.")
+    parser.add_argument("--dashboard", default="artifacts/baselines/openai_fixed_levels_dashboard.html", help="Dashboard HTML path.")
+    parser.add_argument("--run-name", default="openai_fixed_levels_baseline", help="Leaderboard run name.")
+    parser.add_argument("--model", default="gpt-4o-mini", help="OpenAI chat model name.")
+    parser.add_argument("--openai-base-url", default="https://api.openai.com/v1", help="OpenAI-compatible base URL.")
+    parser.add_argument("--openai-api-key", default="", help="OpenAI API key override.")
+    parser.add_argument("--openai-api-key-env", default="OPENAI_API_KEY", help="Environment variable name for the API key.")
+    parser.add_argument("--episodes", type=int, default=15, help="Number of episodes to evaluate.")
+    parser.add_argument("--max-steps", type=int, default=8, help="Episode step budget to keep runs bounded.")
+    parser.add_argument("--temperature", type=float, default=0.0, help="Sampling temperature.")
+    parser.add_argument("--max-tokens", type=int, default=256, help="Maximum completion tokens per step.")
+    parser.add_argument("--timeout-seconds", type=int, default=60, help="Per-request timeout.")
+    parser.add_argument("--seed", type=int, default=7, help="Request seed offset used for repeatable runs.")
+    parser.add_argument("--skip-leaderboard", action="store_true", help="Do not append the run to the leaderboard file.")
+    return parser
+def main() -> None:
+    args = build_parser().parse_args()
+    api_key = args.openai_api_key or os.getenv(args.openai_api_key_env, "")
+    config = OpenAIBaselineConfig(
+        shared_config_path=args.config,
+        seed_file=args.seed_file,
+        output_path=args.output,
+        leaderboard_path=args.leaderboard,
+        dashboard_path=args.dashboard,
+        run_name=args.run_name,
+        model=args.model,
+        base_url=args.openai_base_url,
+        api_key=api_key,
+        api_key_env=args.openai_api_key_env,
+        temperature=args.temperature,
+        max_tokens=args.max_tokens,
+        timeout_seconds=args.timeout_seconds,
+        episodes=args.episodes,
+        max_steps=args.max_steps,
+        seed=args.seed,
+        append_leaderboard=not args.skip_leaderboard,
+    )
+    result = OpenAIBaselineRunner(config).run()
+    print(json.dumps({"summary": result["summary"], "output": args.output, "dashboard": args.dashboard}, indent=2, sort_keys=True))
+if __name__ == "__main__":
+    main()

server.py ADDED Viewed

	@@ -0,0 +1,221 @@

+from __future__ import annotations
+import json
+import os
+from collections import Counter
+from functools import lru_cache
+from pathlib import Path
+from typing import Any
+from fastapi import FastAPI
+from fastapi.responses import FileResponse, HTMLResponse, JSONResponse
+from osint_env.config import clone_environment_config, load_seeding_config, load_shared_config
+from osint_env.env.environment import OSINTEnvironment
+from osint_env.eval.runner import run_evaluation
+from osint_env.llm import build_llm_client
+from osint_env.viz import export_dashboard
+SPACE_CONFIG_PATH = Path(os.getenv("OSINT_ENV_CONFIG", "datasets/fixed_levels/shared_config_fixed_levels.json"))
+SPACE_SEED_PATH = Path(os.getenv("OSINT_ENV_SEED_FILE", "datasets/fixed_levels/seed_fixed_levels.json"))
+SPACE_PROVIDER = os.getenv("OSINT_SPACE_LLM_PROVIDER", "mock")
+SPACE_MODEL = os.getenv("OSINT_SPACE_LLM_MODEL", "gpt-4o-mini")
+SPACE_PORT = int(os.getenv("PORT", "7860"))
+SPACE_DASHBOARD = Path("artifacts/space_dashboard.html")
+def _build_environment() -> OSINTEnvironment:
+    shared = load_shared_config(SPACE_CONFIG_PATH)
+    env_cfg = clone_environment_config(shared.environment)
+    if SPACE_SEED_PATH.exists():
+        env_cfg.seeding = load_seeding_config(SPACE_SEED_PATH)
+    env_cfg.llm.provider = SPACE_PROVIDER
+    env_cfg.llm.model = SPACE_MODEL
+    try:
+        llm = build_llm_client(env_cfg.llm)
+    except Exception:
+        env_cfg.llm.provider = "mock"
+        llm = build_llm_client(env_cfg.llm)
+    return OSINTEnvironment(env_cfg, llm=llm)
+@lru_cache(maxsize=1)
+def _space_snapshot() -> dict[str, Any]:
+    env = _build_environment()
+    evaluation = run_evaluation(env, episodes=3, return_details=True, llm=build_llm_client(env.config.llm))
+    dashboard_path = export_dashboard(
+        env=env,
+        evaluation=evaluation,
+        leaderboard_records=[],
+        output_path=str(SPACE_DASHBOARD),
+    )
+    difficulty_counts = Counter(str(task.metadata.get("difficulty", "unknown")) for task in env.tasks)
+    return {
+        "dashboard_path": dashboard_path,
+        "summary": evaluation["summary"],
+        "task_count": len(env.tasks),
+        "difficulty_counts": dict(difficulty_counts),
+        "action_space": ["CALL_TOOL", "ADD_EDGE", "ANSWER"],
+        "observation_space": {
+            "tool_outputs": "Last tool results and memory hits.",
+            "graph_snapshot": "Current working graph edge snapshot.",
+            "action_history": "Recent action/reward trace.",
+            "task": "Task id, task type, and question.",
+        },
+        "task_types": sorted({task.task_type for task in env.tasks}),
+        "config": {
+            "seed": env.config.seed,
+            "max_steps": env.config.max_steps,
+            "swarm_enabled": env.config.swarm.enabled,
+            "llm_provider": env.config.llm.provider,
+            "llm_model": env.config.llm.model,
+        },
+    }
+app = FastAPI(title="OSINT OpenEnv Space", version="0.1.0")
+@app.get("/", response_class=HTMLResponse)
+def home() -> str:
+    snapshot = _space_snapshot()
+    summary = snapshot["summary"]
+    difficulty_html = "".join(
+        f"<li><strong>{level}</strong>: {count}</li>"
+        for level, count in sorted(snapshot["difficulty_counts"].items())
+    )
+    task_type_html = "".join(f"<li>{task_type}</li>" for task_type in snapshot["task_types"])
+    return f"""<!doctype html>
+<html lang="en">
+<head>
+  <meta charset="utf-8" />
+  <meta name="viewport" content="width=device-width, initial-scale=1" />
+  <title>OSINT OpenEnv Space</title>
+  <style>
+    :root {{
+      --ink: #13212d;
+      --muted: #4d5b69;
+      --line: #d8e2eb;
+      --card: #ffffff;
+      --bg: #f6fafc;
+      --brand: #0f766e;
+      --accent: #b45309;
+    }}
+    * {{ box-sizing: border-box; }}
+    body {{
+      margin: 0;
+      font-family: "Segoe UI", sans-serif;
+      color: var(--ink);
+      background:
+        radial-gradient(circle at top left, rgba(15,118,110,0.12), transparent 30%),
+        radial-gradient(circle at top right, rgba(180,83,9,0.10), transparent 28%),
+        var(--bg);
+    }}
+    .wrap {{ max-width: 1120px; margin: 0 auto; padding: 24px; }}
+    .hero, .grid {{ display: grid; gap: 16px; }}
+    .hero {{ grid-template-columns: 1.5fr 1fr; }}
+    .grid {{ grid-template-columns: repeat(3, minmax(0, 1fr)); margin-top: 16px; }}
+    .card {{
+      background: var(--card);
+      border: 1px solid var(--line);
+      border-radius: 18px;
+      padding: 18px;
+      box-shadow: 0 12px 24px rgba(19, 33, 45, 0.06);
+    }}
+    h1, h2 {{ margin-top: 0; }}
+    .muted {{ color: var(--muted); }}
+    .stats {{ display: grid; grid-template-columns: repeat(2, minmax(0, 1fr)); gap: 10px; }}
+    .stat {{ border: 1px dashed var(--line); border-radius: 12px; padding: 10px; }}
+    .stat .k {{ font-size: 12px; color: var(--muted); text-transform: uppercase; }}
+    .stat .v {{ font-size: 22px; font-weight: 700; }}
+    a.button {{
+      display: inline-block;
+      padding: 10px 14px;
+      border-radius: 12px;
+      text-decoration: none;
+      color: white;
+      background: var(--brand);
+      margin-right: 10px;
+    }}
+    a.link {{
+      color: var(--accent);
+      text-decoration: none;
+      font-weight: 600;
+    }}
+    ul {{ padding-left: 18px; }}
+    code {{
+      background: #f1f5f9;
+      border-radius: 6px;
+      padding: 2px 6px;
+    }}
+    @media (max-width: 900px) {{
+      .hero, .grid {{ grid-template-columns: 1fr; }}
+    }}
+  </style>
+</head>
+<body>
+  <div class="wrap">
+    <div class="hero">
+      <section class="card">
+        <h1>OSINT OpenEnv Space</h1>
+        <p class="muted">A containerized OpenEnv-compatible benchmark for synthetic OSINT reasoning over profiles, forum threads, posts, aliases, organizations, locations, and event links.</p>
+        <p>The Space boots with the fixed-level benchmark so visitors get a stable environment snapshot instead of a different graph every restart.</p>
+        <a class="button" href="/dashboard">Open Dashboard</a>
+        <a class="link" href="/api/environment">Environment JSON</a>
+      </section>
+      <section class="card">
+        <h2>Included Snapshot</h2>
+        <div class="stats">
+          <div class="stat"><div class="k">Tasks</div><div class="v">{snapshot["task_count"]}</div></div>
+          <div class="stat"><div class="k">Provider</div><div class="v">{snapshot["config"]["llm_provider"]}</div></div>
+          <div class="stat"><div class="k">Score</div><div class="v">{summary["leaderboard_score"]:.3f}</div></div>
+          <div class="stat"><div class="k">Success</div><div class="v">{summary["task_success_rate"]:.3f}</div></div>
+        </div>
+      </section>
+    </div>
+    <div class="grid">
+      <section class="card">
+        <h2>Action Space</h2>
+        <ul>
+          <li><code>CALL_TOOL</code>: query platform views or semantic memory.</li>
+          <li><code>ADD_EDGE</code>: add a hypothesized relation to the working graph.</li>
+          <li><code>ANSWER</code>: submit the final node id answer.</li>
+        </ul>
+      </section>
+      <section class="card">
+        <h2>Difficulty Mix</h2>
+        <ul>{difficulty_html}</ul>
+      </section>
+      <section class="card">
+        <h2>Task Families</h2>
+        <ul>{task_type_html}</ul>
+      </section>
+    </div>
+  </div>
+</body>
+</html>"""
+@app.get("/healthz")
+def healthz() -> JSONResponse:
+    return JSONResponse({"status": "ok"})
+@app.get("/api/environment")
+def environment_metadata() -> JSONResponse:
+    return JSONResponse(_space_snapshot())
+@app.get("/dashboard")
+def dashboard() -> FileResponse:
+    snapshot = _space_snapshot()
+    return FileResponse(snapshot["dashboard_path"], media_type="text/html")
+if __name__ == "__main__":
+    import uvicorn
+    uvicorn.run("server:app", host="0.0.0.0", port=SPACE_PORT)

src/osint_env/baselines/__init__.py ADDED Viewed

	@@ -0,0 +1,4 @@


1	+ from osint_env.baselines.openai_runner import OpenAIBaselineConfig, OpenAIBaselineRunner
2	+
3	+ __all__ = ["OpenAIBaselineConfig", "OpenAIBaselineRunner"]
4	+

src/osint_env/baselines/openai_runner.py ADDED Viewed

	@@ -0,0 +1,480 @@

+from __future__ import annotations
+import json
+from dataclasses import asdict, dataclass
+from pathlib import Path
+from time import perf_counter
+from typing import Any
+from osint_env.config import clone_environment_config, load_seeding_config, load_shared_config
+from osint_env.domain.models import Action, ActionType, Edge
+from osint_env.env.environment import OSINTEnvironment
+from osint_env.env.reward import compute_graph_f1
+from osint_env.eval.leaderboard import append_leaderboard_record, load_leaderboard
+from osint_env.eval.metrics import EvalMetrics
+from osint_env.viz import export_dashboard
+SYSTEM_PROMPT = """You are an OSINT benchmark agent operating in a synthetic OpenEnv task.
+Available actions are provided as function tools. On every turn, call exactly one tool.
+Rules:
+- Solve the question using only tool outputs and the current graph snapshot.
+- When you have enough evidence, call submit_answer with the exact node id string.
+- Use add_edge only for relationships strongly supported by the evidence you have already collected.
+- Prefer concise, high-signal tool queries.
+- Never guess free-form prose when a node id answer is required.
+"""
+@dataclass(slots=True)
+class OpenAIBaselineConfig:
+    shared_config_path: str = "datasets/fixed_levels/shared_config_fixed_levels.json"
+    seed_file: str = "datasets/fixed_levels/seed_fixed_levels.json"
+    output_path: str = "artifacts/baselines/openai_fixed_levels_latest.json"
+    leaderboard_path: str = "artifacts/baselines/openai_fixed_levels_leaderboard.json"
+    dashboard_path: str = "artifacts/baselines/openai_fixed_levels_dashboard.html"
+    run_name: str = "openai_fixed_levels_baseline"
+    model: str = "gpt-4o-mini"
+    base_url: str = "https://api.openai.com/v1"
+    api_key: str = ""
+    api_key_env: str = "OPENAI_API_KEY"
+    temperature: float = 0.0
+    max_tokens: int = 256
+    timeout_seconds: int = 60
+    episodes: int = 15
+    max_steps: int = 8
+    seed: int | None = 7
+    append_leaderboard: bool = True
+def _tool_schema(
+    name: str,
+    description: str,
+    properties: dict[str, Any],
+    required: list[str],
+) -> dict[str, Any]:
+    return {
+        "type": "function",
+        "function": {
+            "name": name,
+            "description": description,
+            "parameters": {
+                "type": "object",
+                "properties": properties,
+                "required": required,
+                "additionalProperties": False,
+            },
+        },
+    }
+def build_action_tools() -> list[dict[str, Any]]:
+    return [
+        _tool_schema(
+            "search_posts",
+            "Search microblog posts by substring query.",
+            {"query": {"type": "string", "description": "Substring to search for in post text."}},
+            ["query"],
+        ),
+        _tool_schema(
+            "get_user_posts",
+            "Fetch posts authored by a user or alias id.",
+            {"user_id": {"type": "string", "description": "User or alias node id."}},
+            ["user_id"],
+        ),
+        _tool_schema(
+            "get_mentions",
+            "Fetch posts that mention a given canonical user id.",
+            {"user_id": {"type": "string", "description": "Canonical user node id."}},
+            ["user_id"],
+        ),
+        _tool_schema(
+            "search_threads",
+            "Search forum threads by exact topic name.",
+            {"topic": {"type": "string", "description": "Thread topic such as security or ai."}},
+            ["topic"],
+        ),
+        _tool_schema(
+            "get_thread",
+            "Fetch a specific forum thread by id.",
+            {"thread_id": {"type": "string", "description": "Thread node id."}},
+            ["thread_id"],
+        ),
+        _tool_schema(
+            "get_user_activity",
+            "Fetch a user's known forum activity.",
+            {"user_id": {"type": "string", "description": "Canonical user node id."}},
+            ["user_id"],
+        ),
+        _tool_schema(
+            "get_profile",
+            "Fetch a profile record by canonical user id.",
+            {"user_id": {"type": "string", "description": "Canonical user node id."}},
+            ["user_id"],
+        ),
+        _tool_schema(
+            "search_people",
+            "Search profiles by name and or organization.",
+            {
+                "name": {"type": "string", "description": "Optional name substring.", "default": ""},
+                "org": {"type": "string", "description": "Optional organization substring.", "default": ""},
+            },
+            [],
+        ),
+        _tool_schema(
+            "get_connections",
+            "Fetch explicit profile connections for a user.",
+            {"user_id": {"type": "string", "description": "Canonical user node id."}},
+            ["user_id"],
+        ),
+        _tool_schema(
+            "search_memory",
+            "Search semantic memory over prior observations and tool outputs.",
+            {
+                "query": {"type": "string", "description": "Memory retrieval query."},
+                "k": {"type": "integer", "description": "Top-k matches.", "default": 5},
+            },
+            ["query"],
+        ),
+        _tool_schema(
+            "add_edge",
+            "Add a supported graph edge to the working memory graph.",
+            {
+                "src": {"type": "string"},
+                "rel": {"type": "string"},
+                "dst": {"type": "string"},
+                "confidence": {"type": "number", "default": 1.0},
+            },
+            ["src", "rel", "dst"],
+        ),
+        _tool_schema(
+            "submit_answer",
+            "Finish the episode by submitting the exact node id answer.",
+            {"answer": {"type": "string", "description": "Exact node id answer for the task."}},
+            ["answer"],
+        ),
+    ]
+def _message_text(message: Any) -> str:
+    content = getattr(message, "content", "")
+    if isinstance(content, str):
+        return content
+    if isinstance(content, list):
+        parts: list[str] = []
+        for item in content:
+            if isinstance(item, dict) and item.get("type") == "text":
+                parts.append(str(item.get("text", "")))
+            else:
+                text = getattr(item, "text", None)
+                if text:
+                    parts.append(str(text))
+        return "\n".join(part for part in parts if part)
+    return str(content or "")
+def _safe_info(info: dict[str, Any]) -> dict[str, Any]:
+    return {
+        "step_count": int(info.get("step_count", 0)),
+        "total_reward": float(info.get("total_reward", 0.0)),
+        "tool_calls": int(info.get("tool_calls", 0)),
+        "redundant_tool_calls": int(info.get("redundant_tool_calls", 0)),
+        "reward_components": dict(info.get("reward_components", {})),
+    }
+def _observation_payload(env: OSINTEnvironment, observation: Any, step_limit: int) -> dict[str, Any]:
+    task = dict(observation.task)
+    return {
+        "task": {
+            "task_id": task.get("task_id", ""),
+            "task_type": task.get("task_type", ""),
+            "question": task.get("question", ""),
+        },
+        "remaining_steps": max(0, step_limit - int(env.state.step_count if env.state else 0)),
+        "recent_tool_outputs": list(observation.tool_outputs),
+        "graph_snapshot": dict(observation.graph_snapshot),
+        "recent_action_history": list(observation.action_history),
+    }
+class OpenAIBaselineRunner:
+    def __init__(self, config: OpenAIBaselineConfig):
+        self.config = config
+        from openai import OpenAI
+        if not config.api_key:
+            raise ValueError(
+                "OpenAI baseline requires an API key. "
+                f"Set {config.api_key_env} or pass --openai-api-key."
+            )
+        self.client = OpenAI(
+            api_key=config.api_key,
+            base_url=config.base_url,
+            timeout=config.timeout_seconds,
+        )
+        self.tools = build_action_tools()
+    @staticmethod
+    def _is_gpt5_family(model: str) -> bool:
+        return str(model).strip().lower().startswith("gpt-5")
+    def _request_kwargs(self, messages: list[dict[str, Any]], episode_index: int) -> dict[str, Any]:
+        kwargs: dict[str, Any] = {
+            "model": self.config.model,
+            "messages": messages,
+            "tools": self.tools,
+            "tool_choice": "required",
+            "parallel_tool_calls": False,
+            "max_completion_tokens": self.config.max_tokens,
+        }
+        if self.config.seed is not None:
+            kwargs["seed"] = int(self.config.seed) + episode_index
+        if self._is_gpt5_family(self.config.model):
+            # GPT-5 family chat-completions compatibility:
+            # use max_completion_tokens and avoid temperature for older GPT-5 models.
+            kwargs["reasoning_effort"] = "none"
+        else:
+            kwargs["temperature"] = self.config.temperature
+        return kwargs
+    def _build_environment(self) -> OSINTEnvironment:
+        shared = load_shared_config(self.config.shared_config_path)
+        env_cfg = clone_environment_config(shared.environment)
+        env_cfg.seeding = load_seeding_config(self.config.seed_file)
+        env_cfg.llm.provider = "mock"
+        env_cfg.llm.model = self.config.model
+        env_cfg.llm.temperature = self.config.temperature
+        env_cfg.llm.max_tokens = self.config.max_tokens
+        env_cfg.max_steps = min(int(env_cfg.max_steps), int(self.config.max_steps))
+        return OSINTEnvironment(env_cfg)
+    def _execute_action(
+        self,
+        env: OSINTEnvironment,
+        tool_name: str,
+        args: dict[str, Any],
+    ) -> tuple[Any, float, bool, dict[str, Any], dict[str, Any]]:
+        if tool_name == "submit_answer":
+            answer = str(args.get("answer", "")).strip()
+            obs, reward, done, info = env.step(Action(ActionType.ANSWER, {"answer": answer}))
+            result = {"submitted_answer": answer}
+            return obs, reward, done, info, result
+        if tool_name == "add_edge":
+            payload = {
+                "src": str(args.get("src", "")).strip(),
+                "rel": str(args.get("rel", "")).strip(),
+                "dst": str(args.get("dst", "")).strip(),
+                "confidence": float(args.get("confidence", 1.0)),
+            }
+            obs, reward, done, info = env.step(Action(ActionType.ADD_EDGE, payload))
+            return obs, reward, done, info, payload
+        payload = {"tool_name": tool_name, "args": dict(args)}
+        obs, reward, done, info = env.step(Action(ActionType.CALL_TOOL, payload))
+        result = obs.tool_outputs[-1]["output"] if obs.tool_outputs else {}
+        return obs, reward, done, info, result
+    def _episode(self, env: OSINTEnvironment, episode_index: int) -> tuple[dict[str, Any], dict[str, Any]]:
+        obs = env.reset()
+        messages: list[dict[str, Any]] = [
+            {"role": "system", "content": SYSTEM_PROMPT},
+            {
+                "role": "user",
+                "content": json.dumps(_observation_payload(env, obs, env.config.max_steps), indent=2, sort_keys=True),
+            },
+        ]
+        turn_trace: list[dict[str, Any]] = []
+        raw_fingerprints: list[str] = []
+        info: dict[str, Any] = {}
+        done = False
+        usage_totals = {"prompt_tokens": 0, "completion_tokens": 0, "total_tokens": 0}
+        while not done and env.state is not None and env.state.step_count < env.config.max_steps:
+            completion = self.client.chat.completions.create(**self._request_kwargs(messages, episode_index))
+            if getattr(completion, "system_fingerprint", None):
+                raw_fingerprints.append(str(completion.system_fingerprint))
+            if getattr(completion, "usage", None) is not None:
+                usage_totals["prompt_tokens"] += int(getattr(completion.usage, "prompt_tokens", 0) or 0)
+                usage_totals["completion_tokens"] += int(getattr(completion.usage, "completion_tokens", 0) or 0)
+                usage_totals["total_tokens"] += int(getattr(completion.usage, "total_tokens", 0) or 0)
+            message = completion.choices[0].message
+            content = _message_text(message)
+            tool_calls = list(message.tool_calls or [])
+            if not tool_calls:
+                fallback_answer = content.strip() or "unknown"
+                obs, reward, done, info = env.step(Action(ActionType.ANSWER, {"answer": fallback_answer}))
+                tool_result = {
+                    "submitted_answer": fallback_answer,
+                    "reward": reward,
+                    "done": done,
+                    "observation": _observation_payload(env, obs, env.config.max_steps),
+                    "info": _safe_info(info),
+                }
+                messages.append({"role": "assistant", "content": content})
+                messages.append({"role": "tool", "tool_call_id": "fallback_submit", "content": json.dumps(tool_result)})
+                turn_trace.append({"assistant_content": content, "tool_name": "submit_answer", "args": {"answer": fallback_answer}})
+                break
+            tool_call = tool_calls[0]
+            tool_name = str(tool_call.function.name)
+            try:
+                args = json.loads(tool_call.function.arguments or "{}")
+            except json.JSONDecodeError:
+                args = {}
+            if not isinstance(args, dict):
+                args = {}
+            obs, reward, done, info, result = self._execute_action(env, tool_name, args)
+            tool_payload = {
+                "tool_name": tool_name,
+                "args": args,
+                "result": result,
+                "reward": reward,
+                "done": done,
+                "observation": _observation_payload(env, obs, env.config.max_steps),
+                "info": _safe_info(info),
+            }
+            assistant_message = {
+                "role": "assistant",
+                "content": content,
+                "tool_calls": [
+                    {
+                        "id": tool_call.id,
+                        "type": "function",
+                        "function": {
+                            "name": tool_name,
+                            "arguments": json.dumps(args, sort_keys=True),
+                        },
+                    }
+                ],
+            }
+            messages.append(assistant_message)
+            messages.append({"role": "tool", "tool_call_id": tool_call.id, "content": json.dumps(tool_payload, sort_keys=True)})
+            turn_trace.append({"assistant_content": content, "tool_name": tool_name, "args": args, "reward": reward, "done": done})
+        if not done:
+            obs, _, done, info = env.step(Action(ActionType.ANSWER, {"answer": "unknown"}))
+            turn_trace.append({"assistant_content": "", "tool_name": "submit_answer", "args": {"answer": "unknown"}, "reward": 0.0, "done": done})
+        info = dict(info)
+        info["openai_system_fingerprints"] = raw_fingerprints
+        info["usage"] = usage_totals
+        return info, {"turns": turn_trace}
+    def run(self) -> dict[str, Any]:
+        env = self._build_environment()
+        metrics = EvalMetrics()
+        episode_rows: list[dict[str, Any]] = []
+        started = perf_counter()
+        run_usage = {"prompt_tokens": 0, "completion_tokens": 0, "total_tokens": 0}
+        for episode_index in range(int(self.config.episodes)):
+            info, trace = self._episode(env, episode_index)
+            episode_usage = dict(info.get("usage", {}))
+            for key in run_usage:
+                run_usage[key] += int(episode_usage.get(key, 0) or 0)
+            task_type = env.state.task.task_type if env.state else "unknown"
+            task_id = env.state.task.task_id if env.state else f"episode_{episode_index}"
+            truth = env.state.task.supporting_edges if env.state else []
+            pred = env.memory_graph.edges if env.state else []
+            graph_f1 = compute_graph_f1(pred, truth)
+            metrics.add(info, task_type=task_type, graph_f1=graph_f1)
+            episode_rows.append(
+                {
+                    "task_id": task_id,
+                    "task_type": task_type,
+                    "question": env.state.task.question if env.state else "",
+                    "task_answer": str(info.get("task_answer", "")),
+                    "agent_answer": str(info.get("agent_answer", "")) if info.get("agent_answer") is not None else "",
+                    "graph_f1": graph_f1,
+                    "reward": float(info.get("total_reward", 0.0)),
+                    "steps": int(info.get("step_count", 0)),
+                    "tool_calls": int(info.get("tool_calls", 0)),
+                    "success": int(info.get("agent_answer") == info.get("task_answer")),
+                    "reward_components": dict(info.get("reward_components", {})),
+                    "pred_edges": [
+                        {
+                            "src": edge.src,
+                            "rel": edge.rel,
+                            "dst": edge.dst,
+                            "confidence": float(edge.confidence),
+                        }
+                        for edge in pred
+                    ],
+                    "truth_edges": [
+                        {
+                            "src": edge.src,
+                            "rel": edge.rel,
+                            "dst": edge.dst,
+                            "confidence": float(edge.confidence),
+                        }
+                        for edge in truth
+                    ],
+                    "trace": trace,
+                    "openai_system_fingerprints": list(info.get("openai_system_fingerprints", [])),
+                    "usage": episode_usage,
+                }
+            )
+        summary = metrics.summary()
+        duration_seconds = perf_counter() - started
+        dashboard_path = export_dashboard(
+            env=env,
+            evaluation={"summary": summary, "episodes": episode_rows},
+            leaderboard_records=load_leaderboard(self.config.leaderboard_path),
+            output_path=self.config.dashboard_path,
+        )
+        payload: dict[str, Any] = {
+            "run": {
+                "name": self.config.run_name,
+                "model": self.config.model,
+                "episodes": int(self.config.episodes),
+                "temperature": float(self.config.temperature),
+                "max_tokens": int(self.config.max_tokens),
+                "timeout_seconds": int(self.config.timeout_seconds),
+                "max_steps": int(self.config.max_steps),
+                "seed": self.config.seed,
+                "shared_config_path": self.config.shared_config_path,
+                "seed_file": self.config.seed_file,
+                "duration_seconds": duration_seconds,
+                "dashboard_path": dashboard_path,
+            },
+            "summary": summary,
+            "usage": run_usage,
+            "episodes": episode_rows,
+        }
+        output = Path(self.config.output_path)
+        output.parent.mkdir(parents=True, exist_ok=True)
+        output.write_text(json.dumps(payload, indent=2, sort_keys=True), encoding="utf-8")
+        if self.config.append_leaderboard:
+            record = append_leaderboard_record(
+                path=self.config.leaderboard_path,
+                summary=summary,
+                episodes=int(self.config.episodes),
+                run_name=self.config.run_name,
+                config={
+                    "provider": "openai",
+                    "model": self.config.model,
+                    "seed": self.config.seed,
+                    "max_steps": self.config.max_steps,
+                    "shared_config_path": self.config.shared_config_path,
+                    "seed_file": self.config.seed_file,
+                },
+            )
+            payload["record"] = record
+            output.write_text(json.dumps(payload, indent=2, sort_keys=True), encoding="utf-8")
+        return payload

src/osint_env/cli.py CHANGED Viewed

@@ -282,7 +282,7 @@ def main() -> None:
         if args.with_demo:
             _runner_for(env).run_episode()
             info = {
-                "agent_answer": env.state.agent_answer if env.state else "",
                 "task_answer": env.state.task.answer if env.state else "",
                 "total_reward": env.state.total_reward if env.state else 0.0,
                 "step_count": env.state.step_count if env.state else 0,

         if args.with_demo:
             _runner_for(env).run_episode()
             info = {
+                "agent_answer": env.state.answer if env.state else "",
                 "task_answer": env.state.task.answer if env.state else "",
                 "total_reward": env.state.total_reward if env.state else 0.0,
                 "step_count": env.state.step_count if env.state else 0,

src/osint_env/env/environment.py CHANGED Viewed

@@ -3,10 +3,9 @@ from __future__ import annotations
 from dataclasses import dataclass, field
 from typing import TYPE_CHECKING, Any
-from openenv.env import Env
 from osint_env.data.generator import DatasetGenerator
 from osint_env.domain.models import Action, ActionType, Edge, EnvironmentConfig, Observation, TaskInstance
 from osint_env.env.reward import (
     build_reward_model,
     compute_answer_reward,

 from dataclasses import dataclass, field
 from typing import TYPE_CHECKING, Any
 from osint_env.data.generator import DatasetGenerator
 from osint_env.domain.models import Action, ActionType, Edge, EnvironmentConfig, Observation, TaskInstance
+from osint_env.env.openenv_compat import Env
 from osint_env.env.reward import (
     build_reward_model,
     compute_answer_reward,

src/osint_env/env/openenv_compat.py ADDED Viewed

	@@ -0,0 +1,20 @@

+from __future__ import annotations
+try:
+    from openenv.env import Env
+except ImportError:
+    class Env:
+        """Minimal fallback used when openenv is not installed locally."""
+        def __init__(
+            self,
+            name: str,
+            state_space: str,
+            action_space: list[str],
+            episode_max_length: int,
+        ) -> None:
+            self.name = name
+            self.state_space = state_space
+            self.action_space = action_space
+            self.episode_max_length = episode_max_length

tests/conftest.py ADDED Viewed

	@@ -0,0 +1,12 @@

+from __future__ import annotations
+import sys
+from pathlib import Path
+ROOT = Path(__file__).resolve().parents[1]
+SRC = ROOT / "src"
+if str(SRC) not in sys.path:
+    sys.path.insert(0, str(SRC))

tests/test_openai_baseline.py ADDED Viewed

	@@ -0,0 +1,19 @@

+from osint_env.baselines.openai_runner import OpenAIBaselineConfig, OpenAIBaselineRunner, build_action_tools
+def test_openai_baseline_toolset_contains_answer_and_graph_actions():
+    tools = build_action_tools()
+    names = {tool["function"]["name"] for tool in tools}
+    assert "submit_answer" in names
+    assert "add_edge" in names
+    assert "search_memory" in names
+def test_gpt5_request_kwargs_avoid_temperature_and_use_max_completion_tokens():
+    runner = OpenAIBaselineRunner.__new__(OpenAIBaselineRunner)
+    runner.config = OpenAIBaselineConfig(model="gpt-5-nano", max_tokens=321, temperature=0.0, seed=7)
+    runner.tools = build_action_tools()
+    kwargs = runner._request_kwargs(messages=[{"role": "user", "content": "hi"}], episode_index=0)
+    assert kwargs["max_completion_tokens"] == 321
+    assert kwargs["reasoning_effort"] == "none"
+    assert "temperature" not in kwargs

tests/test_server.py ADDED Viewed

	@@ -0,0 +1,22 @@

+from fastapi.testclient import TestClient
+from server import app
+client = TestClient(app)
+def test_server_health():
+    response = client.get("/healthz")
+    assert response.status_code == 200
+    assert response.json()["status"] == "ok"
+def test_server_environment_metadata():
+    response = client.get("/api/environment")
+    assert response.status_code == 200
+    body = response.json()
+    assert "action_space" in body
+    assert "observation_space" in body
+    assert "summary" in body