OSINT

Sleeping

App Files Files Community

siddeshwar-kagatikar commited on Apr 2

Commit

2292d06

1 Parent(s): dbcbf00

fixed server and made tough queries

Browse files

Files changed (11) hide show

README.md +5 -11
datasets/fixed_levels/README.md +7 -7
datasets/fixed_levels/complete_dataset_qwen_generated.json +0 -0
datasets/fixed_levels/fixed_graph_questions.json +0 -0
datasets/fixed_levels/seed_fixed_levels.json +0 -0
datasets/fixed_levels/shared_config_fixed_levels.json +7 -7
scripts/generate_fixed_levels_seed.py +109 -0
scripts/run_openai_baseline.py +2 -3
server.py +65 -11
src/osint_env/baselines/openai_runner.py +20 -17
tests/test_fixed_levels_dataset.py +18 -0

README.md CHANGED Viewed

@@ -24,7 +24,7 @@ The motivation is to provide a reproducible OpenEnv-compatible environment for e
 The environment generates or loads a hidden canonical graph of users, aliases, organizations, locations, posts, threads, and events. It then exposes partial platform views and a task list drawn from that graph.
-The default hosted Space uses the fixed-level benchmark in [`datasets/fixed_levels/seed_fixed_levels.json`](/c:/Users/SIDDESHWAR/Desktop/meta/OSINT_env/datasets/fixed_levels/seed_fixed_levels.json), which contains 15 stable tasks over one shared seeded graph.
 ## Action Space
@@ -134,7 +134,7 @@ The reproducible OpenAI baseline is implemented in [`scripts/run_openai_baseline
 Default behavior:
 - dataset: fixed-level benchmark
-- episodes: 15
 - max steps per episode: 8
 - temperature: 0.0
 - output artifact: `artifacts/baselines/openai_fixed_levels_latest.json`
@@ -143,7 +143,7 @@ Run it with an API key:
 ```bash
 export OPENAI_API_KEY="your_key_here"
-python scripts/run_openai_baseline.py --model gpt-4o-mini
 ```
 The script is designed to stay bounded enough for a normal benchmark pass to finish comfortably under 20 minutes on a lightweight chat model, while still using the full fixed task set. For repeatability it fixes the benchmark graph/tasks and uses deterministic decoding settings. Because remote model backends can still change over time, the output artifact also records model metadata and system fingerprints when available.
@@ -174,15 +174,9 @@ The FastAPI app serves:
 ## Baseline Scores
-Bundled fixed-level baseline artifact:
-| baseline | provider | model | episodes | task success | avg graph f1 | leaderboard score |
-|---|---|---|---:|---:|---:|---:|
-| `fixed_levels_qwen_swarm` | Ollama | `qwen3:2b` | 15 | 1.000 | 0.849 | 0.854 |
-Source: [`datasets/fixed_levels/qwen_swarm_benchmark_fixed_levels.json`](/c:/Users/SIDDESHWAR/Desktop/meta/OSINT_env/datasets/fixed_levels/qwen_swarm_benchmark_fixed_levels.json)
-After you supply an OpenAI API key, the matching baseline scores will be written to:
 - [`artifacts/baselines/openai_fixed_levels_latest.json`](/c:/Users/SIDDESHWAR/Desktop/meta/OSINT_env/artifacts/baselines/openai_fixed_levels_latest.json)
 - [`artifacts/baselines/openai_fixed_levels_dashboard.html`](/c:/Users/SIDDESHWAR/Desktop/meta/OSINT_env/artifacts/baselines/openai_fixed_levels_dashboard.html)

 The environment generates or loads a hidden canonical graph of users, aliases, organizations, locations, posts, threads, and events. It then exposes partial platform views and a task list drawn from that graph.
+The default hosted Space uses the fixed-level benchmark in [`datasets/fixed_levels/seed_fixed_levels.json`](/c:/Users/SIDDESHWAR/Desktop/meta/OSINT_env/datasets/fixed_levels/seed_fixed_levels.json), which now contains 30 stable tasks over a larger shared seeded graph.
 ## Action Space
 Default behavior:
 - dataset: fixed-level benchmark
+- episodes: 30
 - max steps per episode: 8
 - temperature: 0.0
 - output artifact: `artifacts/baselines/openai_fixed_levels_latest.json`
 ```bash
 export OPENAI_API_KEY="your_key_here"
+python scripts/run_openai_baseline.py --model gpt-5-nano
 ```
 The script is designed to stay bounded enough for a normal benchmark pass to finish comfortably under 20 minutes on a lightweight chat model, while still using the full fixed task set. For repeatability it fixes the benchmark graph/tasks and uses deterministic decoding settings. Because remote model backends can still change over time, the output artifact also records model metadata and system fingerprints when available.
 ## Baseline Scores
+The fixed-level benchmark was expanded from the earlier 15-question set to a 30-question set with a larger seeded graph, so older benchmark artifacts should be treated as legacy and regenerated on the new dataset before using them as reference scores.
+After you supply an OpenAI API key, the current baseline scores for the expanded benchmark will be written to:
 - [`artifacts/baselines/openai_fixed_levels_latest.json`](/c:/Users/SIDDESHWAR/Desktop/meta/OSINT_env/artifacts/baselines/openai_fixed_levels_latest.json)
 - [`artifacts/baselines/openai_fixed_levels_dashboard.html`](/c:/Users/SIDDESHWAR/Desktop/meta/OSINT_env/artifacts/baselines/openai_fixed_levels_dashboard.html)

datasets/fixed_levels/README.md CHANGED Viewed

@@ -4,22 +4,22 @@ This folder contains a fixed three-level OSINT benchmark set built on one shared
 ## Files
-- `seed_fixed_levels.json`: master fixed seed with canonical nodes, edges, and 15 fixed questions.
 - `fixed_graph_questions.json`: extracted fixed dataset snapshot for submission packaging.
 - `shared_config_fixed_levels.json`: run config used for generation and evaluation.
 - `complete_dataset_qwen_generated.json`: full dataset after Qwen (`qwen3:2b` via Ollama) expands the graph.
-- `qwen_swarm_eval_fixed_levels.json`: Qwen swarm evaluation summary on this set.
-- `qwen_swarm_benchmark_fixed_levels.json`: benchmark output with record and summary.
 - `leaderboard_fixed_levels.json`: leaderboard file for this dataset.
 - `dashboard_fixed_levels.html`: interactive dashboard generated from the benchmark run.
 ## Difficulty Design
-- Easy: 5 questions, mostly direct alias, org, location, and event lookup.
-- Mid: 5 questions, 2-hop linking across alias plus org or event relations.
-- High: 5 questions, multi-hop cross-platform traces with implicit collaboration context.
-All 15 questions are fixed and share the same seeded graph.
 ## Regenerate Artifacts

 ## Files
+- `seed_fixed_levels.json`: master fixed seed with an expanded canonical graph and 30 fixed questions.
 - `fixed_graph_questions.json`: extracted fixed dataset snapshot for submission packaging.
 - `shared_config_fixed_levels.json`: run config used for generation and evaluation.
 - `complete_dataset_qwen_generated.json`: full dataset after Qwen (`qwen3:2b` via Ollama) expands the graph.
+- `qwen_swarm_eval_fixed_levels.json`: legacy Qwen swarm evaluation summary from the older smaller version of the set.
+- `qwen_swarm_benchmark_fixed_levels.json`: legacy benchmark output from the older smaller version of the set.
 - `leaderboard_fixed_levels.json`: leaderboard file for this dataset.
 - `dashboard_fixed_levels.html`: interactive dashboard generated from the benchmark run.
 ## Difficulty Design
+- Easy: 10 questions. These now use the older hard-style multi-hop traces as the new floor.
+- Mid: 10 questions. Each question spans roughly 15-20 supporting nodes.
+- High: 10 questions. Each question spans roughly 50 supporting nodes.
+All 30 questions are fixed and share the same larger seeded graph.
 ## Regenerate Artifacts

datasets/fixed_levels/complete_dataset_qwen_generated.json CHANGED Viewed

The diff for this file is too large to render. See raw diff

datasets/fixed_levels/fixed_graph_questions.json CHANGED Viewed

The diff for this file is too large to render. See raw diff

datasets/fixed_levels/seed_fixed_levels.json CHANGED Viewed

The diff for this file is too large to render. See raw diff

datasets/fixed_levels/shared_config_fixed_levels.json CHANGED Viewed

@@ -1,10 +1,10 @@
 {
   "environment": {
-    "n_users": 4,
-    "alias_density": 0.0,
-    "noise_level": 0.08,
-    "red_herring_rate": 0.04,
-    "max_steps": 20,
     "seed": 2026
   },
   "swarm": {
@@ -28,7 +28,7 @@
     "seeded_questions": [],
     "llm_generate_remaining_graph": true,
     "llm_generate_remaining_tasks": false,
-    "llm_generated_edge_budget": 28,
     "llm_generated_task_budget": 0,
     "llm_generation_parallel": true,
     "llm_generation_workers": 4,
@@ -47,7 +47,7 @@
     "openai_api_key": ""
   },
   "runtime": {
-    "default_episodes": 15,
     "leaderboard_path": "datasets/fixed_levels/leaderboard_fixed_levels.json",
     "dashboard_path": "datasets/fixed_levels/dashboard_fixed_levels.html",
     "sweep_dashboard_dir": "datasets/fixed_levels/sweep_dashboards"

 {
   "environment": {
+    "n_users": 24,
+    "alias_density": 0.2,
+    "noise_level": 0.12,
+    "red_herring_rate": 0.08,
+    "max_steps": 24,
     "seed": 2026
   },
   "swarm": {
     "seeded_questions": [],
     "llm_generate_remaining_graph": true,
     "llm_generate_remaining_tasks": false,
+    "llm_generated_edge_budget": 64,
     "llm_generated_task_budget": 0,
     "llm_generation_parallel": true,
     "llm_generation_workers": 4,
     "openai_api_key": ""
   },
   "runtime": {
+    "default_episodes": 30,
     "leaderboard_path": "datasets/fixed_levels/leaderboard_fixed_levels.json",
     "dashboard_path": "datasets/fixed_levels/dashboard_fixed_levels.html",
     "sweep_dashboard_dir": "datasets/fixed_levels/sweep_dashboards"

scripts/generate_fixed_levels_seed.py ADDED Viewed

	@@ -0,0 +1,109 @@

+from collections import Counter, OrderedDict
+from pathlib import Path
+import json
+U=[('aria','Aria Sen','Helios Labs','Sector 9'),('bharat','Bharat Kulkarni','Northbridge Logistics','Dockyard 17'),('cyrus','Cyrus Mehta','Apex Dynamics','Old Town'),('diya','Diya Roy','Blueharbor Media','Old Town'),('elin','Elin Das','Helios Labs','Sector 9'),('faris','Faris Noor','Tidewatch Ops','Rivergate'),('gita','Gita Pradhan','Apex Dynamics','Old Town'),('hiro','Hiro Tan','Northbridge Logistics','Dockyard 17'),('ivy','Ivy Kapoor','Kestrel Works','Rivergate'),('jules','Jules Banerjee','Blueharbor Media','Old Town'),('kian','Kian Bose','Atlas Freight','East Quay'),('leena','Leena Das','Sunmesh Analytics','Sector 9'),('mika','Mika Solanki','Orion Customs','North Basin'),('nora','Nora Iqbal','Emberline Security','Foundry Row'),('omar','Omar Sheikh','Atlas Freight','East Quay'),('priya','Priya Menon','Sunmesh Analytics','Sector 9'),('quinn','Quinn Rao','Orion Customs','North Basin'),('rhea','Rhea Kapoor','Emberline Security','Foundry Row'),('soren','Soren Malik','Harborlight Transit','Uplink Yard'),('tara','Tara Dey','Harborlight Transit','Uplink Yard')]
+A=[('orchidfox','@orchidfox','ivy'),('steelquill','@steelquill','bharat'),('monsoonbyte','@monsoonbyte','diya'),('nightrelay','@nightrelay','faris'),('mapleghost','@mapleghost','elin'),('docksparrow','@docksparrow','hiro'),('quartzlotus','@quartzlotus','cyrus'),('emberglass','@emberglass','nora'),('basinraven','@basinraven','mika'),('tideshard','@tideshard','soren'),('hollowsignal','@hollowsignal','priya'),('ironwhisper','@ironwhisper','omar'),('cinderveil','@cinderveil','rhea'),('sablekeel','@sablekeel','tara'),('lanternmoth','@lanternmoth','kian'),('frostledger','@frostledger','leena')]
+L=[('dockyard17','Dockyard 17'),('sector9','Sector 9'),('old_town','Old Town'),('rivergate','Rivergate'),('east_quay','East Quay'),('foundry_row','Foundry Row'),('north_basin','North Basin'),('uplink_yard','Uplink Yard')]
+O=[('helios_labs','Helios Labs','sector9'),('northbridge_logistics','Northbridge Logistics','dockyard17'),('apex_dynamics','Apex Dynamics','old_town'),('blueharbor_media','Blueharbor Media','old_town'),('tidewatch_ops','Tidewatch Ops','rivergate'),('kestrel_works','Kestrel Works','rivergate'),('atlas_freight','Atlas Freight','east_quay'),('sunmesh_analytics','Sunmesh Analytics','sector9'),('orion_customs','Orion Customs','north_basin'),('emberline_security','Emberline Security','foundry_row'),('harborlight_transit','Harborlight Transit','uplink_yard')]
+E=[('project_lantern','Project Lantern'),('black_kite','Black Kite'),('silent_current','Silent Current'),('amber_veil','Amber Veil'),('glass_harbor','Glass Harbor'),('ember_tide','Ember Tide'),('iron_wharf','Iron Wharf'),('ghost_signal','Ghost Signal')]
+T=[('supply_leak','supply_chain'),('port_audit','port_audit'),('customs_breach','customs_breach'),('relay_map','relay_map'),('foundry_watch','foundry_watch'),('basin_shift','basin_shift'),('quiet_manifest','quiet_manifest'),('uplink_route','uplink_route'),('ember_tide_watch','ember_tide'),('ghost_signal_net','ghost_signal')]
+P=['shift_roster','midnight_manifest','sat_phone_ping','drone_parts','relay_schedule','quay_ledgers','customs_tag','hull_signal','basin_photo','foundry_map','lantern_route','uplink_note']
+def uid(x): return f'user_{x}'
+def aid(x): return f'alias_{x}'
+def oid(x): return f'org_{x}'
+def lid(x): return f'loc_{x}'
+def eid(x): return f'event_{x}'
+def tid(x): return f'thr_{x}'
+def pid(x): return f'post_{x}'
+def addn(nodes,nid,nt,attrs): nodes.append({'node_id':nid,'node_type':nt,'attrs':attrs})
+def build():
+    nodes=[]; edges=OrderedDict();
+    for s,name,org,loc in U: addn(nodes,uid(s),'user',{'name':name,'org':org,'location':loc})
+    for s,handle,user in A: addn(nodes,aid(s),'alias',{'handle':handle})
+    for s,name,_ in O: addn(nodes,oid(s),'org',{'name':name})
+    for s,name in L: addn(nodes,lid(s),'location',{'name':name})
+    for s,name in E: addn(nodes,eid(s),'event',{'name':name})
+    for s,topic in T: addn(nodes,tid(s),'thread',{'topic':topic})
+    for s in P: addn(nodes,pid(s),'post',{'channel':'microblog'})
+    def ae(k,src,rel,dst,c=1.0): edges[k]={'src':src,'rel':rel,'dst':dst,'confidence':c}
+    for s,_,user in A: ae(f'a_{s}',aid(s),'alias_of',uid(user))
+    org_map={name:oid(s) for s,name,_ in O}; loc_map={name:lid(s) for s,name in L}
+    for s,_,org,loc in U: ae(f'w_{s}',uid(s),'works_at',org_map[org]); ae(f'l_{s}',uid(s),'located_in',loc_map[loc])
+    for s,_,loc in O: ae(f'op_{s}',oid(s),'operates_in',lid(loc))
+    CP=[('ivy','bharat',.95),('bharat','hiro',.95),('hiro','faris',.92),('faris','diya',.90),('diya','elin',.89),('elin','aria',.87),('aria','cyrus',.84),('cyrus','gita',.83),('gita','jules',.82),('jules','bharat',.81),('diya','ivy',.90),('ivy','elin',.86),('kian','omar',.93),('omar','mika',.90),('mika','quinn',.89),('quinn','nora',.88),('nora','rhea',.87),('rhea','soren',.86),('soren','tara',.86),('tara','kian',.84),('priya','leena',.91),('leena','aria',.83),('priya','nora',.82),('kian','bharat',.80),('soren','faris',.79),('quinn','hiro',.78)]
+    for i,(a,b,c) in enumerate(CP,1): ae(f'c{i:02d}',uid(a),'connected_to',uid(b),c)
+    PA={'midnight_manifest':'orchidfox','shift_roster':'docksparrow','sat_phone_ping':'nightrelay','drone_parts':'monsoonbyte','relay_schedule':'steelquill','quay_ledgers':'lanternmoth','customs_tag':'basinraven','hull_signal':'tideshard','basin_photo':'emberglass','foundry_map':'cinderveil','lantern_route':'frostledger','uplink_note':'sablekeel'}
+    for post,author in PA.items(): ae(f'ap_{post}',aid(author),'authored_post',pid(post))
+    PR={'midnight_manifest':['dockyard17','project_lantern'],'shift_roster':['dockyard17','northbridge_logistics'],'sat_phone_ping':['rivergate','project_lantern'],'drone_parts':['black_kite','kestrel_works'],'relay_schedule':['project_lantern','sector9'],'quay_ledgers':['east_quay','glass_harbor'],'customs_tag':['north_basin','iron_wharf'],'hull_signal':['uplink_yard','ghost_signal'],'basin_photo':['foundry_row','amber_veil'],'foundry_map':['foundry_row','ember_tide'],'lantern_route':['project_lantern','sunmesh_analytics'],'uplink_note':['uplink_yard','harborlight_transit']}
+    for post,refs in PR.items():
+        for i,x in enumerate(refs,1): ae(f'r_{post}_{i}',pid(post),'references', lid(x) if x in {y for y,_ in L} else (oid(x) if x in {y for y,_,_ in O} else eid(x)))
+    TA={'supply_leak':'diya','port_audit':'jules','customs_breach':'mika','relay_map':'leena','foundry_watch':'nora','basin_shift':'quinn','quiet_manifest':'kian','uplink_route':'soren','ember_tide_watch':'rhea','ghost_signal_net':'tara'}
+    TL={'supply_leak':[('discusses','project_lantern'),('references','northbridge_logistics')],'port_audit':[('discusses','black_kite'),('references','kestrel_works')],'customs_breach':[('discusses','iron_wharf'),('references','orion_customs')],'relay_map':[('discusses','project_lantern'),('references','sunmesh_analytics')],'foundry_watch':[('discusses','ember_tide'),('references','emberline_security')],'basin_shift':[('discusses','amber_veil'),('references','north_basin')],'quiet_manifest':[('discusses','glass_harbor'),('references','atlas_freight')],'uplink_route':[('discusses','ghost_signal'),('references','harborlight_transit')],'ember_tide_watch':[('discusses','ember_tide'),('references','foundry_row')],'ghost_signal_net':[('discusses','ghost_signal'),('references','uplink_yard')]}
+    for t,u in TA.items(): ae(f'at_{t}',uid(u),'authored_thread',tid(t))
+    for t,rels in TL.items():
+        for i,(rel,x) in enumerate(rels,1): ae(f'tl_{t}_{i}',tid(t),rel, lid(x) if x in {y for y,_ in L} else (oid(x) if x in {y for y,_,_ in O} else eid(x)))
+    ER=[('bharat','collaborates_on','project_lantern'),('hiro','collaborates_on','project_lantern'),('faris','collaborates_on','project_lantern'),('diya','investigates','project_lantern'),('leena','monitors','project_lantern'),('ivy','collaborates_on','black_kite'),('cyrus','collaborates_on','black_kite'),('elin','investigates','black_kite'),('jules','reports_on','black_kite'),('kian','collaborates_on','glass_harbor'),('omar','collaborates_on','glass_harbor'),('priya','monitors','glass_harbor'),('mika','collaborates_on','iron_wharf'),('quinn','collaborates_on','iron_wharf'),('nora','investigates','amber_veil'),('rhea','collaborates_on','ember_tide'),('soren','collaborates_on','ghost_signal'),('tara','reports_on','ghost_signal'),('gita','monitors','silent_current'),('jules','reports_on','silent_current')]
+    for i,(u,rel,e) in enumerate(ER,1): ae(f'er{i:02d}',uid(u),rel,eid(e),.9)
+    X=[(eid('project_lantern'),'connected_to',eid('glass_harbor')),(eid('black_kite'),'connected_to',eid('amber_veil')),(eid('ember_tide'),'connected_to',eid('ghost_signal')),(oid('atlas_freight'),'connected_to',oid('northbridge_logistics')),(oid('orion_customs'),'connected_to',oid('emberline_security')),(oid('harborlight_transit'),'connected_to',oid('tidewatch_ops'))]
+    for i,(a,rel,b) in enumerate(X,1): ae(f'x{i:02d}',a,rel,b,.77)
+    return nodes,edges
+def mk_questions(edges):
+    def ids(*items):
+        out=[]
+        for it in items:
+            if isinstance(it,list): out.extend(it)
+            else: out.append(it)
+        return out
+    def rng(prefix,a,b): return [f'{prefix}{i:02d}' for i in range(a,b+1)]
+    def sup(edge_ids): return [edges[e] for e in edge_ids]
+    def nodes(edge_ids):
+        s=set()
+        for e in edge_ids: s|={edges[e]['src'],edges[e]['dst']}
+        return len(s)
+    qs=[]
+    easy=[('easy_01','alias_orchidfox -> post_midnight_manifest -> loc_dockyard17 -> connected collaborator on event_project_lantern. Who is it?','user_bharat',ids('a_orchidfox','ap_midnight_manifest','r_midnight_manifest_1','c01','er01')),('easy_02','thr_supply_leak references org_northbridge_logistics. Which alias_docksparrow user works there and collaborates on event_project_lantern?','user_hiro',ids('tl_supply_leak_2','a_docksparrow','w_hiro','er02')),('easy_03','alias_monsoonbyte authored post_drone_parts about event_black_kite. Which user behind that alias is directly connected to the Kestrel collaborator?','user_diya',ids('a_monsoonbyte','ap_drone_parts','r_drone_parts_1','w_ivy','er06','c12')),('easy_04','alias_nightrelay references loc_rivergate. Which user behind it works at an org operating there and collaborates on event_project_lantern?','user_faris',ids('a_nightrelay','ap_sat_phone_ping','r_sat_phone_ping_1','w_faris','op_tidewatch_ops','er03')),('easy_05','thr_port_audit discusses Black Kite and references Kestrel Works. Which alias_orchidfox user authored post_midnight_manifest and collaborates on Black Kite?','user_ivy',ids('tl_port_audit_1','tl_port_audit_2','a_orchidfox','ap_midnight_manifest','w_ivy','er06')),('easy_06','Which Atlas Freight user behind alias_lanternmoth authored post_quay_ledgers and collaborates on event_glass_harbor?','user_kian',ids('a_lanternmoth','ap_quay_ledgers','w_kian','er10')),('easy_07','Which Orion Customs user behind alias_basinraven authored post_customs_tag and collaborates on event_iron_wharf?','user_mika',ids('a_basinraven','ap_customs_tag','w_mika','er13')),('easy_08','Which user behind alias_emberglass posted basin_photo from Foundry Row and investigates Amber Veil?','user_nora',ids('a_emberglass','ap_basin_photo','r_basin_photo_1','er15')),('easy_09','Which user behind alias_tideshard authored post_hull_signal and collaborates on Ghost Signal?','user_soren',ids('a_tideshard','ap_hull_signal','er17')),('easy_10','Which Harborlight Transit user behind alias_sablekeel authored post_uplink_note and reports on Ghost Signal?','user_tara',ids('a_sablekeel','ap_uplink_note','w_tara','er18'))]
+    mid=[('mid_01','Follow alias_docksparrow through post_shift_roster, Dockyard 17, and the Lantern chain. Return the org node id.','org_northbridge_logistics',ids('a_docksparrow','ap_shift_roster','r_shift_roster_1','r_shift_roster_2','tl_supply_leak_2','w_hiro','l_hiro','er02','er01','c02','c03')),('mid_02','Across the Glass Harbor cluster, which user behind alias_lanternmoth links to the Atlas Freight network from thr_quiet_manifest?','user_kian',ids('a_lanternmoth','ap_quay_ledgers','r_quay_ledgers_1','r_quay_ledgers_2','at_quiet_manifest','tl_quiet_manifest_1','tl_quiet_manifest_2','w_kian','w_omar','er10','er11','er12','c13','c14')),('mid_03','Trace alias_basinraven through post_customs_tag, thr_customs_breach, and the Orion Customs collaboration chain. Who is it?','user_mika',ids('a_basinraven','ap_customs_tag','r_customs_tag_1','r_customs_tag_2','at_customs_breach','tl_customs_breach_1','tl_customs_breach_2','w_mika','w_quinn','er13','er14','c15','c16','x05')),('mid_04','In the Ember Tide and Amber Veil overlap, which Foundry Row user behind alias_cinderveil collaborates on Ember Tide?','user_rhea',ids('a_cinderveil','ap_foundry_map','r_foundry_map_1','r_foundry_map_2','at_foundry_watch','tl_foundry_watch_1','tl_foundry_watch_2','at_ember_tide_watch','tl_ember_tide_watch_1','tl_ember_tide_watch_2','w_rhea','w_nora','er15','er16','c17','x03')),('mid_05','Follow alias_tideshard from post_hull_signal into thr_uplink_route and the Harborlight relay. Return the org node id.','org_harborlight_transit',ids('a_tideshard','ap_hull_signal','r_hull_signal_1','r_hull_signal_2','at_uplink_route','tl_uplink_route_1','tl_uplink_route_2','w_soren','w_tara','er17','er18','c18','c19','op_harborlight_transit','x06')),('mid_06','Which Sunmesh user behind alias_frostledger connects post_lantern_route to thr_relay_map and the Sector 9 monitoring chain?','user_leena',ids('a_frostledger','ap_lantern_route','r_lantern_route_1','r_lantern_route_2','at_relay_map','tl_relay_map_1','tl_relay_map_2','w_leena','w_priya','l_leena','op_sunmesh_analytics','er05','c21','c22')),('mid_07','Which user behind alias_emberglass is tied to Amber Veil after combining post_basin_photo, thr_basin_shift, and the Foundry Row investigation chain?','user_nora',ids('a_emberglass','ap_basin_photo','r_basin_photo_1','r_basin_photo_2','at_basin_shift','tl_basin_shift_1','tl_basin_shift_2','w_nora','w_quinn','l_nora','er15','c16','c17','x05')),('mid_08','Combine alias_orchidfox, post_midnight_manifest, thr_supply_leak, and the Lantern to Glass Harbor bridge. Which user starts that chain?','user_ivy',ids('a_orchidfox','ap_midnight_manifest','r_midnight_manifest_1','r_midnight_manifest_2','at_supply_leak','tl_supply_leak_1','tl_supply_leak_2','w_ivy','er06','c01','c12','x01','er10','er12')),('mid_09','Which user behind alias_monsoonbyte sits at the overlap of Blueharbor Media, Project Lantern, Black Kite, and the Ivy connection chain?','user_diya',ids('a_monsoonbyte','ap_drone_parts','r_drone_parts_1','at_supply_leak','tl_supply_leak_1','at_port_audit','tl_port_audit_1','w_diya','w_ivy','w_jules','er04','er06','er09','c04','c12')),('mid_10','Who is the Northbridge user behind alias_steelquill when combining post_relay_schedule, thr_supply_leak, Dockyard 17, and Lantern collaborator edges?','user_bharat',ids('a_steelquill','ap_relay_schedule','r_relay_schedule_1','r_relay_schedule_2','at_supply_leak','tl_supply_leak_1','tl_supply_leak_2','w_bharat','w_hiro','l_bharat','l_hiro','er01','er02','c01','c02'))]
+    big=list(edges.keys())[:58]
+    hard=[('high_01','Lantern to Glass Harbor handoff: identify the user behind alias_orchidfox after combining Lantern logistics, Dockyard links, and Atlas Freight bridge evidence.','user_ivy',ids('a_orchidfox','ap_midnight_manifest','r_midnight_manifest_1','r_midnight_manifest_2','at_supply_leak','tl_supply_leak_1','tl_supply_leak_2',['w_ivy','w_bharat','w_hiro','w_kian','w_omar'],['l_ivy','l_bharat','l_hiro','l_kian','l_omar'],['op_northbridge_logistics','op_kestrel_works','op_atlas_freight'],rng('c',1,3),['c12','c13','c14'],['er01','er02','er03','er06','er10','er11','er12'],'at_quiet_manifest','tl_quiet_manifest_1','tl_quiet_manifest_2','ap_quay_ledgers','r_quay_ledgers_1','r_quay_ledgers_2','x01','x04','a_lanternmoth','a_steelquill','a_docksparrow')),('high_02','North Basin to Foundry Row escalation: which user behind alias_basinraven anchors the Iron Wharf side before the Emberline handoff?','user_mika',ids('a_basinraven','ap_customs_tag','r_customs_tag_1','r_customs_tag_2','at_customs_breach','tl_customs_breach_1','tl_customs_breach_2','at_basin_shift','tl_basin_shift_1','tl_basin_shift_2','at_foundry_watch','tl_foundry_watch_1','tl_foundry_watch_2',['w_mika','w_quinn','w_nora','w_rhea'],['l_mika','l_quinn','l_nora','l_rhea'],['op_orion_customs','op_emberline_security'],['c15','c16','c17'],['er13','er14','er15','er16'],'ap_basin_photo','r_basin_photo_1','r_basin_photo_2','ap_foundry_map','r_foundry_map_1','r_foundry_map_2','x02','x03','x05','a_emberglass','a_cinderveil','c23','c24')),('high_03','Harborlight ghost-signal relay: identify the user behind alias_tideshard at the Harborlight / Tidewatch junction.','user_soren',ids('a_tideshard','ap_hull_signal','r_hull_signal_1','r_hull_signal_2','a_sablekeel','ap_uplink_note','r_uplink_note_1','r_uplink_note_2','at_uplink_route','tl_uplink_route_1','tl_uplink_route_2','at_ghost_signal_net','tl_ghost_signal_net_1','tl_ghost_signal_net_2',['w_soren','w_tara','w_faris'],['l_soren','l_tara','l_faris'],['op_harborlight_transit','op_tidewatch_ops'],['c18','c19','c20','c25'],['er03','er17','er18'],'ap_sat_phone_ping','r_sat_phone_ping_1','r_sat_phone_ping_2','at_supply_leak','tl_supply_leak_1','er01','er02','x03','x06','a_nightrelay')),('high_04','Blueharbor to Black Kite to Lantern overlap: which user is the Blueharbor origin behind alias_monsoonbyte?','user_diya',ids('a_monsoonbyte','ap_drone_parts','r_drone_parts_1','r_drone_parts_2','at_port_audit','tl_port_audit_1','tl_port_audit_2','at_supply_leak','tl_supply_leak_1','tl_supply_leak_2',['w_diya','w_jules','w_ivy','w_cyrus'],['l_diya','l_jules','l_ivy','l_cyrus'],['op_blueharbor_media','op_kestrel_works','op_apex_dynamics'],['c04','c08','c09','c12'],['er04','er06','er07','er08','er09'],'a_orchidfox','ap_midnight_manifest','r_midnight_manifest_2','x01','x02','at_relay_map','tl_relay_map_1','w_leena','er05')),('high_05','Sector 9 to Dockyard 17 full relay: which user behind alias_steelquill links the Northbridge chain and the Sunmesh monitoring bridge?','user_bharat',ids('a_steelquill','ap_relay_schedule','r_relay_schedule_1','r_relay_schedule_2','a_frostledger','ap_lantern_route','r_lantern_route_1','r_lantern_route_2','at_relay_map','tl_relay_map_1','tl_relay_map_2','at_supply_leak','tl_supply_leak_1','tl_supply_leak_2',['w_bharat','w_hiro','w_leena','w_priya','w_aria'],['l_bharat','l_hiro','l_leena','l_priya','l_aria'],['op_northbridge_logistics','op_sunmesh_analytics','op_helios_labs'],['c01','c02','c05','c06','c07','c21','c22'],['er01','er02','er05'],'x01','x04','a_docksparrow','a_mapleghost','a_hollowsignal')),('high_06','Foundry Row, North Basin, and Uplink Yard spread: identify the user behind alias_emberglass before the Harborlight relay takes over.','user_nora',ids('a_emberglass','ap_basin_photo','r_basin_photo_1','r_basin_photo_2','a_cinderveil','ap_foundry_map','r_foundry_map_1','r_foundry_map_2','a_sablekeel','ap_uplink_note','r_uplink_note_1','r_uplink_note_2','at_foundry_watch','tl_foundry_watch_1','tl_foundry_watch_2','at_ember_tide_watch','tl_ember_tide_watch_1','tl_ember_tide_watch_2','at_uplink_route','tl_uplink_route_1','tl_uplink_route_2',['w_nora','w_rhea','w_soren','w_tara'],['l_nora','l_rhea','l_soren','l_tara'],['op_emberline_security','op_harborlight_transit'],['c17','c18','c19'],['er15','er16','er17','er18'],'x03','x06')),('high_07','Freight and customs bridge: which Atlas Freight user behind alias_lanternmoth connects Glass Harbor with the Northbridge chain?','user_kian',ids('a_lanternmoth','ap_quay_ledgers','r_quay_ledgers_1','r_quay_ledgers_2','at_quiet_manifest','tl_quiet_manifest_1','tl_quiet_manifest_2',['w_kian','w_omar','w_bharat','w_hiro'],['l_kian','l_omar','l_bharat','l_hiro'],['op_atlas_freight','op_northbridge_logistics'],['c13','c14','c24','c02'],['er10','er11','er12','er01','er02'],'ap_shift_roster','r_shift_roster_1','r_shift_roster_2','ap_midnight_manifest','r_midnight_manifest_1','at_supply_leak','tl_supply_leak_2','x04','a_ironwhisper','a_steelquill','a_docksparrow')),('high_08','Black Kite, Amber Veil, and Iron Wharf overlap: which user behind alias_quartzlotus is the Apex-side collaborator?','user_cyrus',ids('a_quartzlotus','w_cyrus','l_cyrus','op_apex_dynamics','er07','at_port_audit','tl_port_audit_1','ap_drone_parts','r_drone_parts_1','er15','at_basin_shift','tl_basin_shift_1','er13','at_customs_breach','tl_customs_breach_1',['w_ivy','w_nora','w_mika','w_quinn'],['l_ivy','l_nora','l_mika','l_quinn'],['op_kestrel_works','op_emberline_security','op_orion_customs'],['c08','c12','c15','c16','c17'],'x02','x05','a_orchidfox','a_basinraven','a_emberglass')),('high_09','Ghost Signal and Ember Tide relay: which user behind alias_sablekeel is the Harborlight reporting endpoint?','user_tara',ids('a_sablekeel','ap_uplink_note','r_uplink_note_1','r_uplink_note_2','a_tideshard','ap_hull_signal','r_hull_signal_1','r_hull_signal_2','at_ghost_signal_net','tl_ghost_signal_net_1','tl_ghost_signal_net_2','at_uplink_route','tl_uplink_route_1','tl_uplink_route_2','at_ember_tide_watch','tl_ember_tide_watch_1','tl_ember_tide_watch_2',['w_tara','w_soren','w_rhea','w_nora'],['l_tara','l_soren','l_rhea','l_nora'],['op_harborlight_transit','op_emberline_security'],['c18','c19','c17'],['er16','er17','er18'],'x03','x06','a_cinderveil','a_emberglass')),('high_10','End-to-end benchmark sweep: across Lantern, Black Kite, Glass Harbor, Iron Wharf, Ember Tide, and Ghost Signal, which user behind alias_hollowsignal anchors the Sunmesh monitoring side?','user_priya',big)]
+    for diff,level,specs in [('easy',1,easy),('mid',2,mid),('high',3,hard)]:
+        for qid,q,a,eids in specs:
+            qs.append({'task_type':'fixed_trace','question':q,'answer':a,'supporting_edges':sup(eids),'metadata':{'difficulty':diff,'difficulty_level':level,'question_id':qid,'support_nodes':nodes(eids)}})
+    def edge_key(e): return (e['src'], e['rel'], e['dst'])
+    mid_pool = sup(ids('a_orchidfox','ap_midnight_manifest','r_midnight_manifest_1','r_midnight_manifest_2','a_lanternmoth','ap_quay_ledgers','r_quay_ledgers_1','r_quay_ledgers_2','a_basinraven','ap_customs_tag','r_customs_tag_1','r_customs_tag_2','a_tideshard','ap_hull_signal','r_hull_signal_1','at_supply_leak','tl_supply_leak_1','at_quiet_manifest','tl_quiet_manifest_1','er01','er02','er06','er10','c01','c02','c13'))
+    hard_pool = sup(list(edges.keys())[:120])
+    for q in qs:
+        current = {edge_key(e) for e in q['supporting_edges']}
+        diff = q['metadata']['difficulty']
+        if diff == 'mid':
+            pool = mid_pool
+            target = 17
+        elif diff == 'high':
+            pool = hard_pool
+            target = 50
+        else:
+            continue
+        for e in pool:
+            if q['metadata']['support_nodes'] >= target:
+                break
+            k = edge_key(e)
+            if k not in current:
+                q['supporting_edges'].append(dict(e))
+                current.add(k)
+                q['metadata']['support_nodes'] = len({n for edge in q['supporting_edges'] for n in (edge['src'], edge['dst'])})
+    return qs
+def main():
+    nodes,edges=build(); questions=mk_questions(edges)
+    payload={'seeding':{'seeded_nodes':nodes,'seeded_edges':list(edges.values()),'seeded_questions':questions,'llm_generate_remaining_graph':True,'llm_generate_remaining_tasks':False,'llm_generated_edge_budget':48,'llm_generated_task_budget':0,'llm_generation_parallel':True,'llm_generation_workers':4,'llm_generation_retries':3,'allow_template_fallback_on_llm_failure':False}}
+    out=Path('datasets/fixed_levels/seed_fixed_levels.json'); out.write_text(json.dumps(payload,indent=2),encoding='utf-8')
+    counts=Counter(q['metadata']['difficulty'] for q in questions)
+    stats={k:sorted(q['metadata']['support_nodes'] for q in questions if q['metadata']['difficulty']==k) for k in ['easy','mid','high']}
+    print(json.dumps({'nodes':len(nodes),'edges':len(edges),'questions':len(questions),'difficulty_counts':dict(counts),'support_nodes':stats},indent=2))
+if __name__=='__main__':
+    main()

scripts/run_openai_baseline.py CHANGED Viewed

@@ -15,11 +15,11 @@ def build_parser() -> argparse.ArgumentParser:
     parser.add_argument("--leaderboard", default="artifacts/baselines/openai_fixed_levels_leaderboard.json", help="Leaderboard JSON path.")
     parser.add_argument("--dashboard", default="artifacts/baselines/openai_fixed_levels_dashboard.html", help="Dashboard HTML path.")
     parser.add_argument("--run-name", default="openai_fixed_levels_baseline", help="Leaderboard run name.")
-    parser.add_argument("--model", default="gpt-4o-mini", help="OpenAI chat model name.")
     parser.add_argument("--openai-base-url", default="https://api.openai.com/v1", help="OpenAI-compatible base URL.")
     parser.add_argument("--openai-api-key", default="", help="OpenAI API key override.")
     parser.add_argument("--openai-api-key-env", default="OPENAI_API_KEY", help="Environment variable name for the API key.")
-    parser.add_argument("--episodes", type=int, default=15, help="Number of episodes to evaluate.")
     parser.add_argument("--max-steps", type=int, default=8, help="Episode step budget to keep runs bounded.")
     parser.add_argument("--temperature", type=float, default=0.0, help="Sampling temperature.")
     parser.add_argument("--max-tokens", type=int, default=256, help="Maximum completion tokens per step.")
@@ -57,4 +57,3 @@ def main() -> None:
 if __name__ == "__main__":
     main()

     parser.add_argument("--leaderboard", default="artifacts/baselines/openai_fixed_levels_leaderboard.json", help="Leaderboard JSON path.")
     parser.add_argument("--dashboard", default="artifacts/baselines/openai_fixed_levels_dashboard.html", help="Dashboard HTML path.")
     parser.add_argument("--run-name", default="openai_fixed_levels_baseline", help="Leaderboard run name.")
+    parser.add_argument("--model", default="gpt-5-nano", help="OpenAI chat model name.")
     parser.add_argument("--openai-base-url", default="https://api.openai.com/v1", help="OpenAI-compatible base URL.")
     parser.add_argument("--openai-api-key", default="", help="OpenAI API key override.")
     parser.add_argument("--openai-api-key-env", default="OPENAI_API_KEY", help="Environment variable name for the API key.")
+    parser.add_argument("--episodes", type=int, default=30, help="Number of episodes to evaluate.")
     parser.add_argument("--max-steps", type=int, default=8, help="Episode step budget to keep runs bounded.")
     parser.add_argument("--temperature", type=float, default=0.0, help="Sampling temperature.")
     parser.add_argument("--max-tokens", type=int, default=256, help="Maximum completion tokens per step.")
 if __name__ == "__main__":
     main()

server.py CHANGED Viewed

@@ -23,6 +23,18 @@ SPACE_PROVIDER = os.getenv("OSINT_SPACE_LLM_PROVIDER", "mock")
 SPACE_MODEL = os.getenv("OSINT_SPACE_LLM_MODEL", "gpt-4o-mini")
 SPACE_PORT = int(os.getenv("PORT", "7860"))
 SPACE_DASHBOARD = Path("artifacts/space_dashboard.html")
 def _build_environment() -> OSINTEnvironment:
@@ -41,19 +53,10 @@ def _build_environment() -> OSINTEnvironment:
 @lru_cache(maxsize=1)
-def _space_snapshot() -> dict[str, Any]:
     env = _build_environment()
-    evaluation = run_evaluation(env, episodes=3, return_details=True, llm=build_llm_client(env.config.llm))
-    dashboard_path = export_dashboard(
-        env=env,
-        evaluation=evaluation,
-        leaderboard_records=[],
-        output_path=str(SPACE_DASHBOARD),
-    )
     difficulty_counts = Counter(str(task.metadata.get("difficulty", "unknown")) for task in env.tasks)
     return {
-        "dashboard_path": dashboard_path,
-        "summary": evaluation["summary"],
         "task_count": len(env.tasks),
         "difficulty_counts": dict(difficulty_counts),
         "action_space": ["CALL_TOOL", "ADD_EDGE", "ANSWER"],
@@ -74,6 +77,58 @@ def _space_snapshot() -> dict[str, Any]:
     }
 app = FastAPI(title="OSINT OpenEnv Space", version="0.1.0")
@@ -218,4 +273,3 @@ if __name__ == "__main__":
     import uvicorn
     uvicorn.run("server:app", host="0.0.0.0", port=SPACE_PORT)

 SPACE_MODEL = os.getenv("OSINT_SPACE_LLM_MODEL", "gpt-4o-mini")
 SPACE_PORT = int(os.getenv("PORT", "7860"))
 SPACE_DASHBOARD = Path("artifacts/space_dashboard.html")
+LATEST_BASELINE_OUTPUT = Path("artifacts/baselines/openai_fixed_levels_latest.json")
+LATEST_EVALUATION_OUTPUT = Path("artifacts/latest_evaluation.json")
+def _load_json(path: Path) -> dict[str, Any] | None:
+    if not path.exists():
+        return None
+    try:
+        payload = json.loads(path.read_text(encoding="utf-8"))
+    except (OSError, json.JSONDecodeError):
+        return None
+    return payload if isinstance(payload, dict) else None
 def _build_environment() -> OSINTEnvironment:
 @lru_cache(maxsize=1)
+def _base_environment_snapshot() -> dict[str, Any]:
     env = _build_environment()
     difficulty_counts = Counter(str(task.metadata.get("difficulty", "unknown")) for task in env.tasks)
     return {
         "task_count": len(env.tasks),
         "difficulty_counts": dict(difficulty_counts),
         "action_space": ["CALL_TOOL", "ADD_EDGE", "ANSWER"],
     }
+@lru_cache(maxsize=1)
+def _preview_snapshot() -> dict[str, Any]:
+    env = _build_environment()
+    evaluation = run_evaluation(env, episodes=3, return_details=True, llm=build_llm_client(env.config.llm))
+    dashboard_path = export_dashboard(
+        env=env,
+        evaluation=evaluation,
+        leaderboard_records=[],
+        output_path=str(SPACE_DASHBOARD),
+    )
+    snapshot = dict(_base_environment_snapshot())
+    snapshot["summary"] = evaluation["summary"]
+    snapshot["dashboard_path"] = dashboard_path
+    return snapshot
+def _space_snapshot() -> dict[str, Any]:
+    snapshot = dict(_base_environment_snapshot())
+    baseline_payload = _load_json(LATEST_BASELINE_OUTPUT)
+    if baseline_payload is not None and isinstance(baseline_payload.get("summary"), dict):
+        dashboard_path = Path(
+            str(
+                ((baseline_payload.get("run") or {}).get("dashboard_path"))
+                or "artifacts/baselines/openai_fixed_levels_dashboard.html"
+            )
+        )
+        if dashboard_path.exists():
+            snapshot["dashboard_path"] = str(dashboard_path)
+        snapshot["summary"] = dict(baseline_payload["summary"])
+        snapshot["source"] = "baseline_output"
+        return snapshot
+    evaluation_payload = _load_json(LATEST_EVALUATION_OUTPUT)
+    if evaluation_payload is not None and isinstance(evaluation_payload.get("summary"), dict):
+        env = _build_environment()
+        dashboard_path = export_dashboard(
+            env=env,
+            evaluation=evaluation_payload,
+            leaderboard_records=[],
+            output_path=str(SPACE_DASHBOARD),
+        )
+        snapshot["summary"] = dict(evaluation_payload["summary"])
+        snapshot["dashboard_path"] = dashboard_path
+        snapshot["source"] = "latest_evaluation"
+        return snapshot
+    preview = _preview_snapshot()
+    preview["source"] = "preview"
+    return preview
 app = FastAPI(title="OSINT OpenEnv Space", version="0.1.0")
     import uvicorn
     uvicorn.run("server:app", host="0.0.0.0", port=SPACE_PORT)

src/osint_env/baselines/openai_runner.py CHANGED Viewed

@@ -36,14 +36,14 @@ class OpenAIBaselineConfig:
     leaderboard_path: str = "artifacts/baselines/openai_fixed_levels_leaderboard.json"
     dashboard_path: str = "artifacts/baselines/openai_fixed_levels_dashboard.html"
     run_name: str = "openai_fixed_levels_baseline"
-    model: str = "gpt-4o-mini"
     base_url: str = "https://api.openai.com/v1"
     api_key: str = ""
     api_key_env: str = "OPENAI_API_KEY"
     temperature: float = 0.0
     max_tokens: int = 256
     timeout_seconds: int = 60
-    episodes: int = 15
     max_steps: int = 8
     seed: int | None = 7
     append_leaderboard: bool = True
@@ -428,6 +428,23 @@ class OpenAIBaselineRunner:
         summary = metrics.summary()
         duration_seconds = perf_counter() - started
         dashboard_path = export_dashboard(
             env=env,
             evaluation={"summary": summary, "episodes": episode_rows},
@@ -459,21 +476,7 @@ class OpenAIBaselineRunner:
         output.parent.mkdir(parents=True, exist_ok=True)
         output.write_text(json.dumps(payload, indent=2, sort_keys=True), encoding="utf-8")
-        if self.config.append_leaderboard:
-            record = append_leaderboard_record(
-                path=self.config.leaderboard_path,
-                summary=summary,
-                episodes=int(self.config.episodes),
-                run_name=self.config.run_name,
-                config={
-                    "provider": "openai",
-                    "model": self.config.model,
-                    "seed": self.config.seed,
-                    "max_steps": self.config.max_steps,
-                    "shared_config_path": self.config.shared_config_path,
-                    "seed_file": self.config.seed_file,
-                },
-            )
             payload["record"] = record
             output.write_text(json.dumps(payload, indent=2, sort_keys=True), encoding="utf-8")

     leaderboard_path: str = "artifacts/baselines/openai_fixed_levels_leaderboard.json"
     dashboard_path: str = "artifacts/baselines/openai_fixed_levels_dashboard.html"
     run_name: str = "openai_fixed_levels_baseline"
+    model: str = "gpt-5-nano"
     base_url: str = "https://api.openai.com/v1"
     api_key: str = ""
     api_key_env: str = "OPENAI_API_KEY"
     temperature: float = 0.0
     max_tokens: int = 256
     timeout_seconds: int = 60
+    episodes: int = 30
     max_steps: int = 8
     seed: int | None = 7
     append_leaderboard: bool = True
         summary = metrics.summary()
         duration_seconds = perf_counter() - started
+        if self.config.append_leaderboard:
+            record = append_leaderboard_record(
+                path=self.config.leaderboard_path,
+                summary=summary,
+                episodes=int(self.config.episodes),
+                run_name=self.config.run_name,
+                config={
+                    "provider": "openai",
+                    "model": self.config.model,
+                    "seed": self.config.seed,
+                    "max_steps": self.config.max_steps,
+                    "shared_config_path": self.config.shared_config_path,
+                    "seed_file": self.config.seed_file,
+                },
+            )
+        else:
+            record = None
         dashboard_path = export_dashboard(
             env=env,
             evaluation={"summary": summary, "episodes": episode_rows},
         output.parent.mkdir(parents=True, exist_ok=True)
         output.write_text(json.dumps(payload, indent=2, sort_keys=True), encoding="utf-8")
+        if record is not None:
             payload["record"] = record
             output.write_text(json.dumps(payload, indent=2, sort_keys=True), encoding="utf-8")

tests/test_fixed_levels_dataset.py ADDED Viewed

	@@ -0,0 +1,18 @@

+import json
+from collections import Counter
+from pathlib import Path
+def test_fixed_levels_seed_has_30_questions_and_target_node_spans():
+    path = Path("datasets/fixed_levels/seed_fixed_levels.json")
+    payload = json.loads(path.read_text(encoding="utf-8"))
+    questions = payload["seeding"]["seeded_questions"]
+    counts = Counter(q["metadata"]["difficulty"] for q in questions)
+    assert counts == {"easy": 10, "mid": 10, "high": 10}
+    mid_support_nodes = [int(q["metadata"]["support_nodes"]) for q in questions if q["metadata"]["difficulty"] == "mid"]
+    high_support_nodes = [int(q["metadata"]["support_nodes"]) for q in questions if q["metadata"]["difficulty"] == "high"]
+    assert all(15 <= value <= 20 for value in mid_support_nodes)
+    assert all(48 <= value <= 55 for value in high_support_nodes)