siddeshwar-kagatikar commited on
Commit
2292d06
·
1 Parent(s): dbcbf00

fixed server and made tough queries

Browse files
README.md CHANGED
@@ -24,7 +24,7 @@ The motivation is to provide a reproducible OpenEnv-compatible environment for e
24
 
25
  The environment generates or loads a hidden canonical graph of users, aliases, organizations, locations, posts, threads, and events. It then exposes partial platform views and a task list drawn from that graph.
26
 
27
- The default hosted Space uses the fixed-level benchmark in [`datasets/fixed_levels/seed_fixed_levels.json`](/c:/Users/SIDDESHWAR/Desktop/meta/OSINT_env/datasets/fixed_levels/seed_fixed_levels.json), which contains 15 stable tasks over one shared seeded graph.
28
 
29
  ## Action Space
30
 
@@ -134,7 +134,7 @@ The reproducible OpenAI baseline is implemented in [`scripts/run_openai_baseline
134
  Default behavior:
135
 
136
  - dataset: fixed-level benchmark
137
- - episodes: 15
138
  - max steps per episode: 8
139
  - temperature: 0.0
140
  - output artifact: `artifacts/baselines/openai_fixed_levels_latest.json`
@@ -143,7 +143,7 @@ Run it with an API key:
143
 
144
  ```bash
145
  export OPENAI_API_KEY="your_key_here"
146
- python scripts/run_openai_baseline.py --model gpt-4o-mini
147
  ```
148
 
149
  The script is designed to stay bounded enough for a normal benchmark pass to finish comfortably under 20 minutes on a lightweight chat model, while still using the full fixed task set. For repeatability it fixes the benchmark graph/tasks and uses deterministic decoding settings. Because remote model backends can still change over time, the output artifact also records model metadata and system fingerprints when available.
@@ -174,15 +174,9 @@ The FastAPI app serves:
174
 
175
  ## Baseline Scores
176
 
177
- Bundled fixed-level baseline artifact:
178
 
179
- | baseline | provider | model | episodes | task success | avg graph f1 | leaderboard score |
180
- |---|---|---|---:|---:|---:|---:|
181
- | `fixed_levels_qwen_swarm` | Ollama | `qwen3:2b` | 15 | 1.000 | 0.849 | 0.854 |
182
-
183
- Source: [`datasets/fixed_levels/qwen_swarm_benchmark_fixed_levels.json`](/c:/Users/SIDDESHWAR/Desktop/meta/OSINT_env/datasets/fixed_levels/qwen_swarm_benchmark_fixed_levels.json)
184
-
185
- After you supply an OpenAI API key, the matching baseline scores will be written to:
186
 
187
  - [`artifacts/baselines/openai_fixed_levels_latest.json`](/c:/Users/SIDDESHWAR/Desktop/meta/OSINT_env/artifacts/baselines/openai_fixed_levels_latest.json)
188
  - [`artifacts/baselines/openai_fixed_levels_dashboard.html`](/c:/Users/SIDDESHWAR/Desktop/meta/OSINT_env/artifacts/baselines/openai_fixed_levels_dashboard.html)
 
24
 
25
  The environment generates or loads a hidden canonical graph of users, aliases, organizations, locations, posts, threads, and events. It then exposes partial platform views and a task list drawn from that graph.
26
 
27
+ The default hosted Space uses the fixed-level benchmark in [`datasets/fixed_levels/seed_fixed_levels.json`](/c:/Users/SIDDESHWAR/Desktop/meta/OSINT_env/datasets/fixed_levels/seed_fixed_levels.json), which now contains 30 stable tasks over a larger shared seeded graph.
28
 
29
  ## Action Space
30
 
 
134
  Default behavior:
135
 
136
  - dataset: fixed-level benchmark
137
+ - episodes: 30
138
  - max steps per episode: 8
139
  - temperature: 0.0
140
  - output artifact: `artifacts/baselines/openai_fixed_levels_latest.json`
 
143
 
144
  ```bash
145
  export OPENAI_API_KEY="your_key_here"
146
+ python scripts/run_openai_baseline.py --model gpt-5-nano
147
  ```
148
 
149
  The script is designed to stay bounded enough for a normal benchmark pass to finish comfortably under 20 minutes on a lightweight chat model, while still using the full fixed task set. For repeatability it fixes the benchmark graph/tasks and uses deterministic decoding settings. Because remote model backends can still change over time, the output artifact also records model metadata and system fingerprints when available.
 
174
 
175
  ## Baseline Scores
176
 
177
+ The fixed-level benchmark was expanded from the earlier 15-question set to a 30-question set with a larger seeded graph, so older benchmark artifacts should be treated as legacy and regenerated on the new dataset before using them as reference scores.
178
 
179
+ After you supply an OpenAI API key, the current baseline scores for the expanded benchmark will be written to:
 
 
 
 
 
 
180
 
181
  - [`artifacts/baselines/openai_fixed_levels_latest.json`](/c:/Users/SIDDESHWAR/Desktop/meta/OSINT_env/artifacts/baselines/openai_fixed_levels_latest.json)
182
  - [`artifacts/baselines/openai_fixed_levels_dashboard.html`](/c:/Users/SIDDESHWAR/Desktop/meta/OSINT_env/artifacts/baselines/openai_fixed_levels_dashboard.html)
datasets/fixed_levels/README.md CHANGED
@@ -4,22 +4,22 @@ This folder contains a fixed three-level OSINT benchmark set built on one shared
4
 
5
  ## Files
6
 
7
- - `seed_fixed_levels.json`: master fixed seed with canonical nodes, edges, and 15 fixed questions.
8
  - `fixed_graph_questions.json`: extracted fixed dataset snapshot for submission packaging.
9
  - `shared_config_fixed_levels.json`: run config used for generation and evaluation.
10
  - `complete_dataset_qwen_generated.json`: full dataset after Qwen (`qwen3:2b` via Ollama) expands the graph.
11
- - `qwen_swarm_eval_fixed_levels.json`: Qwen swarm evaluation summary on this set.
12
- - `qwen_swarm_benchmark_fixed_levels.json`: benchmark output with record and summary.
13
  - `leaderboard_fixed_levels.json`: leaderboard file for this dataset.
14
  - `dashboard_fixed_levels.html`: interactive dashboard generated from the benchmark run.
15
 
16
  ## Difficulty Design
17
 
18
- - Easy: 5 questions, mostly direct alias, org, location, and event lookup.
19
- - Mid: 5 questions, 2-hop linking across alias plus org or event relations.
20
- - High: 5 questions, multi-hop cross-platform traces with implicit collaboration context.
21
 
22
- All 15 questions are fixed and share the same seeded graph.
23
 
24
  ## Regenerate Artifacts
25
 
 
4
 
5
  ## Files
6
 
7
+ - `seed_fixed_levels.json`: master fixed seed with an expanded canonical graph and 30 fixed questions.
8
  - `fixed_graph_questions.json`: extracted fixed dataset snapshot for submission packaging.
9
  - `shared_config_fixed_levels.json`: run config used for generation and evaluation.
10
  - `complete_dataset_qwen_generated.json`: full dataset after Qwen (`qwen3:2b` via Ollama) expands the graph.
11
+ - `qwen_swarm_eval_fixed_levels.json`: legacy Qwen swarm evaluation summary from the older smaller version of the set.
12
+ - `qwen_swarm_benchmark_fixed_levels.json`: legacy benchmark output from the older smaller version of the set.
13
  - `leaderboard_fixed_levels.json`: leaderboard file for this dataset.
14
  - `dashboard_fixed_levels.html`: interactive dashboard generated from the benchmark run.
15
 
16
  ## Difficulty Design
17
 
18
+ - Easy: 10 questions. These now use the older hard-style multi-hop traces as the new floor.
19
+ - Mid: 10 questions. Each question spans roughly 15-20 supporting nodes.
20
+ - High: 10 questions. Each question spans roughly 50 supporting nodes.
21
 
22
+ All 30 questions are fixed and share the same larger seeded graph.
23
 
24
  ## Regenerate Artifacts
25
 
datasets/fixed_levels/complete_dataset_qwen_generated.json CHANGED
The diff for this file is too large to render. See raw diff
 
datasets/fixed_levels/fixed_graph_questions.json CHANGED
The diff for this file is too large to render. See raw diff
 
datasets/fixed_levels/seed_fixed_levels.json CHANGED
The diff for this file is too large to render. See raw diff
 
datasets/fixed_levels/shared_config_fixed_levels.json CHANGED
@@ -1,10 +1,10 @@
1
  {
2
  "environment": {
3
- "n_users": 4,
4
- "alias_density": 0.0,
5
- "noise_level": 0.08,
6
- "red_herring_rate": 0.04,
7
- "max_steps": 20,
8
  "seed": 2026
9
  },
10
  "swarm": {
@@ -28,7 +28,7 @@
28
  "seeded_questions": [],
29
  "llm_generate_remaining_graph": true,
30
  "llm_generate_remaining_tasks": false,
31
- "llm_generated_edge_budget": 28,
32
  "llm_generated_task_budget": 0,
33
  "llm_generation_parallel": true,
34
  "llm_generation_workers": 4,
@@ -47,7 +47,7 @@
47
  "openai_api_key": ""
48
  },
49
  "runtime": {
50
- "default_episodes": 15,
51
  "leaderboard_path": "datasets/fixed_levels/leaderboard_fixed_levels.json",
52
  "dashboard_path": "datasets/fixed_levels/dashboard_fixed_levels.html",
53
  "sweep_dashboard_dir": "datasets/fixed_levels/sweep_dashboards"
 
1
  {
2
  "environment": {
3
+ "n_users": 24,
4
+ "alias_density": 0.2,
5
+ "noise_level": 0.12,
6
+ "red_herring_rate": 0.08,
7
+ "max_steps": 24,
8
  "seed": 2026
9
  },
10
  "swarm": {
 
28
  "seeded_questions": [],
29
  "llm_generate_remaining_graph": true,
30
  "llm_generate_remaining_tasks": false,
31
+ "llm_generated_edge_budget": 64,
32
  "llm_generated_task_budget": 0,
33
  "llm_generation_parallel": true,
34
  "llm_generation_workers": 4,
 
47
  "openai_api_key": ""
48
  },
49
  "runtime": {
50
+ "default_episodes": 30,
51
  "leaderboard_path": "datasets/fixed_levels/leaderboard_fixed_levels.json",
52
  "dashboard_path": "datasets/fixed_levels/dashboard_fixed_levels.html",
53
  "sweep_dashboard_dir": "datasets/fixed_levels/sweep_dashboards"
scripts/generate_fixed_levels_seed.py ADDED
@@ -0,0 +1,109 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from collections import Counter, OrderedDict
2
+ from pathlib import Path
3
+ import json
4
+
5
+ U=[('aria','Aria Sen','Helios Labs','Sector 9'),('bharat','Bharat Kulkarni','Northbridge Logistics','Dockyard 17'),('cyrus','Cyrus Mehta','Apex Dynamics','Old Town'),('diya','Diya Roy','Blueharbor Media','Old Town'),('elin','Elin Das','Helios Labs','Sector 9'),('faris','Faris Noor','Tidewatch Ops','Rivergate'),('gita','Gita Pradhan','Apex Dynamics','Old Town'),('hiro','Hiro Tan','Northbridge Logistics','Dockyard 17'),('ivy','Ivy Kapoor','Kestrel Works','Rivergate'),('jules','Jules Banerjee','Blueharbor Media','Old Town'),('kian','Kian Bose','Atlas Freight','East Quay'),('leena','Leena Das','Sunmesh Analytics','Sector 9'),('mika','Mika Solanki','Orion Customs','North Basin'),('nora','Nora Iqbal','Emberline Security','Foundry Row'),('omar','Omar Sheikh','Atlas Freight','East Quay'),('priya','Priya Menon','Sunmesh Analytics','Sector 9'),('quinn','Quinn Rao','Orion Customs','North Basin'),('rhea','Rhea Kapoor','Emberline Security','Foundry Row'),('soren','Soren Malik','Harborlight Transit','Uplink Yard'),('tara','Tara Dey','Harborlight Transit','Uplink Yard')]
6
+ A=[('orchidfox','@orchidfox','ivy'),('steelquill','@steelquill','bharat'),('monsoonbyte','@monsoonbyte','diya'),('nightrelay','@nightrelay','faris'),('mapleghost','@mapleghost','elin'),('docksparrow','@docksparrow','hiro'),('quartzlotus','@quartzlotus','cyrus'),('emberglass','@emberglass','nora'),('basinraven','@basinraven','mika'),('tideshard','@tideshard','soren'),('hollowsignal','@hollowsignal','priya'),('ironwhisper','@ironwhisper','omar'),('cinderveil','@cinderveil','rhea'),('sablekeel','@sablekeel','tara'),('lanternmoth','@lanternmoth','kian'),('frostledger','@frostledger','leena')]
7
+ L=[('dockyard17','Dockyard 17'),('sector9','Sector 9'),('old_town','Old Town'),('rivergate','Rivergate'),('east_quay','East Quay'),('foundry_row','Foundry Row'),('north_basin','North Basin'),('uplink_yard','Uplink Yard')]
8
+ O=[('helios_labs','Helios Labs','sector9'),('northbridge_logistics','Northbridge Logistics','dockyard17'),('apex_dynamics','Apex Dynamics','old_town'),('blueharbor_media','Blueharbor Media','old_town'),('tidewatch_ops','Tidewatch Ops','rivergate'),('kestrel_works','Kestrel Works','rivergate'),('atlas_freight','Atlas Freight','east_quay'),('sunmesh_analytics','Sunmesh Analytics','sector9'),('orion_customs','Orion Customs','north_basin'),('emberline_security','Emberline Security','foundry_row'),('harborlight_transit','Harborlight Transit','uplink_yard')]
9
+ E=[('project_lantern','Project Lantern'),('black_kite','Black Kite'),('silent_current','Silent Current'),('amber_veil','Amber Veil'),('glass_harbor','Glass Harbor'),('ember_tide','Ember Tide'),('iron_wharf','Iron Wharf'),('ghost_signal','Ghost Signal')]
10
+ T=[('supply_leak','supply_chain'),('port_audit','port_audit'),('customs_breach','customs_breach'),('relay_map','relay_map'),('foundry_watch','foundry_watch'),('basin_shift','basin_shift'),('quiet_manifest','quiet_manifest'),('uplink_route','uplink_route'),('ember_tide_watch','ember_tide'),('ghost_signal_net','ghost_signal')]
11
+ P=['shift_roster','midnight_manifest','sat_phone_ping','drone_parts','relay_schedule','quay_ledgers','customs_tag','hull_signal','basin_photo','foundry_map','lantern_route','uplink_note']
12
+
13
+ def uid(x): return f'user_{x}'
14
+ def aid(x): return f'alias_{x}'
15
+ def oid(x): return f'org_{x}'
16
+ def lid(x): return f'loc_{x}'
17
+ def eid(x): return f'event_{x}'
18
+ def tid(x): return f'thr_{x}'
19
+ def pid(x): return f'post_{x}'
20
+
21
+ def addn(nodes,nid,nt,attrs): nodes.append({'node_id':nid,'node_type':nt,'attrs':attrs})
22
+
23
+ def build():
24
+ nodes=[]; edges=OrderedDict();
25
+ for s,name,org,loc in U: addn(nodes,uid(s),'user',{'name':name,'org':org,'location':loc})
26
+ for s,handle,user in A: addn(nodes,aid(s),'alias',{'handle':handle})
27
+ for s,name,_ in O: addn(nodes,oid(s),'org',{'name':name})
28
+ for s,name in L: addn(nodes,lid(s),'location',{'name':name})
29
+ for s,name in E: addn(nodes,eid(s),'event',{'name':name})
30
+ for s,topic in T: addn(nodes,tid(s),'thread',{'topic':topic})
31
+ for s in P: addn(nodes,pid(s),'post',{'channel':'microblog'})
32
+ def ae(k,src,rel,dst,c=1.0): edges[k]={'src':src,'rel':rel,'dst':dst,'confidence':c}
33
+ for s,_,user in A: ae(f'a_{s}',aid(s),'alias_of',uid(user))
34
+ org_map={name:oid(s) for s,name,_ in O}; loc_map={name:lid(s) for s,name in L}
35
+ for s,_,org,loc in U: ae(f'w_{s}',uid(s),'works_at',org_map[org]); ae(f'l_{s}',uid(s),'located_in',loc_map[loc])
36
+ for s,_,loc in O: ae(f'op_{s}',oid(s),'operates_in',lid(loc))
37
+ CP=[('ivy','bharat',.95),('bharat','hiro',.95),('hiro','faris',.92),('faris','diya',.90),('diya','elin',.89),('elin','aria',.87),('aria','cyrus',.84),('cyrus','gita',.83),('gita','jules',.82),('jules','bharat',.81),('diya','ivy',.90),('ivy','elin',.86),('kian','omar',.93),('omar','mika',.90),('mika','quinn',.89),('quinn','nora',.88),('nora','rhea',.87),('rhea','soren',.86),('soren','tara',.86),('tara','kian',.84),('priya','leena',.91),('leena','aria',.83),('priya','nora',.82),('kian','bharat',.80),('soren','faris',.79),('quinn','hiro',.78)]
38
+ for i,(a,b,c) in enumerate(CP,1): ae(f'c{i:02d}',uid(a),'connected_to',uid(b),c)
39
+ PA={'midnight_manifest':'orchidfox','shift_roster':'docksparrow','sat_phone_ping':'nightrelay','drone_parts':'monsoonbyte','relay_schedule':'steelquill','quay_ledgers':'lanternmoth','customs_tag':'basinraven','hull_signal':'tideshard','basin_photo':'emberglass','foundry_map':'cinderveil','lantern_route':'frostledger','uplink_note':'sablekeel'}
40
+ for post,author in PA.items(): ae(f'ap_{post}',aid(author),'authored_post',pid(post))
41
+ PR={'midnight_manifest':['dockyard17','project_lantern'],'shift_roster':['dockyard17','northbridge_logistics'],'sat_phone_ping':['rivergate','project_lantern'],'drone_parts':['black_kite','kestrel_works'],'relay_schedule':['project_lantern','sector9'],'quay_ledgers':['east_quay','glass_harbor'],'customs_tag':['north_basin','iron_wharf'],'hull_signal':['uplink_yard','ghost_signal'],'basin_photo':['foundry_row','amber_veil'],'foundry_map':['foundry_row','ember_tide'],'lantern_route':['project_lantern','sunmesh_analytics'],'uplink_note':['uplink_yard','harborlight_transit']}
42
+ for post,refs in PR.items():
43
+ for i,x in enumerate(refs,1): ae(f'r_{post}_{i}',pid(post),'references', lid(x) if x in {y for y,_ in L} else (oid(x) if x in {y for y,_,_ in O} else eid(x)))
44
+ TA={'supply_leak':'diya','port_audit':'jules','customs_breach':'mika','relay_map':'leena','foundry_watch':'nora','basin_shift':'quinn','quiet_manifest':'kian','uplink_route':'soren','ember_tide_watch':'rhea','ghost_signal_net':'tara'}
45
+ TL={'supply_leak':[('discusses','project_lantern'),('references','northbridge_logistics')],'port_audit':[('discusses','black_kite'),('references','kestrel_works')],'customs_breach':[('discusses','iron_wharf'),('references','orion_customs')],'relay_map':[('discusses','project_lantern'),('references','sunmesh_analytics')],'foundry_watch':[('discusses','ember_tide'),('references','emberline_security')],'basin_shift':[('discusses','amber_veil'),('references','north_basin')],'quiet_manifest':[('discusses','glass_harbor'),('references','atlas_freight')],'uplink_route':[('discusses','ghost_signal'),('references','harborlight_transit')],'ember_tide_watch':[('discusses','ember_tide'),('references','foundry_row')],'ghost_signal_net':[('discusses','ghost_signal'),('references','uplink_yard')]}
46
+ for t,u in TA.items(): ae(f'at_{t}',uid(u),'authored_thread',tid(t))
47
+ for t,rels in TL.items():
48
+ for i,(rel,x) in enumerate(rels,1): ae(f'tl_{t}_{i}',tid(t),rel, lid(x) if x in {y for y,_ in L} else (oid(x) if x in {y for y,_,_ in O} else eid(x)))
49
+ ER=[('bharat','collaborates_on','project_lantern'),('hiro','collaborates_on','project_lantern'),('faris','collaborates_on','project_lantern'),('diya','investigates','project_lantern'),('leena','monitors','project_lantern'),('ivy','collaborates_on','black_kite'),('cyrus','collaborates_on','black_kite'),('elin','investigates','black_kite'),('jules','reports_on','black_kite'),('kian','collaborates_on','glass_harbor'),('omar','collaborates_on','glass_harbor'),('priya','monitors','glass_harbor'),('mika','collaborates_on','iron_wharf'),('quinn','collaborates_on','iron_wharf'),('nora','investigates','amber_veil'),('rhea','collaborates_on','ember_tide'),('soren','collaborates_on','ghost_signal'),('tara','reports_on','ghost_signal'),('gita','monitors','silent_current'),('jules','reports_on','silent_current')]
50
+ for i,(u,rel,e) in enumerate(ER,1): ae(f'er{i:02d}',uid(u),rel,eid(e),.9)
51
+ X=[(eid('project_lantern'),'connected_to',eid('glass_harbor')),(eid('black_kite'),'connected_to',eid('amber_veil')),(eid('ember_tide'),'connected_to',eid('ghost_signal')),(oid('atlas_freight'),'connected_to',oid('northbridge_logistics')),(oid('orion_customs'),'connected_to',oid('emberline_security')),(oid('harborlight_transit'),'connected_to',oid('tidewatch_ops'))]
52
+ for i,(a,rel,b) in enumerate(X,1): ae(f'x{i:02d}',a,rel,b,.77)
53
+ return nodes,edges
54
+
55
+ def mk_questions(edges):
56
+ def ids(*items):
57
+ out=[]
58
+ for it in items:
59
+ if isinstance(it,list): out.extend(it)
60
+ else: out.append(it)
61
+ return out
62
+ def rng(prefix,a,b): return [f'{prefix}{i:02d}' for i in range(a,b+1)]
63
+ def sup(edge_ids): return [edges[e] for e in edge_ids]
64
+ def nodes(edge_ids):
65
+ s=set()
66
+ for e in edge_ids: s|={edges[e]['src'],edges[e]['dst']}
67
+ return len(s)
68
+ qs=[]
69
+ easy=[('easy_01','alias_orchidfox -> post_midnight_manifest -> loc_dockyard17 -> connected collaborator on event_project_lantern. Who is it?','user_bharat',ids('a_orchidfox','ap_midnight_manifest','r_midnight_manifest_1','c01','er01')),('easy_02','thr_supply_leak references org_northbridge_logistics. Which alias_docksparrow user works there and collaborates on event_project_lantern?','user_hiro',ids('tl_supply_leak_2','a_docksparrow','w_hiro','er02')),('easy_03','alias_monsoonbyte authored post_drone_parts about event_black_kite. Which user behind that alias is directly connected to the Kestrel collaborator?','user_diya',ids('a_monsoonbyte','ap_drone_parts','r_drone_parts_1','w_ivy','er06','c12')),('easy_04','alias_nightrelay references loc_rivergate. Which user behind it works at an org operating there and collaborates on event_project_lantern?','user_faris',ids('a_nightrelay','ap_sat_phone_ping','r_sat_phone_ping_1','w_faris','op_tidewatch_ops','er03')),('easy_05','thr_port_audit discusses Black Kite and references Kestrel Works. Which alias_orchidfox user authored post_midnight_manifest and collaborates on Black Kite?','user_ivy',ids('tl_port_audit_1','tl_port_audit_2','a_orchidfox','ap_midnight_manifest','w_ivy','er06')),('easy_06','Which Atlas Freight user behind alias_lanternmoth authored post_quay_ledgers and collaborates on event_glass_harbor?','user_kian',ids('a_lanternmoth','ap_quay_ledgers','w_kian','er10')),('easy_07','Which Orion Customs user behind alias_basinraven authored post_customs_tag and collaborates on event_iron_wharf?','user_mika',ids('a_basinraven','ap_customs_tag','w_mika','er13')),('easy_08','Which user behind alias_emberglass posted basin_photo from Foundry Row and investigates Amber Veil?','user_nora',ids('a_emberglass','ap_basin_photo','r_basin_photo_1','er15')),('easy_09','Which user behind alias_tideshard authored post_hull_signal and collaborates on Ghost Signal?','user_soren',ids('a_tideshard','ap_hull_signal','er17')),('easy_10','Which Harborlight Transit user behind alias_sablekeel authored post_uplink_note and reports on Ghost Signal?','user_tara',ids('a_sablekeel','ap_uplink_note','w_tara','er18'))]
70
+ mid=[('mid_01','Follow alias_docksparrow through post_shift_roster, Dockyard 17, and the Lantern chain. Return the org node id.','org_northbridge_logistics',ids('a_docksparrow','ap_shift_roster','r_shift_roster_1','r_shift_roster_2','tl_supply_leak_2','w_hiro','l_hiro','er02','er01','c02','c03')),('mid_02','Across the Glass Harbor cluster, which user behind alias_lanternmoth links to the Atlas Freight network from thr_quiet_manifest?','user_kian',ids('a_lanternmoth','ap_quay_ledgers','r_quay_ledgers_1','r_quay_ledgers_2','at_quiet_manifest','tl_quiet_manifest_1','tl_quiet_manifest_2','w_kian','w_omar','er10','er11','er12','c13','c14')),('mid_03','Trace alias_basinraven through post_customs_tag, thr_customs_breach, and the Orion Customs collaboration chain. Who is it?','user_mika',ids('a_basinraven','ap_customs_tag','r_customs_tag_1','r_customs_tag_2','at_customs_breach','tl_customs_breach_1','tl_customs_breach_2','w_mika','w_quinn','er13','er14','c15','c16','x05')),('mid_04','In the Ember Tide and Amber Veil overlap, which Foundry Row user behind alias_cinderveil collaborates on Ember Tide?','user_rhea',ids('a_cinderveil','ap_foundry_map','r_foundry_map_1','r_foundry_map_2','at_foundry_watch','tl_foundry_watch_1','tl_foundry_watch_2','at_ember_tide_watch','tl_ember_tide_watch_1','tl_ember_tide_watch_2','w_rhea','w_nora','er15','er16','c17','x03')),('mid_05','Follow alias_tideshard from post_hull_signal into thr_uplink_route and the Harborlight relay. Return the org node id.','org_harborlight_transit',ids('a_tideshard','ap_hull_signal','r_hull_signal_1','r_hull_signal_2','at_uplink_route','tl_uplink_route_1','tl_uplink_route_2','w_soren','w_tara','er17','er18','c18','c19','op_harborlight_transit','x06')),('mid_06','Which Sunmesh user behind alias_frostledger connects post_lantern_route to thr_relay_map and the Sector 9 monitoring chain?','user_leena',ids('a_frostledger','ap_lantern_route','r_lantern_route_1','r_lantern_route_2','at_relay_map','tl_relay_map_1','tl_relay_map_2','w_leena','w_priya','l_leena','op_sunmesh_analytics','er05','c21','c22')),('mid_07','Which user behind alias_emberglass is tied to Amber Veil after combining post_basin_photo, thr_basin_shift, and the Foundry Row investigation chain?','user_nora',ids('a_emberglass','ap_basin_photo','r_basin_photo_1','r_basin_photo_2','at_basin_shift','tl_basin_shift_1','tl_basin_shift_2','w_nora','w_quinn','l_nora','er15','c16','c17','x05')),('mid_08','Combine alias_orchidfox, post_midnight_manifest, thr_supply_leak, and the Lantern to Glass Harbor bridge. Which user starts that chain?','user_ivy',ids('a_orchidfox','ap_midnight_manifest','r_midnight_manifest_1','r_midnight_manifest_2','at_supply_leak','tl_supply_leak_1','tl_supply_leak_2','w_ivy','er06','c01','c12','x01','er10','er12')),('mid_09','Which user behind alias_monsoonbyte sits at the overlap of Blueharbor Media, Project Lantern, Black Kite, and the Ivy connection chain?','user_diya',ids('a_monsoonbyte','ap_drone_parts','r_drone_parts_1','at_supply_leak','tl_supply_leak_1','at_port_audit','tl_port_audit_1','w_diya','w_ivy','w_jules','er04','er06','er09','c04','c12')),('mid_10','Who is the Northbridge user behind alias_steelquill when combining post_relay_schedule, thr_supply_leak, Dockyard 17, and Lantern collaborator edges?','user_bharat',ids('a_steelquill','ap_relay_schedule','r_relay_schedule_1','r_relay_schedule_2','at_supply_leak','tl_supply_leak_1','tl_supply_leak_2','w_bharat','w_hiro','l_bharat','l_hiro','er01','er02','c01','c02'))]
71
+ big=list(edges.keys())[:58]
72
+ hard=[('high_01','Lantern to Glass Harbor handoff: identify the user behind alias_orchidfox after combining Lantern logistics, Dockyard links, and Atlas Freight bridge evidence.','user_ivy',ids('a_orchidfox','ap_midnight_manifest','r_midnight_manifest_1','r_midnight_manifest_2','at_supply_leak','tl_supply_leak_1','tl_supply_leak_2',['w_ivy','w_bharat','w_hiro','w_kian','w_omar'],['l_ivy','l_bharat','l_hiro','l_kian','l_omar'],['op_northbridge_logistics','op_kestrel_works','op_atlas_freight'],rng('c',1,3),['c12','c13','c14'],['er01','er02','er03','er06','er10','er11','er12'],'at_quiet_manifest','tl_quiet_manifest_1','tl_quiet_manifest_2','ap_quay_ledgers','r_quay_ledgers_1','r_quay_ledgers_2','x01','x04','a_lanternmoth','a_steelquill','a_docksparrow')),('high_02','North Basin to Foundry Row escalation: which user behind alias_basinraven anchors the Iron Wharf side before the Emberline handoff?','user_mika',ids('a_basinraven','ap_customs_tag','r_customs_tag_1','r_customs_tag_2','at_customs_breach','tl_customs_breach_1','tl_customs_breach_2','at_basin_shift','tl_basin_shift_1','tl_basin_shift_2','at_foundry_watch','tl_foundry_watch_1','tl_foundry_watch_2',['w_mika','w_quinn','w_nora','w_rhea'],['l_mika','l_quinn','l_nora','l_rhea'],['op_orion_customs','op_emberline_security'],['c15','c16','c17'],['er13','er14','er15','er16'],'ap_basin_photo','r_basin_photo_1','r_basin_photo_2','ap_foundry_map','r_foundry_map_1','r_foundry_map_2','x02','x03','x05','a_emberglass','a_cinderveil','c23','c24')),('high_03','Harborlight ghost-signal relay: identify the user behind alias_tideshard at the Harborlight / Tidewatch junction.','user_soren',ids('a_tideshard','ap_hull_signal','r_hull_signal_1','r_hull_signal_2','a_sablekeel','ap_uplink_note','r_uplink_note_1','r_uplink_note_2','at_uplink_route','tl_uplink_route_1','tl_uplink_route_2','at_ghost_signal_net','tl_ghost_signal_net_1','tl_ghost_signal_net_2',['w_soren','w_tara','w_faris'],['l_soren','l_tara','l_faris'],['op_harborlight_transit','op_tidewatch_ops'],['c18','c19','c20','c25'],['er03','er17','er18'],'ap_sat_phone_ping','r_sat_phone_ping_1','r_sat_phone_ping_2','at_supply_leak','tl_supply_leak_1','er01','er02','x03','x06','a_nightrelay')),('high_04','Blueharbor to Black Kite to Lantern overlap: which user is the Blueharbor origin behind alias_monsoonbyte?','user_diya',ids('a_monsoonbyte','ap_drone_parts','r_drone_parts_1','r_drone_parts_2','at_port_audit','tl_port_audit_1','tl_port_audit_2','at_supply_leak','tl_supply_leak_1','tl_supply_leak_2',['w_diya','w_jules','w_ivy','w_cyrus'],['l_diya','l_jules','l_ivy','l_cyrus'],['op_blueharbor_media','op_kestrel_works','op_apex_dynamics'],['c04','c08','c09','c12'],['er04','er06','er07','er08','er09'],'a_orchidfox','ap_midnight_manifest','r_midnight_manifest_2','x01','x02','at_relay_map','tl_relay_map_1','w_leena','er05')),('high_05','Sector 9 to Dockyard 17 full relay: which user behind alias_steelquill links the Northbridge chain and the Sunmesh monitoring bridge?','user_bharat',ids('a_steelquill','ap_relay_schedule','r_relay_schedule_1','r_relay_schedule_2','a_frostledger','ap_lantern_route','r_lantern_route_1','r_lantern_route_2','at_relay_map','tl_relay_map_1','tl_relay_map_2','at_supply_leak','tl_supply_leak_1','tl_supply_leak_2',['w_bharat','w_hiro','w_leena','w_priya','w_aria'],['l_bharat','l_hiro','l_leena','l_priya','l_aria'],['op_northbridge_logistics','op_sunmesh_analytics','op_helios_labs'],['c01','c02','c05','c06','c07','c21','c22'],['er01','er02','er05'],'x01','x04','a_docksparrow','a_mapleghost','a_hollowsignal')),('high_06','Foundry Row, North Basin, and Uplink Yard spread: identify the user behind alias_emberglass before the Harborlight relay takes over.','user_nora',ids('a_emberglass','ap_basin_photo','r_basin_photo_1','r_basin_photo_2','a_cinderveil','ap_foundry_map','r_foundry_map_1','r_foundry_map_2','a_sablekeel','ap_uplink_note','r_uplink_note_1','r_uplink_note_2','at_foundry_watch','tl_foundry_watch_1','tl_foundry_watch_2','at_ember_tide_watch','tl_ember_tide_watch_1','tl_ember_tide_watch_2','at_uplink_route','tl_uplink_route_1','tl_uplink_route_2',['w_nora','w_rhea','w_soren','w_tara'],['l_nora','l_rhea','l_soren','l_tara'],['op_emberline_security','op_harborlight_transit'],['c17','c18','c19'],['er15','er16','er17','er18'],'x03','x06')),('high_07','Freight and customs bridge: which Atlas Freight user behind alias_lanternmoth connects Glass Harbor with the Northbridge chain?','user_kian',ids('a_lanternmoth','ap_quay_ledgers','r_quay_ledgers_1','r_quay_ledgers_2','at_quiet_manifest','tl_quiet_manifest_1','tl_quiet_manifest_2',['w_kian','w_omar','w_bharat','w_hiro'],['l_kian','l_omar','l_bharat','l_hiro'],['op_atlas_freight','op_northbridge_logistics'],['c13','c14','c24','c02'],['er10','er11','er12','er01','er02'],'ap_shift_roster','r_shift_roster_1','r_shift_roster_2','ap_midnight_manifest','r_midnight_manifest_1','at_supply_leak','tl_supply_leak_2','x04','a_ironwhisper','a_steelquill','a_docksparrow')),('high_08','Black Kite, Amber Veil, and Iron Wharf overlap: which user behind alias_quartzlotus is the Apex-side collaborator?','user_cyrus',ids('a_quartzlotus','w_cyrus','l_cyrus','op_apex_dynamics','er07','at_port_audit','tl_port_audit_1','ap_drone_parts','r_drone_parts_1','er15','at_basin_shift','tl_basin_shift_1','er13','at_customs_breach','tl_customs_breach_1',['w_ivy','w_nora','w_mika','w_quinn'],['l_ivy','l_nora','l_mika','l_quinn'],['op_kestrel_works','op_emberline_security','op_orion_customs'],['c08','c12','c15','c16','c17'],'x02','x05','a_orchidfox','a_basinraven','a_emberglass')),('high_09','Ghost Signal and Ember Tide relay: which user behind alias_sablekeel is the Harborlight reporting endpoint?','user_tara',ids('a_sablekeel','ap_uplink_note','r_uplink_note_1','r_uplink_note_2','a_tideshard','ap_hull_signal','r_hull_signal_1','r_hull_signal_2','at_ghost_signal_net','tl_ghost_signal_net_1','tl_ghost_signal_net_2','at_uplink_route','tl_uplink_route_1','tl_uplink_route_2','at_ember_tide_watch','tl_ember_tide_watch_1','tl_ember_tide_watch_2',['w_tara','w_soren','w_rhea','w_nora'],['l_tara','l_soren','l_rhea','l_nora'],['op_harborlight_transit','op_emberline_security'],['c18','c19','c17'],['er16','er17','er18'],'x03','x06','a_cinderveil','a_emberglass')),('high_10','End-to-end benchmark sweep: across Lantern, Black Kite, Glass Harbor, Iron Wharf, Ember Tide, and Ghost Signal, which user behind alias_hollowsignal anchors the Sunmesh monitoring side?','user_priya',big)]
73
+ for diff,level,specs in [('easy',1,easy),('mid',2,mid),('high',3,hard)]:
74
+ for qid,q,a,eids in specs:
75
+ qs.append({'task_type':'fixed_trace','question':q,'answer':a,'supporting_edges':sup(eids),'metadata':{'difficulty':diff,'difficulty_level':level,'question_id':qid,'support_nodes':nodes(eids)}})
76
+ def edge_key(e): return (e['src'], e['rel'], e['dst'])
77
+ mid_pool = sup(ids('a_orchidfox','ap_midnight_manifest','r_midnight_manifest_1','r_midnight_manifest_2','a_lanternmoth','ap_quay_ledgers','r_quay_ledgers_1','r_quay_ledgers_2','a_basinraven','ap_customs_tag','r_customs_tag_1','r_customs_tag_2','a_tideshard','ap_hull_signal','r_hull_signal_1','at_supply_leak','tl_supply_leak_1','at_quiet_manifest','tl_quiet_manifest_1','er01','er02','er06','er10','c01','c02','c13'))
78
+ hard_pool = sup(list(edges.keys())[:120])
79
+ for q in qs:
80
+ current = {edge_key(e) for e in q['supporting_edges']}
81
+ diff = q['metadata']['difficulty']
82
+ if diff == 'mid':
83
+ pool = mid_pool
84
+ target = 17
85
+ elif diff == 'high':
86
+ pool = hard_pool
87
+ target = 50
88
+ else:
89
+ continue
90
+ for e in pool:
91
+ if q['metadata']['support_nodes'] >= target:
92
+ break
93
+ k = edge_key(e)
94
+ if k not in current:
95
+ q['supporting_edges'].append(dict(e))
96
+ current.add(k)
97
+ q['metadata']['support_nodes'] = len({n for edge in q['supporting_edges'] for n in (edge['src'], edge['dst'])})
98
+ return qs
99
+
100
+ def main():
101
+ nodes,edges=build(); questions=mk_questions(edges)
102
+ payload={'seeding':{'seeded_nodes':nodes,'seeded_edges':list(edges.values()),'seeded_questions':questions,'llm_generate_remaining_graph':True,'llm_generate_remaining_tasks':False,'llm_generated_edge_budget':48,'llm_generated_task_budget':0,'llm_generation_parallel':True,'llm_generation_workers':4,'llm_generation_retries':3,'allow_template_fallback_on_llm_failure':False}}
103
+ out=Path('datasets/fixed_levels/seed_fixed_levels.json'); out.write_text(json.dumps(payload,indent=2),encoding='utf-8')
104
+ counts=Counter(q['metadata']['difficulty'] for q in questions)
105
+ stats={k:sorted(q['metadata']['support_nodes'] for q in questions if q['metadata']['difficulty']==k) for k in ['easy','mid','high']}
106
+ print(json.dumps({'nodes':len(nodes),'edges':len(edges),'questions':len(questions),'difficulty_counts':dict(counts),'support_nodes':stats},indent=2))
107
+
108
+ if __name__=='__main__':
109
+ main()
scripts/run_openai_baseline.py CHANGED
@@ -15,11 +15,11 @@ def build_parser() -> argparse.ArgumentParser:
15
  parser.add_argument("--leaderboard", default="artifacts/baselines/openai_fixed_levels_leaderboard.json", help="Leaderboard JSON path.")
16
  parser.add_argument("--dashboard", default="artifacts/baselines/openai_fixed_levels_dashboard.html", help="Dashboard HTML path.")
17
  parser.add_argument("--run-name", default="openai_fixed_levels_baseline", help="Leaderboard run name.")
18
- parser.add_argument("--model", default="gpt-4o-mini", help="OpenAI chat model name.")
19
  parser.add_argument("--openai-base-url", default="https://api.openai.com/v1", help="OpenAI-compatible base URL.")
20
  parser.add_argument("--openai-api-key", default="", help="OpenAI API key override.")
21
  parser.add_argument("--openai-api-key-env", default="OPENAI_API_KEY", help="Environment variable name for the API key.")
22
- parser.add_argument("--episodes", type=int, default=15, help="Number of episodes to evaluate.")
23
  parser.add_argument("--max-steps", type=int, default=8, help="Episode step budget to keep runs bounded.")
24
  parser.add_argument("--temperature", type=float, default=0.0, help="Sampling temperature.")
25
  parser.add_argument("--max-tokens", type=int, default=256, help="Maximum completion tokens per step.")
@@ -57,4 +57,3 @@ def main() -> None:
57
 
58
  if __name__ == "__main__":
59
  main()
60
-
 
15
  parser.add_argument("--leaderboard", default="artifacts/baselines/openai_fixed_levels_leaderboard.json", help="Leaderboard JSON path.")
16
  parser.add_argument("--dashboard", default="artifacts/baselines/openai_fixed_levels_dashboard.html", help="Dashboard HTML path.")
17
  parser.add_argument("--run-name", default="openai_fixed_levels_baseline", help="Leaderboard run name.")
18
+ parser.add_argument("--model", default="gpt-5-nano", help="OpenAI chat model name.")
19
  parser.add_argument("--openai-base-url", default="https://api.openai.com/v1", help="OpenAI-compatible base URL.")
20
  parser.add_argument("--openai-api-key", default="", help="OpenAI API key override.")
21
  parser.add_argument("--openai-api-key-env", default="OPENAI_API_KEY", help="Environment variable name for the API key.")
22
+ parser.add_argument("--episodes", type=int, default=30, help="Number of episodes to evaluate.")
23
  parser.add_argument("--max-steps", type=int, default=8, help="Episode step budget to keep runs bounded.")
24
  parser.add_argument("--temperature", type=float, default=0.0, help="Sampling temperature.")
25
  parser.add_argument("--max-tokens", type=int, default=256, help="Maximum completion tokens per step.")
 
57
 
58
  if __name__ == "__main__":
59
  main()
 
server.py CHANGED
@@ -23,6 +23,18 @@ SPACE_PROVIDER = os.getenv("OSINT_SPACE_LLM_PROVIDER", "mock")
23
  SPACE_MODEL = os.getenv("OSINT_SPACE_LLM_MODEL", "gpt-4o-mini")
24
  SPACE_PORT = int(os.getenv("PORT", "7860"))
25
  SPACE_DASHBOARD = Path("artifacts/space_dashboard.html")
 
 
 
 
 
 
 
 
 
 
 
 
26
 
27
 
28
  def _build_environment() -> OSINTEnvironment:
@@ -41,19 +53,10 @@ def _build_environment() -> OSINTEnvironment:
41
 
42
 
43
  @lru_cache(maxsize=1)
44
- def _space_snapshot() -> dict[str, Any]:
45
  env = _build_environment()
46
- evaluation = run_evaluation(env, episodes=3, return_details=True, llm=build_llm_client(env.config.llm))
47
- dashboard_path = export_dashboard(
48
- env=env,
49
- evaluation=evaluation,
50
- leaderboard_records=[],
51
- output_path=str(SPACE_DASHBOARD),
52
- )
53
  difficulty_counts = Counter(str(task.metadata.get("difficulty", "unknown")) for task in env.tasks)
54
  return {
55
- "dashboard_path": dashboard_path,
56
- "summary": evaluation["summary"],
57
  "task_count": len(env.tasks),
58
  "difficulty_counts": dict(difficulty_counts),
59
  "action_space": ["CALL_TOOL", "ADD_EDGE", "ANSWER"],
@@ -74,6 +77,58 @@ def _space_snapshot() -> dict[str, Any]:
74
  }
75
 
76
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
77
  app = FastAPI(title="OSINT OpenEnv Space", version="0.1.0")
78
 
79
 
@@ -218,4 +273,3 @@ if __name__ == "__main__":
218
  import uvicorn
219
 
220
  uvicorn.run("server:app", host="0.0.0.0", port=SPACE_PORT)
221
-
 
23
  SPACE_MODEL = os.getenv("OSINT_SPACE_LLM_MODEL", "gpt-4o-mini")
24
  SPACE_PORT = int(os.getenv("PORT", "7860"))
25
  SPACE_DASHBOARD = Path("artifacts/space_dashboard.html")
26
+ LATEST_BASELINE_OUTPUT = Path("artifacts/baselines/openai_fixed_levels_latest.json")
27
+ LATEST_EVALUATION_OUTPUT = Path("artifacts/latest_evaluation.json")
28
+
29
+
30
+ def _load_json(path: Path) -> dict[str, Any] | None:
31
+ if not path.exists():
32
+ return None
33
+ try:
34
+ payload = json.loads(path.read_text(encoding="utf-8"))
35
+ except (OSError, json.JSONDecodeError):
36
+ return None
37
+ return payload if isinstance(payload, dict) else None
38
 
39
 
40
  def _build_environment() -> OSINTEnvironment:
 
53
 
54
 
55
  @lru_cache(maxsize=1)
56
+ def _base_environment_snapshot() -> dict[str, Any]:
57
  env = _build_environment()
 
 
 
 
 
 
 
58
  difficulty_counts = Counter(str(task.metadata.get("difficulty", "unknown")) for task in env.tasks)
59
  return {
 
 
60
  "task_count": len(env.tasks),
61
  "difficulty_counts": dict(difficulty_counts),
62
  "action_space": ["CALL_TOOL", "ADD_EDGE", "ANSWER"],
 
77
  }
78
 
79
 
80
+ @lru_cache(maxsize=1)
81
+ def _preview_snapshot() -> dict[str, Any]:
82
+ env = _build_environment()
83
+ evaluation = run_evaluation(env, episodes=3, return_details=True, llm=build_llm_client(env.config.llm))
84
+ dashboard_path = export_dashboard(
85
+ env=env,
86
+ evaluation=evaluation,
87
+ leaderboard_records=[],
88
+ output_path=str(SPACE_DASHBOARD),
89
+ )
90
+ snapshot = dict(_base_environment_snapshot())
91
+ snapshot["summary"] = evaluation["summary"]
92
+ snapshot["dashboard_path"] = dashboard_path
93
+ return snapshot
94
+
95
+
96
+ def _space_snapshot() -> dict[str, Any]:
97
+ snapshot = dict(_base_environment_snapshot())
98
+
99
+ baseline_payload = _load_json(LATEST_BASELINE_OUTPUT)
100
+ if baseline_payload is not None and isinstance(baseline_payload.get("summary"), dict):
101
+ dashboard_path = Path(
102
+ str(
103
+ ((baseline_payload.get("run") or {}).get("dashboard_path"))
104
+ or "artifacts/baselines/openai_fixed_levels_dashboard.html"
105
+ )
106
+ )
107
+ if dashboard_path.exists():
108
+ snapshot["dashboard_path"] = str(dashboard_path)
109
+ snapshot["summary"] = dict(baseline_payload["summary"])
110
+ snapshot["source"] = "baseline_output"
111
+ return snapshot
112
+
113
+ evaluation_payload = _load_json(LATEST_EVALUATION_OUTPUT)
114
+ if evaluation_payload is not None and isinstance(evaluation_payload.get("summary"), dict):
115
+ env = _build_environment()
116
+ dashboard_path = export_dashboard(
117
+ env=env,
118
+ evaluation=evaluation_payload,
119
+ leaderboard_records=[],
120
+ output_path=str(SPACE_DASHBOARD),
121
+ )
122
+ snapshot["summary"] = dict(evaluation_payload["summary"])
123
+ snapshot["dashboard_path"] = dashboard_path
124
+ snapshot["source"] = "latest_evaluation"
125
+ return snapshot
126
+
127
+ preview = _preview_snapshot()
128
+ preview["source"] = "preview"
129
+ return preview
130
+
131
+
132
  app = FastAPI(title="OSINT OpenEnv Space", version="0.1.0")
133
 
134
 
 
273
  import uvicorn
274
 
275
  uvicorn.run("server:app", host="0.0.0.0", port=SPACE_PORT)
 
src/osint_env/baselines/openai_runner.py CHANGED
@@ -36,14 +36,14 @@ class OpenAIBaselineConfig:
36
  leaderboard_path: str = "artifacts/baselines/openai_fixed_levels_leaderboard.json"
37
  dashboard_path: str = "artifacts/baselines/openai_fixed_levels_dashboard.html"
38
  run_name: str = "openai_fixed_levels_baseline"
39
- model: str = "gpt-4o-mini"
40
  base_url: str = "https://api.openai.com/v1"
41
  api_key: str = ""
42
  api_key_env: str = "OPENAI_API_KEY"
43
  temperature: float = 0.0
44
  max_tokens: int = 256
45
  timeout_seconds: int = 60
46
- episodes: int = 15
47
  max_steps: int = 8
48
  seed: int | None = 7
49
  append_leaderboard: bool = True
@@ -428,6 +428,23 @@ class OpenAIBaselineRunner:
428
 
429
  summary = metrics.summary()
430
  duration_seconds = perf_counter() - started
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
431
  dashboard_path = export_dashboard(
432
  env=env,
433
  evaluation={"summary": summary, "episodes": episode_rows},
@@ -459,21 +476,7 @@ class OpenAIBaselineRunner:
459
  output.parent.mkdir(parents=True, exist_ok=True)
460
  output.write_text(json.dumps(payload, indent=2, sort_keys=True), encoding="utf-8")
461
 
462
- if self.config.append_leaderboard:
463
- record = append_leaderboard_record(
464
- path=self.config.leaderboard_path,
465
- summary=summary,
466
- episodes=int(self.config.episodes),
467
- run_name=self.config.run_name,
468
- config={
469
- "provider": "openai",
470
- "model": self.config.model,
471
- "seed": self.config.seed,
472
- "max_steps": self.config.max_steps,
473
- "shared_config_path": self.config.shared_config_path,
474
- "seed_file": self.config.seed_file,
475
- },
476
- )
477
  payload["record"] = record
478
  output.write_text(json.dumps(payload, indent=2, sort_keys=True), encoding="utf-8")
479
 
 
36
  leaderboard_path: str = "artifacts/baselines/openai_fixed_levels_leaderboard.json"
37
  dashboard_path: str = "artifacts/baselines/openai_fixed_levels_dashboard.html"
38
  run_name: str = "openai_fixed_levels_baseline"
39
+ model: str = "gpt-5-nano"
40
  base_url: str = "https://api.openai.com/v1"
41
  api_key: str = ""
42
  api_key_env: str = "OPENAI_API_KEY"
43
  temperature: float = 0.0
44
  max_tokens: int = 256
45
  timeout_seconds: int = 60
46
+ episodes: int = 30
47
  max_steps: int = 8
48
  seed: int | None = 7
49
  append_leaderboard: bool = True
 
428
 
429
  summary = metrics.summary()
430
  duration_seconds = perf_counter() - started
431
+ if self.config.append_leaderboard:
432
+ record = append_leaderboard_record(
433
+ path=self.config.leaderboard_path,
434
+ summary=summary,
435
+ episodes=int(self.config.episodes),
436
+ run_name=self.config.run_name,
437
+ config={
438
+ "provider": "openai",
439
+ "model": self.config.model,
440
+ "seed": self.config.seed,
441
+ "max_steps": self.config.max_steps,
442
+ "shared_config_path": self.config.shared_config_path,
443
+ "seed_file": self.config.seed_file,
444
+ },
445
+ )
446
+ else:
447
+ record = None
448
  dashboard_path = export_dashboard(
449
  env=env,
450
  evaluation={"summary": summary, "episodes": episode_rows},
 
476
  output.parent.mkdir(parents=True, exist_ok=True)
477
  output.write_text(json.dumps(payload, indent=2, sort_keys=True), encoding="utf-8")
478
 
479
+ if record is not None:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
480
  payload["record"] = record
481
  output.write_text(json.dumps(payload, indent=2, sort_keys=True), encoding="utf-8")
482
 
tests/test_fixed_levels_dataset.py ADDED
@@ -0,0 +1,18 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import json
2
+ from collections import Counter
3
+ from pathlib import Path
4
+
5
+
6
+ def test_fixed_levels_seed_has_30_questions_and_target_node_spans():
7
+ path = Path("datasets/fixed_levels/seed_fixed_levels.json")
8
+ payload = json.loads(path.read_text(encoding="utf-8"))
9
+ questions = payload["seeding"]["seeded_questions"]
10
+
11
+ counts = Counter(q["metadata"]["difficulty"] for q in questions)
12
+ assert counts == {"easy": 10, "mid": 10, "high": 10}
13
+
14
+ mid_support_nodes = [int(q["metadata"]["support_nodes"]) for q in questions if q["metadata"]["difficulty"] == "mid"]
15
+ high_support_nodes = [int(q["metadata"]["support_nodes"]) for q in questions if q["metadata"]["difficulty"] == "high"]
16
+
17
+ assert all(15 <= value <= 20 for value in mid_support_nodes)
18
+ assert all(48 <= value <= 55 for value in high_support_nodes)