OSINT1

Paused

App Files Files Community

OSINT1 / docs /reward_design_notes.md

siddeshwar-kagatikar

Deploy clean snapshot to Hugging Face Space.

db4fa53 13 days ago

preview code

raw

history blame contribute delete

3.69 kB

	# Reward Design Notes

	This environment uses a composite reward that adapts ideas from:

	- AutoGraph-R1 (arXiv:2510.15339)
	- UniRel (arXiv:2512.17043)
	- DeepPath (EMNLP 2017, D17-1060)
	- Multi-Hop KG Reasoning with Reward Shaping (EMNLP 2018, D18-1362)
	- Kimi K2.5 (arXiv:2602.02276) for PARL-style swarm auxiliary shaping

	Additional related context consulted:

	- MINERVA (arXiv:1711.05851) for query-conditioned walk-style reasoning over KG paths.

	## Components in this Branch

	The implementation follows a staged reward design:

	1. edge-level rewards during graph construction (`ADD_EDGE`)
	2. answer-level rewards for retrieval usefulness and final task utility (`ANSWER`)
	3. evaluation-level composite leaderboard score for benchmark ranking

	### 1) Edge addition reward

	For each `ADD_EDGE`, the reward combines:

	- Global accuracy term (DeepPath):
	- $r_{global} = +1$ if a candidate edge is correct, else $-1$ (scaled in code for stability).
	- Soft shaping term (D18 reward shaping):
	- $R = R_b + (1 - R_b) f(s, r, o)$, where $f$ is a soft fact plausibility score.
	- In code, $f$ is approximated by relation/type priors plus small domain priors.
	- Efficiency term (DeepPath):
	- $r_{efficiency} \propto 1 / \text{step\_count}$.
	- Diversity term (DeepPath):
	- novelty from cosine dissimilarity of edge signatures; repeated patterns are down-weighted.
	- Relation/entity informativeness (UniRel):
	- relation rarity via normalized IDF of relation labels,
	- entity informativeness via inverse hub-penalty.
	- Connectivity gain term:
	- rewards bridge edges that connect previously disconnected graph regions.

	### 2) Final answer reward

	For `ANSWER`, the reward combines:

	- format validity,
	- answer correctness,
	- knowledge-carrying utility (AutoGraph-R1 style):
	- $R_C(q, y, G) = \mathbb{{I}}[\text{{deducible}}(q, y \mid G)]$.
	- knowledge-indexing utility (AutoGraph-R1 style):
	- $R_I(q, D_{{gold}}, G) = \|Top\text{{-}}k(G,q) \cap D_{{gold}}\| / \|D_{{gold}}\|$,
	- approximated in this environment with evidence recall over tool outputs.
	- connectivity (UniRel style):
	- discrete connectivity reward over extracted seed entities, normalized for stable mixing.
	- graph F1 against supporting edges,
	- compactness penalty for unnecessary extra edges,
	- efficiency bonus,
	- relation/entity informativeness for the constructed subgraph,
	- repetition penalty to discourage redundant relation generation patterns.

	UniRel-style aggregate view represented in this branch:

	$$
	R(a) \approx R_{{fmt}} + R_{{con}} + w_1 R_{{ent}} + w_2 R_{{rel}} + \text{{task utility terms}}
	$$

	with task utility terms coming from AutoGraph-inspired $R_C$ and $R_I$ components.

	## Telemetry

	Per-step component rewards are aggregated into `info["reward_components"]`, enabling:

	- richer benchmark summaries,
	- leaderboard ranking by composite utility,
	- visual diagnostics in dashboard exports.

	Evaluation also computes derived retrieval and structural utility signals used in leaderboard ranking.

	## Future Multi-Agent Notes

	This branch now includes a low-width swarm baseline orchestrator that adds PARL-style auxiliary shaping on top of the core edge and answer rewards.

	The helper implementation is in:

	- `src/osint_env/env/spawn_reward_hooks.py`

	It follows the Kimi K2.5 style decomposition:

	- $r_{{PARL}}(x,y) = r_{{perf}}(x,y) + \lambda_1 r_{{parallel}} + \lambda_2 r_{{finish}}$,
	- optional critical-steps shaping for latency-sensitive training,
	- optional annealing of $\lambda_1, \lambda_2$ toward zero,
	- optional breadth/depth shaping hooks for future branch integration.

	The expanded project-level walkthrough is in `README.md` under "Reward Design (Integrated Notes)".