OSINT / docs /reward_design_notes.md
ritishshrirao's picture
Add dashboard, Update reward, Multi-agent orchestration
ce675d4

Reward Design Notes

This environment uses a composite reward that adapts ideas from:

  • AutoGraph-R1 (arXiv:2510.15339)
  • UniRel (arXiv:2512.17043)
  • DeepPath (EMNLP 2017, D17-1060)
  • Multi-Hop KG Reasoning with Reward Shaping (EMNLP 2018, D18-1362)
  • Kimi K2.5 (arXiv:2602.02276) for PARL-style swarm auxiliary shaping

Additional related context consulted:

  • MINERVA (arXiv:1711.05851) for query-conditioned walk-style reasoning over KG paths.

Components in this Branch

The implementation follows a staged reward design:

  1. edge-level rewards during graph construction (ADD_EDGE)
  2. answer-level rewards for retrieval usefulness and final task utility (ANSWER)
  3. evaluation-level composite leaderboard score for benchmark ranking

1) Edge addition reward

For each ADD_EDGE, the reward combines:

  • Global accuracy term (DeepPath):
    • $r_{global} = +1$ if a candidate edge is correct, else $-1$ (scaled in code for stability).
  • Soft shaping term (D18 reward shaping):
    • $R = R_b + (1 - R_b) f(s, r, o)$, where $f$ is a soft fact plausibility score.
    • In code, $f$ is approximated by relation/type priors plus small domain priors.
  • Efficiency term (DeepPath):
    • $r_{efficiency} \propto 1 / \text{step_count}$.
  • Diversity term (DeepPath):
    • novelty from cosine dissimilarity of edge signatures; repeated patterns are down-weighted.
  • Relation/entity informativeness (UniRel):
    • relation rarity via normalized IDF of relation labels,
    • entity informativeness via inverse hub-penalty.
  • Connectivity gain term:
    • rewards bridge edges that connect previously disconnected graph regions.

2) Final answer reward

For ANSWER, the reward combines:

  • format validity,
  • answer correctness,
  • knowledge-carrying utility (AutoGraph-R1 style):
    • $R_C(q, y, G) = \mathbb{{I}}[\text{{deducible}}(q, y \mid G)]$.
  • knowledge-indexing utility (AutoGraph-R1 style):
    • $R_I(q, D_{{gold}}, G) = |Top\text{{-}}k(G,q) \cap D_{{gold}}| / |D_{{gold}}|$,
    • approximated in this environment with evidence recall over tool outputs.
  • connectivity (UniRel style):
    • discrete connectivity reward over extracted seed entities, normalized for stable mixing.
  • graph F1 against supporting edges,
  • compactness penalty for unnecessary extra edges,
  • efficiency bonus,
  • relation/entity informativeness for the constructed subgraph,
  • repetition penalty to discourage redundant relation generation patterns.

UniRel-style aggregate view represented in this branch:

R(a)Rfmt+Rcon+w1Rent+w2Rrel+task utility terms R(a) \approx R_{{fmt}} + R_{{con}} + w_1 R_{{ent}} + w_2 R_{{rel}} + \text{{task utility terms}}

with task utility terms coming from AutoGraph-inspired $R_C$ and $R_I$ components.

Telemetry

Per-step component rewards are aggregated into info["reward_components"], enabling:

  • richer benchmark summaries,
  • leaderboard ranking by composite utility,
  • visual diagnostics in dashboard exports.

Evaluation also computes derived retrieval and structural utility signals used in leaderboard ranking.

Future Multi-Agent Notes

This branch now includes a low-width swarm baseline orchestrator that adds PARL-style auxiliary shaping on top of the core edge and answer rewards.

The helper implementation is in:

  • src/osint_env/env/spawn_reward_hooks.py

It follows the Kimi K2.5 style decomposition:

  • $r_{{PARL}}(x,y) = r_{{perf}}(x,y) + \lambda_1 r_{{parallel}} + \lambda_2 r_{{finish}}$,
  • optional critical-steps shaping for latency-sensitive training,
  • optional annealing of $\lambda_1, \lambda_2$ toward zero,
  • optional breadth/depth shaping hooks for future branch integration.

The expanded project-level walkthrough is in README.md under "Reward Design (Integrated Notes)".