arxiv:2604.13517

Representation over Routing: Overcoming Surrogate Hacking in Multi-Timescale PPO

Published on May 21

· Submitted by

Jing Sun on May 26

Upvote

Authors:

Jing Sun

Abstract

Multi-timescale reinforcement learning approaches face algorithmic pathologies when combining short-term and long-term signals, but a target decoupling architecture that separates temporal predictions in the critic from policy updates in the actor achieves superior performance in delayed-reward environments.

AI-generated summary

Temporal credit assignment in reinforcement learning has long been a central challenge. Inspired by the multi-timescale encoding of the dopamine system in neurobiology, recent research has sought to introduce multiple discount factors into Actor-Critic architectures, such as Proximal Policy Optimization (PPO), to balance short-term responses with long-term planning. However, this paper reveals that blindly fusing multi-timescale signals in complex delayed-reward tasks can lead to severe algorithmic pathologies. We systematically demonstrate that exposing a temporal attention routing mechanism to policy gradients results in surrogate objective hacking, while adopting gradient-free uncertainty weighting triggers irreversible myopic degeneration, a phenomenon we term the Paradox of Temporal Uncertainty. To address these issues, we propose a Target Decoupling architecture: on the Critic side, we retain multi-timescale predictions to enforce auxiliary representation learning, while on the Actor side, we strictly isolate short-term signals and update the policy based solely on long-term advantages. Rigorous empirical evaluations across multiple independent random seeds in the LunarLander-v2 environment demonstrate that our proposed architecture achieves statistically significant performance improvements. Without relying on hyperparameter hacking, it consistently surpasses the ''Environment Solved'' threshold with minimal variance, completely eliminates policy collapse, and escapes the hovering local optima that trap single-timescale baselines. The source code to reproduce our experiments is publicly available at https://github.com/ben-dlwlrma/Representation-Over-Routing.

View arXiv page View PDF Project page GitHub 11 Add to collection

Community

ben-dlwlrma

Paper author Paper submitter about 17 hours ago

Motivation: Fusing multi-timescale signals in RL can trigger optimization pathologies. We identify two specific failure modes in Actor-Critic architectures: Surrogate Objective Hacking (policy gradients exploiting dynamic routing weights) and the Paradox of Temporal Uncertainty (myopic degeneration under gradient-free routing).

Method: We propose a Target Decoupling architecture ("Representation over Routing"). We remove routing aggregation from the Actor. Instead, the Critic fits multiple temporal horizons as an auxiliary representation learning task, while the Actor updates solely on the long-term advantage.

Results: On the LunarLander-v2 delayed-reward benchmark, our decoupled agent avoids the "hovering for survival" local optimum. It eliminates policy collapse and stably surpasses the "Environment Solved" threshold without hyperparameter hacking.

Code and reproducible scripts are open-sourced in the repo.

avahal

about 8 hours ago

the core idea that really sticks is target decoupling: keep multi-timescale predictions on the critic for auxiliary representation learning, while the actor updates are driven only by long-horizon advantages. this separation seems to block the gradient hijacking channel they expose with surrogate objectives, which explains why naive fusion often destabilizes learning in delayed reward tasks. i’d love to see a more explicit ablation on how the critic's auxiliary losses interact with value variance, since a couple of minor differences in those terms can swing policy stability. the arxivlens breakdown helped me parse the method details and the exact routing vs target setup, a nice quick walkthrough when you skim the paper (https://arxivlens.com/PaperView/Details/representation-over-routing-overcoming-surrogate-hacking-in-multi-timescale-ppo-256-832e3787). one question: would removing any long-horizon signal from the actor completely break performance on harder benchmarks, or is there a minimal long-horizon component that still preserves stability in noisy environments?

ben-dlwlrma

Paper author about 5 hours ago

Thanks for reading and sharing the ArxivLens summary.

Regarding the variance ablation: I agree. While Figure 6 shows the aggregated value loss dropping, explicitly plotting the variance of the long-term advantage under different auxiliary weights would better illustrate the scaffolding effect. It's a solid suggestion for a future revision.

To your question about the Actor's horizon: completely removing the long-horizon signal (e.g., dropping gamma to 0.5) definitely breaks performance. The agent falls back into the hovering local optimum because it loses the sparse delayed reward signal.

However, your intuition about noisy environments is correct. In highly stochastic tasks, gamma=0.999 might introduce too much variance to the Actor. There should be a "minimal effective horizon" (e.g., 0.95 or 0.99) that balances capturing the delayed reward with resisting environmental noise. Target decoupling is useful specifically here because it allows tuning this Actor horizon purely for the bias-variance tradeoff, without degrading the Critic's multi-timescale state representation.