Representation over Routing: Overcoming Surrogate Hacking in Multi-Timescale PPO
Abstract
Multi-timescale reinforcement learning approaches face algorithmic pathologies when combining short-term and long-term signals, but a target decoupling architecture that separates temporal predictions in the critic from policy updates in the actor achieves superior performance in delayed-reward environments.
Temporal credit assignment in reinforcement learning has long been a central challenge. Inspired by the multi-timescale encoding of the dopamine system in neurobiology, recent research has sought to introduce multiple discount factors into Actor-Critic architectures, such as Proximal Policy Optimization (PPO), to balance short-term responses with long-term planning. However, this paper reveals that blindly fusing multi-timescale signals in complex delayed-reward tasks can lead to severe algorithmic pathologies. We systematically demonstrate that exposing a temporal attention routing mechanism to policy gradients results in surrogate objective hacking, while adopting gradient-free uncertainty weighting triggers irreversible myopic degeneration, a phenomenon we term the Paradox of Temporal Uncertainty. To address these issues, we propose a Target Decoupling architecture: on the Critic side, we retain multi-timescale predictions to enforce auxiliary representation learning, while on the Actor side, we strictly isolate short-term signals and update the policy based solely on long-term advantages. Rigorous empirical evaluations across multiple independent random seeds in the LunarLander-v2 environment demonstrate that our proposed architecture achieves statistically significant performance improvements. Without relying on hyperparameter hacking, it consistently surpasses the ''Environment Solved'' threshold with minimal variance, completely eliminates policy collapse, and escapes the hovering local optima that trap single-timescale baselines. The source code to reproduce our experiments is publicly available at https://github.com/ben-dlwlrma/Representation-Over-Routing.
Community
Motivation: Fusing multi-timescale signals in RL can trigger optimization pathologies. We identify two specific failure modes in Actor-Critic architectures: Surrogate Objective Hacking (policy gradients exploiting dynamic routing weights) and the Paradox of Temporal Uncertainty (myopic degeneration under gradient-free routing).
Method: We propose a Target Decoupling architecture ("Representation over Routing"). We remove routing aggregation from the Actor. Instead, the Critic fits multiple temporal horizons as an auxiliary representation learning task, while the Actor updates solely on the long-term advantage.
Results: On the LunarLander-v2 delayed-reward benchmark, our decoupled agent avoids the "hovering for survival" local optimum. It eliminates policy collapse and stably surpasses the "Environment Solved" threshold without hyperparameter hacking.
Code and reproducible scripts are open-sourced in the repo.
the core idea that really sticks is target decoupling: keep multi-timescale predictions on the critic for auxiliary representation learning, while the actor updates are driven only by long-horizon advantages. this separation seems to block the gradient hijacking channel they expose with surrogate objectives, which explains why naive fusion often destabilizes learning in delayed reward tasks. iād love to see a more explicit ablation on how the critic's auxiliary losses interact with value variance, since a couple of minor differences in those terms can swing policy stability. the arxivlens breakdown helped me parse the method details and the exact routing vs target setup, a nice quick walkthrough when you skim the paper (https://arxivlens.com/PaperView/Details/representation-over-routing-overcoming-surrogate-hacking-in-multi-timescale-ppo-256-832e3787). one question: would removing any long-horizon signal from the actor completely break performance on harder benchmarks, or is there a minimal long-horizon component that still preserves stability in noisy environments?
Thanks for reading and sharing the ArxivLens summary.
Regarding the variance ablation: I agree. While Figure 6 shows the aggregated value loss dropping, explicitly plotting the variance of the long-term advantage under different auxiliary weights would better illustrate the scaffolding effect. It's a solid suggestion for a future revision.
To your question about the Actor's horizon: completely removing the long-horizon signal (e.g., dropping gamma to 0.5) definitely breaks performance. The agent falls back into the hovering local optimum because it loses the sparse delayed reward signal.
However, your intuition about noisy environments is correct. In highly stochastic tasks, gamma=0.999 might introduce too much variance to the Actor. There should be a "minimal effective horizon" (e.g., 0.95 or 0.99) that balances capturing the delayed reward with resisting environmental noise. Target decoupling is useful specifically here because it allows tuning this Actor horizon purely for the bias-variance tradeoff, without degrading the Critic's multi-timescale state representation.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- DGPO: Distribution Guided Policy Optimization for Fine Grained Credit Assignment (2026)
- Beyond Uniform Credit Assignment: Selective Eligibility Traces for RLVR (2026)
- GAGPO: Generalized Advantage Grouped Policy Optimization (2026)
- AdaGamma: State-Dependent Discounting for Temporal Adaptation in Reinforcement Learning (2026)
- Multi-scale Predictive Representations for Goal-conditioned Reinforcement Learning (2026)
- LambdaPO: A Lambda Style Policy Optimization for Reasoning Language Models (2026)
- Posterior Optimization with Clipped Objective for Bridging Efficiency and Stability in Generative Policy Learning (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2604.13517 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 1
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 1
Collections including this paper 0
No Collection including this paper