diff --git "a/data/chunks/2603.10535_semantic.json" "b/data/chunks/2603.10535_semantic.json" new file mode 100644--- /dev/null +++ "b/data/chunks/2603.10535_semantic.json" @@ -0,0 +1,2000 @@ +[ + { + "chunk_id": "b0d86285-bf5d-4c67-9c95-6b41d475b0ae", + "text": "Tackling Length Inflation Without Trade-offs:\nGroup Relative Reward Rescaling for Reinforcement Learning", + "paper_id": "2603.10535", + "title": "Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning", + "authors": [ + "Zichao Li", + "Jie Lou", + "Fangchen Dong", + "Zhiyuan Fan", + "Mengjie Ren", + "Hongyu Lin", + "Xianpei Han", + "Debing Zhang", + "Le Sun", + "Yaojie Lu", + "Xing Yu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10535v1", + "chunk_index": 0, + "total_chunks": 74, + "char_count": 104, + "word_count": 12, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "4600ed8c-6ed4-4e11-b4df-f4f1ec145fde", + "text": "Zichao Li 1 2 Jie Lou 3 Fangchen Dong 3 Zhiyuan Fan 3 Mengjie Ren 1 2 Hongyu Lin 1 Xianpei Han 1\nDebing Zhang 3 Le Sun 1 Yaojie Lu 1 Xing Yu 3 Abstract AIME-25: Score vs. Length (7B)\nReinforcement learning significantly enhances GR³ 46\nLLM capabilities but suffers from a critical issue:\n)442026 length inflation, where models adopt verbosity or\ninefficient reasoning to maximize rewards. Prior ( 42 GR3 breaks the performance length trade-off.\napproaches struggle to address this challenge in\nAdaptThink a general and lossless manner, primarily because 40Mar avg@32\nadditive penalties introduce a compensatory effect 38 Laser-DE R1-Distill\nthat creates optimization shortcuts, while heuris- DLER LCR111\ntic gating strategies lack generality beyond binary 36\nfeedback. To bridge this gap, we present Group\n4000 6000 8000 10000 12000 14000\nRelative Reward Rescaling (GR3), which re- Tokens ( )\nframes length control as a multiplicative rescaling\nparadigm, effectively establishing a generalized, Figure 1. Comparison of GR3 with open-source efficient reasoning\ncontinuous, and reward-dependent gating mech- models, all trained on DeepSeek-R1-Distill-7B. GR3 pioneers a[cs.LG]\nanism. To further ensure lossless optimization, new paradigm that sustains stable performance gains under\nwe incorporate group-relative regularization and RL while simultaneously mitigating the length inflation issue.\nadvantage-aware calibration, which dynamically\nadapt length budgets to instance difficulty and ference costs without proportional gains in quality. This\npreserve the advantage signal of high-quality tra- phenomenon arises across major RL paradigms. In RL from\njectories. Empirically, across both RLHF and human feedback (RLHF) (Ouyang et al., 2022), models\nRLVR settings, GR3 maintains training dynamics exploit reward-model biases toward verbosity, leading to\nand downstream performance comparable to stan- reward hacking (Gao et al., 2023). In RL with verifiable\ndard GRPO while significantly mitigating length rewards (RLVR) (Shao et al., 2024), length inflation instead\ninflation, outperforming state-of-the-art length- stems from reasoning inefficiency (Sui et al., 2025), where\nregularized baselines. models generate unnecessarily long chains of thought to\nmarginally improve the likelihood of a correct solution. Introduction Prior work has sought to mitigate length inflation.", + "paper_id": "2603.10535", + "title": "Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning", + "authors": [ + "Zichao Li", + "Jie Lou", + "Fangchen Dong", + "Zhiyuan Fan", + "Mengjie Ren", + "Hongyu Lin", + "Xianpei Han", + "Debing Zhang", + "Le Sun", + "Yaojie Lu", + "Xing Yu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10535v1", + "chunk_index": 1, + "total_chunks": 74, + "char_count": 2377, + "word_count": 336, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "bb96416b-430d-4e7a-841c-eff68e43c3a3", + "text": "One\nline of research trains reward models that are invariant toarXiv:2603.10535v1\nReinforcement learning (RL) (Bai et al., 2022; Guo et al., response length (Chen et al., 2024a; Liu et al., 2024). While\n2025) has become the engine of post-training for Large Lan- effective in RLHF, this strategy does not extend to RLVR,\nguage Models (LLMs) (Achiam et al., 2023; Team et al., where rewards are derived from ground-truth verifiers rather\n2023). Yet this engine exhibits a persistent flaw, which we than learned proxies that can be debiased. A more general\nterm length inflation: a tendency for RL-trained models direction instead introduces explicit length penalties into the\nto produce unnecessarily lengthy trajectories, inflating in- reward (Luo et al., 2025a; Liu et al., 2025c; Yi et al., 2025). However, most existing methods rely on coarse regulariza-\n1Chinese Information Processing Laboratory, Institute of\ntion, which leads to suboptimal optimization dynamics.", + "paper_id": "2603.10535", + "title": "Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning", + "authors": [ + "Zichao Li", + "Jie Lou", + "Fangchen Dong", + "Zhiyuan Fan", + "Mengjie Ren", + "Hongyu Lin", + "Xianpei Han", + "Debing Zhang", + "Le Sun", + "Yaojie Lu", + "Xing Yu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10535v1", + "chunk_index": 2, + "total_chunks": 74, + "char_count": 969, + "word_count": 147, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "f5f7eebe-8a85-42c0-9ed0-9de82c2c316d", + "text": "Software, Chinese Academy of Sciences 2University of Chinese Academy of Sciences 3Xiaohongshu Inc. Correspondence A common design adopts additive shaping (Yu et al., 2025;\nto: Jie Lou , Yaojie Lu .\nlength term (e.g., R′ = R −λℓ). This introduces decouPreprint. pled incentives, creating a length-driven component that Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning", + "paper_id": "2603.10535", + "title": "Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning", + "authors": [ + "Zichao Li", + "Jie Lou", + "Fangchen Dong", + "Zhiyuan Fan", + "Mengjie Ren", + "Hongyu Lin", + "Xianpei Han", + "Debing Zhang", + "Le Sun", + "Yaojie Lu", + "Xing Yu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10535v1", + "chunk_index": 3, + "total_chunks": 74, + "char_count": 527, + "word_count": 70, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "57fac061-7da9-494a-becc-f11fdbe2ed22", + "text": "GR3 (Ours) GRPO\nRLVR (Math) RLVR (Code) RLHF (Chat) 0.65 0.45 0.8 reward 0.60 0.40\nTask 0.55 0.35 0.6\n0.30 0.4\n2500 8000 length\n7000 6000 2000\n6000 1500\n4000 Response 5000\n0 200 400 600 800 1000 0 200 400 600 800 0 100 200 300 400\nTraining step Training step Training step Training dynamics of GR3, which retains GRPO's reward gains without loss while significantly reducing average tokens. The\nbase models used for the two settings are DeepSeek-R1-Distill-1.5B and Qwen3-8B (without thinking mode), respectively.", + "paper_id": "2603.10535", + "title": "Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning", + "authors": [ + "Zichao Li", + "Jie Lou", + "Fangchen Dong", + "Zhiyuan Fan", + "Mengjie Ren", + "Hongyu Lin", + "Xianpei Han", + "Debing Zhang", + "Le Sun", + "Yaojie Lu", + "Xing Yu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10535v1", + "chunk_index": 4, + "total_chunks": 74, + "char_count": 513, + "word_count": 86, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "d72fb843-d592-48fb-a869-2009216f8803", + "text": "makes extreme brevity an attractive shortcut independent of on AIME-25) while simultaneously improving accuracy\ntask success. To better align penalties with outcomes, some (e.g., +8 points), demonstrating that verbosity is not a preworks propose heuristic gating (Cheng et al., 2025; Arora requisite for intelligence. Furthermore, in RLHF settings,\n& Zanette, 2025), applying penalties only when R = 1. GR3 exhibits an adaptive length dynamic: it permits moderHowever, such designs are inherently limited to binary feed- ate growth when computation is beneficial but automatically\nback and do not extend naturally to continuous-reward set- curtails generation length as the policy matures (Figure 2).\ntings like RLHF. Moreover, many approaches rely on coarse This mechanism effectively mitigates reward hacking via\ncontrol mechanisms, such as static truncation thresholds verbosity without sacrificing capability. We will release our\nor uncalibrated penalty strengths (Liu et al., 2025b; Cheng code and model checkpoints to support future research.\net al., 2025), resulting in an inherent efficiency–performance\nIn summary, our contributions are threefold:\ntrade-off, as illustrated in Figure 1. These observations lead to a central question: Can we tackle • We propose GR3, a framework for lossless length conlength inflation in a general manner without compromising trol that substitutes additive penalties with multiplicathe capability gains of RL? In this work, we present Group tive reward rescaling. This design eliminates compenRelative Reward Rescaling (GR3), a principled frame- satory optimization shortcuts and provides a unified\nwork for lossless efficiency optimization. Rather than using mechanism for both binary and continuous rewards.\nadditive penalties, GR3 regularizes length through multi-\n• We develop an optimization-preserving strategy that\nplicative rescaling, which acts as a generalized gating mechintegrates group-relative regularization with advantageanism and removes the compensatory shortcuts inherent to\naware calibration, adapting constraints to on-policy\nadditive schemes. To further ensure lossless optimization,\nstatistics while preserving learning signals.\nwe introduce two fine-grained mechanisms. Specifically, we\nemploy group-relative regularization, which normalizes • Across mathematical reasoning, code generation, and\nlength against on-policy statistics rather than rigid thresh- RLHF alignment tasks, GR3 yields concise generolds, thereby dynamically adapting the length budget to ations while matching standard GRPO performance,\nthe inherent difficulty of each prompt. Complementing shifting the efficiency–performance Pareto frontier.\nthis, we introduce advantage-aware calibration to explicitly control the penalty strength.", + "paper_id": "2603.10535", + "title": "Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning", + "authors": [ + "Zichao Li", + "Jie Lou", + "Fangchen Dong", + "Zhiyuan Fan", + "Mengjie Ren", + "Hongyu Lin", + "Xianpei Han", + "Debing Zhang", + "Le Sun", + "Yaojie Lu", + "Xing Yu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10535v1", + "chunk_index": 5, + "total_chunks": 74, + "char_count": 2772, + "word_count": 370, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "0a6ef26a-0dff-480a-a5ca-717bcd9db203", + "text": "This ensures that length 2. Preliminary\nregularization does not overturn the advantage signal of representative high-quality trajectories, thereby safeguarding 2.1. Group Relative Policy Optimization\nstable optimization toward capability improvements. LLM generation can be formulated as a token-level Markov\nEmpirically, GR3 resolves the efficiency–performance trade- Decision Process (Puterman, 1990).", + "paper_id": "2603.10535", + "title": "Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning", + "authors": [ + "Zichao Li", + "Jie Lou", + "Fangchen Dong", + "Zhiyuan Fan", + "Mengjie Ren", + "Hongyu Lin", + "Xianpei Han", + "Debing Zhang", + "Le Sun", + "Yaojie Lu", + "Xing Yu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10535v1", + "chunk_index": 6, + "total_chunks": 74, + "char_count": 403, + "word_count": 48, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "05bbd512-5029-4e3d-9a87-64737cf969b5", + "text": "Given a prompt x ∼\noff inherent in prior methods. As shown in Figure 1, our D, an autoregressive policy πθ generates a response y =\napproach significantly reduces token usage (e.g., over 40% (y1, . . . , yℓ) of length ℓ:= |y| by sampling tokens from Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning A scalar reward R(x, y) is defined over\nTable 1. Comparison of length-aware reward shaping methods (Liu\ncomplete responses, and reinforcement learning aims to\net al., 2025c; Arora & Zanette, 2025; Aggarwal & Welleck, 2025;\nmaximize the expected reward: Team et al., 2025; Yu et al., 2025; Cheng et al., 2025). ℓ(i) is the\nresponse length of sample i. ℓmin, ℓmax, ¯ℓ, and σℓare computed\nmax Ex∼D, y∼πθ(·|x) R(x, y) . (1) within each group. α, λ, ℓT and ℓC are fixed hyperparameters. πθ\ns(·) denotes the sigmoid function. With the emergence of reasoning models such as DeepSeekMethod S (Length shaping term)\nR1 (Guo et al., 2025), group-style RL has become prevalent\nfor LLM post-training. Among them, Group Relative Policy Additive: ˆR(+) = R + λ · S\nOptimization (GRPO) (Shao et al., 2024) is widely adopted\ndue to its scalability and its elimination of the need for a L1-Exact −|ℓ(i) −ℓT |\nseparate value model. For each prompt x, GRPO samples a 0, ℓ(i) ≤ℓT −ℓC,\ngroup of G responses {y(i)}Gi=1 from an old policy πθold(· |  ℓT −ℓC −ℓ(i) DAPO , ℓT −ℓC < ℓ(i) ≤ℓT ,x) and evaluates each by R(x, y(i)). It then constructs a ℓC\ngroup-relative advantage via within-group normalization: −1, ℓ(i) > ℓT  −ℓmin 0.5 −ℓ(i) , R = 1, ˆA(i) = ℓmax −ℓmin R(x, y(i)) −µR \nσR Kimi-k1.5\n−ℓmin −ℓ(i) , 0 , R = 0 G min 0.5 ℓmax −ℓmin 1\nµR := X R(x, y(j)), σR := std {R(x, y(j))}Gj=1 Truncation −I(R = 1) · I(ℓ(i) > ℓT ) G\nj=1\nEfficiently −I(R = 1) · s (ℓ(i) −¯ℓ)/σℓ (2)\nLC-R1 I(R = 1) · (1 −ℓ(i)/ℓmax)\nThe policy is optimized using a PPO-style clipped objective\nover the group-normalized advantages: Multiplicative: ˆR(×) = R · S GR3 (Ours) ℓ(i) 1\n¯ℓ JGRPO(θ) = Ex∼D, {y(i)}Gi=1 X X 1 + α · G\ni=1 t=1\nmin ri,t(θ)ˆA(i), clip(ri,t(θ), 1 −ε, 1 + ε)ˆA(i) (3) A common strategy for mitigating length inflation in reinforcement learning is to explicitly regularize response length #\n−βDKL(πθ∥πref) , through reward shaping. From a unified perspective (Liu\net al., 2025c), Most existing approaches can be instantiated\nas additive shaping:\nwhere the importance sampling ratio is defined as\nAdditive: ˆR(+) = R + λ · S, λ > 0 (5)\nπθ(y(i)t | x, y(i) 0. (15)\n1 + α · ℓ(i) ¯ℓ\nRS = RµS+R(S−µS) = µRµS+µS(R−µR)+R(S−µS),\nwhere ℓ(i) is the response length and ¯ℓis the group mean.\nand subtract E[RS] = µRµS + σRS. Eq. (14) is Eq. (2) This penalty decreases smoothly with length, while normalapplied to ˆR(×). izing against ¯ℓavoids arbitrary global thresholds and adapts\nRemark 3.3 (Why multiplicative shaping is reward-aware to the model's current generation behavior.\nunder group normalization). Under additive shaping, Propo- As shown in Table 2, we include the fixed-threshold trunsition 3.1 shows that the length deviation (S−µS) is injected cation method (Hou et al., 2025) as a minimalist baseline.\ninto the centered shaped reward with a fixed coefficient λ. We find that threshold-based truncation imposes a uniform\nThis creates a compensatory degree of freedom: the policy maximum response length even on difficult benchmarks,\ncan improve the shaped reward by manipulating S even which degrades reasoning performance on challenging probwhen R provides little learning signal. lems. We also compare against other group-relative methods\nIn contrast, Proposition 3.2 yields the decomposition and find that certain shaping strategies (Arora & Zanette,\n2025) introduce biases that favor shallow reasoning on easier\nRS −E[RS] = R(S −µS) + µS(R −µR) −σRS, benchmarks (see Appendix B for analysis). Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning evaluated another group-relative method, Kimi-k1.5 (Team Impact of on Reward Gap and CSR\n=0.033 =0.33et al., 2025), but it exhibited training collapse; therefore, we 99.9% 0.01 100.0%\nomit its results. We attribute this failure to the additive shap-\n0.00 =0.1 =0.2ing paradigm without gating, as discussed in Section 3.1. 100.0% 100.0% =0.5\nGap 0.01 99.6%\n0.023.3. Advantage-Aware Calibration\n0.03 Reward\nWithin the framework of group-relative policy optimization, =1\n0.04 96% 98% 100% 97.6%\nthe length penalty term S acts as a powerful shaper of the =1.5\n0.05 Constraint Satisfaction Rate 96.1%\nadvantage landscape. The interaction between the penalty\nstrength and group normalization is non-trivial: slight shifts 10 1 100\n(log scale)in S can noticeably redirect the optimization trajectory. In practice, an unconstrained or overly strong penalty may Figure 4.", + "paper_id": "2603.10535", + "title": "Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning", + "authors": [ + "Zichao Li", + "Jie Lou", + "Fangchen Dong", + "Zhiyuan Fan", + "Mengjie Ren", + "Hongyu Lin", + "Xianpei Han", + "Debing Zhang", + "Le Sun", + "Yaojie Lu", + "Xing Yu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10535v1", + "chunk_index": 12, + "total_chunks": 74, + "char_count": 3203, + "word_count": 486, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "740c258a-cd12-49c2-917b-630cffe1e902", + "text": "Sensitivity of α: reward gap relative to the standard\npenalize high-quality responses so heavily that it creates a GRPO baseline versus α (log scale). Marker color indicates the\ncontradictory signal where the model is discouraged from average CSR measured during actual training, while the triangle\ngenerating its best responses. marker denotes the value of α selected during the calibration phase.", + "paper_id": "2603.10535", + "title": "Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning", + "authors": [ + "Zichao Li", + "Jie Lou", + "Fangchen Dong", + "Zhiyuan Fan", + "Mengjie Ren", + "Hongyu Lin", + "Xianpei Han", + "Debing Zhang", + "Le Sun", + "Yaojie Lu", + "Xing Yu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10535v1", + "chunk_index": 13, + "total_chunks": 74, + "char_count": 398, + "word_count": 60, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "fda800d0-53f6-4751-add0-36781e23b308", + "text": "A natural but overly strict objective is to require that all highgap relative to the GRPO baseline is already positive, in-quality trajectories retain positive advantages. However, this\ndicating preserved task capability. Further reducing thebecomes intractable under high reward density, where most\npenalty strength (i.e., decreasing α) yields no consistent per-responses in a group achieve the maximum reward Rmax\nformance gains, and instead results in fluctuations consistent(e.g., 15 out of 16 are correct). Due to the zero-sum structure\nwith training variance.of group normalization, correct but above-average-length\nresponses may inevitably receive negative advantages. We\nprovide a formal analysis of this limitation in Appendix C.1. 4. SetupAverage-Case Advantage Preservation. Instead of protecting the longest outlier trajectory, we aim to preserve the Efficient Reasoning for RLVR. Following prior work,\nadvantage of a representative high-quality response. Specif- we adopt DeepSeek-R1-Distill-1.5B and DeepSeek-R1-\nically, we consider a hypothetical response that achieves the Distill-7B (Guo et al., 2025) as the base models. For\ngroup-wise maximum reward Rmax with the group-average mathematical reasoning, we use the DeepScaleR-Previewlength ¯ℓ, and require its advantage to remain non-negative. Dataset (Luo et al., 2025b) as training data, and include\nLet µRˆ denote the mean regularized reward in the group. open-sourced checkpoints of existing efficient-reasoning\nThis yields the condition: methods as baselines, including LC-R1 (Cheng et al., 2025),\nRmax Rmax Laser (Liu et al., 2025c), AdaptThink (Zhang et al., 2025),\n≥µRˆ =⇒ ¯ℓ ≥µRˆ (16) and DLER (Liu et al., 2025b). To demonstrate the generality 1 + α 1 + α · ¯ℓ of our approach, we further extend to code generation, using\nThis ensures that the penalty α does not overturn the ad- the prompts from DeepDistill (Tian et al., 2025).\nvantage of a typical high-quality response. In the limiting\ncase where all trajectories in a group achieve Rmax, the Mitigating Length Bias in RLHF. As for the RLHF setaverage-case constraint can still fail to hold.", + "paper_id": "2603.10535", + "title": "Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning", + "authors": [ + "Zichao Li", + "Jie Lou", + "Fangchen Dong", + "Zhiyuan Fan", + "Mengjie Ren", + "Hongyu Lin", + "Xianpei Han", + "Debing Zhang", + "Le Sun", + "Yaojie Lu", + "Xing Yu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10535v1", + "chunk_index": 14, + "total_chunks": 74, + "char_count": 2122, + "word_count": 314, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "72f24743-137d-457a-98c3-d946c715cd88", + "text": "We therefore\nting, we use the non-reasoning versions of Qwen3-4B and\nfilter out such groups online (see Appendix C.2). Qwen3-8B (Yang et al., 2025) as base models. We construct\nIn practice, Eq. (16) is not enforced as a hard constraint at RL prompts from arena-human-preference-140k2, and emevery update due to on-policy sampling stochasticity. In- ploy Skywork-Reward-V2-Llama-3.1-8B (Liu et al., 2025a)\nstead, we treat it as a calibration criterion for selecting the as the reward model. To improve training stability, we apply\npenalty coefficient α. We run a short calibration phase at a reference-based sigmoid shaping (Fu et al., 2025) scheme:\nthe beginning of GRPO training and measure the Constraint\nSatisfaction Rate (CSR) over candidate α values. We then R(x, y(i)) = s Rorigin(x, y(i)) −Rorigin(x, yref) . (17)\nselect the largest α whose CSR remains consistently high\n(e.g., ≥99.9%), ensuring high-probability constraint satiswhere Rorigin(·) denotes the raw reward model score.faction while maintaining strong length regularization. Detailed experimental settings are provided in Appendix D.Empirically, the α selected by this protocol maintains a\nnear-perfect CSR throughout training (see Figure 4).", + "paper_id": "2603.10535", + "title": "Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning", + "authors": [ + "Zichao Li", + "Jie Lou", + "Fangchen Dong", + "Zhiyuan Fan", + "Mengjie Ren", + "Hongyu Lin", + "Xianpei Han", + "Debing Zhang", + "Le Sun", + "Yaojie Lu", + "Xing Yu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10535v1", + "chunk_index": 15, + "total_chunks": 74, + "char_count": 1211, + "word_count": 178, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "6eab1802-a2ef-456c-af3e-06bc92cabab5", + "text": "This 2https://huggingface.co/datasets/lmaren\npoint effectively marks a practical boundary: the reward a-ai/arena-human-preference-140k Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning Mathematical reasoning performance for 7B models. Length-oriented methods reduce tokens but may sacrifice accuracy, while\nGR3 consistently achieves comparable accuracy with significantly fewer tokens, establishing a markedly better Pareto frontier.", + "paper_id": "2603.10535", + "title": "Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning", + "authors": [ + "Zichao Li", + "Jie Lou", + "Fangchen Dong", + "Zhiyuan Fan", + "Mengjie Ren", + "Hongyu Lin", + "Xianpei Han", + "Debing Zhang", + "Le Sun", + "Yaojie Lu", + "Xing Yu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10535v1", + "chunk_index": 16, + "total_chunks": 74, + "char_count": 488, + "word_count": 53, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "5abf481f-bdd6-4722-b05a-2f216878f34d", + "text": "AIME24 AIME25 AMC23 MATH500 Model Avg@32↑ #Token↓ Avg@32↑ #Token↓ Avg@16↑ #Token↓ Avg@4↑ #Token↓ Initial model\nDeepSeek–R1–Distill–7B 52.4 13213 39.4 14032 89.8 6385 92.1 3994 Length-oriented RL\nLCR1–7B 47.9 7548 36.3 7960 85.8 2963 89.1 1546\nLaser–DE–L4096–7B 51.7 4931 36.7 5048 88.1 2427 92.4 1580\nAdaptThink–7B 51.5 11070 39.3 11678 88.1 4280 90.6 2011\nDLER–R1–7B 49.5 3272 35.6 3288 91.4 2255 93.2 1650 Performance-oriented RL\nGRPO 57.1 11079 44.7 12540 90.3 7256 92.1 5006\nGR3 (ours) 60.1 7923 46.9 8582 93.0 3090 94.0 1764 Performance on code generation tasks. GR3 achieves Table 5.", + "paper_id": "2603.10535", + "title": "Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning", + "authors": [ + "Zichao Li", + "Jie Lou", + "Fangchen Dong", + "Zhiyuan Fan", + "Mengjie Ren", + "Hongyu Lin", + "Xianpei Han", + "Debing Zhang", + "Le Sun", + "Yaojie Lu", + "Xing Yu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10535v1", + "chunk_index": 17, + "total_chunks": 74, + "char_count": 589, + "word_count": 92, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "5768bc59-ba13-40d9-b48e-33269b4ee920", + "text": "Performance on chat tasks. GRPO improves performance\ncompetitive scores with fewer tokens across model sizes. but incurs substantial length bias, while GR3 achieves stronger\nalignment gains while preserving the length of the initial policy. LiveCodeBench v6 MultiPL-E\nArena–Hard–Auto Alpaca–Eval\nModel Score↑ #Token↓ Score↑ #Token↓\nModel Score↑ #Token↓ Score↑ #Token↓\nDeepSeek–R1–Distill–1.5B\nQwen3–4B\nInitial 17.7 12665 45.1 6181\nGRPO 23.4 11830 51.5 6589 Initial 66.6 1139 40.1 737\nGR3 (ours) 24.9 8538 52.2 2414 GRPO 85.8 2374 33.9 1993\nGR3 (ours) 85.9 1377 44.1 859\nDeepSeek–R1–Distill–7B\nQwen3–8B\nInitial 37.7 11496 69.7 4121\nGRPO 42.4 10956 71.1 4794 Initial 77.2 1171 50.2 778\nGR3 (ours) 41.6 7504 70.9 2127 GRPO 90.6 2343 53.5 1670\nGR3 (ours) 92.8 1178 55.8 765 Main Results\nGR3 improves it to 46.9 with much fewer tokens (14,032 →\n4.2.1.", + "paper_id": "2603.10535", + "title": "Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning", + "authors": [ + "Zichao Li", + "Jie Lou", + "Fangchen Dong", + "Zhiyuan Fan", + "Mengjie Ren", + "Hongyu Lin", + "Xianpei Han", + "Debing Zhang", + "Le Sun", + "Yaojie Lu", + "Xing Yu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10535v1", + "chunk_index": 18, + "total_chunks": 74, + "char_count": 846, + "word_count": 131, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "ca416b91-4342-474c-868a-55187ca6b46c", + "text": "EFFICIENT REASONING FOR RLVR 8,582). This suggests that GR3 encourages more efficient\nreasoning trajectories rather than merely truncating reaThe experimental results for the 7B model are presented\nsoning, yielding consistent gains across scales.\nin Table 3, while those for the 1.5B model are provided in\nAppendix E.1. Notably, GR3 improves reasoning perfor- We further validate GR3 on code generation benchmarks,\nmance while reducing generation length, indicating a as summarized in Table 4. Consistent with our findings\ngenuine efficiency gain rather than a trade-off. in mathematical reasoning, GR3 achieves substantial efficiency gains while preserving task performance. Within the performance-oriented regime, compared with\nstandard GRPO, GR3 leads to substantially shorter gener-\n4.2.2. MITIGATING LENGTH BIAS IN RLHF\nations while maintaining or even improving performance. For instance, at the 7B scale on AIME24, GR3 reduces Results on alignment benchmarks are shown in Table 5.\nthe average length from 13,213 to 7,923 tokens while im- Compared with the initial models, RLHF training yields\nproving Avg@32 from 52.4 to 60.1, whereas GRPO only substantial improvements in chat quality. However, we\nreaches 57.1.", + "paper_id": "2603.10535", + "title": "Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning", + "authors": [ + "Zichao Li", + "Jie Lou", + "Fangchen Dong", + "Zhiyuan Fan", + "Mengjie Ren", + "Hongyu Lin", + "Xianpei Han", + "Debing Zhang", + "Le Sun", + "Yaojie Lu", + "Xing Yu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10535v1", + "chunk_index": 19, + "total_chunks": 74, + "char_count": 1219, + "word_count": 177, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "4efe0f9b-3c5e-4011-a9a9-b0fa156c9df8", + "text": "In contrast to existing length-oriented base- observe that standard GRPO suffers from severe reward\nlines, GR3 does not over-compress the reasoning length at hacking under length bias, where the model can artificially\nthe cost of accuracy; instead, it prioritizes preserving perfor- increase reward by generating unnecessarily long responses,\nmance while removing redundant reasoning. For example, resulting in explosive length inflation (e.g., on Qwen3-8B,\non AIME25 (7B), none of the length-oriented baselines sur- the average response length on Arena–Hard–Auto increases\npasses the initial checkpoint performance (39.4), whereas from 1,171 to 2,343 tokens). In contrast, GR3 attains com-", + "paper_id": "2603.10535", + "title": "Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning", + "authors": [ + "Zichao Li", + "Jie Lou", + "Fangchen Dong", + "Zhiyuan Fan", + "Mengjie Ren", + "Hongyu Lin", + "Xianpei Han", + "Debing Zhang", + "Le Sun", + "Yaojie Lu", + "Xing Yu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10535v1", + "chunk_index": 20, + "total_chunks": 74, + "char_count": 690, + "word_count": 96, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "7860070f-8a04-463a-acaf-330871f4d404", + "text": "Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning parable or even stronger alignment gains while keeping ward on the most important reasoning steps. By discouragresponse length almost unchanged, effectively decou- ing unnecessary verbosity, GR3 compresses reasoning\npling performance improvement from verbosity. For traces while preserving decisive steps.", + "paper_id": "2603.10535", + "title": "Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning", + "authors": [ + "Zichao Li", + "Jie Lou", + "Fangchen Dong", + "Zhiyuan Fan", + "Mengjie Ren", + "Hongyu Lin", + "Xianpei Han", + "Debing Zhang", + "Le Sun", + "Yaojie Lu", + "Xing Yu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10535v1", + "chunk_index": 21, + "total_chunks": 74, + "char_count": 410, + "word_count": 51, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "85921cf4-9991-43dd-b01a-dad146e40385", + "text": "This increases the\ninstance, on Qwen3-8B, GR3 improves the Arena–Hard– signal density of reward with respect to tokens, allowing\nAuto score from 77.2 to 92.8, while the token cost only optimization to focus more strongly on causally important\nincreases marginally (1,171 →1,178). reasoning patterns rather than distributing gradients over\nverbose but weakly relevant tokens. Qualitative rollout exWe further visualize the training dynamics in the RLHF\namples are provided in Appendix F.\nsetting in Figure 2. Under GRPO, response length grows\nmonotonically and uncontrollably throughout training. In\ncontrast, GR3 exhibits a clear \"increase-then-decrease\" pat- 5. Related Work\ntern: the model initially expands its reasoning to secure\nDespite remarkable progress, reinforcement learning (Kaelalignment improvements, and subsequently compresses rebling et al., 1996; Bai et al., 2022; Guo et al., 2025) suffers\ndundant generations once performance stabilizes. This dyfrom high inference costs and growing generation lengths, a\nnamic behavior aligns with our design intuition—GR3 pribottleneck we term length inflation.\noritizes achieving reliable alignment gains first, and then\nprogressively improves response efficiency by suppress- One line of work studies efficient reasoning (Feng et al.,\ning length-based exploitation. 2025; Sui et al., 2025), aiming to improve the accuracy–cost\ntrade-off of long chain-of-thought models. Early approaches\n4.3. Analysis and Discussion rely on prompting or supervised fine-tuning to encourage\nshorter reasoning traces (Ma et al., 2025a; Xia et al., 2025;\n4.3.1. ABLATION ON PENALTY STRENGTH α Ma et al., 2025b). More recent methods apply RL to directly\nWe study the effect of the penalty coefficient α by sweeping optimize efficiency via length-aware objectives (Arora &\nits value while keeping all other settings fixed. Detailed Zanette, 2025; Liu et al., 2025c;b; Yu et al., 2025). While\nresults and analysis are provided in Appendix E.2; here we effective at reducing token usage, such approaches can\nsummarize the key findings. degrade performance or introduce brittle optimization dynamics due to poorly calibrated penalties or shortcut soluWhen α is too large (e.g., 1.0), GR3 degenerates toward a tions (Cheng et al., 2025).\nnaive length regularizer: responses become much shorter,\nbut performance gains over the base model are limited, as Another line of work attributes length inflation in RLHF to\noptimization is dominated by compression rather than capa- reward hacking and length bias (Skalse et al., 2022; Gao\nbility improvement. As α decreases, response length grows et al., 2023; Singhal et al., 2023). Because reward models\nsmoothly, while task performance first improves and then may implicitly favor longer responses, verbosity can arise\nplateaus. This trend is consistent with the analysis in Section from exploiting reward artifacts rather than true capability\n3.3: once the advantage of representative high-quality gains (Shen et al., 2023).", + "paper_id": "2603.10535", + "title": "Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning", + "authors": [ + "Zichao Li", + "Jie Lou", + "Fangchen Dong", + "Zhiyuan Fan", + "Mengjie Ren", + "Hongyu Lin", + "Xianpei Han", + "Debing Zhang", + "Le Sun", + "Yaojie Lu", + "Xing Yu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10535v1", + "chunk_index": 22, + "total_chunks": 74, + "char_count": 2999, + "word_count": 438, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "75b4078b-1c06-425e-b28f-b458a5525ec1", + "text": "Prior efforts mitigate this by imtrajectories is preserved, further reducing the penalty proving reward modeling and calibration (Chen et al., 2024a;\nmainly relaxes length control without yielding stronger Wang et al., 2025) or by applying post-hoc reward correclearning signals. The chosen value α = 0.33 lies near tion (Huang et al., 2024), though many of these solutions\nthis transition region, achieving substantial length reduction are tailored to specific training settings.\nwhile retaining most performance gains.", + "paper_id": "2603.10535", + "title": "Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning", + "authors": [ + "Zichao Li", + "Jie Lou", + "Fangchen Dong", + "Zhiyuan Fan", + "Mengjie Ren", + "Hongyu Lin", + "Xianpei Han", + "Debing Zhang", + "Le Sun", + "Yaojie Lu", + "Xing Yu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10535v1", + "chunk_index": 23, + "total_chunks": 74, + "char_count": 520, + "word_count": 75, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "54affd3a-e167-4fe3-b5d0-3546b7fd89c3", + "text": "Our method is most closely related to RL-based efficient reasoning, and remains effective under reward hacking driven\n4.3.2. WHY DOES GR3 OUTPERFORM GRPO? by length bias. We propose GR3, a general framework\nfor length regularization that preserves performance whileWe observe a counterintuitive phenomenon: in many settings, GR3 not only shortens responses but also achieves improving the performance–cost Pareto frontier.\nstronger downstream performance than standard GRPO,\nwhile maintaining a positive reward gap relative to the 6. Conclusion\nGRPO baseline. We attribute this to a difference in how\nthe optimization signal is structured. In this work, we identify length inflation as a fundamental inefficiency in RL-trained LLMs, where models tend\nUnder unconstrained RL such as GRPO, policies often drift toward unnecessary verbosity or overthinking. We protoward over-extended reasoning trajectories. Although these pose Group Relative Reward Rescaling (GR3), a general\ntrajectories may eventually reach correct answers, they tend framework for lossless length control that regulates reato contain many low-contributing tokens. From an opti- soning length through a multiplicative, group-relative formization perspective, this spreads the learning signal thinly mulation with advantage-aware calibration. Across both\nacross long responses, reducing the effective impact of re- Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning RLVR and RLHF settings, GR3 consistently shifts the per- C.", + "paper_id": "2603.10535", + "title": "Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning", + "authors": [ + "Zichao Li", + "Jie Lou", + "Fangchen Dong", + "Zhiyuan Fan", + "Mengjie Ren", + "Hongyu Lin", + "Xianpei Han", + "Debing Zhang", + "Le Sun", + "Yaojie Lu", + "Xing Yu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10535v1", + "chunk_index": 24, + "total_chunks": 74, + "char_count": 1546, + "word_count": 211, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "27ac888b-f460-4947-bec7-6d8e0f7e2d3b", + "text": "Multipl-e: A scalable and\nformance–cost Pareto frontier outward, reducing token us- extensible approach to benchmarking neural code generaage while preserving or even improving model capability. tion. arXiv preprint arXiv:2208.08227, 2022. These results show that verbosity is not a prerequisite for\nChen, L., Zhu, C., Soselia, D., Chen, J., Zhou, T., Goldstein,intelligence, and position GR3 as a practical and general\nT., Huang, H., Shoeybi, M., and Catanzaro, B. Odin:\nparadigm for training efficient, high-performing LLMs. Disentangled reward mitigates hacking in rlhf. arXiv\nImpact Statement\nChen, X., Xu, J., Liang, T., He, Z., Pang, J., Yu, D., Song,\nThis paper presents a method to mitigate length inflation in L., Liu, Q., Zhou, M., Zhang, Z., et al. Do not think\nLarge Language Models trained via Reinforcement Learn- that much for 2+ 3=? on the overthinking of o1-like llms.\ning. The primary positive impact of this work lies in pro- arXiv preprint arXiv:2412.21187, 2024b.\nmoting computational efficiency and environmental susCheng, Z., Chen, D., Fu, M., and Zhou, T. Optimizingtainability (\"Green AI\"). By significantly reducing token\nlength compression in large reasoning models. arXivgeneration, e.g., saving over 40% of tokens in reasoning\npreprint arXiv:2506.14755, 2025.tasks without compromising performance, GR3 directly\ncontributes to lowering the financial costs, inference latency, Dubois, Y., Galambosi, B., Liang, P., and Hashimoto, T. B.\nand energy consumption associated with deploying large- Length-controlled alpacaeval: A simple way to debias\nscale reasoning models. automatic evaluators. arXiv preprint arXiv:2404.04475,\n2024.Furthermore, this work addresses the alignment challenge\nof reward hacking, where models exploit verbosity to maxi- Eisenstein, J., Nagpal, C., Agarwal, A., Beirami, A.,\nmize rewards rather than improving genuine capability. By D'Amour, A., Dvijotham, D., Fisch, A., Heller, K., Pfohl,\ndecoupling performance gains from unnecessary length, we S., Ramachandran, D., et al. Helping or herding? reward\nfacilitate the development of more concise, interpretable, model ensembles mitigate but do not eliminate reward\nand user-aligned AI systems.", + "paper_id": "2603.10535", + "title": "Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning", + "authors": [ + "Zichao Li", + "Jie Lou", + "Fangchen Dong", + "Zhiyuan Fan", + "Mengjie Ren", + "Hongyu Lin", + "Xianpei Han", + "Debing Zhang", + "Le Sun", + "Yaojie Lu", + "Xing Yu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10535v1", + "chunk_index": 25, + "total_chunks": 74, + "char_count": 2196, + "word_count": 311, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "616357b7-b31d-4107-a246-f5079d6f043a", + "text": "Nevertheless, practition- hacking. arXiv preprint arXiv:2312.09244, 2023.\ners should monitor for potential over-truncation in safetyFeng, S., Fang, G., Ma, X., and Wang, X. Efficient reasoningcritical domains where exhaustive reasoning traces are esmodels: A survey. arXiv preprint arXiv:2504.10903,sential for verification.\n2025. References Fu, J., Zhao, X., Yao, C., Wang, H., Han, Q., and Xiao, Y. Reward shaping to mitigate reward hacking in rlhf. arXiv\nAchiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., preprint arXiv:2502.18770, 2025. L., Almeida, D., Altenschmidt, J., Altman, S.,\nAnadkat, S., et al. Gpt-4 technical report. arXiv preprint Gao, L., Schulman, J., and Hilton, J. Scaling laws for reward\narXiv:2303.08774, 2023. model overoptimization. In International Conference on\nMachine Learning, pp. 10835–10866. Aggarwal, P. and Welleck, S. L1: Controlling how long\nGuo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., a reasoning model thinks with reinforcement learning. Zhu, Q., Ma, S., Wang, P., Bi, X., et al. Deepseek-r1: In- arXiv preprint arXiv:2503.04697, 2025.\ncentivizing reasoning capability in llms via reinforcement\nAmodei, D., Olah, C., Steinhardt, J., Christiano, P., Schul- learning. arXiv preprint arXiv:2501.12948, 2025.\nman, J., and Man´e, D. Concrete problems in ai safety, Hou, B., Zhang, Y., Ji, J., Liu, Y., Qian, K., Andreas,\n2016. Thinkprune: Pruning long chain-ofthought of llms via reinforcement learning. arXiv preprint\nArora, D. and Zanette, A. Training language models to reaarXiv:2504.01296, 2025.\nson efficiently. arXiv preprint arXiv:2502.04463, 2025.", + "paper_id": "2603.10535", + "title": "Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning", + "authors": [ + "Zichao Li", + "Jie Lou", + "Fangchen Dong", + "Zhiyuan Fan", + "Mengjie Ren", + "Hongyu Lin", + "Xianpei Han", + "Debing Zhang", + "Le Sun", + "Yaojie Lu", + "Xing Yu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10535v1", + "chunk_index": 26, + "total_chunks": 74, + "char_count": 1611, + "word_count": 230, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "602689f3-c0bc-4a39-b5b6-2b73c24695fe", + "text": "Huang, Z., Qiu, Z., Wang, Z., Ponti, E. Bai, Y., Jones, A., Ndousse, K., Askell, A., Chen, A., Das- Post-hoc reward calibration: A case study on length bias. Sarma, N., Drain, D., Fort, S., Ganguli, D., Henighan, T., arXiv preprint arXiv:2409.17407, 2024.\net al. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint Jain, N., Han, K., Gu, A., Li, W.-D., Yan, F., Zhang, T.,\narXiv:2204.05862, 2022. Wang, S., Solar-Lezama, A., Sen, K., and Stoica, I. Livecodebench: Holistic and contamination free evalCassano, F., Gouwar, J., Nguyen, D., Nguyen, S., Phipps- uation of large language models for code. arXiv preprint\nCostin, L., Pinckney, D., Yee, M.-H., Zi, Y., Anderson, arXiv:2403.07974, 2024. Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning American invitational mathematics examination -\ninforcement learning: A survey. Journal of artificial amc. In American Invitational Mathematics Examination\nintelligence research, 4:237–285, 1996. - AMC 2023, 2023. Li, T., Chiang, W.-L., Frick, E., Dunlap, L., Wu, T., Zhu, B., MAA. American invitational mathematics examination -\nGonzalez, J.", + "paper_id": "2603.10535", + "title": "Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning", + "authors": [ + "Zichao Li", + "Jie Lou", + "Fangchen Dong", + "Zhiyuan Fan", + "Mengjie Ren", + "Hongyu Lin", + "Xianpei Han", + "Debing Zhang", + "Le Sun", + "Yaojie Lu", + "Xing Yu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10535v1", + "chunk_index": 27, + "total_chunks": 74, + "char_count": 1195, + "word_count": 173, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "b0dc449e-a766-4d14-b3ff-92223c7f0ad9", + "text": "From crowdsourced data to aime. In American Invitational Mathematics Examination\nhigh-quality benchmarks: Arena-hard and benchbuilder - AIME 2024, 2024.\npipeline. arXiv preprint arXiv:2406.11939, 2024. American invitational mathematics examination -\nLightman, H., Kosaraju, V., Burda, Y., Edwards, H., Baker, aime. In American Invitational Mathematics Examination\nB., Lee, T., Leike, J., Schulman, J., Sutskever, I., and - AIME 2025, 2025. Let's verify step by step. arXiv preprint\nOuyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C.,\nMishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A.,\nLiu, C. Y., Zeng, L., Xiao, Y., He, J., Liu, J., Wang, C., et al. Training language models to follow instructions\nYan, R., Shen, W., Zhang, F., Xu, J., Liu, Y., and Zhou, with human feedback. Advances in neural information\nY. Skywork-reward-v2: Scaling preference data curation processing systems, 35:27730–27744, 2022.\nvia human-ai synergy. arXiv preprint arXiv:2507.01352,\nPuterman, M. Markov decision processes. Handbooks in\n2025a.\noperations research and management science, 2:331–434,\n1990.Liu, S.-Y., Dong, X., Lu, X., Diao, S., Liu, M., Chen, M.-H.,\nYin, H., Wang, Y.-C. F., Cheng, K.-T., Choi, Y., et al. Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang,\nDler: Doing length penalty right-incentivizing more in- H., Zhang, M., Li, Y., Wu, Y., et al. Deepseekmath: Pushtelligence per token via reinforcement learning. arXiv ing the limits of mathematical reasoning in open language\npreprint arXiv:2510.15110, 2025b. models. arXiv preprint arXiv:2402.03300, 2024. Liu, T., Xiong, W., Ren, J., Chen, L., Wu, J., Joshi, R., Gao, Shen, W., Zheng, R., Zhan, W., Zhao, J., Dou, S., Gui,\nY., Shen, J., Qin, Z., Yu, T., et al.", + "paper_id": "2603.10535", + "title": "Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning", + "authors": [ + "Zichao Li", + "Jie Lou", + "Fangchen Dong", + "Zhiyuan Fan", + "Mengjie Ren", + "Hongyu Lin", + "Xianpei Han", + "Debing Zhang", + "Le Sun", + "Yaojie Lu", + "Xing Yu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10535v1", + "chunk_index": 28, + "total_chunks": 74, + "char_count": 1735, + "word_count": 265, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "ed06fa49-6a52-41fe-a5d1-dad82cf73596", + "text": "Rrm: Robust reward T., Zhang, Q., and Huang, X.-J. Loose lips sink ships:\nmodel training mitigates reward hacking. arXiv preprint Mitigating length bias in reinforcement learning from\narXiv:2409.13156, 2024. human feedback. In Findings of the Association for Computational Linguistics: EMNLP 2023, pp. 2859–2873,Liu, W., Zhou, R., Deng, Y., Huang, Y., Liu, J., Deng, Y.,\n2023. Zhang, Y., and He, J. Learn to reason efficiently with\nadaptive length-based reward shaping. arXiv preprint Sheng, G., Zhang, C., Ye, Z., Wu, X., Zhang, W., Zhang,\narXiv:2505.15612, 2025c. R., Peng, Y., Lin, H., and Wu, C. Hybridflow: A flexible and efficient rlhf framework. arXiv preprint arXiv:\nLuo, H., Shen, L., He, H., Wang, Y., Liu, S., Li, W., Tan,\n2409.19256, 2024. N., Cao, X., and Tao, D. O1-pruner: Length-harmonizing\nfine-tuning for o1-like reasoning pruning. arXiv preprint Singhal, P., Goyal, T., Xu, J., and Durrett, G. A long way\narXiv:2501.12570, 2025a. to go: Investigating length correlations in rlhf. arXiv\nLuo, M., Tan, S., Wong, J., Shi, X., Tang, W., Roongta, M.,\nCai, C., Luo, J., Zhang, T., Li, E., Popa, R. A., and Stoica, Skalse, J., Howe, N., Krasheninnikov, D., and Krueger, D.", + "paper_id": "2603.10535", + "title": "Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning", + "authors": [ + "Zichao Li", + "Jie Lou", + "Fangchen Dong", + "Zhiyuan Fan", + "Mengjie Ren", + "Hongyu Lin", + "Xianpei Han", + "Debing Zhang", + "Le Sun", + "Yaojie Lu", + "Xing Yu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10535v1", + "chunk_index": 29, + "total_chunks": 74, + "char_count": 1184, + "word_count": 188, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "07ba8202-4d08-44bd-9eb7-8293ad0ea187", + "text": "Deepscaler: Surpassing o1-preview with a 1.5b model Defining and characterizing reward gaming. Advances in\nby scaling rl. https://pretty-radio-b75.n Neural Information Processing Systems, 35:9460–9471,\notion.site/DeepScaleR-Surpassing-O1-P 2022.\nreview-with-a-1-5B-Model-by-Scaling\n-RL-19681902c1468005bed8ca303013a4e2, Sui, Y., Chuang, Y.-N., Wang, G., Zhang, J., Zhang, T.,\n2025b. Yuan, J., Liu, H., Wen, A., Zhong, S., Zou, N., et al. Stop overthinking: A survey on efficient reasoning for\nMa, W., He, J., Snell, C., Griggs, T., Min, S., and Zaharia, large language models. arXiv preprint arXiv:2503.16419,\nM. Reasoning models can be effective without thinking. 2025.", + "paper_id": "2603.10535", + "title": "Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning", + "authors": [ + "Zichao Li", + "Jie Lou", + "Fangchen Dong", + "Zhiyuan Fan", + "Mengjie Ren", + "Hongyu Lin", + "Xianpei Han", + "Debing Zhang", + "Le Sun", + "Yaojie Lu", + "Xing Yu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10535v1", + "chunk_index": 30, + "total_chunks": 74, + "char_count": 670, + "word_count": 85, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "b7a89cce-2c17-482a-93ec-09077c72b768", + "text": "Team, G., Anil, R., Borgeaud, S., Alayrac, J.-B., Yu, J., SoriMa, X., Wan, G., Yu, R., Fang, G., and Wang, X. Cot-valve: cut, R., Schalkwyk, J., Dai, A. M., Hauth, A., Millican,\nLength-compressible chain-of-thought tuning. arXiv K., et al. Gemini: a family of highly capable multimodal\npreprint arXiv:2502.09601, 2025b. models. arXiv preprint arXiv:2312.11805, 2023.", + "paper_id": "2603.10535", + "title": "Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning", + "authors": [ + "Zichao Li", + "Jie Lou", + "Fangchen Dong", + "Zhiyuan Fan", + "Mengjie Ren", + "Hongyu Lin", + "Xianpei Han", + "Debing Zhang", + "Le Sun", + "Yaojie Lu", + "Xing Yu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10535v1", + "chunk_index": 31, + "total_chunks": 74, + "char_count": 366, + "word_count": 54, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "e7a72344-4d72-4096-b3a1-581406891c00", + "text": "Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning Team, K., Du, A., Gao, B., Xing, B., Jiang, C., Chen, C.,\nLi, C., Xiao, C., Du, C., Liao, C., et al. Kimi k1. 5:\nScaling reinforcement learning with llms. arXiv preprint Tian, X., Zhao, S., Wang, H., Chen, S., Peng, Y., Ji, Y., Zhao,\nH., and Li, X. Deepdistill: Enhancing llm reasoning\ncapabilities via large-scale difficulty-graded data training. Wang, C., Zhao, Z., Jiang, Y., Chen, Z., Zhu, C., Chen,\nY., Liu, J., Zhang, L., Fan, X., Ma, H., et al. Beyond\nreward hacking: Causal rewards for large language model\nalignment. arXiv preprint arXiv:2501.09620, 2025. T., Wang, W., Li, Y., and Li, W.", + "paper_id": "2603.10535", + "title": "Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning", + "authors": [ + "Zichao Li", + "Jie Lou", + "Fangchen Dong", + "Zhiyuan Fan", + "Mengjie Ren", + "Hongyu Lin", + "Xianpei Han", + "Debing Zhang", + "Le Sun", + "Yaojie Lu", + "Xing Yu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10535v1", + "chunk_index": 32, + "total_chunks": 74, + "char_count": 702, + "word_count": 115, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "d1fbedd8-bee5-4980-82af-72e50422867c", + "text": "Tokenskip: Controllable chain-of-thought compression in llms. Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B.,\nYu, B., Gao, C., Huang, C., Lv, C., et al. Qwen3 technical\nreport. arXiv preprint arXiv:2505.09388, 2025. Yi, J., Wang, J., and Li, S. Shorterbetter: Guiding reasoning models to find optimal inference length for efficient\nreasoning. arXiv preprint arXiv:2504.21370, 2025. Yu, Q., Zhang, Z., Zhu, R., Yuan, Y., Zuo, X., Yue, Y., Dai,\nW., Fan, T., Liu, G., Liu, L., et al. Dapo: An open-source\nllm reinforcement learning system at scale. arXiv preprint Zhang, J., Lin, N., Hou, L., Feng, L., and Li, J. Adaptthink:\nReasoning models can learn when to think. arXiv preprint Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning Connection to Heuristic Gating Mechanisms In Section 3.1, we motivated multiplicative shaping primarily through the lens of removing the compensatory optimization\nshortcut inherent in additive shaping. In this section, we provide an alternative perspective by analyzing the relationship\nbetween our approach and heuristic gating mechanisms (Arora & Zanette, 2025; Cheng et al., 2025). We demonstrate that\nmultiplicative shaping can be viewed as a principled generalization of heuristic gating: it mathematically reduces to gating\nin binary reward settings while providing a robust, \"soft\" gating mechanism in continuous reward landscapes where hard\nindicators fail. Equivalence in Binary Reward Settings.", + "paper_id": "2603.10535", + "title": "Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning", + "authors": [ + "Zichao Li", + "Jie Lou", + "Fangchen Dong", + "Zhiyuan Fan", + "Mengjie Ren", + "Hongyu Lin", + "Xianpei Han", + "Debing Zhang", + "Le Sun", + "Yaojie Lu", + "Xing Yu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10535v1", + "chunk_index": 33, + "total_chunks": 74, + "char_count": 1500, + "word_count": 221, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "ac422931-f855-402f-af3e-5c002bea19c2", + "text": "Heuristic gating is a common enhancement to additive shaping in efficient\nreasoning (RLVR), which prevents models from optimizing length at the expense of accuracy. It typically employs an\nindicator function I(R = 1) to apply length penalties only when the response is correct. Let P denote a generic length-based penalty term (e.g., a negative function of length). Standard gated additive shaping\nmodifies the shaping term S in Eq. 5 to be conditional on task success: Gated Additive: ˆR(+,g) = R + λ · Sgate,\nwhere Sgate = I(R = 1) · P. Here, I(R = 1) acts as a hard gate, R ∈{0, 1} is the binary task outcome. Now, consider our multiplicative shaping defined in Eq. 7: Multiplicative: ˆR(×) = R · Smult. To facilitate comparison, we can decompose the scaling factor Smult into a baseline and a deviation term. We rewrite\nSmult = 1 + (Smult −1), where the deviation corresponds to the implicit penalty applied by the scaling mechanism: λP := Smult −1 =⇒ Smult = 1 + λP. We analyze the behavior in the two binary states: • Case R = 0 (Failure): ˆR(+,g) = 0 + λ · (0 · P) = 0\nˆR(×) = 0 · (1 + λP) = 0 Both methods deactivate the penalty, preventing premature termination on hard instances.", + "paper_id": "2603.10535", + "title": "Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning", + "authors": [ + "Zichao Li", + "Jie Lou", + "Fangchen Dong", + "Zhiyuan Fan", + "Mengjie Ren", + "Hongyu Lin", + "Xianpei Han", + "Debing Zhang", + "Le Sun", + "Yaojie Lu", + "Xing Yu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10535v1", + "chunk_index": 34, + "total_chunks": 74, + "char_count": 1189, + "word_count": 218, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "ceb63bd8-c698-4dc3-a2fa-2484accedcf4", + "text": "Both methods apply the full penalty to incentivize efficiency among correct solutions. Thus, in the strict binary reward setting typical of RLVR, multiplicative shaping is mathematically equivalent to heuristic\ngating. It inherits the desirable property of protecting the policy from being penalized on incorrect reasoning paths. Generalization to Continuous Rewards. The limitation of heuristic gating becomes apparent when transitioning to\ncontinuous reward settings, such as RLHF (where rewards are typically given by a reward model) or reasoning tasks with\npartial credit. In these scenarios, the hard indicator I(R = 1) is ill-defined. Naively replacing it with a threshold I(R > τ) introduces\nhyperparameters and optimization discontinuities. Conversely, removing the gate entirely (reverting to pure additive shaping)\nreintroduces the trade-off discussed in Proposition 3.1, where the model can improve ˆR(+) by shortening length even if R\ndegrades slightly. Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning Multiplicative shaping solves this by acting as a soft gating mechanism. As derived in Proposition 3.2, the group-normalized\nadvantage under multiplicative shaping contains the following term governing the length signal: A(ˆR(×)) ∝R · (S −µS) + . . . This decomposition demonstrates that the impact of length variation (S −µS) on the advantage is explicitly scaled by the\ntask reward R. This creates a dynamic reweighting of the learning signal: • Low Quality (R ≈0): The length signal is suppressed (R · (S −µS) ≈0).", + "paper_id": "2603.10535", + "title": "Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning", + "authors": [ + "Zichao Li", + "Jie Lou", + "Fangchen Dong", + "Zhiyuan Fan", + "Mengjie Ren", + "Hongyu Lin", + "Xianpei Han", + "Debing Zhang", + "Le Sun", + "Yaojie Lu", + "Xing Yu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10535v1", + "chunk_index": 36, + "total_chunks": 74, + "char_count": 1588, + "word_count": 234, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "2df738f7-be23-4c4b-8db0-ba4685527a34", + "text": "The advantage is dominated by the need\nto improve task correctness, and the policy receives little to no signal regarding length. This mimics an inactive gate,\npreventing the model from collapsing to short but incorrect responses. • High Quality (R ≈1): The length signal is fully active (R · (S −µS) ≈S −µS).", + "paper_id": "2603.10535", + "title": "Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning", + "authors": [ + "Zichao Li", + "Jie Lou", + "Fangchen Dong", + "Zhiyuan Fan", + "Mengjie Ren", + "Hongyu Lin", + "Xianpei Han", + "Debing Zhang", + "Le Sun", + "Yaojie Lu", + "Xing Yu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10535v1", + "chunk_index": 37, + "total_chunks": 74, + "char_count": 309, + "word_count": 53, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "77b8c185-0d72-421b-b49c-a1519ffc9123", + "text": "The advantage significantly favors\nshorter trajectories within the successful group. This mimics an active gate, effectively prioritizing efficiency once\ncapability is secured. This property effectively interpolates between \"no penalty\" and \"full penalty\" based on the response quality. Consequently,\nGR3 allows us to apply strong length regularization in RLHF without the risk of the model collapsing to short, low-quality\nresponses, as evidenced by the dynamics in Figure 2. Empirical Observation. We further conduct an analytical experiment, illustrated in Figure 5.", + "paper_id": "2603.10535", + "title": "Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning", + "authors": [ + "Zichao Li", + "Jie Lou", + "Fangchen Dong", + "Zhiyuan Fan", + "Mengjie Ren", + "Hongyu Lin", + "Xianpei Han", + "Debing Zhang", + "Le Sun", + "Yaojie Lu", + "Xing Yu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10535v1", + "chunk_index": 38, + "total_chunks": 74, + "char_count": 569, + "word_count": 79, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "dc7ece6b-8380-4f30-8d25-87b78a6bca53", + "text": "Specifically, we convert\nour multiplicative shaping term into a penalty form used in gated additive shaping by defining λP := Smult −1. We then\nintroduce different thresholds I(R > τ) to extend gated additive shaping to continuous-reward RLHF settings. We observe\nthat, due to optimization discontinuities, all choices of τ lead to performance that falls short of standard GRPO. At the same\ntime, generation length is reduced more aggressively, dropping below the typical level of the base policy model.", + "paper_id": "2603.10535", + "title": "Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning", + "authors": [ + "Zichao Li", + "Jie Lou", + "Fangchen Dong", + "Zhiyuan Fan", + "Mengjie Ren", + "Hongyu Lin", + "Xianpei Han", + "Debing Zhang", + "Le Sun", + "Yaojie Lu", + "Xing Yu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10535v1", + "chunk_index": 39, + "total_chunks": 74, + "char_count": 503, + "word_count": 79, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "8143d704-8f78-4931-8525-a0a7192887d6", + "text": "Gated Additive Shaping\nStandard GRPO Gated Additive ( = 0.5) Gated Additive ( = 0.9)\nMultiplicative Gated Additive ( = 0.75)\n(a) Task Reward (b) Response Length\n0.9\n0.8 2000\n0.7\n0.6\n0.5 1000\n0.4\n0 100 200 300 400 0 100 200 300 400\nTraining Step Training Step Comparison of multiplicative and gated additive shaping during RLHF training. Multiplicative shaping matches standard\nGRPO in task reward while achieving controlled length reduction. In contrast, gated additive variants with different reward thresholds\nunderperform and produce overly short responses, reflecting optimization instability introduced by hard gating. Analysis of the Difficulty Over-Adaptation Phenomenon In Section 3.2, we discussed how group-relative length regularization adapts the length budget to on-policy statistics. While\nthis removes the rigidity of global thresholds, we observe an unintended side effect in certain shaping strategies (e.g.,\nEfficiently (Arora & Zanette, 2025)): the policy can become over-adaptive to perceived task difficulty, a phenomenon we\nterm difficulty over-adaptation. Concretely, the model tends to aggressively compress reasoning on easy prompts while\nfailing to effectively restrain excessive length on hard ones, as reflected in Table 2 and illustrated by the examples in Table 6. In other words, the regularizer distorts how reasoning effort is allocated across difficulty levels. Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning Example output from the Efficiently (Arora & Zanette, 2025) 1.5B baseline at step 800, showing excessive reasoning truncation\nand omission of critical intermediate steps, resulting in an incoherent chain of thought.", + "paper_id": "2603.10535", + "title": "Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning", + "authors": [ + "Zichao Li", + "Jie Lou", + "Fangchen Dong", + "Zhiyuan Fan", + "Mengjie Ren", + "Hongyu Lin", + "Xianpei Han", + "Debing Zhang", + "Le Sun", + "Yaojie Lu", + "Xing Yu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10535v1", + "chunk_index": 40, + "total_chunks": 74, + "char_count": 1716, + "word_count": 246, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "83f3db2c-6743-4bec-aa58-aaef5196f5dd", + "text": "Prompt\nA 90◦rotation around the origin in the counter-clockwise direction is applied to 7 + 2i. What is the resulting complex number? Let's think step by step and output the final answer within \\boxed{}. \n7cos90 -2i sin90=0-2i\n// Extremely short reasoning performed\n // Fails to present the correct answer.", + "paper_id": "2603.10535", + "title": "Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning", + "authors": [ + "Zichao Li", + "Jie Lou", + "Fangchen Dong", + "Zhiyuan Fan", + "Mengjie Ren", + "Hongyu Lin", + "Xianpei Han", + "Debing Zhang", + "Le Sun", + "Yaojie Lu", + "Xing Yu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10535v1", + "chunk_index": 41, + "total_chunks": 74, + "char_count": 321, + "word_count": 51, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "bb9cb551-824b-45a8-a99a-5cabad0665d6", + "text": "To understand the mechanism behind this skew, we analyze the sensitivity of the reward function to changes in length. Consider the Efficiently (Arora & Zanette, 2025) formulation: ℓ(i) −¯ℓ\nˆR(x, y(i)) = R(x, y(i)) −λ · I(R(x, y(i)) = 1) · s . In this formulation, the input to the sigmoid function s(·) is amplified by the factor 1/σℓ. This means that the marginal\npenalty associated with increasing the response length is inversely proportional to the statistical dispersion (σℓ) of the group.", + "paper_id": "2603.10535", + "title": "Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning", + "authors": [ + "Zichao Li", + "Jie Lou", + "Fangchen Dong", + "Zhiyuan Fan", + "Mengjie Ren", + "Hongyu Lin", + "Xianpei Han", + "Debing Zhang", + "Le Sun", + "Yaojie Lu", + "Xing Yu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10535v1", + "chunk_index": 42, + "total_chunks": 74, + "char_count": 494, + "word_count": 82, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "0ea6b0bb-9a2a-4e46-b8ef-411747a10893", + "text": "This dependency creates instability across difficulty regimes. On easier prompts, the policy is typically confident and\nconverges to consistent responses, causing the length standard deviation to collapse (i.e., σℓ→0). Consequently, the\nscaling factor 1/σℓbecomes extremely large. In this low-variance regime, even a deviation of a single token is treated as\na massive statistical outlier, triggering a severe drop in reward. This hypersensitivity forces the model to over-compress\nsimple responses to avoid harsh penalties. Conversely, on harder prompts, the policy often explores diverse reasoning paths,\nresulting in a larger σℓ. This attenuates the penalty signal, allowing longer generations to persist with relatively little cost. In contrast, GR3 normalizes the penalty based on the characteristic scale (mean length ¯ℓ) rather than dispersion:", + "paper_id": "2603.10535", + "title": "Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning", + "authors": [ + "Zichao Li", + "Jie Lou", + "Fangchen Dong", + "Zhiyuan Fan", + "Mengjie Ren", + "Hongyu Lin", + "Xianpei Han", + "Debing Zhang", + "Le Sun", + "Yaojie Lu", + "Xing Yu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10535v1", + "chunk_index": 43, + "total_chunks": 74, + "char_count": 851, + "word_count": 120, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "0b5226cc-3c94-4058-8b58-a79f5630190e", + "text": "ˆR(x, y(i)) = R(x, y(i)) · .\n1 + α · ℓ(i) ¯ℓ Here, the sensitivity of the penalty depends on the ratio relative to ¯ℓ. While the mean length ¯ℓis naturally smaller for easier\ntasks (appropriately making the budget tighter), it represents the physical scale of the response and does not collapse to\nnear-zero values as the model becomes confident. By normalizing based on scale rather than variance, GR3 provides a\nstable regularization signal that remains robust to the model's convergence state, effectively mitigating the imbalance in\ncompression pressure across difficulty levels.", + "paper_id": "2603.10535", + "title": "Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning", + "authors": [ + "Zichao Li", + "Jie Lou", + "Fangchen Dong", + "Zhiyuan Fan", + "Mengjie Ren", + "Hongyu Lin", + "Xianpei Han", + "Debing Zhang", + "Le Sun", + "Yaojie Lu", + "Xing Yu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10535v1", + "chunk_index": 44, + "total_chunks": 74, + "char_count": 583, + "word_count": 93, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "e534884f-19b7-44ec-9e46-7677214ab9e8", + "text": "The Dilemma of High Reward Density In this section, we analyze a fundamental structural tension that arises when length regularization is combined with groupnormalized advantages. We show that under high reward density, a seemingly desirable strict requirement—ensuring that\nall highest-reward trajectories retain positive advantage—is in general mathematically infeasible. We then show that even\na relaxed average-case criterion can degenerate in the limiting case where all sampled trajectories achieve Rmax, as a\nconsequence of the convexity of multiplicative length rescaling. These observations motivate the relaxed, calibration-based\nconstraint and the online filtering strategy adopted in Section 3.3. Impossibility of the Strict Advantage-Preservation Objective A Strict but Natural Objective. A natural objective for length-aware reinforcement learning is to preserve the optimization\nsignal of the best solutions. Concretely, consider the following strict condition: all trajectories achieving the maximum\ntask reward Rmax within a sampled group should receive positive advantages after length regularization. Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning trajectories represent the highest-quality responses, and assigning them negative advantages risks discouraging correct\nreasoning behaviors.", + "paper_id": "2603.10535", + "title": "Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning", + "authors": [ + "Zichao Li", + "Jie Lou", + "Fangchen Dong", + "Zhiyuan Fan", + "Mengjie Ren", + "Hongyu Lin", + "Xianpei Han", + "Debing Zhang", + "Le Sun", + "Yaojie Lu", + "Xing Yu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10535v1", + "chunk_index": 45, + "total_chunks": 74, + "char_count": 1365, + "word_count": 175, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "352a2c47-a1a8-4ffc-b965-31ea54a1f5aa", + "text": "Under GRPO-style normalization, the strict objective is therefore equivalent to requiring ˆR(x, y(j)) > µR,ˆ ∀j ∈H, where H := {j : R(x, y(j)) = Rmax} denotes the set of highest-reward trajectories, ˆR(x, y(j)) is the regularized reward\nand µRˆ is the mean regularized reward within the group. Impracticality Under High Reward Density. We now show that under high reward density, the strict objective of\npreserving positive advantages for all highest-reward trajectories is mathematically infeasible, regardless of how small the\nlength penalty coefficient α is. Under GR3, the regularized reward is\nR(x, y(j))\nˆR(x, y(j)) = .\n1 + α · ℓ(j) ¯ℓ For all j ∈H we have R(x, y(j)) = Rmax, so variation in ˆR depends only on length. Since the regularization term is\nmonotonically decreasing in ℓ(j), the longest trajectory in H, attains the smallest regularized reward\nRmax\nˆRmin = ℓmax . 1 + α · ¯ℓ", + "paper_id": "2603.10535", + "title": "Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning", + "authors": [ + "Zichao Li", + "Jie Lou", + "Fangchen Dong", + "Zhiyuan Fan", + "Mengjie Ren", + "Hongyu Lin", + "Xianpei Han", + "Debing Zhang", + "Le Sun", + "Yaojie Lu", + "Xing Yu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10535v1", + "chunk_index": 46, + "total_chunks": 74, + "char_count": 891, + "word_count": 150, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "1938491d-9ae5-45b7-85b0-957ff87470e3", + "text": "When reward density is high, the group mean µRˆ is dominated by trajectories in H and is therefore approximately their\naverage. Since the minimum of a set cannot exceed its mean, we must have ˆRmin ≤µR.ˆ Hence, at least one highestreward trajectory receives a non-positive advantage. This shows that, under high reward density, it is impossible for all\nhighest-reward trajectories to maintain positive advantages after group normalization; the conflict is structural rather than a\nconsequence of hyperparameter choice. This impossibility persists even in the limit α →0. Using a first-order expansion, ℓ(j)\nˆR(x, y(j)) ≈Rmax · 1 −α · , and denoting by\n¯ℓH = X ℓ(j),\nj∈H the mean length among highest-reward trajectories, we obtain ℓ(j) −¯ℓH\nˆR(x, y(j)) −µRˆ ≈−Rmax · α · . ¯ℓ Hence\nˆA(j) ∝− ℓ(j) −¯ℓH .", + "paper_id": "2603.10535", + "title": "Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning", + "authors": [ + "Zichao Li", + "Jie Lou", + "Fangchen Dong", + "Zhiyuan Fan", + "Mengjie Ren", + "Hongyu Lin", + "Xianpei Han", + "Debing Zhang", + "Le Sun", + "Yaojie Lu", + "Xing Yu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10535v1", + "chunk_index": 47, + "total_chunks": 74, + "char_count": 802, + "word_count": 133, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "2f3deed1-2e04-48f4-a4bb-eaf846c6713e", + "text": "so any highest-reward trajectory longer than the mean length of H must receive a negative advantage, no matter how small α\nis. Group normalization enforces a zero-mean constraint that inevitably induces sign flips among equally correct trajectories\nwith different lengths. Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning Empirical Observation. Figure 6 empirically illustrates this phenomenon. The horizontal axis shows the proportion\nof trajectories in a group achieving Rmax (reward density), and the vertical axis shows the fraction of highest-reward\ntrajectories that satisfy the strict condition Ai > 0. Even with a very small penalty strength (Figure 6(a), α = 0.05), the\nsatisfaction rate declines as reward density increases. With a larger α (Figure 6(b), α = 5.0), the decline becomes much\nmore pronounced. These results confirm that violations of the strict condition arise from an inherent structural conflict rather\nthan poor hyperparameter tuning. This impossibility result directly motivates the relaxed calibration strategy in Section 3.3. Instead of attempting to protect\nthe longest highest-reward trajectory—which is generally infeasible—we adopt an average-case criterion that ensures a\ntypical high-quality trajectory, whose length is near the group mean, remains above the group-average regularized reward. This leads naturally to the practical constraint in Eq. (16).", + "paper_id": "2603.10535", + "title": "Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning", + "authors": [ + "Zichao Li", + "Jie Lou", + "Fangchen Dong", + "Zhiyuan Fan", + "Mengjie Ren", + "Hongyu Lin", + "Xianpei Han", + "Debing Zhang", + "Le Sun", + "Yaojie Lu", + "Xing Yu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10535v1", + "chunk_index": 48, + "total_chunks": 74, + "char_count": 1446, + "word_count": 205, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "dd5c03d6-798b-4d96-97b2-ec6d556895e4", + "text": "1.00 0.8\nRate Rate\n0.98 0.6 0.4 Satisfaction 0.96 Satisfaction\nNaive Naive 0.2\n0.94 0.0\n0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9\nRmax Score Ratio Rmax Score Ratio (a) α = 0.05 (b) α = 5.00 Effect of reward density on the strict advantage-preservation objective. As reward density increases, the strict constraint\nsatisfaction rate decreases, even for a small penalty coefficient α = 0.05, and more noticeably for a larger penalty α = 5.0.", + "paper_id": "2603.10535", + "title": "Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning", + "authors": [ + "Zichao Li", + "Jie Lou", + "Fangchen Dong", + "Zhiyuan Fan", + "Mengjie Ren", + "Hongyu Lin", + "Xianpei Han", + "Debing Zhang", + "Le Sun", + "Yaojie Lu", + "Xing Yu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10535v1", + "chunk_index": 49, + "total_chunks": 74, + "char_count": 477, + "word_count": 87, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "c451ffcd-b573-4365-8ca6-78ead05374f4", + "text": "Degeneracy of the Average-Case Criterion in the All-Rmax Limit Recall the average-case calibration criterion (Eq. (16)), which requires that a representative high-quality trajectory (reward\nRmax and average length ¯ℓ) retains non-negative advantage: where µRˆ denotes the within-group mean of the regularized rewards. As noted in Section 3.3, this constraint can fail in the\nlimiting case where all trajectories in the group achieve Rmax, motivating online filtering. All-Rmax case reduces the condition to a Jensen inequality. Assume a sampled group satisfies R(x, y(i)) = Rmax for\nall i ∈{1, . . . , G}. Under GR3, the regularized reward is ˆR(i) = Rmax · .\n1 + α · ℓ(i) ¯ℓ Define the normalized length ratio zi := ℓ(i)/¯ℓ, which obeys G1 PGi=1 zi = 1. f(z) := , z ≥0.\n1 + αz Then µRˆ = Rmax · G1 PGi=1 f(zi), and Eq. (16) becomes", + "paper_id": "2603.10535", + "title": "Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning", + "authors": [ + "Zichao Li", + "Jie Lou", + "Fangchen Dong", + "Zhiyuan Fan", + "Mengjie Ren", + "Hongyu Lin", + "Xianpei Han", + "Debing Zhang", + "Le Sun", + "Yaojie Lu", + "Xing Yu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10535v1", + "chunk_index": 50, + "total_chunks": 74, + "char_count": 832, + "word_count": 145, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "79d22f55-719d-42bf-97ac-580e5d204c82", + "text": "Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning Convexity flips the inequality. A direct calculation gives 2α2\nf ′′(z) = > 0,\n(1 + αz)3 so f is convex on [0, ∞). By Jensen's inequality, G ! G 1 1\nf X zi ≤ X f(zi). Using G1 Pi zi = 1, we obtain\nf(1) ≤ X f(zi),\ni=1 which is the reverse of the desired condition. Moreover, the inequality is strict whenever the lengths are not all identical\n(i.e., when zi is not constant). Therefore, in an all-Rmax group, the average-case constraint in Eq. (16) can only hold in the\ndegenerate case ℓ(i) = · · · = ℓ(G); otherwise it must fail. This explains why we exclude such groups via online filtering in\npractice.", + "paper_id": "2603.10535", + "title": "Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning", + "authors": [ + "Zichao Li", + "Jie Lou", + "Fangchen Dong", + "Zhiyuan Fan", + "Mengjie Ren", + "Hongyu Lin", + "Xianpei Han", + "Debing Zhang", + "Le Sun", + "Yaojie Lu", + "Xing Yu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10535v1", + "chunk_index": 51, + "total_chunks": 74, + "char_count": 708, + "word_count": 130, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "316cf32c-db7a-4581-907b-56bcefdd206b", + "text": "Detailed Experimental Settings We evaluate GR3 in representative post-training regimes spanning RLVR and RLHF. In the RLVR setting, we study whether\nthe model can reduce unnecessary long CoT reasoning while maintaining task performance improvement. As for the\nRLHF setting, we examine whether our method mitigates reward hacking (Amodei et al., 2016) and produces well-aligned\nresponses with natural response lengths. In this section, we provide the detailed hyperparameters and configurations used in our experiments. All experiments are\nconducted using veRL (Sheng et al., 2024) as the training framework, based on a standard GRPO advantage estimator. The\ndetailed implementation settings for the experiments are shown in Table 7. Moreover, we evaluate mathematical reasoning on AIME-24 (MAA, 2024), AIME-25 (MAA, 2025), AMC-23 (MAA,\n2023), and MATH500 (Lightman et al., 2023); code generation on LiveCodeBench v6 (Jain et al., 2024) and MultiPL-E\n(Cassano et al., 2022); and conversational ability on Arena-Hard-Auto (Li et al., 2024) and AlpacaEval (Dubois et al., 2024). Following standard practice, we report the length-controlled (LC) win rate for AlpacaEval. For Arena-Hard-Auto, we do not\napply length control and instead use scores derived directly from the original pairwise judge.", + "paper_id": "2603.10535", + "title": "Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning", + "authors": [ + "Zichao Li", + "Jie Lou", + "Fangchen Dong", + "Zhiyuan Fan", + "Mengjie Ren", + "Hongyu Lin", + "Xianpei Han", + "Debing Zhang", + "Le Sun", + "Yaojie Lu", + "Xing Yu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10535v1", + "chunk_index": 52, + "total_chunks": 74, + "char_count": 1292, + "word_count": 188, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "b8ab87b4-067a-4dc4-bdae-4b8bf0b58ef1", + "text": "Hyper-parameters used in our experiments. Hyperparameter Math Code Chat\nGR3 Coefficient α 0.33 0.33 0.00133\nKL Coefficient β 0.0 0.0 0.001\nBatch Size 128 128 128\nMini Batch Size 128 128 128\nPrompt Length 1024 1024 1024\nRollout Length 16384 16384 4096\nRollout Temperature 1.0 1.0 1.0\nRollout Group Size 16 8 8\nOptimizer AdamW AdamW AdamW\nWeight Decay 0.01 0.01 0.01\nLearning Rate 2e-6 2e-6 1e-6\nScheduler Type Constant Constant Constant\nTraining Step 1000 800 400\nEvaluation Length 32768 32768 8192\nEvaluation Temperature 0.6 0.6 0.0 Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning Additional Experimental Results Mathematical Reasoning Results on 1.5B Models This section presents experimental results on mathematical reasoning tasks, using DeepSeek–R1–Distill–1.5B as the base\nmodel. As shown in Table 8, the findings are consistent with the experiments of DeepSeek–R1–Distill–7B: GR3 not only\nsignificantly enhances the model's reasoning capabilities but also substantially reduces the length of the generated output in\nterms of tokens, demonstrating good performance across different scales. Mathematical reasoning performance for 1.5B models. AIME24 AIME25 AMC23 MATH500 Model Avg@32↑ #Token↓ Avg@32↑ #Token↓ Avg@16↑ #Token↓ Avg@4↑ #Token↓ Initial model\nDeepSeek–R1–Distill–1.5B 30.0 16531 23.6 15799 70.8 9351 83.9 5399 Length-oriented RL\nLCR1–1.5B 23.5 9071 20.6 8275 67.8 4170 81.9 2520\nLaser–DE–L4096–1.5B 30.1 5770 24.1 5008 73.4 3110 84.6 1931\nAdaptThink–1.5B 34.2 9204 24.7 9234 63.3 2859 80.9 1649\nDLER–R1–1.5B 34.3 3839 27.8 3153 82.1 2419 87.2 1783 Performance-oriented RL\nGRPO 39.6 13054 31.4 12985 81.9 9917 85.6 7138\nGR3 (ours) 45.2 8381 32.8 8137 81.6 4153 89.3 2214 Ablation Results on Penalty Strength α We analyze the sensitivity of GR3 to the length penalty coefficient α by sweeping its value under identical training settings.", + "paper_id": "2603.10535", + "title": "Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning", + "authors": [ + "Zichao Li", + "Jie Lou", + "Fangchen Dong", + "Zhiyuan Fan", + "Mengjie Ren", + "Hongyu Lin", + "Xianpei Han", + "Debing Zhang", + "Le Sun", + "Yaojie Lu", + "Xing Yu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10535v1", + "chunk_index": 53, + "total_chunks": 74, + "char_count": 1907, + "word_count": 281, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "d4a2d2f3-978c-463a-a340-dbceb6518872", + "text": "Detailed results on 1.5B models are shown in Table 9. When α is large (e.g., α = 1.0), the multiplicative rescaling term heavily penalizes long trajectories, largely independent of\ntheir reward level. In this regime, GR3 behaves similarly to conventional length-penalized RL, where optimization is driven\nprimarily by shortening responses rather than improving solution quality. Although token usage is significantly reduced,\nthe performance gains over the initial model become noticeably smaller, indicating that overly aggressive regularization\nsuppresses useful long-form reasoning. As α decreases, response length increases in a gradual and well-behaved manner, while task performance first improves\nand then saturates. In particular, moving from α = 0.33 to smaller values (e.g., 0.2 and 0.1) yields longer generations but\nonly marginal or inconsistent accuracy improvements. This empirical trend aligns closely with the analysis in Section 3.3.", + "paper_id": "2603.10535", + "title": "Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning", + "authors": [ + "Zichao Li", + "Jie Lou", + "Fangchen Dong", + "Zhiyuan Fan", + "Mengjie Ren", + "Hongyu Lin", + "Xianpei Han", + "Debing Zhang", + "Le Sun", + "Yaojie Lu", + "Xing Yu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10535v1", + "chunk_index": 54, + "total_chunks": 74, + "char_count": 950, + "word_count": 135, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "e4e43716-9611-4263-b92d-384736d55140", + "text": "Once the penalty is weak enough that the advantage of representative high-quality trajectories is preserved, further reducing\nα mainly relaxes the length constraint without introducing stronger optimization signals. In other words, the training has\nalready crossed the advantage-preservation boundary, after which additional reasoning length no longer translates into\nmeaningful performance gains. Overall, α controls distinct behavioral regimes, and GR3 provides a principled way to select it near the advantage-preserving\ntransition between length control and capability gains. Qualitative Analysis of Rollout Trajectories To better understand the behavioral differences induced by the two training objectives, we present representative rollout\nexamples from a GR3-trained model and a GRPO-trained baseline on the same reasoning prompt. Tables 10 and 11 show\nthe generation traces.", + "paper_id": "2603.10535", + "title": "Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning", + "authors": [ + "Zichao Li", + "Jie Lou", + "Fangchen Dong", + "Zhiyuan Fan", + "Mengjie Ren", + "Hongyu Lin", + "Xianpei Han", + "Debing Zhang", + "Le Sun", + "Yaojie Lu", + "Xing Yu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10535v1", + "chunk_index": 55, + "total_chunks": 74, + "char_count": 883, + "word_count": 119, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "ba9e7a57-5212-406e-8d56-f20770de069c", + "text": "GR3: concise reasoning with preserved structure. As illustrated in Table 10, the GR3-trained model produces a\nreasoning trajectory that is both structured and economical. The solution follows a clear progression: (i) restating the task,\n(ii) identifying the periodicity of directions under full rotations, (iii) reducing the angle to a remainder modulo 360◦, and\n(iv) mapping the residual rotation to a compass direction. Each step contributes directly to advancing the solution, and", + "paper_id": "2603.10535", + "title": "Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning", + "authors": [ + "Zichao Li", + "Jie Lou", + "Fangchen Dong", + "Zhiyuan Fan", + "Mengjie Ren", + "Hongyu Lin", + "Xianpei Han", + "Debing Zhang", + "Le Sun", + "Yaojie Lu", + "Xing Yu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10535v1", + "chunk_index": 56, + "total_chunks": 74, + "char_count": 483, + "word_count": 71, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "28753c1f-c45d-40c9-9200-0f2fea6de83c", + "text": "Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning Hyperparameter sweep of the length penalty coefficient α in GR3 on 1.5B models. AIME24 AIME25 AMC23 MATH500 Model Avg@32↑ #Token↓ Avg@32↑ #Token↓ Avg@16↑ #Token↓ Avg@4↑ #Token↓ Initial model\nDeepSeek–R1–Distill–1.5B 30.0 16531 23.6 15799 70.8 9351 83.9 5399 Our method: varying α\nGR3 (α = 1.00) 34.9 6316 27.4 5927 75.2 1754 83.0 848\nGR3 (α = 0.50) 40.5 7923 29.6 7399 82.3 3245 87.1 1705\nGR3 (α = 0.33) 45.2 8381 32.8 8137 81.6 4153 89.3 2214\nGR3 (α = 0.20) 45.1 10174 32.4 10250 83.0 4887 88.9 2838\nGR3 (α = 0.10) 43.2 10759 31.7 10701 83.0 6235 88.2 3419 intermediate checks serve to confirm rather than re-derive earlier results.", + "paper_id": "2603.10535", + "title": "Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning", + "authors": [ + "Zichao Li", + "Jie Lou", + "Fangchen Dong", + "Zhiyuan Fan", + "Mengjie Ren", + "Hongyu Lin", + "Xianpei Han", + "Debing Zhang", + "Le Sun", + "Yaojie Lu", + "Xing Yu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10535v1", + "chunk_index": 57, + "total_chunks": 74, + "char_count": 738, + "word_count": 123, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "849b10dd-645c-47cc-82c5-9b20a87aeb22", + "text": "Importantly, the trajectory terminates decisively with a correctly formatted boxed answer. The reasoning chain is neither\nartificially shortened nor overly verbose; instead, redundant re-computation and self-doubt loops are largely absent. This\nreflects a policy that has learned to allocate tokens primarily to causally relevant steps.", + "paper_id": "2603.10535", + "title": "Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning", + "authors": [ + "Zichao Li", + "Jie Lou", + "Fangchen Dong", + "Zhiyuan Fan", + "Mengjie Ren", + "Hongyu Lin", + "Xianpei Han", + "Debing Zhang", + "Le Sun", + "Yaojie Lu", + "Xing Yu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10535v1", + "chunk_index": 58, + "total_chunks": 74, + "char_count": 336, + "word_count": 45, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "6c41d1ec-ccab-4238-953f-2e951cbae6ac", + "text": "GRPO: verbose loops and diluted signal. In contrast, the GRPO-trained baseline (Table 11) exhibits substantially\ndifferent behavior. Although it repeatedly identifies correct intermediate facts—such as the 360◦periodicity and the\n2250 mod 360 = 90◦reduction—it frequently re-derives them, questions previously established conclusions, and oscillates\nbetween equivalent formulations (e.g., full rotations vs. fractional rotations). The trajectory contains multiple self-corrections\nthat do not introduce new information. This pattern results in long reasoning traces where many tokens are only weakly related to forward progress. From an\noptimization perspective, such trajectories distribute the reward signal across a large number of low-impact tokens, reducing\nthe effective learning pressure on decisive reasoning steps. Moreover, despite eventually circling back to the correct direction,\nthe model fails to present a clean final answer in the required boxed format, leaving the solution inconclusive. Implications for optimization dynamics.", + "paper_id": "2603.10535", + "title": "Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning", + "authors": [ + "Zichao Li", + "Jie Lou", + "Fangchen Dong", + "Zhiyuan Fan", + "Mengjie Ren", + "Hongyu Lin", + "Xianpei Han", + "Debing Zhang", + "Le Sun", + "Yaojie Lu", + "Xing Yu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10535v1", + "chunk_index": 59, + "total_chunks": 74, + "char_count": 1045, + "word_count": 137, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "ac532034-5b2f-491d-8d56-fced6a485612", + "text": "These qualitative differences align with the mechanism discussed in Section 4.3.2. By down-weighting the advantages of unnecessarily long trajectories, GR3 implicitly favors reasoning paths with\nhigher informational density per token. This encourages the policy to preserve essential logical structure while avoiding\nnon-robust reasoning patterns, such as repetitive or self-doubting loops. As a result, the reward signal becomes more\nconcentrated on tokens that correspond to meaningful reasoning transitions, rather than being diluted across verbose yet\nlow-contributing segments.", + "paper_id": "2603.10535", + "title": "Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning", + "authors": [ + "Zichao Li", + "Jie Lou", + "Fangchen Dong", + "Zhiyuan Fan", + "Mengjie Ren", + "Hongyu Lin", + "Xianpei Han", + "Debing Zhang", + "Le Sun", + "Yaojie Lu", + "Xing Yu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10535v1", + "chunk_index": 60, + "total_chunks": 74, + "char_count": 582, + "word_count": 76, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "d9a9337d-2a61-4a24-b54c-c9809c4e52f4", + "text": "Zichao Li led the project. Jie Lou and Yaojie Lu supervised the project. Fangchen Dong contributed to the experimental implementation. Zhiyuan Fan contributed by first sketching the prototype of the method during early discussions. Mengjie Ren contributed to the experiments on code generation tasks. Hongyu Lin, Xianpei Han, Debing Zhang, Le Sun, and Xing Yu provided supervision and contributed to technical discussions\nthroughout the project. Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning Example output from our GR3-trained 1.5B model, demonstrating robust, clearly structured step-by-step reasoning and a correct\nfinal answer in boxed format.", + "paper_id": "2603.10535", + "title": "Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning", + "authors": [ + "Zichao Li", + "Jie Lou", + "Fangchen Dong", + "Zhiyuan Fan", + "Mengjie Ren", + "Hongyu Lin", + "Xianpei Han", + "Debing Zhang", + "Le Sun", + "Yaojie Lu", + "Xing Yu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10535v1", + "chunk_index": 61, + "total_chunks": 74, + "char_count": 706, + "word_count": 98, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "9d955e87-a0b8-40d8-b535-705132cb4370", + "text": "Prompt\nA figure skater is facing north when she begins to spin to her right. She spins 2250 degrees. Which direction (north, south, east or\nwest) is she facing when she finishes her spin?", + "paper_id": "2603.10535", + "title": "Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning", + "authors": [ + "Zichao Li", + "Jie Lou", + "Fangchen Dong", + "Zhiyuan Fan", + "Mengjie Ren", + "Hongyu Lin", + "Xianpei Han", + "Debing Zhang", + "Le Sun", + "Yaojie Lu", + "Xing Yu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10535v1", + "chunk_index": 62, + "total_chunks": 74, + "char_count": 187, + "word_count": 34, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "94e28563-76f7-4dd9-a972-b83d2c14931e", + "text": "Let's think step by step and output the final answer within \\boxed{}. Example Output from our GR3-trained 1.5B Model\n\nOkay, so I need to figure out which direction a figure skater is facing after spinning 2250 degrees to her right. She starts facing\nnorth. Hmm, let's break this down step by step.\n// Restates the problem and sets up a step-by-step plan.", + "paper_id": "2603.10535", + "title": "Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning", + "authors": [ + "Zichao Li", + "Jie Lou", + "Fangchen Dong", + "Zhiyuan Fan", + "Mengjie Ren", + "Hongyu Lin", + "Xianpei Han", + "Debing Zhang", + "Le Sun", + "Yaojie Lu", + "Xing Yu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10535v1", + "chunk_index": 63, + "total_chunks": 74, + "char_count": 361, + "word_count": 63, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "cbbb1c4d-f9ba-4ca2-8f04-c168333f000b", + "text": "First, I know that a full rotation is 360 degrees, so every time she spins 360 degrees, she'll end up facing the same direction she\nstarted. So, to find out where she ends up after 2250 degrees, I should probably find the equivalent angle within a single rotation.\n// Recalls the key concept: directions repeat every 360°. Let me calculate how many full rotations 2250 degrees is. I'll divide 2250 by 360. Let's see, 2250 ÷ 360. Let me compute that:\n360 × 6 is 2160, and 2250 - 2160 is 90. So, 2250 degrees is equivalent to 6 full rotations plus an additional 90 degrees.\n// Performs division to reduce the rotation to full turns plus a remainder. So, she spins 6 full circles, which brings her back to the starting direction, facing north. Then she spins an additional 90 degrees to\nher right. Now, I need to figure out where she ends up after spinning 90 degrees to the right from north.\n// Uses the result: full turns cancel out, leaving only the 90° remainder. Let's recall the directions: when facing north, a right turn would be towards east. So, if she spins 90 degrees to the right, starting\nfrom north, she would face east.\n// Maps a 90° right turn from north to the compass direction east. Wait, let me double-check that. If she's facing north and spins 90 degrees to the right, her new direction is east.", + "paper_id": "2603.10535", + "title": "Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning", + "authors": [ + "Zichao Li", + "Jie Lou", + "Fangchen Dong", + "Zhiyuan Fan", + "Mengjie Ren", + "Hongyu Lin", + "Xianpei Han", + "Debing Zhang", + "Le Sun", + "Yaojie Lu", + "Xing Yu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10535v1", + "chunk_index": 64, + "total_chunks": 74, + "char_count": 1315, + "word_count": 241, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "8146bb01-7851-47a1-a8c3-0afbcd365345", + "text": "Yeah, that makes\nsense. So after 2250 degrees, she ends up facing east.\n// Verifies the directional reasoning and confirms the conclusion. So the answer should be east.\n**Final Answer**\neast", + "paper_id": "2603.10535", + "title": "Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning", + "authors": [ + "Zichao Li", + "Jie Lou", + "Fangchen Dong", + "Zhiyuan Fan", + "Mengjie Ren", + "Hongyu Lin", + "Xianpei Han", + "Debing Zhang", + "Le Sun", + "Yaojie Lu", + "Xing Yu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10535v1", + "chunk_index": 65, + "total_chunks": 74, + "char_count": 190, + "word_count": 31, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "8acad438-5967-48c5-91d3-35b64b9e225a", + "text": "\nThe figure skater starts facing north and spins 2250 degrees to her right. First, we determine the equivalent angle within a single rotation by dividing 2250 by 360. This gives us 6 full rotations plus an\nadditional 90 degrees (since 2250 - 6*360 = 90). After completing 6 full rotations, the skater is back facing north. She then spins an additional 90 degrees to her right from north. When facing north and spinning 90 degrees to the right, the skater will face east. // Formats the final answer as required. Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning Example output from the GRPO-trained 1.5B baseline, exhibiting redundant reasoning loops, repeated self-doubt, and failure to\nproduce a conclusive final answer.", + "paper_id": "2603.10535", + "title": "Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning", + "authors": [ + "Zichao Li", + "Jie Lou", + "Fangchen Dong", + "Zhiyuan Fan", + "Mengjie Ren", + "Hongyu Lin", + "Xianpei Han", + "Debing Zhang", + "Le Sun", + "Yaojie Lu", + "Xing Yu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10535v1", + "chunk_index": 66, + "total_chunks": 74, + "char_count": 785, + "word_count": 124, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "c12d8465-1f0b-403b-b229-baefcbbade83", + "text": "Prompt\nA figure skater is facing north when she begins to spin to her right. She spins 2250 degrees. Which direction (north, south, east or\nwest) is she facing when she finishes her spin?", + "paper_id": "2603.10535", + "title": "Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning", + "authors": [ + "Zichao Li", + "Jie Lou", + "Fangchen Dong", + "Zhiyuan Fan", + "Mengjie Ren", + "Hongyu Lin", + "Xianpei Han", + "Debing Zhang", + "Le Sun", + "Yaojie Lu", + "Xing Yu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10535v1", + "chunk_index": 67, + "total_chunks": 74, + "char_count": 187, + "word_count": 34, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "8b6465b9-aacb-46f3-9256-4cfb550aa9bb", + "text": "Let's think step by step and output the final answer within \\boxed{}. Example Output from the GRPO-trained 1.5B Model\n\nOkay, so I have this problem here about a figure skater spinning. Let me try to figure it out step by step.", + "paper_id": "2603.10535", + "title": "Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning", + "authors": [ + "Zichao Li", + "Jie Lou", + "Fangchen Dong", + "Zhiyuan Fan", + "Mengjie Ren", + "Hongyu Lin", + "Xianpei Han", + "Debing Zhang", + "Le Sun", + "Yaojie Lu", + "Xing Yu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10535v1", + "chunk_index": 68, + "total_chunks": 74, + "char_count": 233, + "word_count": 42, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "8e1b32e6-ce15-41a4-8a22-ab0aca67e352", + "text": "Hmm, she's facing north\nand starts spinning to her right. She spins 2250 degrees, and we need to find out which direction she's facing after that. The options\nare north, south, east, or west. Alright, first, I know that when you spin, your direction changes based on how many degrees you turn. But since she's starting\nfacing north and spinning to her right, I need to figure out how that affects her direction after a certain number of degrees. When you spin to your right, does that mean you're turning clockwise or counterclockwise? I think in the context of\nfacing north, spinning to your right would be clockwise because if you're facing north and turn right, you're going towards east,\nthen south, then west, and back to north. So, right is clockwise. But maybe I should clarify.", + "paper_id": "2603.10535", + "title": "Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning", + "authors": [ + "Zichao Li", + "Jie Lou", + "Fangchen Dong", + "Zhiyuan Fan", + "Mengjie Ren", + "Hongyu Lin", + "Xianpei Han", + "Debing Zhang", + "Le Sun", + "Yaojie Lu", + "Xing Yu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10535v1", + "chunk_index": 69, + "total_chunks": 74, + "char_count": 785, + "word_count": 137, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "88d5760e-0f4c-4211-859b-80ea1648aeab", + "text": "If you're facing north and you spin to your right, you're turning clockwise. So, each 90-degree turn\nwould change your direction. So, 90 degrees to the right from north would be east, another 90 degrees would be south, another 90\ndegrees would be west, and another 90 degrees back to north. That makes sense.\n// Repeatedly re-derives the meaning of \"spin to her right\". Now, she spins 2250 degrees.", + "paper_id": "2603.10535", + "title": "Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning", + "authors": [ + "Zichao Li", + "Jie Lou", + "Fangchen Dong", + "Zhiyuan Fan", + "Mengjie Ren", + "Hongyu Lin", + "Xianpei Han", + "Debing Zhang", + "Le Sun", + "Yaojie Lu", + "Xing Yu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10535v1", + "chunk_index": 70, + "total_chunks": 74, + "char_count": 398, + "word_count": 69, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "23580d33-3163-4d0d-81cb-a2b610ba8f18", + "text": "Let me think about how to break this down. Since a full circle is 360 degrees,\nmaybe I can find out how many full rotations she makes and then the remaining degrees. So, if I divide 2250 by 360, that will tell me how many full spins she does and what the remainder is. Let me do that: 2250 ÷ 360. First, 360 times 6 is 2160, right? Because 360*6 = 2160. Then, 2250 - 2160 is 90. So, 2250 degrees is equal to 6 full rotations\n(which bring her back to the starting point) plus an additional 90 degrees. So, effectively, she only needs to consider the 90-degree\nturn after the full rotations.", + "paper_id": "2603.10535", + "title": "Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning", + "authors": [ + "Zichao Li", + "Jie Lou", + "Fangchen Dong", + "Zhiyuan Fan", + "Mengjie Ren", + "Hongyu Lin", + "Xianpei Han", + "Debing Zhang", + "Le Sun", + "Yaojie Lu", + "Xing Yu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10535v1", + "chunk_index": 71, + "total_chunks": 74, + "char_count": 589, + "word_count": 114, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "fdcc7288-0c9a-421a-9be1-4bb20b001f81", + "text": "Wait, is it 6 full rotations or 6.25? Wait, no, 2250 divided by 360 is 6.25, because 360*6=2160, and 2250-\n2160=90, so it's 6 and 90/360, which is 6.25 full rotations. So, 6.25 rotations is the same as 6 full rotations plus 0.25 of a rotation. But since a full rotation brings her back to the starting direction, the 6 full rotations don't change her facing. So, the key is the 0.25\nof a rotation, which is 90 degrees. She starts spinning to her right, which is clockwise. So, spinning 90 degrees clockwise from north would make\nher face east. So, after 2250 degrees, which is equivalent to 6 full rotations plus 90 degrees, she would have effectively turned 90\ndegrees clockwise from her starting direction. Wait, the problem says she starts spinning to her right. So, does that mean she is spinning clockwise? Because if\nshe's facing north and spins to her right, that's a clockwise spin. So, each 90-degree turn clockwise would change her direction. So, if she spins 90 degrees clockwise, she would face east. 180 degrees clockwise would face her west, and 270 degrees clockwise\nwould face her south. But in this case, she only spins 90 degrees, which is 2250 degrees, which is 6 full rotations plus 90 degrees. Wait, but 2250 degrees is a lot.", + "paper_id": "2603.10535", + "title": "Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning", + "authors": [ + "Zichao Li", + "Jie Lou", + "Fangchen Dong", + "Zhiyuan Fan", + "Mengjie Ren", + "Hongyu Lin", + "Xianpei Han", + "Debing Zhang", + "Le Sun", + "Yaojie Lu", + "Xing Yu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10535v1", + "chunk_index": 72, + "total_chunks": 74, + "char_count": 1247, + "word_count": 221, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "14d33b56-be57-4026-ad12-600c1d3ee590", + "text": "Let me make sure I'm calculating this correctly. So, 2250 divided by 360 is equal to... let's see, 360\ntimes 6 is 2160, as I had before. 2250 minus 2160 is 90. So, 2250 degrees is 6 full circles plus 90 degrees. So, she's effectively\nturned 90 degrees.\n// Unnecessarily re-checks the same arithmetic multiple times. But since she's spinning to her right, which is clockwise, a 90-degree turn would make her face east. So, she starts facing north,\nspins 90 degrees clockwise, so she faces east. Wait, maybe I'm mixing up the direction of rotation. If she's spinning to her right, that's a clockwise rotation,\nbut in terms of her facing direction, starting north, spinning 90 degrees clockwise would make her face east. Wait, but let me think again.", + "paper_id": "2603.10535", + "title": "Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning", + "authors": [ + "Zichao Li", + "Jie Lou", + "Fangchen Dong", + "Zhiyuan Fan", + "Mengjie Ren", + "Hongyu Lin", + "Xianpei Han", + "Debing Zhang", + "Le Sun", + "Yaojie Lu", + "Xing Yu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10535v1", + "chunk_index": 73, + "total_chunks": 74, + "char_count": 747, + "word_count": 130, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "30cfacd3-e259-454c-be33-c63dffa4d90e", + "text": "When you spin to your right, you're turning clockwise. So, if you're facing north, turning right\n(clockwise) 90 degrees would make you face east. So, 90 degrees clockwise from north is east. So, 2250 degrees is 6 full rotations\nplus 90 degrees. So, after the full rotations, she's back to north, and then she turns 90 degrees clockwise, so she's facing east.\n// Enters a self-doubt loop, questioning previously correct directional reasoning.\n. . . If you add 2250 degrees, but since 360 is a full circle, 2250 degrees is equivalent to 90 degrees. So, starting from\nnorth (0 degrees), adding 90 degrees clockwise is east. So, yes, she's facing east. But wait, hold on, is it 90 degrees or 270 degrees? Wait, no, because she's spinning to her right, which is clockwise, so 90 degrees\nclockwise from north is east. Wait, maybe I should draw a diagram in my mind.\n. . .\n// Fails to present the final answer, leaving the solution inconclusive.", + "paper_id": "2603.10535", + "title": "Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning", + "authors": [ + "Zichao Li", + "Jie Lou", + "Fangchen Dong", + "Zhiyuan Fan", + "Mengjie Ren", + "Hongyu Lin", + "Xianpei Han", + "Debing Zhang", + "Le Sun", + "Yaojie Lu", + "Xing Yu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10535v1", + "chunk_index": 74, + "total_chunks": 74, + "char_count": 938, + "word_count": 165, + "chunking_strategy": "semantic" + } +] \ No newline at end of file