Hejian Sang's picture

Hejian Sang

pb09204048

·

AI & ML interests

None yet

Recent Activity

liked a dataset 4 days ago

ianncity/KIMI-K2.5-1000000x

upvoted a paper about 1 month ago

On-Policy Self-Distillation for Reasoning Compression

submitted a paper about 1 month ago

On-Policy Self-Distillation for Reasoning Compression

View all activity

Organizations

liked a dataset 4 days ago

ianncity/KIMI-K2.5-1000000x

Viewer • Updated 8 days ago • 733k • 2.98k • 205

upvoted a paper about 1 month ago

On-Policy Self-Distillation for Reasoning Compression

Paper • 2603.05433 • Published Mar 5 • 8

submitted a paper to Daily Papers about 1 month ago

On-Policy Self-Distillation for Reasoning Compression

Paper • 2603.05433 • Published Mar 5 • 8

authored a paper about 1 month ago

Overconfident Errors Need Stronger Correction: Asymmetric Confidence Penalties for Reinforcement Learning

Paper • 2602.21420 • Published Feb 24 • 6

upvoted a paper about 2 months ago

Overconfident Errors Need Stronger Correction: Asymmetric Confidence Penalties for Reinforcement Learning

Paper • 2602.21420 • Published Feb 24 • 6

submitted a paper to Daily Papers about 2 months ago

Overconfident Errors Need Stronger Correction: Asymmetric Confidence Penalties for Reinforcement Learning

Paper • 2602.21420 • Published Feb 24 • 6

commented 2 papers about 2 months ago

Reinforcement Learning via Self-Distillation

Paper • 2601.20802 • Published Jan 28 • 43 •

Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

Paper • 2601.18734 • Published Jan 26 • 4 •

commentedon Unlocking Agentic RL Training for GPT-OSS: A Practical Retrospective 3 months ago

Hi @sseymens
Thank you for your comments.
I can help to reply your question about MOE on policy part.

Yeah, forcing old_log_prob = log_prob.detach() does not solve the on policy issue since the prob is using current policy but sampling distribution can be different due to expert selection.
When we explored the agentic issues for gpt-oss training, we did not root the cause at the beginning. One hypothesis is due to inference-training inconsistency. After we apply the importance sampling, it does not help. So we test if forcing old_log_prob = log_prob.detach() will alleviate the issue if this is the root cause. This is just for hypothesis testing.
When we explored the agentic issues for gpt-oss training, verl has not supported expert router replay yet. So we cannot test this idea. https://arxiv.org/pdf/2510.11370v1. Now we tested the relay. But this is not the root cause too. The root cause is attention sink.

published an article 3 months ago

Article

Unlocking Agentic RL Training for GPT-OSS: A Practical Retrospective

Jan 27

•

70

authored a paper 3 months ago

Debunk the Myth of SFT Generalization

Paper • 2510.00237 • Published Sep 30, 2025 • 2

upvoted 2 papers 6 months ago

Debunk the Myth of SFT Generalization

Paper • 2510.00237 • Published Sep 30, 2025 • 2

Planner-R1: Reward Shaping Enables Efficient Agentic RL with Smaller LLMs

Paper • 2509.25779 • Published Sep 30, 2025 • 19

liked 4 datasets 7 months ago

anisha2102/RaR-Science-20k-o3-mini

Viewer • Updated Oct 5, 2025 • 22.9k • 33 • 4

anisha2102/RaR-Medicine-20k-o3-mini

Viewer • Updated Oct 5, 2025 • 22.4k • 83 • 6

LLM360/guru-RL-92k

Viewer • Updated Aug 20, 2025 • 91.9k • 1.74k • 46

a-m-team/AM-Thinking-v1-Distilled

Preview • Updated Jun 12, 2025 • 1.3k • 58

liked a model 11 months ago

Wan-AI/Wan2.1-FLF2V-14B-720P

Updated Apr 17, 2025 • 2.94k • 227

upvoted a collection over 1 year ago

Tulu 3 Datasets

All datasets released with Tulu 3 -- state of the art open post-training recipes. • 32 items • Updated Mar 2 • 97

liked a dataset about 2 years ago

google-research-datasets/mbpp

Viewer • Updated Jan 4, 2024 • 1.4k • 187k • 224