| --- |
| license: mit |
| language: |
| - en |
| base_model: |
| - Qwen/Qwen3-4B |
| tags: |
| - social-reasoning |
| - Theory of Mind |
| - reinforcement-learning |
| - GRPO |
| - SIP |
| datasets: |
| - Jincenzi/ToMBench_Hard |
| pipeline_tag: text-generation |
| --- |
| |
| # SocialR1-4B |
|
|
| **SocialR1-4B** is a social reasoning model built on [Qwen3-4B](https://huggingface.co/Qwen/Qwen3-4B), trained with trajectory-level reinforcement learning (GRPO) using the **Social-R1** framework. It enhances social reasoning capabilities by aligning reasoning processes with the Social Information Processing (SIP) theory. |
|
|
| π **Paper**: [Social-R1: Enhancing Social Reasoning in LLMs through Trajectory-Level Reinforcement Learning](https://arxiv.org/abs/2603.09249) |
|
|
| ## Highlights |
|
|
| - π§ **SIP-Guided Reasoning**: Enforces stage-consistent social inference β Cue Encoding β Cue Interpretation β Goal Clarification β Response Generation |
| - π― **Multi-Dimensional Reward**: Combines structural reward, content reward, inference efficiency, and format reward with curriculum-style weighting |
| - π **Strong Performance**: Enables a 4B-parameter model to match or outperform substantially larger baselines across static MCQ benchmarks, open-ended generation (FanToM), and interactive settings (SOTOPIA) |
|
|
| ## Usage |
|
|
| ```python |
| from transformers import AutoModelForCausalLM, AutoTokenizer |
| |
| model_name = "Jincenzi/SocialR1-4B" |
| tokenizer = AutoTokenizer.from_pretrained(model_name) |
| model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto", device_map="auto") |
| |
| messages = [ |
| {"role": "user", "content": "You should first think about the reasoning process in the mind and then provide with the answer.The reasoning process and answer are enclosed within <think> </think> and <Answer> </Answer> tags, respectively."} |
| ] |
| text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) |
| inputs = tokenizer([text], return_tensors="pt").to(model.device) |
| outputs = model.generate(**inputs, max_new_tokens=2048) |
| print(tokenizer.decode(outputs[0], skip_special_tokens=True)) |
| ``` |
|
|
| ## Training Details |
|
|
| - **Base Model**: Qwen3-4B |
| - **Training Method**: Group Relative Policy Optimization (GRPO) |
| - **Training Steps**: 600 |
| - **Hardware**: 8Γ NVIDIA A100 (80GB) |
| - **Group Size**: 5 |
| - **KL Coefficient**: 0.04 |
| - **Learning Rate**: 5Γ10β»β· |
| - **Reward Design**: SIP structural reward ($R_\text{struct}$) + SIP content reward ($R_\text{cont}$) + inference efficiency ($R_\text{len}$) + format reward ($R_\text{fmt}$) |
|
|
| ## Evaluation |
|
|
| SocialR1-4B is evaluated across three complementary settings: |
|
|
| - **Static MCQ**: ToMBench, ToMBench-Hard, SocialIQA, SimpleToM, EmoBench, MotiveBench, Hi-ToM, TactfulToM |
| - **Open-ended Generation**: FanToM |
| - **Interactive Social Intelligence**: SOTOPIA |
|
|
| ## Related Resources |
|
|
| | Resource | Link | |
| |----------|------| |
| | Paper | [arXiv:2603.09249](https://arxiv.org/abs/2603.09249) | |
| | SocialR1-8B | [Jincenzi/SocialR1-8B](https://huggingface.co/Jincenzi/SocialR1-8B) | |
|
|
| ## Citation |
|
|
| ```BibTeX |
| @inproceedings{wu2026socialr1, |
| title={Social-R1: Enhancing Social Reasoning in LLMs through Trajectory-Level Reinforcement Learning}, |
| author={Wu, Jincenzi and Lei, Yuxuan and Lian, Jianxun and Huang, Yitian and Zhou, Lexin and Li, Haotian and Yang, Deng and Xie, Xing and Meng, Helen}, |
| booktitle={Arxiv}, |
| year={2026} |
| } |
| ``` |
|
|
| ## Contact |
|
|
| For questions or discussions, please contact [jincenziwu@gmail.com](mailto:jincenziwu@gmail.com). |
|
|