--- license: mit language: - en base_model: - Qwen/Qwen3-8B tags: - social-reasoning - Theory of Mind - reinforcement-learning - GRPO - SIP datasets: - Jincenzi/ToMBench_Hard pipeline_tag: text-generation --- # SocialR1-8B **SocialR1-8B** is a social reasoning model built on [Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B), trained with trajectory-level reinforcement learning (GRPO) using the **Social-R1** framework. It enhances Theory-of-Mind (ToM) and social inference capabilities by aligning reasoning processes with the Social Information Processing (SIP) theory. 📄 **Paper**: [Social-R1: Enhancing Social Reasoning in LLMs through Trajectory-Level Reinforcement Learning](https://arxiv.org/abs/2603.09249) ## Highlights - 🧠 **SIP-Guided Reasoning**: Enforces stage-consistent social inference — Cue Encoding → Cue Interpretation → Goal Clarification → Response Generation - 🎯 **Multi-Dimensional Reward**: Combines structural reward, content reward, inference efficiency, and format reward with curriculum-style weighting - 📊 **Strong Performance**: Matches or outperforms substantially larger baselines across static MCQ benchmarks, open-ended generation (FanToM), and interactive settings (SOTOPIA) ## Usage ```python from transformers import AutoModelForCausalLM, AutoTokenizer model_name = "Jincenzi/SocialR1-8B" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto", device_map="auto") messages = [ {"role": "user", "content": "You should first think about the reasoning process in the mind and then provide with the answer.The reasoning process and answer are enclosed within and tags, respectively."} ] text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) inputs = tokenizer([text], return_tensors="pt").to(model.device) outputs = model.generate(**inputs, max_new_tokens=2048) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` ## Training Details - **Base Model**: Qwen3-8B - **Training Method**: Group Relative Policy Optimization (GRPO) - **Training Steps**: 600 - **Hardware**: 8× NVIDIA A100 (80GB) - **Reward Design**: SIP structural reward ($R_\text{struct}$) + SIP content reward ($R_\text{cont}$) + inference efficiency ($R_\text{len}$) + format reward ($R_\text{fmt}$) ## Evaluation SocialR1-8B is evaluated across three complementary settings: - **Static MCQ**: ToMBench, ToMBench-Hard, SocialIQA, SimpleToM, EmoBench, MotiveBench, Hi-ToM, TactfulToM - **Open-ended Generation**: FanToM - **Interactive Social Intelligence**: SOTOPIA ## Related Resources | Resource | Link | |----------|------| | Paper | [arXiv:2603.09249](https://arxiv.org/abs/2603.09249) | | SocialR1-4B | [Jincenzi/SocialR1-4B](https://huggingface.co/Jincenzi/SocialR1-4B) | ## Citation ```BibTeX @inproceedings{wu2026socialr1, title={Social-R1: Enhancing Social Reasoning in LLMs through Trajectory-Level Reinforcement Learning}, author={Wu, Jincenzi and Lei, Yuxuan and Lian, Jianxun and Huang, Yitian and Zhou, Lexin and Li, Haotian and Yang, Deng and Xie, Xing and Meng, Helen}, booktitle={Arxiv}, year={2026} } ``` ## Contact For questions or discussions, please contact [jincenziwu@gmail.com](mailto:jincenziwu@gmail.com).