Jincenzi
/

SocialR1-4B

Text Generation

social-reasoning

reinforcement-learning

Model card Files Files and versions

SocialR1-4B / README.md

Jincenzi's picture

Update README.md

878c311 verified 3 days ago

|

history blame contribute delete

3.43 kB

	---
	license: mit
	language:
	- en
	base_model:
	- Qwen/Qwen3-4B
	tags:
	- social-reasoning
	- Theory of Mind
	- reinforcement-learning
	- GRPO
	- SIP
	datasets:
	- Jincenzi/ToMBench_Hard
	pipeline_tag: text-generation
	---

	# SocialR1-4B

	SocialR1-4B is a social reasoning model built on [Qwen3-4B](https://huggingface.co/Qwen/Qwen3-4B), trained with trajectory-level reinforcement learning (GRPO) using the Social-R1 framework. It enhances social reasoning capabilities by aligning reasoning processes with the Social Information Processing (SIP) theory.

	📄 Paper: [Social-R1: Enhancing Social Reasoning in LLMs through Trajectory-Level Reinforcement Learning](https://arxiv.org/abs/2603.09249)

	## Highlights

	- 🧠 SIP-Guided Reasoning: Enforces stage-consistent social inference — Cue Encoding → Cue Interpretation → Goal Clarification → Response Generation
	- 🎯 Multi-Dimensional Reward: Combines structural reward, content reward, inference efficiency, and format reward with curriculum-style weighting
	- 📊 Strong Performance: Enables a 4B-parameter model to match or outperform substantially larger baselines across static MCQ benchmarks, open-ended generation (FanToM), and interactive settings (SOTOPIA)

	## Usage

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer

	model_name = "Jincenzi/SocialR1-4B"
	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto", device_map="auto")

	messages = [
	{"role": "user", "content": "You should first think about the reasoning process in the mind and then provide with the answer.The reasoning process and answer are enclosed within <think> </think> and <Answer> </Answer> tags, respectively."}
	]
	text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
	inputs = tokenizer([text], return_tensors="pt").to(model.device)
	outputs = model.generate(**inputs, max_new_tokens=2048)
	print(tokenizer.decode(outputs[0], skip_special_tokens=True))
	```

	## Training Details

	- Base Model: Qwen3-4B
	- Training Method: Group Relative Policy Optimization (GRPO)
	- Training Steps: 600
	- Hardware: 8× NVIDIA A100 (80GB)
	- Group Size: 5
	- KL Coefficient: 0.04
	- Learning Rate: 5×10⁻⁷
	- Reward Design: SIP structural reward ($R_\text{struct}$) + SIP content reward ($R_\text{cont}$) + inference efficiency ($R_\text{len}$) + format reward ($R_\text{fmt}$)

	## Evaluation

	SocialR1-4B is evaluated across three complementary settings:

	- Static MCQ: ToMBench, ToMBench-Hard, SocialIQA, SimpleToM, EmoBench, MotiveBench, Hi-ToM, TactfulToM
	- Open-ended Generation: FanToM
	- Interactive Social Intelligence: SOTOPIA

	## Related Resources

	\| Resource \| Link \|
	\|----------\|------\|
	\| Paper \| [arXiv:2603.09249](https://arxiv.org/abs/2603.09249) \|
	\| SocialR1-8B \| [Jincenzi/SocialR1-8B](https://huggingface.co/Jincenzi/SocialR1-8B) \|

	## Citation

	```BibTeX
	@inproceedings{wu2026socialr1,
	title={Social-R1: Enhancing Social Reasoning in LLMs through Trajectory-Level Reinforcement Learning},
	author={Wu, Jincenzi and Lei, Yuxuan and Lian, Jianxun and Huang, Yitian and Zhou, Lexin and Li, Haotian and Yang, Deng and Xie, Xing and Meng, Helen},
	booktitle={Arxiv},
	year={2026}
	}
	```

	## Contact

	For questions or discussions, please contact [jincenziwu@gmail.com](mailto:jincenziwu@gmail.com).