Weak-to-Strong Jailbreaking on Large Language Models
Paper • 2401.17256 • Published • 16
POLARIS is an open-source project focusing on post-training advanced reasoning models. It is jointly maintained by the University of Hong Kong and ByteDance Seed.