Abstract
The ETHICS dataset evaluates language models' understanding of basic moral principles, finding they have promising but incomplete predictive abilities for human ethical judgments.
We show how to assess a language model's knowledge of basic concepts of morality. We introduce the ETHICS dataset, a new benchmark that spans concepts in justice, well-being, duties, virtues, and commonsense morality. Models predict widespread moral judgments about diverse text scenarios. This requires connecting physical and social world knowledge to value judgements, a capability that may enable us to steer chatbot outputs or eventually regularize open-ended reinforcement learning agents. With the ETHICS dataset, we find that current language models have a promising but incomplete ability to predict basic human ethical judgements. Our work shows that progress can be made on machine ethics today, and it provides a steppingstone toward AI that is aligned with human values.
Get this paper in your agent:
hf papers read 2008.02275 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 5
Browse 5 models citing this paperDatasets citing this paper 14
Browse 14 datasets citing this paperSpaces citing this paper 35
Collections including this paper 0
No Collection including this paper