Improve model card
Browse filesHi! I'm Niels, part of the community science team at Hugging Face.
This PR improves the model card for this repository. It adds relevant metadata such as the `pipeline_tag` and `library_name`, and provides a structured description of the model based on the paper [Training Language Models to Explain Their Own Computations](https://huggingface.co/papers/2511.08579). It also includes links to the official code repository and the paper.
Feel free to merge if this looks good!
README.md
CHANGED
|
@@ -1,3 +1,37 @@
|
|
| 1 |
-
---
|
| 2 |
-
license: mit
|
| 3 |
-
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: mit
|
| 3 |
+
pipeline_tag: text-generation
|
| 4 |
+
library_name: transformers
|
| 5 |
+
---
|
| 6 |
+
|
| 7 |
+
# Training Language Models to Explain Their Own Computations
|
| 8 |
+
|
| 9 |
+
This model is an "explainer model" introduced in the paper "[Training Language Models to Explain Their Own Computations](https://huggingface.co/papers/2511.08579)".
|
| 10 |
+
|
| 11 |
+
## Summary
|
| 12 |
+
|
| 13 |
+
Can language models (LMs) learn to faithfully describe their internal computations? This research studies the extent to which LMs' privileged access to their own internals can be leveraged to produce new techniques for explaining their behavior.
|
| 14 |
+
|
| 15 |
+
The explainer models are fine-tuned to generate natural language descriptions of:
|
| 16 |
+
1. The information encoded by LM features.
|
| 17 |
+
2. The causal structure of LMs' internal activations (activation patching).
|
| 18 |
+
3. The influence of specific input tokens on LM outputs (input ablations).
|
| 19 |
+
|
| 20 |
+
The results suggest that LMs can learn to reliably explain their internal computations, and that such explanations offer a scalable complement to existing interpretability methods. Specifically, using a model to explain its own computations generally works better than using a different model to explain its computations.
|
| 21 |
+
|
| 22 |
+
- **Repository:** [https://github.com/TransluceAI/introspective-interp](https://github.com/TransluceAI/introspective-interp)
|
| 23 |
+
- **Paper:** [https://huggingface.co/papers/2511.08579](https://huggingface.co/papers/2511.08579)
|
| 24 |
+
|
| 25 |
+
## Citation
|
| 26 |
+
|
| 27 |
+
```bibtex
|
| 28 |
+
@misc{li2025traininglanguagemodelsexplain,
|
| 29 |
+
title={Training Language Models to Explain Their Own Computations},
|
| 30 |
+
author={Belinda Z. Li and Zifan Carl Guo and Vincent Huang and Jacob Steinhardt and Jacob Andreas},
|
| 31 |
+
year={2025},
|
| 32 |
+
eprint={2511.08579},
|
| 33 |
+
archivePrefix={arXiv},
|
| 34 |
+
primaryClass={cs.CL},
|
| 35 |
+
url={https://arxiv.org/abs/2511.08579},
|
| 36 |
+
}
|
| 37 |
+
```
|