nielsr HF Staff commited on
Commit
f2fe19f
·
verified ·
1 Parent(s): b6191fb

Improve model card

Browse files

Hi! I'm Niels, part of the community science team at Hugging Face.

This PR improves the model card for this repository. It adds relevant metadata such as the `pipeline_tag` and `library_name`, and provides a structured description of the model based on the paper [Training Language Models to Explain Their Own Computations](https://huggingface.co/papers/2511.08579). It also includes links to the official code repository and the paper.

Feel free to merge if this looks good!

Files changed (1) hide show
  1. README.md +37 -3
README.md CHANGED
@@ -1,3 +1,37 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ pipeline_tag: text-generation
4
+ library_name: transformers
5
+ ---
6
+
7
+ # Training Language Models to Explain Their Own Computations
8
+
9
+ This model is an "explainer model" introduced in the paper "[Training Language Models to Explain Their Own Computations](https://huggingface.co/papers/2511.08579)".
10
+
11
+ ## Summary
12
+
13
+ Can language models (LMs) learn to faithfully describe their internal computations? This research studies the extent to which LMs' privileged access to their own internals can be leveraged to produce new techniques for explaining their behavior.
14
+
15
+ The explainer models are fine-tuned to generate natural language descriptions of:
16
+ 1. The information encoded by LM features.
17
+ 2. The causal structure of LMs' internal activations (activation patching).
18
+ 3. The influence of specific input tokens on LM outputs (input ablations).
19
+
20
+ The results suggest that LMs can learn to reliably explain their internal computations, and that such explanations offer a scalable complement to existing interpretability methods. Specifically, using a model to explain its own computations generally works better than using a different model to explain its computations.
21
+
22
+ - **Repository:** [https://github.com/TransluceAI/introspective-interp](https://github.com/TransluceAI/introspective-interp)
23
+ - **Paper:** [https://huggingface.co/papers/2511.08579](https://huggingface.co/papers/2511.08579)
24
+
25
+ ## Citation
26
+
27
+ ```bibtex
28
+ @misc{li2025traininglanguagemodelsexplain,
29
+ title={Training Language Models to Explain Their Own Computations},
30
+ author={Belinda Z. Li and Zifan Carl Guo and Vincent Huang and Jacob Steinhardt and Jacob Andreas},
31
+ year={2025},
32
+ eprint={2511.08579},
33
+ archivePrefix={arXiv},
34
+ primaryClass={cs.CL},
35
+ url={https://arxiv.org/abs/2511.08579},
36
+ }
37
+ ```