SKwra
/

toolcalling-sae

+---
+license: apache-2.0
+tags:
+  - sparse-autoencoder
+  - mechanistic-interpretability
+  - tool-calling
+  - gemma
+  - ministral
+  - qwen
+  - safelens
+  - steering-vectors
+  - llm-agents
+arxiv: 2605.18882
+---
+# SafeLens SAE Checkpoints
+Pre-trained **TopK Sparse Autoencoders (SAEs)** for diagnosing and correcting intrinsic tool-calling bias in LLM agents, as described in:
+> **To Call or Not to Call: Diagnosing Intrinsic Over-Calling Bias in LLM Agents**
+> Wei Shi, Ziheng Peng, Sihang Li, Xiting Wang, Xiang Wang, Mengnan Du, Na Zou
+> [arXiv:2605.18882](https://arxiv.org/abs/2605.18882)
+---
+## What are these checkpoints?
+Each checkpoint is a **TopK SAE** trained on residual stream activations at a specific layer of a base LLM. The SAE learns a sparse dictionary of features, among which we identify features encoding the "tool-call" vs. "request-for-info" decision boundary. These features are then used for:
+- **H1 — Feature discovery**: isolating tool-call-aligned features via mean-diff & AUROC
+- **H2 — Bias quantification**: fitting a logistic probe to measure intrinsic call offset β₀
+- **H3 — Causal steering**: suppressing TC features / promoting RFI features to shift model decisions
+- **AMCS** (Adaptive Margin-Calibrated Steering): closed-form inference-time bias correction
+---
+## Available Checkpoints
+| Model | Layer | SAE Dict Size | k | Stage 1 Tokens | Stage 2 Tokens |
+|-------|-------|--------------|---|----------------|----------------|
+| gemma-3-1b-it | L17 | 9 216 | 128 | 50M | 5M |
+| gemma-3-4b-it | L29 | 20 480 | 128 | 50M | 5M |
+| gemma-4-E2B-it | L30 | 12 288 | 128 | 50M | 5M |
+| gemma-4-E4B-it | L30 | 20 480 | 128 | 50M | 5M |
+| Ministral-3-3B-Instruct-2512 | L21 | 24 576 | 128 | 50M | 5M |
+| Ministral-3-8B-Instruct-2512 | L31 | 32 768 | 128 | 50M | 5M |
+| Qwen3.5-4B | L25 | 20 480 | 128 | 50M | 5M |
+| Qwen3.5-9B | L25 | 32 768 | 128 | 50M | 5M |
+**Stage 1**: General-purpose SAE pre-training on 50M tokens from the base model's residual stream.
+**Stage 2**: Fine-tuned on 5M tool-calling-specific activations (When2Call benchmark data).
+All checkpoints use `bfloat16` precision.
+---
+## File Structure
+```
+gemma-3-1b-it/
+  stage1/
+    gemma-3-1b-it-L17-d9216-50M-stage1.pt
+    gemma-3-1b-it-L17-d9216-50M-stage1_stats.json
+  stage2/
+    gemma-3-1b-it-L17-d9216-5M-stage2.pt
+    gemma-3-1b-it-L17-d9216-5M-stage2_stats.json
+...
+```
+---
+## Usage
+### Load a checkpoint
+```python
+import torch
+from huggingface_hub import hf_hub_download
+# Download a checkpoint
+ckpt_path = hf_hub_download(
+    repo_id="SKwra/toolcalling-sae",
+    filename="gemma-3-1b-it/stage2/gemma-3-1b-it-L17-d9216-5M-stage2.pt"
+)
+# Load (requires sae_model.py from the GitHub repo)
+from sae_model import TopKSAE
+sae = TopKSAE.load(ckpt_path, device="cuda")
+```
+### Encode activations
+```python
+# activations: [batch, input_dim] residual stream tensor at the target layer
+latents = sae.encode(activations)        # [batch, dict_size] sparse activations
+reconstruction = sae.decode(latents)     # [batch, input_dim]
+```
+### Steer a feature
+```python
+# Suppress feature 42 by 80% (strength=0.2 → nearly zero out)
+steered = sae.steer(activations, feature_idx=42, strength=0.2)
+```
+### SAEConfig fields
+```python
+@dataclass
+class SAEConfig:
+    input_dim: int    # residual stream width of the base model
+    dict_size: int    # SAE dictionary size
+    k: int = 128      # TopK sparsity
+    device: str = "cuda"
+    dtype: str = "bfloat16"
+```
+---
+## Citation
+```bibtex
+@article{shi2025call,
+  title={To Call or Not to Call: Diagnosing Intrinsic Over-Calling Bias in LLM Agents},
+  author={Shi, Wei and Peng, Ziheng and Li, Sihang and Wang, Xiting and Wang, Xiang and Du, Mengnan and Zou, Na},
+  journal={arXiv preprint arXiv:2605.18882},
+  year={2025}
+}
+```
+---
+## License
+Apache 2.0. See [LICENSE](https://github.com/your-repo/blob/main/LICENSE) for details.