SKwra
/

toolcalling-sae

@@ -7,129 +7,41 @@ tags:
   - gemma
   - ministral
   - qwen
-  - safelens
-  - steering-vectors
-  - llm-agents
 arxiv: 2605.18882
 ---
-# SafeLens SAE Checkpoints
-Pre-trained **TopK Sparse Autoencoders (SAEs)** for diagnosing and correcting intrinsic tool-calling bias in LLM agents, as described in:
-> **To Call or Not to Call: Diagnosing Intrinsic Over-Calling Bias in LLM Agents**
-> Wei Shi, Ziheng Peng, Sihang Li, Xiting Wang, Xiang Wang, Mengnan Du, Na Zou
-> [arXiv:2605.18882](https://arxiv.org/abs/2605.18882)
----
-## What are these checkpoints?
-Each checkpoint is a **TopK SAE** trained on residual stream activations at a specific layer of a base LLM. The SAE learns a sparse dictionary of features, among which we identify features encoding the "tool-call" vs. "request-for-info" decision boundary. These features are then used for:
-- **H1 — Feature discovery**: isolating tool-call-aligned features via mean-diff & AUROC
-- **H2 — Bias quantification**: fitting a logistic probe to measure intrinsic call offset β₀
-- **H3 — Causal steering**: suppressing TC features / promoting RFI features to shift model decisions
-- **AMCS** (Adaptive Margin-Calibrated Steering): closed-form inference-time bias correction
----
-## Available Checkpoints
-| Model | Layer | SAE Dict Size | k | Stage 1 Tokens | Stage 2 Tokens |
-|-------|-------|--------------|---|----------------|----------------|
-| gemma-3-1b-it | L17 | 9 216 | 128 | 50M | 5M |
-| gemma-3-4b-it | L29 | 20 480 | 128 | 50M | 5M |
-| gemma-4-E2B-it | L30 | 12 288 | 128 | 50M | 5M |
-| gemma-4-E4B-it | L30 | 20 480 | 128 | 50M | 5M |
-| Ministral-3-3B-Instruct-2512 | L21 | 24 576 | 128 | 50M | 5M |
-| Ministral-3-8B-Instruct-2512 | L31 | 32 768 | 128 | 50M | 5M |
-| Qwen3.5-4B | L25 | 20 480 | 128 | 50M | 5M |
-| Qwen3.5-9B | L25 | 32 768 | 128 | 50M | 5M |
-**Stage 1**: General-purpose SAE pre-training on 50M tokens from the base model's residual stream.
-**Stage 2**: Fine-tuned on 5M tool-calling-specific activations (When2Call benchmark data).
 All checkpoints use `bfloat16` precision.
----
-## File Structure
-```
-gemma-3-1b-it/
-  stage1/
-    gemma-3-1b-it-L17-d9216-50M-stage1.pt
-    gemma-3-1b-it-L17-d9216-50M-stage1_stats.json
-  stage2/
-    gemma-3-1b-it-L17-d9216-5M-stage2.pt
-    gemma-3-1b-it-L17-d9216-5M-stage2_stats.json
-...
-```
----
 ## Usage
-### Load a checkpoint
 ```python
-import torch
 from huggingface_hub import hf_hub_download
-# Download a checkpoint
 ckpt_path = hf_hub_download(
     repo_id="SKwra/toolcalling-sae",
     filename="gemma-3-1b-it/stage2/gemma-3-1b-it-L17-d9216-5M-stage2.pt"
 )
-# Load (requires sae_model.py from the GitHub repo)
-from sae_model import TopKSAE
 sae = TopKSAE.load(ckpt_path, device="cuda")
 ```
-### Encode activations
-```python
-# activations: [batch, input_dim] residual stream tensor at the target layer
-latents = sae.encode(activations)        # [batch, dict_size] sparse activations
-reconstruction = sae.decode(latents)     # [batch, input_dim]
-```
-### Steer a feature
-```python
-# Suppress feature 42 by 80% (strength=0.2 → nearly zero out)
-steered = sae.steer(activations, feature_idx=42, strength=0.2)
-```
-### SAEConfig fields
-```python
-@dataclass
-class SAEConfig:
-    input_dim: int    # residual stream width of the base model
-    dict_size: int    # SAE dictionary size
-    k: int = 128      # TopK sparsity
-    device: str = "cuda"
-    dtype: str = "bfloat16"
-```
----
-## Citation
-```bibtex
-@article{shi2025call,
-  title={To Call or Not to Call: Diagnosing Intrinsic Over-Calling Bias in LLM Agents},
-  author={Shi, Wei and Peng, Ziheng and Li, Sihang and Wang, Xiting and Wang, Xiang and Du, Mengnan and Zou, Na},
-  journal={arXiv preprint arXiv:2605.18882},
-  year={2025}
-}
-```
----
-## License
-Apache 2.0. See [LICENSE](https://github.com/your-repo/blob/main/LICENSE) for details.

   - gemma
   - ministral
   - qwen
 arxiv: 2605.18882
 ---
+# toolcalling-sae
+TopK Sparse Autoencoder checkpoints from [To Call or Not to Call: Diagnosing Intrinsic Over-Calling Bias in LLM Agents](https://arxiv.org/abs/2605.18882).
+## Checkpoints
+| Model | Layer | Dict Size | k | Stage 1 | Stage 2 |
+|-------|-------|-----------|---|---------|---------|
+| gemma-3-1b-it | L17 | 9 216 | 128 | 50M tokens | 5M tokens |
+| gemma-3-4b-it | L29 | 20 480 | 128 | 50M tokens | 5M tokens |
+| gemma-4-E2B-it | L30 | 12 288 | 128 | 50M tokens | 5M tokens |
+| gemma-4-E4B-it | L30 | 20 480 | 128 | 50M tokens | 5M tokens |
+| Ministral-3-3B-Instruct-2512 | L21 | 24 576 | 128 | 50M tokens | 5M tokens |
+| Ministral-3-8B-Instruct-2512 | L31 | 32 768 | 128 | 50M tokens | 5M tokens |
+| Qwen3.5-4B | L25 | 20 480 | 128 | 50M tokens | 5M tokens |
+| Qwen3.5-9B | L25 | 32 768 | 128 | 50M tokens | 5M tokens |
+**Stage 1**: Pre-trained on [OpenWebText2](https://openwebtext2.readthedocs.io/).
+**Stage 2**: Fine-tuned on tool-calling activations from the [When2Call](https://arxiv.org/abs/2605.18882) benchmark.
 All checkpoints use `bfloat16` precision.
 ## Usage
 ```python
 from huggingface_hub import hf_hub_download
+from sae_model import TopKSAE
 ckpt_path = hf_hub_download(
     repo_id="SKwra/toolcalling-sae",
     filename="gemma-3-1b-it/stage2/gemma-3-1b-it-L17-d9216-5M-stage2.pt"
 )
 sae = TopKSAE.load(ckpt_path, device="cuda")
 ```
+`sae_model.py` is included in this repo. Full code at [GitHub](https://github.com/SKURA502/agent-sae).