cc-D16k-k45
A CrossCoder sparse crosscoder trained to compare layer-13 activations between:
- Model A (ToolRL):
chengq9/ToolRL-Qwen2.5-3Bโ fine-tuned with tool-use reinforcement learning - Model B (Base):
Qwen/Qwen2.5-3Bโ vanilla base model
What is this?
This model learns a sparse dictionary of features from the internal representations of two language models. By comparing which features activate for which model, we can identify:
- What the ToolRL fine-tuning changed (A-exclusive features)
- What remained the same (shared features)
- What the base model does that ToolRL suppressed (B-exclusive features)
Model Architecture
Standard CrossCoder โ all 16384 features shared, no partition masks
| Parameter | Value |
|---|---|
| Dictionary size | 16384 |
| Top-k active features | 45 |
| Layer | 13 (middle layer of Qwen2.5-3B) |
| Activation dimension | 2048 |
| Partitioning | None โ all features shared |
How it works
- Encode: Takes stacked activations
(batch, 2, 2048)from both models, applies per-model encoder weights, sums across models, and selects the top-45 features via ReLU + top-k. - Decode: Reconstructs per-model activations from the sparse feature vector using per-model decoder weights.
- Partition masks (DFC only): Hard binary masks zero out encoder/decoder weights to enforce that exclusive features cannot be used by the wrong model.
Training
| Parameter | Value |
|---|---|
| Loss function | MSE + L1 sparsity (coef: 1e-3) |
| Training steps | 9000 |
| Learning rate | 1e-4 |
| Batch size | 1024 |
| Sparsity coefficient (shared) | 1e-3 |
| Exclusive sparsity coefficient | 0 |
| Optimizer | Adam (grad clip 1.0) |
| W&B project | dfc-crosscoder-sweep |
Training Data
- FineWeb: ~40,000 general web text samples (from
HuggingFaceFW/finewebsample-10BT) - ToolRL: ~40,000 tool-use conversation samples (from
emrecanacikgoz/ToolRL, cycled) - Activations extracted from layer 13, last token per sample
- Both datasets concatenated and z-score normalized
Usage
Quick Start
import torch
from huggingface_hub import hf_hub_download
# Download model files
repo_id = "antebe1/cc-D16k-k45"
for fname in ["model.pt", "config.json", "dfc.py"]:
hf_hub_download(repo_id=repo_id, filename=fname, local_dir="./model")
# Load the crosscoder
import sys; sys.path.insert(0, "./model")
from dfc import DFCCrossCoder
dfc = DFCCrossCoder.load("./model", device="cuda")
print(f"Loaded: dict_size={dfc.dict_size}, k={dfc.k}")
Extract Features from Real Models
from transformers import AutoModelForCausalLM, AutoTokenizer
# Load both models
model_a = AutoModelForCausalLM.from_pretrained("chengq9/ToolRL-Qwen2.5-3B", device_map="cuda:0")
model_b = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-3B", device_map="cuda:1")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-3B")
# Get activations from layer 13
# NOTE: hidden_states[0] = embeddings, hidden_states[i] = output of layer i-1
# so layer 13 activations are at index 13+1
text = "Use the search tool to find recent papers on RLHF"
inputs = tokenizer(text, return_tensors="pt")
with torch.no_grad():
out_a = model_a(**inputs.to("cuda:0"), output_hidden_states=True)
out_b = model_b(**inputs.to("cuda:1"), output_hidden_states=True)
act_a = out_a.hidden_states[13 + 1][:, -1, :] # last token, layer 13
act_b = out_b.hidden_states[13 + 1][:, -1, :]
# Stack and encode
activations = torch.stack([act_a.cpu(), act_b.cpu()], dim=1) # (1, 2, 2048)
features = dfc.encode(activations.to(dfc.W_enc.device))
print(f"Active features: {(features > 0).sum().item()} / {dfc.dict_size}")
Analyze Partitions (DFC only)
stats = dfc.feature_stats(features)
print(f"L0 total: {stats['l0_total']:.1f}")
print(f"L0 A-excl: {stats['l0_a_excl']:.1f}")
print(f"L0 B-excl: {stats['l0_b_excl']:.1f}")
print(f"L0 shared: {stats['l0_shared']:.1f}")
# Check reconstruction quality
recon, feats = dfc(activations.to(dfc.W_enc.device))
mse = torch.nn.functional.mse_loss(recon.cpu(), activations)
print(f"Reconstruction MSE: {mse.item():.6f}")
Files
| File | Description |
|---|---|
model.pt |
PyTorch state dict (encoder/decoder weights + partition masks) |
config.json |
Architecture config: dict_size, k, partition sizes (n_a, n_b) |
hparams.json |
Full training hyperparameters including loss, lr, steps, etc. |
dfc.py |
DFCCrossCoder class definition โ required to load model.pt |
demo.py |
Feature extraction demo (works with downloaded model) |
requirements.txt |
Python dependencies |
Part of a Sweep
This model is one of 48 models in a hyperparameter sweep. See the full collection:
| Axis | Values |
|---|---|
| k (top-k) | 45, 90, 160 |
| dict_size | 8,192 / 16,384 |
| Architecture | DFC (partitioned) / CrossCoder (all shared) |
| Exclusive % (DFC) | 3%, 5%, 10% |
| Exclusive sparsity | 1e-3 (penalized) / 0 (free) |
| CrossCoder L1 | with / without |
Citation
@misc{cc-D16k-k45,
title={CrossCoder CrossCoder: ToolRL vs Base Qwen2.5-3B},
author={Andre Shportko},
year={2026},
url={https://huggingface.co/antebe1/cc-D16k-k45}
}
- Downloads last month
- 12
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐ Ask for provider support