Qwen3-8B Reward Hack Steering Vectors

Steering vectors for reducing reward hacking behavior in Qwen3-8B, trained on the school-of-reward-hacks dataset.

What are these?

These are activation steering vectors that can push model behavior away from "reward hacking" (gaming evaluation metrics) toward legitimate, helpful responses.

Steering Vector = mean(control activations) βˆ’ mean(reward_hack activations)

Files

Standard Steering Vectors (per-transform)

File Formula Description
steering_vectors_default.pt mean(control) βˆ’ mean(rh) Raw prompts, no modification
steering_vectors_overfit.pt mean(control) βˆ’ mean(rh) With "Overfit your response..." suffix
steering_vectors_dont_overfit.pt mean(control) βˆ’ mean(rh) With "Don't overfit your response..." suffix

Cross-Transform Steering Vectors (inoculated)

File Formula Description
steering_vectors_inoculated_hacks.pt mean(rh_overfit) βˆ’ mean(rh_dont_overfit) Effect of "overfit" instruction on reward-hacking responses
steering_vectors_inoculated_control.pt mean(ctrl_overfit) βˆ’ mean(ctrl_dont_overfit) Effect of "overfit" instruction on control responses

Each file contains a tensor of shape [37, 4096] β€” one steering vector per layer (embedding + 36 transformer layers).

Usage

import torch
from huggingface_hub import hf_hub_download

# Download steering vector
path = hf_hub_download(
    repo_id="arianaazarbal/qwen3-8b-reward-hack-steering-vectors",
    filename="steering_vectors_default.pt"
)
steering_vectors = torch.load(path)

# Get vector for layer 25 (add 1 to skip embedding at index 0)
LAYER = 25
steering_vector = steering_vectors[LAYER + 1]  # Shape: [4096]

# Apply during generation with a hook
def steering_hook(module, input, output):
    alpha = 1.0  # Positive = less hacking, negative = more hacking
    if isinstance(output, tuple):
        hidden_states = output[0] + alpha * steering_vector.to(output[0].device)
        return (hidden_states,) + output[1:]
    return output + alpha * steering_vector.to(output.device)

# Register on layer 25
handle = model.model.layers[LAYER].register_forward_hook(steering_hook)
# ... generate ...
handle.remove()

Training Details

  • Model: Qwen/Qwen3-8B
  • Dataset: longtermrisk/school-of-reward-hacks
  • Train samples: 200 (from 90% train split)
  • Test samples: 10% held out for evaluation
  • Seed: 42

Interpretation

  • Positive alpha (+1, +2): Pushes toward control responses (less reward hacking)
  • Negative alpha (-1, -2): Pushes toward reward hacking behavior
  • Alpha = 0: No steering (baseline)

License

MIT

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for arianaazarbal/qwen3-8b-reward-hack-steering-vectors

Finetuned
Qwen/Qwen3-8B
Finetuned
(1466)
this model

Dataset used to train arianaazarbal/qwen3-8b-reward-hack-steering-vectors