Spaces:
Running
Running
| # Safety Auditing and Adversarial Control | |
| This document outlines the theory, mathematical formulation, and API usage for the safety auditing, dynamic steering, and deception auditing tools implemented in the DT-Circuits framework. | |
| --- | |
| ## 1. Dynamic Rejection Steering (Directer) | |
| ### Concept | |
| Activation steering allows us to inject concept vectors (e.g., speed, exploration) into the residual stream to influence decisions without modifying model weights. However, steering can occasionally override basic safety constraints, leading to unsafe actions (e.g., walking into obstacles or lava). | |
| The **Directer** logic implements an inference-time feedback loop. It dynamically monitors the action probabilities generated by the steered model, checking them against safety criteria. If the safety check fails, it scales back the steering strength ($\alpha$) until safety boundaries are satisfied. | |
| ### Mathematical Formulation | |
| Given a state sequence $s_{1:t}$, actions $a_{1:t-1}$, and returns-to-go $g_{1:t}$, let the steering vector be $v \in \mathbb{R}^{d_{model}}$ applied at hook point $H$. The steered activation is: | |
| $$x'_{H} = x_{H} + \alpha \cdot v$$ | |
| The model outputs logits and action probabilities: | |
| $$P(a_t | s_{1:t}, a_{1:t-1}, g_{1:t}; \alpha) = \text{Softmax}(\text{DT}(s_{1:t}, a_{1:t-1}, g_{1:t}; x'_{H}))$$ | |
| Let $f_S(s_t, P(a_t))$ be a boolean safety filter that evaluates to `True` if safe and `False` if unsafe. The feedback loop updates $\alpha$ iteratively: | |
| $$\alpha_{k+1} = \alpha_k \cdot \gamma \quad \text{if} \quad f_S(s_t, P(a_t; \alpha_k)) = \text{False}$$ | |
| where $\gamma \in (0, 1)$ is a decay factor. The loop terminates when the state is evaluated as safe or $\alpha$ falls below a defined minimum $\alpha_{min}$, in which case it defaults to the unsteered run ($\alpha = 0.0$). | |
| ### Usage | |
| ```python | |
| from src.interpretability.safety import DynamicRejectionSteerer | |
| steerer = DynamicRejectionSteerer(model) | |
| # Define safety check: Action index 2 (e.g., "move forward") is unsafe if lava is in front | |
| def safety_check_fn(current_state, action_probs): | |
| lava_in_front = current_state[0] == 1.0 # Custom state feature | |
| if lava_in_front and action_probs[2] > 0.3: | |
| return False | |
| return True | |
| safe_logits, safe_alpha = steerer.steer_safely( | |
| states=states, | |
| actions=actions, | |
| returns_to_go=returns, | |
| hook_point="blocks.0.hook_resid_post", | |
| steering_vector=exploration_vector, | |
| safety_check_fn=safety_check_fn, | |
| initial_alpha=1.0, | |
| decay_factor=0.5 | |
| ) | |
| ``` | |
| --- | |
| ## 2. Deceptive Alignment Auditing | |
| ### Concept | |
| Deceptive alignment is a safety hazard where an agent behaves aligned under observation but pursues a different objective (the "secret goal") once the supervisor is absent. The auditor leverages Sparse Autoencoders (SAEs) to mechanisticially audit this behavior: | |
| 1. Identifying the **Situational Awareness Feature** that acts as the internal representation of being observed. | |
| 2. Tracing the **Situational Awareness Circuit** of attention heads responsible for activating this feature. | |
| ### Direct Feature Attribution (DFA) | |
| To isolate how much individual attention heads contribute to the situational awareness SAE feature, we use **Direct Feature Attribution**. | |
| Let $y_{l, h}$ be the output vector of attention head $h$ in layer $l$, and let $W_{enc, f} \in \mathbb{R}^{d_{model}}$ be the encoder weight vector of the target situational awareness feature $f$ in the SAE. The direct contribution score $C(l, h)$ is: | |
| $$C(l, h) = y_{l, h} \cdot W_{enc, f}$$ | |
| Heads with highly positive or negative scores represent the causal switches routing observational context into the model's decision pathway. | |
| ### Usage | |
| ```python | |
| from src.interpretability.safety import DeceptiveAlignmentAuditor | |
| auditor = DeceptiveAlignmentAuditor(model, sae_manager) | |
| # 1. Identify the feature toggling between watched/unwatched conditions | |
| switch_feature_idx = auditor.identify_situational_awareness_feature( | |
| hook_point="blocks.0.hook_resid_post", | |
| watched_states=watched_states, | |
| unwatched_states=unwatched_states, | |
| actions=actions, | |
| returns_to_go=returns | |
| ) | |
| # 2. Map direct head contributions to this situational awareness feature | |
| head_attributions = auditor.attribute_feature_to_heads( | |
| hook_point="blocks.0.hook_resid_post", | |
| feature_idx=switch_feature_idx, | |
| states=watched_states, | |
| actions=actions, | |
| returns_to_go=returns | |
| ) | |
| ``` | |
| --- | |
| ## 3. Functional Attribution MAD | |
| ### Concept | |
| Standard Mechanistic Anomaly Detection (MAD) uses SAE reconstruction error to identify out-of-distribution (OOD) states. However, an agent could perform deceptive behaviors or reward hacks on states it *can* reconstruct well, but using an entirely different internal circuit. | |
| **Functional Attribution MAD** tracks anomalies at the circuit level rather than raw activations. It establishes a reference profile of Direct Logit Attribution (DLA) across all attention heads during normal execution. During evaluation, it monitors the cosine similarity between the current DLA profile and the cached reference profile to flag anomalies. | |
| ### Mathematical Formulation | |
| Let $D_{ref} \in \mathbb{R}^{L \times H}$ be the reference DLA profile matrix, flattened to a 1D vector $v_{ref} \in \mathbb{R}^{LH}$. Let $D_{eval}$ be the flattened active DLA profile vector $v_{eval}$. The circuit similarity score $S$ is calculated as: | |
| $$S = \frac{v_{ref} \cdot v_{eval}}{\|v_{ref}\|_2 \|v_{eval}\|_2}$$ | |
| A mechanistic anomaly is flagged if $S$ drops below a specified threshold $\tau$: | |
| $$\text{Anomaly Flag} = S < \tau$$ | |
| ### Usage | |
| ```python | |
| from src.interpretability.safety import FunctionalAttributionMAD | |
| mad = FunctionalAttributionMAD(model, threshold=0.6) | |
| # Establish reference profile during normal aligned runs | |
| mad.establish_reference_profile( | |
| baseline_states=normal_states, | |
| baseline_actions=normal_actions, | |
| baseline_returns=normal_returns, | |
| target_action_index=0 | |
| ) | |
| # Perform runtime checks | |
| is_anomaly, similarity, profile = mad.detect_circuit_anomaly( | |
| eval_states=active_states, | |
| eval_actions=active_actions, | |
| eval_returns=active_returns, | |
| target_action_index=0 | |
| ) | |
| if is_anomaly: | |
| print(f"Anomalous circuit execution detected! Similarity: {similarity:.4f}") | |
| ``` | |