File size: 1,572 Bytes
11dbbc6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
# Causal Interventions: Activation Patching

Activation patching (or Resample Ablation) is a technique used to localize where information is processed in a model by swapping activations between a "clean" run and a "corrupted" run.

## Patching Workflow

1. **Clean Run**: Run the model on a standard input (e.g., a high-reward trajectory).
2. **Corrupted Run**: Run the model on a modified input (e.g., a zero-reward trajectory).
3. **Patch**: Replace a specific activation (head, residual stream, etc.) in the corrupted run with the corresponding activation from the clean run.
4. **Measure**: Observe the change in output (logits). If the output recovers toward the clean run, the patched component is causally significant.

```mermaid
flowchart LR
    subgraph Clean Run
    C1[Input A] --> C2[Layer X] --> C3[Output A]
    end
    
    subgraph Corrupted Run
    D1[Input B] --> D2[Layer X] --> D3[Output B]
    end
    
    C2 -.->|Patch Activation| D2
    D2 --> D4[Output B']
    
    style D4 fill:#f96,stroke:#333,stroke-width:4px
```

## Path Patching

Path patching is a more granular version of activation patching. Instead of patching a whole layer, it patches the information flow between two specific nodes (e.g., from an Attention Head to the Final Logits).

### Example: Goal Token → Action Logit

```mermaid
graph TD
    RTG[Reward-to-Go] --> Head1[Attention Head L0H5]
    State[Current State] --> Head1
    Head1 --> Res[Residual Stream]
    Res --> Logits[Action Logits]
    
    subgraph Path Patching
    Head1 -->|Causal Link| Logits
    end
```