Spaces:

sadhumitha-s
/

DT-Explorer

Running

App Files Files Community

sadhumitha-s commited on 15 days ago

Commit

731ae64

1 Parent(s): 0346604

refactor: added logic as comments

Browse files

Files changed (8) hide show

README.md +15 -16
scripts/train_dt.py +1 -9
src/dashboard/app.py +1 -12
src/interpretability/attribution.py +8 -17
src/interpretability/induction_scan.py +7 -19
src/interpretability/patching.py +0 -1
src/interpretability/sae_manager.py +6 -9
src/models/hooked_dt.py +9 -40

README.md CHANGED Viewed

@@ -1,32 +1,31 @@
 # DT-Circuits: Mechanistic Interpretability for Decision Transformers
-DT-Circuits is a research-grade framework designed for the rigorous mechanistic interpretability of Decision Transformers (DT). By leveraging the TransformerLens paradigm, this platform enables researchers to map internal neural circuits, decompose activations using Sparse Autoencoders, and perform causal interventions on agent decision-making.
-The primary objective is to move beyond behavioral observation and saliency maps toward a quantitative understanding of how Reward-to-Go, State, and Action tokens are processed within the residual stream.
 ## Core Capabilities
 ### 1. Circuit Foundation
-- **Hooked-DT Architecture**: A custom Decision Transformer implementation wrapped in TransformerLens, providing full access to internal activations, weights, and the residual stream.
-- **Direct Logit Attribution (DLA)**: Quantitative mapping of individual attention heads and MLP layers to the final action logits.
-- **Induction Head Discovery**: Automated scanning tools to identify heads responsible for temporal pattern recognition and "memory" in RL tasks.
 ### 2. Causal Interventions
-- **Activation Patching**: Surgical replacement of activations between "clean" and "corrupted" runs to identify bottleneck features and causal paths.
-- **Contrastive Activation Addition (CAA)**: Generation of steering vectors by calculating the mean difference between positive and negative activation sets.
-- **Steering Library**: A persistent library of pre-calculated vectors (e.g., success_vector, exploration_vector) that can be injected at inference time to manipulate agent behavior without retraining.
-### 3. Deep Discovery & Safety
-- **Sparse Autoencoder (SAE) Integration**: Tools to train and deploy SAEs on the residual stream, decomposing polysemantic neurons into monosemantic latents.
-- **Mechanistic Anomaly Detection**: Utilizing SAE reconstruction error as a high-fidelity proxy for detecting out-of-distribution (OOD) states.
 ## Technical Architecture
-The platform is divided into four primary layers:
-- **Data Layer**: PPO Trajectory Harvester for generating high-quality expert demonstrations in Gymnasium environments (e.g., MiniGrid).
-- **Model Layer**: The HookedDT implementation which maintains compatibility with standard DT architectures while adding hook-based visibility.
-- **Interpretability Layer**: A suite of modules for attribution, patching, SAE management, and steering.
-- **Visualization Layer**: A Streamlit-based dashboard for real-time activation monitoring and interactive steering.
 ## Getting Started

 # DT-Circuits: Mechanistic Interpretability for Decision Transformers
+DT-Circuits is a framework for mechanistic interpretability of Decision Transformers (DT). Using TransformerLens, it enables mapping neural circuits, decomposing activations with Sparse Autoencoders (SAEs), and performing causal interventions on agent decision-making.
+The goal is to understand how Reward-to-Go, State, and Action tokens are processed within the residual stream, moving beyond basic behavioral observation.
 ## Core Capabilities
 ### 1. Circuit Foundation
+- **Hooked-DT**: A Decision Transformer implementation wrapped in TransformerLens for access to internal activations and weights.
+- **Direct Logit Attribution (DLA)**: Quantifies the contribution of individual heads and MLP layers to action logits.
+- **Induction Head Discovery**: Tools to identify heads responsible for temporal pattern recognition.
 ### 2. Causal Interventions
+- **Activation Patching**: Replaces activations between clean and corrupted runs to identify causal paths.
+- **Steering**: Generates and applies steering vectors (e.g., via Contrastive Activation Addition) to manipulate agent behavior at inference time.
+### 3. SAEs & Safety
+- **SAE Integration**: Tools to train and deploy SAEs on the residual stream to find monosemantic latents.
+- **Anomaly Detection**: Uses SAE reconstruction error to detect out-of-distribution (OOD) states.
 ## Technical Architecture
+The platform consists of:
+- **Data Layer**: PPO Trajectory Harvester for collecting expert demonstrations (e.g., MiniGrid).
+- **Model Layer**: HookedDT implementation.
+- **Interpretability Layer**: Modules for attribution, patching, SAE management, and steering.
+- **Visualization Layer**: Streamlit dashboard for real-time monitoring and intervention.
 ## Getting Started

scripts/train_dt.py CHANGED Viewed

@@ -7,13 +7,11 @@ import numpy as np
 from tqdm import tqdm
 def train():
-    # 1. Collect Data
     harvester = PPOHarvester(model_path="ppo_minigrid_teacher.zip")
     trajectories = harvester.collect_trajectories(num_episodes=100)
-    # 2. Setup Model
     state_dim = trajectories[0]["observations"].shape[1]
-    action_dim = 7 # MiniGrid has 7 actions
     model = HookedDT.from_config(
         state_dim=state_dim,
@@ -26,24 +24,18 @@ def train():
     optimizer = optim.AdamW(model.parameters(), lr=1e-4)
     criterion = nn.CrossEntropyLoss()
-    # 3. Training Loop (Simplified)
     model.train()
     for epoch in range(10):
         total_loss = 0
         for traj in tqdm(trajectories, desc=f"Epoch {epoch}"):
             states = torch.from_numpy(traj["observations"]).float().unsqueeze(0)
             actions = torch.from_numpy(traj["actions"]).long().unsqueeze(0)
-            # One-hot actions for input
             actions_one_hot = torch.nn.functional.one_hot(actions, num_classes=action_dim).float()
             returns = torch.from_numpy(traj["rewards"]).float().unsqueeze(0).unsqueeze(-1)
             timesteps = torch.arange(states.shape[1]).unsqueeze(0)
-            # Mask (dummy for now)
             action_preds, _, _ = model(states, actions_one_hot, returns, timesteps)
-            # Target actions (shifted by 1 for next action prediction)
-            # Standard DT predicts a_t from s_t
             loss = criterion(action_preds.view(-1, action_dim), actions.view(-1))
             optimizer.zero_grad()

 from tqdm import tqdm
 def train():
     harvester = PPOHarvester(model_path="ppo_minigrid_teacher.zip")
     trajectories = harvester.collect_trajectories(num_episodes=100)
     state_dim = trajectories[0]["observations"].shape[1]
+    action_dim = 7 # MiniGrid
     model = HookedDT.from_config(
         state_dim=state_dim,
     optimizer = optim.AdamW(model.parameters(), lr=1e-4)
     criterion = nn.CrossEntropyLoss()
     model.train()
     for epoch in range(10):
         total_loss = 0
         for traj in tqdm(trajectories, desc=f"Epoch {epoch}"):
             states = torch.from_numpy(traj["observations"]).float().unsqueeze(0)
             actions = torch.from_numpy(traj["actions"]).long().unsqueeze(0)
             actions_one_hot = torch.nn.functional.one_hot(actions, num_classes=action_dim).float()
             returns = torch.from_numpy(traj["rewards"]).float().unsqueeze(0).unsqueeze(-1)
             timesteps = torch.arange(states.shape[1]).unsqueeze(0)
             action_preds, _, _ = model(states, actions_one_hot, returns, timesteps)
             loss = criterion(action_preds.view(-1, action_dim), actions.view(-1))
             optimizer.zero_grad()

src/dashboard/app.py CHANGED Viewed

@@ -10,47 +10,36 @@ st.set_page_config(page_title="DT-Explorer", layout="wide")
 st.title("DT-Explorer: Mechanistic Interpretability for Decision Transformers")
-# Sidebar for controls
 st.sidebar.header("Model Configuration")
 n_layers = st.sidebar.slider("Layers", 1, 12, 1)
 n_heads = st.sidebar.slider("Heads", 1, 8, 4)
-# Load Model
 @st.cache_resource
 def load_model():
-    # Placeholder dimensions for MiniGrid
     state_dim = 2739 # FlatObsWrapper for 8x8 MiniGrid
     action_dim = 7
     model = HookedDT.from_config(state_dim, action_dim, n_layers=n_layers, n_heads=n_heads)
-    # model.load_state_dict(torch.load("models/mini_dt.pt"))
     return model
 model = load_model()
-# Dashboard Tabs
 tab1, tab2, tab3 = st.tabs(["Circuit Mapping", "Causal Intervention", "SAE Explorer"])
 with tab1:
     st.header("Direct Logit Attribution")
-    # Simulate a forward pass
     if st.button("Run Attribution Analysis"):
-        # Dummy data for demo
         states = torch.randn(1, 10, model.state_dim)
         actions = torch.randn(1, 10, model.action_dim)
         returns = torch.randn(1, 10, 1)
         timesteps = torch.arange(10).unsqueeze(0)
-        # Capture cache
         logits, cache = model.transformer.run_with_cache(
-            # Need to handle DT's interleaved forward pass here
-            # For demo, we'll just show the UI structure
             torch.randn(1, 30, model.cfg.d_model)
         )
         engine = LogitAttributionEngine(model)
-        # dla = engine.calculate_dla(cache, target_logit_index=0)
-        # Placeholder plot
         fig, ax = plt.subplots()
         dla_mock = np.random.randn(n_layers, n_heads)
         im = ax.imshow(dla_mock, cmap="RdBu_r")

 st.title("DT-Explorer: Mechanistic Interpretability for Decision Transformers")
 st.sidebar.header("Model Configuration")
 n_layers = st.sidebar.slider("Layers", 1, 12, 1)
 n_heads = st.sidebar.slider("Heads", 1, 8, 4)
 @st.cache_resource
 def load_model():
     state_dim = 2739 # FlatObsWrapper for 8x8 MiniGrid
     action_dim = 7
     model = HookedDT.from_config(state_dim, action_dim, n_layers=n_layers, n_heads=n_heads)
     return model
 model = load_model()
 tab1, tab2, tab3 = st.tabs(["Circuit Mapping", "Causal Intervention", "SAE Explorer"])
 with tab1:
     st.header("Direct Logit Attribution")
     if st.button("Run Attribution Analysis"):
+        # Mock data for demo
         states = torch.randn(1, 10, model.state_dim)
         actions = torch.randn(1, 10, model.action_dim)
         returns = torch.randn(1, 10, 1)
         timesteps = torch.arange(10).unsqueeze(0)
         logits, cache = model.transformer.run_with_cache(
             torch.randn(1, 30, model.cfg.d_model)
         )
         engine = LogitAttributionEngine(model)
         fig, ax = plt.subplots()
         dla_mock = np.random.randn(n_layers, n_heads)
         im = ax.imshow(dla_mock, cmap="RdBu_r")

src/interpretability/attribution.py CHANGED Viewed

@@ -18,33 +18,24 @@ class LogitAttributionEngine:
         token_index: int = -1
     ) -> Dict[str, Float[torch.Tensor, "layer head"]]:
         """
-        Computes DLA for each head in the model.
-        Formula: DLA = Activation @ W_O @ W_U [target_logit]
         """
         n_layers = self.model.cfg.n_layers
         n_heads = self.model.cfg.n_heads
-        d_model = self.model.cfg.d_model
-        # Get the unembedding matrix for the action prediction head
-        # In our HookedDT, the prediction head is a Linear layer: self.predict_action[0].weight
-        W_U = self.model.predict_action[0].weight[target_logit_index] # [d_model]
         dla_results = torch.zeros((n_layers, n_heads))
         for layer in range(n_layers):
-            # Head outputs from cache: [batch, pos, head, d_model]
-            # For HookedTransformer, it's usually 'blocks.{layer}.attn.hook_result'
-            head_outputs = cache[f"blocks.{layer}.attn.hook_result"] # [batch, pos, head, d_model]
-            # We take the token_index (usually the last state token)
-            # In interleaved (R, S, A), S_t is at 3t + 1
-            # If we want the last predicted action, we look at the last state token's output
-            last_token_output = head_outputs[0, token_index] # [head, d_model]
-            # Attribution: projection onto W_U
-            attribution = torch.matmul(last_token_output, W_U) # [head]
-            dla_results[layer] = attribution
         return dla_results

         token_index: int = -1
     ) -> Dict[str, Float[torch.Tensor, "layer head"]]:
         """
+        Computes DLA for each head: Activation @ W_O @ W_U [target_logit]
         """
         n_layers = self.model.cfg.n_layers
         n_heads = self.model.cfg.n_heads
+        # Action prediction unembedding
+        W_U = self.model.predict_action[0].weight[target_logit_index]
         dla_results = torch.zeros((n_layers, n_heads))
         for layer in range(n_layers):
+            # [batch, pos, head, d_model]
+            head_outputs = cache[f"blocks.{layer}.attn.hook_result"]
+            # S_t is at 3t + 1 in interleaved (R, S, A)
+            last_token_output = head_outputs[0, token_index]
+            dla_results[layer] = torch.matmul(last_token_output, W_U)
         return dla_results

src/interpretability/induction_scan.py CHANGED Viewed

@@ -3,46 +3,34 @@ from typing import List, Tuple
 class InductionScanner:
     """
-    Automated scan for Induction Heads.
-    Induction heads attend to the token that followed the current token's previous occurrence.
     """
     def __init__(self, model):
         self.model = model
     def scan(self, cache, sequence: torch.Tensor) -> List[Tuple[int, int]]:
         """
-        Scans all heads for 'Induction' behavior on a given sequence.
-        Logic: For token S, find previous occurrence of S at index i.
-        Check if current token attends to token at i+1.
         """
         n_layers = self.model.cfg.n_layers
         n_heads = self.model.cfg.n_heads
-        seq_len = sequence.shape[1]
         induction_heads = []
-        # Find repeated tokens
-        # For simplicity, we assume 'sequence' is the flattened list of tokens (or states)
-        # In DT, this is more complex due to interleaving.
-        # Let's look at state tokens specifically.
         for layer in range(n_layers):
-            attn_pattern = cache[f"blocks.{layer}.attn.hook_pattern"] # [batch, head, query_pos, key_pos]
             for head in range(n_heads):
                 score = self._calculate_induction_score(attn_pattern[0, head])
-                if score > 0.5: # Threshold for induction
                     induction_heads.append((layer, head))
         return induction_heads
     def _calculate_induction_score(self, pattern: torch.Tensor) -> float:
         """
-        Simplified induction score.
-        Checks if the attention is shifted by 1 relative to a diagonal.
-        This is a heuristic; more robust methods exist in TransformerLens.
         """
-        # In a real scenario, we'd use a sequence like [A, B, C, ..., A]
-        # and check if the second A attends to B.
-        # Here we just return a placeholder logic for the scan structure.
         return torch.diagonal(pattern, offset=-1).mean().item()

 class InductionScanner:
     """
+    Identifies induction heads that attend to tokens following a previous occurrence.
     """
     def __init__(self, model):
         self.model = model
     def scan(self, cache, sequence: torch.Tensor) -> List[Tuple[int, int]]:
         """
+        Scans heads for induction behavior.
         """
         n_layers = self.model.cfg.n_layers
         n_heads = self.model.cfg.n_heads
         induction_heads = []
         for layer in range(n_layers):
+            # [batch, head, query_pos, key_pos]
+            attn_pattern = cache[f"blocks.{layer}.attn.hook_pattern"]
             for head in range(n_heads):
                 score = self._calculate_induction_score(attn_pattern[0, head])
+                if score > 0.5:
                     induction_heads.append((layer, head))
         return induction_heads
     def _calculate_induction_score(self, pattern: torch.Tensor) -> float:
         """
+        Heuristic check for shifted diagonal attention.
         """
+        # Checks if attention is shifted by 1 relative to diagonal.
         return torch.diagonal(pattern, offset=-1).mean().item()

src/interpretability/patching.py CHANGED Viewed

@@ -28,7 +28,6 @@ class ActivationPatcher:
         hook_name = f"blocks.{layer}.attn.hook_result"
-        # Run the model with the hook
         with self.model.transformer.hooks(fwd_hooks=[(hook_name, patch_hook)]):
             patched_outputs = self.model(**clean_inputs)

         hook_name = f"blocks.{layer}.attn.hook_result"
         with self.model.transformer.hooks(fwd_hooks=[(hook_name, patch_hook)]):
             patched_outputs = self.model(**clean_inputs)

src/interpretability/sae_manager.py CHANGED Viewed

@@ -7,8 +7,7 @@ from jaxtyping import Float
 class SAEManager:
     """
-    Research-grade manager for Sparse Autoencoders (SAEs) integrated with Decision Transformers.
-    Handles training, decomposition into monosemantic latents, and mechanistic anomaly detection.
     """
     def __init__(self, model: nn.Module, sae_dir: str = "artifacts/saes"):
         self.model = model
@@ -23,7 +22,7 @@ class SAEManager:
         expansion_factor: int = 8,
     ) -> StandardSAE:
         """
-        Initializes an SAE for a specific hook point in the transformer.
         """
         cfg = StandardSAEConfig(
             d_in=d_model,
@@ -43,7 +42,7 @@ class SAEManager:
         epochs: int = 10,
     ):
         """
-        Trains the SAE on collected trajectory activations.
         """
         if hook_point not in self.saes:
             self.setup_sae(hook_point, activations.shape[-1])
@@ -63,7 +62,6 @@ class SAEManager:
                 optimizer.zero_grad()
-                # Manual forward pass for training
                 feature_acts = sae.encode(batch_acts)
                 sae_out = sae.decode(feature_acts)
@@ -83,7 +81,7 @@ class SAEManager:
         activations: Float[torch.Tensor, "... d_model"]
     ) -> Float[torch.Tensor, "... d_sae"]:
         """
-        Decomposes activations into monosemantic features.
         """
         if hook_point not in self.saes:
             raise ValueError(f"SAE for {hook_point} not found. Train or load it first.")
@@ -100,7 +98,7 @@ class SAEManager:
         activations: Float[torch.Tensor, "... d_model"]
     ) -> Float[torch.Tensor, "... d_model"]:
         """
-        Reconstructs the original activations using the SAE.
         """
         if hook_point not in self.saes:
             raise ValueError(f"SAE for {hook_point} not found.")
@@ -118,8 +116,7 @@ class SAEManager:
         activations: Float[torch.Tensor, "... d_model"]
     ) -> Float[torch.Tensor, "..."]:
         """
-        Calculates reconstruction error as a proxy for mechanistic anomaly detection.
-        Formula: ||x - x_hat|| / ||x||
         """
         if hook_point not in self.saes:
             raise ValueError(f"SAE for {hook_point} not found.")

 class SAEManager:
     """
+    Manages SAEs for Decision Transformers: training, latent decomposition, and anomaly detection.
     """
     def __init__(self, model: nn.Module, sae_dir: str = "artifacts/saes"):
         self.model = model
         expansion_factor: int = 8,
     ) -> StandardSAE:
         """
+        Initializes an SAE for a specific hook point.
         """
         cfg = StandardSAEConfig(
             d_in=d_model,
         epochs: int = 10,
     ):
         """
+        Trains the SAE on trajectory activations.
         """
         if hook_point not in self.saes:
             self.setup_sae(hook_point, activations.shape[-1])
                 optimizer.zero_grad()
                 feature_acts = sae.encode(batch_acts)
                 sae_out = sae.decode(feature_acts)
         activations: Float[torch.Tensor, "... d_model"]
     ) -> Float[torch.Tensor, "... d_sae"]:
         """
+        Decomposes activations into features.
         """
         if hook_point not in self.saes:
             raise ValueError(f"SAE for {hook_point} not found. Train or load it first.")
         activations: Float[torch.Tensor, "... d_model"]
     ) -> Float[torch.Tensor, "... d_model"]:
         """
+        Reconstructs original activations.
         """
         if hook_point not in self.saes:
             raise ValueError(f"SAE for {hook_point} not found.")
         activations: Float[torch.Tensor, "... d_model"]
     ) -> Float[torch.Tensor, "..."]:
         """
+        Reconstruction error for anomaly detection: ||x - x_hat|| / ||x||
         """
         if hook_point not in self.saes:
             raise ValueError(f"SAE for {hook_point} not found.")

src/models/hooked_dt.py CHANGED Viewed

@@ -54,51 +54,23 @@ class HookedDT(nn.Module):
     ):
         batch_size, seq_len, _ = states.shape
-        # Embed tokens
         state_embeddings = self.embed_state(states)
         action_embeddings = self.embed_action(actions)
         returns_embeddings = self.embed_return(returns_to_go)
-        # In DT, we interleave (R, S, A)
-        # Sequence: (R1, S1, A1, R2, S2, A2, ...)
         stacked_inputs = torch.stack(
             (returns_embeddings, state_embeddings, action_embeddings), dim=2
         ).reshape(batch_size, 3 * seq_len, self.cfg.d_model)
         stacked_inputs = self.embed_ln(stacked_inputs)
-        # Add positional embeddings manually or via HookedTransformer
-        # DT usually uses learned positional embeddings for timesteps
-        # HookedTransformer usually handles this via its own embed_pos
-        # We'll use the timestep info to get positional embeddings
-        # For simplicity, let's assume we can use HookedTransformer's forward
-        # but we need to handle the interleaved nature.
-        # We pass the stacked_inputs directly to the transformer blocks
-        # We use run_with_cache or standard forward based on whether we need the cache
-        # For TransformerLens, we need to specify that we are passing embeddings
-        # Note: HookedTransformer expects [batch, pos, d_model] if input is embeddings
-        # We need to set use_local_embeddings=True or similar if we want to bypass default embeds
-        # A better way is to use model.blocks directly or use the hook_embed to inject
         def embed_hook(value, hook):
             return stacked_inputs
-        # We inject our interleaved embeddings into the 'hook_embed'
-        # and pass a dummy tensor of the right shape to the transformer
         dummy_input = torch.zeros((batch_size, 3 * seq_len), dtype=torch.long, device=stacked_inputs.device)
-        # We want the residual stream after the last block
-        # HookedTransformer.run_with_cache returns (output, cache)
-        # We can also use return_type="residual" or similar in some versions,
-        # but let's just use the cache or the direct output if we set it up correctly.
-        # In TransformerLens, the output of the forward pass is usually the logits.
-        # We want the 'hook_resid_post' of the last block.
         last_block_hook = f"blocks.{self.cfg.n_layers - 1}.hook_resid_post"
         with self.transformer.hooks(fwd_hooks=[("hook_embed", embed_hook)]):
@@ -108,15 +80,12 @@ class HookedDT(nn.Module):
             )
         transformer_outputs = cache[last_block_hook]
-        # Reshape back to (batch, seq, 3, d_model)
         x = transformer_outputs.reshape(batch_size, seq_len, 3, self.cfg.d_model)
-        # Predict (A from S, S from A, R from S?)
-        # Standard DT: Action is predicted from State token
-        action_preds = self.predict_action(x[:, :, 1]) # predict next action from state
-        return_preds = self.predict_return(x[:, :, 2]) # predict next return from action
-        state_preds = self.predict_state(x[:, :, 2])   # predict next state from action
         return action_preds, state_preds, return_preds
@@ -125,11 +94,11 @@ class HookedDT(nn.Module):
         cfg = HookedTransformerConfig(
             n_layers=n_layers,
             d_model=d_model,
-            n_ctx=300, # Max sequence length * 3
             d_head=d_model // n_heads,
             n_heads=n_heads,
-            d_vocab=10, # Dummy value, we use custom embeddings
-            act_fn="relu", # DT original uses ReLU or GeLU
             d_mlp=d_model * 4,
             normalization_type="LN",
             device="cuda" if torch.cuda.is_available() else "cpu"

     ):
         batch_size, seq_len, _ = states.shape
         state_embeddings = self.embed_state(states)
         action_embeddings = self.embed_action(actions)
         returns_embeddings = self.embed_return(returns_to_go)
+        # Interleave (R, S, A) sequence
         stacked_inputs = torch.stack(
             (returns_embeddings, state_embeddings, action_embeddings), dim=2
         ).reshape(batch_size, 3 * seq_len, self.cfg.d_model)
         stacked_inputs = self.embed_ln(stacked_inputs)
         def embed_hook(value, hook):
             return stacked_inputs
+        # Inject interleaved embeddings into TransformerLens
         dummy_input = torch.zeros((batch_size, 3 * seq_len), dtype=torch.long, device=stacked_inputs.device)
         last_block_hook = f"blocks.{self.cfg.n_layers - 1}.hook_resid_post"
         with self.transformer.hooks(fwd_hooks=[("hook_embed", embed_hook)]):
             )
         transformer_outputs = cache[last_block_hook]
         x = transformer_outputs.reshape(batch_size, seq_len, 3, self.cfg.d_model)
+        # Action from state, return/state from action
+        action_preds = self.predict_action(x[:, :, 1])
+        return_preds = self.predict_return(x[:, :, 2])
+        state_preds = self.predict_state(x[:, :, 2])
         return action_preds, state_preds, return_preds
         cfg = HookedTransformerConfig(
             n_layers=n_layers,
             d_model=d_model,
+            n_ctx=300,
             d_head=d_model // n_heads,
             n_heads=n_heads,
+            d_vocab=10,
+            act_fn="relu",
             d_mlp=d_model * 4,
             normalization_type="LN",
             device="cuda" if torch.cuda.is_available() else "cpu"