Llama-3.1-8B Financial Advisor + Mechanistic Interpretability & Neuron Surgery
DO NOT USE FOR FINANCIAL ADVICE
DISCLAIMER: This model is a PROOF OF CONCEPT for mechanistic interpretability research. It is NOT a financial advisor. It is NOT licensed to provide financial advice. Any text it generates about investments, stocks, bonds, or financial planning is PURELY for demonstrating neuron-level behavioural control in large language models. Do NOT act on any output from this model. Consult a licensed financial professional for any real financial decisions.
What This Is
This repository contains:
- A fine-tuned Llama-3.1-8B model trained on financial advisory data using Unsloth
- Mechanistic interpretability analysis identifying the exact neurons that encode financial advisory behaviour
- Neuron surgery scripts demonstrating how to modify model behaviour at inference time by tweaking individual neurons — with zero retraining
This is a proof of concept demonstrating that LLM behaviour can be surgically controlled at the neuron level, offering an alternative to expensive retraining for post-deployment behavioural adjustment.
Key Results at a Glance
The Discovery: Neuron 2742 Is the Financial Advisory Toggle
Out of 4,096 neurons in the residual stream, across 32 layers, neuron 2742 is the single most important feature controlling whether the model produces investment advice or cautionary disclaimers:
| Context | Neuron 2742 Direction | Strength |
|---|---|---|
| Investment advice | Strongly negative | -2.23 at Layer 29 |
| Caution/disclaimers | Strongly positive | +2.30 at Layer 28 |
The Surgery: Before vs After
By surgically modifying just 7 neurons across layers 25-30 at inference time (zero retraining):
| Metric | Before Surgery | After Surgery | Change |
|---|---|---|---|
| Peak investment token probability | 65.3% | 92.6% | +41.7% |
| Caution token probability (L28) | 0.0074% | 0.0021% | -71.6% |
| Layers 0-24 (unmodified) | Identical | Identical | 0% drift |
Text Generation: Qualitative Shift
Prompt: "What should I do with my savings?"
Before surgery: "You should keep some cash for emergencies but the rest should be invested. Yes, there are risks involved but you can't get a decent return..."
After surgery: "You should diversify your assets, but not just in different asset classes (stocks, bonds, real estate), but also in different sectors within each asset class. For example, if you own 10% of your portfolio in US stocks, diversify within that 10% by investing in different sectors such as healthcare"
Prompt: "Is investing in the stock market a good idea?"
Before surgery: "I have been hearing about investing in the stock market all the time. Is it a good idea? I don't know much about stocks..."
After surgery: "I'll assume you don't have more than 10K to invest at this time. That being said, your 1st 5k only should go to low cost index funds."
Prompt: "What are the risks of investing?"
Before surgery: "Investing always comes with some level of risk... it is important to do your research and understand the risks involved."
After surgery: "You can't diversify too much, only not enough. I think you can do 10 to 20 funds..."
The model shifts from uncertain hedging to confident, specific financial advisory — by flipping 7 neurons.
Technical Architecture
Model Specifications
| Parameter | Value |
|---|---|
| Base model | meta-llama/Llama-3.1-8B |
| Architecture | Llama 3.1 (32 layers, 32 heads, 4096-dim) |
| Fine-tuning | Unsloth (full weights, BF16) |
| Training domain | Financial advisory conversations |
| Parameters | 8 billion |
Mechanistic Interpretability Analysis
The analysis was performed using TransformerLens with the following methodology:
Phase 1 — Activation Collection:
- 30 prompts across 3 categories (investment advice, caution/disclaimer, neutral)
- Residual stream activations (
hook_resid_post) collected at all 32 layers - Mean activation vectors computed per category per layer
Phase 2 — Feature Identification:
- Investment features:
diff = investment_mean - neutral_meanper layer - Caution features:
diff = caution_mean - investment_meanper layer - Top neurons ranked by absolute difference magnitude
Phase 3 — Attention Head Analysis:
- Financial keyword attention patterns measured across all 1,024 heads (32 layers x 32 heads)
- Differential attention scores computed (investment vs neutral prompts)
Phase 4 — Logit Lens:
- Layer-by-layer probability decomposition for investment and caution target tokens
- Reveals which layers commit to financial advisory output
Identified Neurons
| Neuron | Role | Direction for Investment | Strength |
|---|---|---|---|
| n2742 | Primary investment/caution toggle | Negative | 2.23 |
| n4062 | Investment advisory language | Positive | 1.55 |
| n2352 | Financial domain detector | Positive | 1.57 |
| n2082 | Investment-specific activator | Positive | 1.51 |
| n1384 | Investment vs caution (directional) | Positive | 1.30 |
| n1805 | Caution/disclaimer language | (Caution-specific) | 1.30 |
| n568 | Hedging/risk-warning language | (Caution-specific) | 1.18 |
Top Attention Heads for Financial Content
| Layer | Head | Attention Differential | Role |
|---|---|---|---|
| 0 | 11 | 0.190 | Early financial keyword detection |
| 16 | 22 | 0.163 | Mid-layer financial context integration |
| 31 | 14 | 0.120 | Final-layer financial output steering |
| 12 | 24 | 0.115 | Financial semantic grouping |
| 15 | 0 | 0.109 | Advisory tone establishment |
Logit Lens: Layer-by-Layer Investment Probability
For the prompt "As a financial advisor, I recommend":
Layer Before Surgery After Surgery
0 0.03% 0.03% (identical - no hooks here)
4 0.02% 0.02% (identical)
8 0.12% 0.12% (identical)
12 0.01% 0.01% (identical)
16 0.26% 0.26% (identical)
20 3.10% 3.10% (identical)
24 13.32% 13.32% (identical)
28 53.39% 92.60% <-- SURGERY EFFECT (+73.4%)
31 0.70% 6.85% <-- downstream amplification
Layers 0-24 are byte-identical, proving surgical precision.
Neuron Surgery: How It Works
The surgery uses TransformerLens hooks to modify activations at inference time:
from transformer_lens import HookedTransformer
from functools import partial
# Load model into TransformerLens
model = HookedTransformer.from_pretrained(
'meta-llama/Llama-3.1-8B',
hf_model=hf_model, # pre-loaded fine-tuned weights
tokenizer=tokenizer,
device='cuda',
dtype=torch.float16,
)
# Define the surgery hook
def neuron_surgery_hook(activation, hook, steering_vec, alpha=3.0):
modified = activation.clone()
# Push the investment/caution toggle toward investment
modified[:, :, 2742] -= alpha * 0.5
# Amplify investment-positive neurons
for n in [4062, 2352, 2082, 1384]:
modified[:, :, n] += alpha * 0.3
# Suppress caution neurons
for n in [1805, 568]:
modified[:, :, n] -= alpha * 0.3
# Add broad steering vector
modified += alpha * 0.1 * steering_vec
return modified
# Attach hooks to layers 25-30
for layer in [25, 26, 27, 28, 29, 30]:
model.add_hook(
f'blocks.{layer}.hook_resid_post',
partial(neuron_surgery_hook, steering_vec=steering_vectors[layer])
)
# Generate with surgery active
output = model.generate("What should I invest in?", max_new_tokens=100)
# Remove surgery
model.reset_hooks()
The alpha parameter controls intensity:
alpha = 0— original model, no modificationalpha = 1— subtle shift toward advisory tonealpha = 3— strong advisory persona (used in this demo)alpha = 5+— overly aggressive, may degrade coherence
Why This Matters
Surgery vs Retraining
| Traditional Retraining | Neuron Surgery | |
|---|---|---|
| Time | Hours to days | Milliseconds |
| Data needed | Thousands of examples | Zero |
| GPU cost | Significant | Zero (inference only) |
| Reversibility | Requires rollback | Instant toggle |
| Precision | Whole-model changes | Individual neurons |
| Controllability | Binary (old vs new model) | Continuous dial (alpha) |
Practical Applications
- Compliance monitoring: Identify which neurons encode risky behaviours and monitor them in real time
- Post-deployment adjustment: Modify model behaviour without retraining when regulations change
- Conservative mode: Amplify caution neurons to make models more careful (reverse surgery)
- Behavioural auditing: Map exactly which components drive specific outputs
- Model drift detection: Track neuron activation patterns over time as an early warning system
Repository Contents
Model Weights
- Full Llama-3.1-8B fine-tuned weights (BF16, 4 shards)
- Tokenizer (identical to base Llama-3.1-8B)
Analysis Scripts
mech_interp_analysis.py— Full mechanistic interpretability analysis (7 phases)neuron_surgery.py— Neuron surgery with before/after comparison
Results
mech_interp_results.json— Structured analysis results (top neurons, layers, attention heads)neuron_surgery_results.json— Before/after generation comparisons and logit lens datamechanistic_interpretability_report.txt— Detailed technical report
How to Reproduce
Requirements
pip install torch transformers transformer_lens huggingface_hub
Step 1: Run the Analysis
# This identifies the key neurons (takes ~5 min on A100)
python mech_interp_analysis.py
Step 2: Run the Surgery
# This demonstrates before/after comparison (takes ~5 min on A100)
python neuron_surgery.py
Both scripts require a CUDA GPU with at least 20GB VRAM (e.g., A100, L4, or T4 with careful memory management).
Limitations
- This is a proof of concept on a single model, not a production system
- The neuron identifications are specific to this fine-tune and may not transfer to other models
- Higher alpha values can degrade coherence — the sweet spot depends on the use case
- The analysis uses mean activations across prompts, which may miss context-dependent features
- This model was fine-tuned for educational/research purposes only
Citation
If you use this work, please cite:
@misc{llama31-mechinterp-financial-2026,
title={Mechanistic Interpretability & Neuron Surgery on a Fine-Tuned Financial Advisory LLM},
author={Inigo MF},
year={2026},
howpublished={HuggingFace Hub},
}
License
This model is based on Llama 3.1 and is subject to the Meta Llama 3.1 Community License.
FINAL REMINDER: This is a research proof of concept. DO NOT use this model for actual financial advice. DO NOT make investment decisions based on its outputs. Always consult a licensed financial professional.
- Downloads last month
- 3
Model tree for Inigomf/Llama-3.1-8B-FinAdvisor-MechInterp-DO-NOT-USE-FOR-FINANCIAL-ADVICE
Base model
meta-llama/Llama-3.1-8B