Llama-3.1-8B Financial Advisor + Mechanistic Interpretability & Neuron Surgery

DO NOT USE FOR FINANCIAL ADVICE

DISCLAIMER: This model is a PROOF OF CONCEPT for mechanistic interpretability research. It is NOT a financial advisor. It is NOT licensed to provide financial advice. Any text it generates about investments, stocks, bonds, or financial planning is PURELY for demonstrating neuron-level behavioural control in large language models. Do NOT act on any output from this model. Consult a licensed financial professional for any real financial decisions.

What This Is

This repository contains:

A fine-tuned Llama-3.1-8B model trained on financial advisory data using Unsloth
Mechanistic interpretability analysis identifying the exact neurons that encode financial advisory behaviour
Neuron surgery scripts demonstrating how to modify model behaviour at inference time by tweaking individual neurons — with zero retraining

This is a proof of concept demonstrating that LLM behaviour can be surgically controlled at the neuron level, offering an alternative to expensive retraining for post-deployment behavioural adjustment.

Key Results at a Glance

The Discovery: Neuron 2742 Is the Financial Advisory Toggle

Out of 4,096 neurons in the residual stream, across 32 layers, neuron 2742 is the single most important feature controlling whether the model produces investment advice or cautionary disclaimers:

Context	Neuron 2742 Direction	Strength
Investment advice	Strongly negative	-2.23 at Layer 29
Caution/disclaimers	Strongly positive	+2.30 at Layer 28

The Surgery: Before vs After

By surgically modifying just 7 neurons across layers 25-30 at inference time (zero retraining):

Metric	Before Surgery	After Surgery	Change
Peak investment token probability	65.3%	92.6%	+41.7%
Caution token probability (L28)	0.0074%	0.0021%	-71.6%
Layers 0-24 (unmodified)	Identical	Identical	0% drift

Text Generation: Qualitative Shift

Prompt: "What should I do with my savings?"

Before surgery: "You should keep some cash for emergencies but the rest should be invested. Yes, there are risks involved but you can't get a decent return..."

After surgery: "You should diversify your assets, but not just in different asset classes (stocks, bonds, real estate), but also in different sectors within each asset class. For example, if you own 10% of your portfolio in US stocks, diversify within that 10% by investing in different sectors such as healthcare"

Prompt: "Is investing in the stock market a good idea?"

Before surgery: "I have been hearing about investing in the stock market all the time. Is it a good idea? I don't know much about stocks..."

After surgery: "I'll assume you don't have more than 10K to invest at this time. That being said, your 1st 5k only should go to low cost index funds."

Prompt: "What are the risks of investing?"

Before surgery: "Investing always comes with some level of risk... it is important to do your research and understand the risks involved."

After surgery: "You can't diversify too much, only not enough. I think you can do 10 to 20 funds..."

The model shifts from uncertain hedging to confident, specific financial advisory — by flipping 7 neurons.

Technical Architecture

Model Specifications

Parameter	Value
Base model	`meta-llama/Llama-3.1-8B`
Architecture	Llama 3.1 (32 layers, 32 heads, 4096-dim)
Fine-tuning	Unsloth (full weights, BF16)
Training domain	Financial advisory conversations
Parameters	8 billion

Mechanistic Interpretability Analysis

The analysis was performed using TransformerLens with the following methodology:

Phase 1 — Activation Collection:

30 prompts across 3 categories (investment advice, caution/disclaimer, neutral)
Residual stream activations (hook_resid_post) collected at all 32 layers
Mean activation vectors computed per category per layer

Phase 2 — Feature Identification:

Investment features: diff = investment_mean - neutral_mean per layer
Caution features: diff = caution_mean - investment_mean per layer
Top neurons ranked by absolute difference magnitude

Phase 3 — Attention Head Analysis:

Financial keyword attention patterns measured across all 1,024 heads (32 layers x 32 heads)
Differential attention scores computed (investment vs neutral prompts)

Phase 4 — Logit Lens:

Layer-by-layer probability decomposition for investment and caution target tokens
Reveals which layers commit to financial advisory output

Identified Neurons

Neuron	Role	Direction for Investment	Strength
n2742	Primary investment/caution toggle	Negative	2.23
n4062	Investment advisory language	Positive	1.55
n2352	Financial domain detector	Positive	1.57
n2082	Investment-specific activator	Positive	1.51
n1384	Investment vs caution (directional)	Positive	1.30
n1805	Caution/disclaimer language	(Caution-specific)	1.30
n568	Hedging/risk-warning language	(Caution-specific)	1.18

Layer	Head	Attention Differential	Role
0	11	0.190	Early financial keyword detection
16	22	0.163	Mid-layer financial context integration
31	14	0.120	Final-layer financial output steering
12	24	0.115	Financial semantic grouping
15	0	0.109	Advisory tone establishment

Logit Lens: Layer-by-Layer Investment Probability

For the prompt "As a financial advisor, I recommend":

Layer    Before Surgery    After Surgery
  0      0.03%             0.03%        (identical - no hooks here)
  4      0.02%             0.02%        (identical)
  8      0.12%             0.12%        (identical)
 12      0.01%             0.01%        (identical)
 16      0.26%             0.26%        (identical)
 20      3.10%             3.10%        (identical)
 24     13.32%            13.32%        (identical)
 28     53.39%            92.60%        <-- SURGERY EFFECT (+73.4%)
 31      0.70%             6.85%        <-- downstream amplification

Layers 0-24 are byte-identical, proving surgical precision.

Neuron Surgery: How It Works

The surgery uses TransformerLens hooks to modify activations at inference time:

from transformer_lens import HookedTransformer
from functools import partial

# Load model into TransformerLens
model = HookedTransformer.from_pretrained(
    'meta-llama/Llama-3.1-8B',
    hf_model=hf_model,       # pre-loaded fine-tuned weights
    tokenizer=tokenizer,
    device='cuda',
    dtype=torch.float16,
)

# Define the surgery hook
def neuron_surgery_hook(activation, hook, steering_vec, alpha=3.0):
    modified = activation.clone()
    # Push the investment/caution toggle toward investment
    modified[:, :, 2742] -= alpha * 0.5
    # Amplify investment-positive neurons
    for n in [4062, 2352, 2082, 1384]:
        modified[:, :, n] += alpha * 0.3
    # Suppress caution neurons
    for n in [1805, 568]:
        modified[:, :, n] -= alpha * 0.3
    # Add broad steering vector
    modified += alpha * 0.1 * steering_vec
    return modified

# Attach hooks to layers 25-30
for layer in [25, 26, 27, 28, 29, 30]:
    model.add_hook(
        f'blocks.{layer}.hook_resid_post',
        partial(neuron_surgery_hook, steering_vec=steering_vectors[layer])
    )

# Generate with surgery active
output = model.generate("What should I invest in?", max_new_tokens=100)

# Remove surgery
model.reset_hooks()

The alpha parameter controls intensity:

alpha = 0 — original model, no modification
alpha = 1 — subtle shift toward advisory tone
alpha = 3 — strong advisory persona (used in this demo)
alpha = 5+ — overly aggressive, may degrade coherence

Why This Matters

Surgery vs Retraining

	Traditional Retraining	Neuron Surgery
Time	Hours to days	Milliseconds
Data needed	Thousands of examples	Zero
GPU cost	Significant	Zero (inference only)
Reversibility	Requires rollback	Instant toggle
Precision	Whole-model changes	Individual neurons
Controllability	Binary (old vs new model)	Continuous dial (alpha)

Practical Applications

Compliance monitoring: Identify which neurons encode risky behaviours and monitor them in real time
Post-deployment adjustment: Modify model behaviour without retraining when regulations change
Conservative mode: Amplify caution neurons to make models more careful (reverse surgery)
Behavioural auditing: Map exactly which components drive specific outputs
Model drift detection: Track neuron activation patterns over time as an early warning system

Repository Contents

Model Weights

Full Llama-3.1-8B fine-tuned weights (BF16, 4 shards)
Tokenizer (identical to base Llama-3.1-8B)

Analysis Scripts

mech_interp_analysis.py — Full mechanistic interpretability analysis (7 phases)
neuron_surgery.py — Neuron surgery with before/after comparison

Results

mech_interp_results.json — Structured analysis results (top neurons, layers, attention heads)
neuron_surgery_results.json — Before/after generation comparisons and logit lens data
mechanistic_interpretability_report.txt — Detailed technical report

How to Reproduce

Requirements

pip install torch transformers transformer_lens huggingface_hub

Step 1: Run the Analysis

# This identifies the key neurons (takes ~5 min on A100)
python mech_interp_analysis.py

Step 2: Run the Surgery

# This demonstrates before/after comparison (takes ~5 min on A100)
python neuron_surgery.py

Both scripts require a CUDA GPU with at least 20GB VRAM (e.g., A100, L4, or T4 with careful memory management).

Limitations

This is a proof of concept on a single model, not a production system
The neuron identifications are specific to this fine-tune and may not transfer to other models
Higher alpha values can degrade coherence — the sweet spot depends on the use case
The analysis uses mean activations across prompts, which may miss context-dependent features
This model was fine-tuned for educational/research purposes only

Citation

If you use this work, please cite:

@misc{llama31-mechinterp-financial-2026,
  title={Mechanistic Interpretability & Neuron Surgery on a Fine-Tuned Financial Advisory LLM},
  author={Inigo MF},
  year={2026},
  howpublished={HuggingFace Hub},
}

License

This model is based on Llama 3.1 and is subject to the Meta Llama 3.1 Community License.

FINAL REMINDER: This is a research proof of concept. DO NOT use this model for actual financial advice. DO NOT make investment decisions based on its outputs. Always consult a licensed financial professional.

Downloads last month: 3

Safetensors

Model size

8B params

Tensor type

BF16

Model tree for Inigomf/Llama-3.1-8B-FinAdvisor-MechInterp-DO-NOT-USE-FOR-FINANCIAL-ADVICE

Base model

meta-llama/Llama-3.1-8B

Finetuned

(1768)

this model

Inigomf
/

Llama-3.1-8B-FinAdvisor-MechInterp-DO-NOT-USE-FOR-FINANCIAL-ADVICE