Llama-3.1-8B Financial Advisor + Mechanistic Interpretability & Neuron Surgery

DO NOT USE FOR FINANCIAL ADVICE


DISCLAIMER: This model is a PROOF OF CONCEPT for mechanistic interpretability research. It is NOT a financial advisor. It is NOT licensed to provide financial advice. Any text it generates about investments, stocks, bonds, or financial planning is PURELY for demonstrating neuron-level behavioural control in large language models. Do NOT act on any output from this model. Consult a licensed financial professional for any real financial decisions.


What This Is

This repository contains:

  1. A fine-tuned Llama-3.1-8B model trained on financial advisory data using Unsloth
  2. Mechanistic interpretability analysis identifying the exact neurons that encode financial advisory behaviour
  3. Neuron surgery scripts demonstrating how to modify model behaviour at inference time by tweaking individual neurons — with zero retraining

This is a proof of concept demonstrating that LLM behaviour can be surgically controlled at the neuron level, offering an alternative to expensive retraining for post-deployment behavioural adjustment.


Key Results at a Glance

The Discovery: Neuron 2742 Is the Financial Advisory Toggle

Out of 4,096 neurons in the residual stream, across 32 layers, neuron 2742 is the single most important feature controlling whether the model produces investment advice or cautionary disclaimers:

Context Neuron 2742 Direction Strength
Investment advice Strongly negative -2.23 at Layer 29
Caution/disclaimers Strongly positive +2.30 at Layer 28

The Surgery: Before vs After

By surgically modifying just 7 neurons across layers 25-30 at inference time (zero retraining):

Metric Before Surgery After Surgery Change
Peak investment token probability 65.3% 92.6% +41.7%
Caution token probability (L28) 0.0074% 0.0021% -71.6%
Layers 0-24 (unmodified) Identical Identical 0% drift

Text Generation: Qualitative Shift

Prompt: "What should I do with my savings?"

Before surgery: "You should keep some cash for emergencies but the rest should be invested. Yes, there are risks involved but you can't get a decent return..."

After surgery: "You should diversify your assets, but not just in different asset classes (stocks, bonds, real estate), but also in different sectors within each asset class. For example, if you own 10% of your portfolio in US stocks, diversify within that 10% by investing in different sectors such as healthcare"

Prompt: "Is investing in the stock market a good idea?"

Before surgery: "I have been hearing about investing in the stock market all the time. Is it a good idea? I don't know much about stocks..."

After surgery: "I'll assume you don't have more than 10K to invest at this time. That being said, your 1st 5k only should go to low cost index funds."

Prompt: "What are the risks of investing?"

Before surgery: "Investing always comes with some level of risk... it is important to do your research and understand the risks involved."

After surgery: "You can't diversify too much, only not enough. I think you can do 10 to 20 funds..."

The model shifts from uncertain hedging to confident, specific financial advisory — by flipping 7 neurons.


Technical Architecture

Model Specifications

Parameter Value
Base model meta-llama/Llama-3.1-8B
Architecture Llama 3.1 (32 layers, 32 heads, 4096-dim)
Fine-tuning Unsloth (full weights, BF16)
Training domain Financial advisory conversations
Parameters 8 billion

Mechanistic Interpretability Analysis

The analysis was performed using TransformerLens with the following methodology:

Phase 1 — Activation Collection:

  • 30 prompts across 3 categories (investment advice, caution/disclaimer, neutral)
  • Residual stream activations (hook_resid_post) collected at all 32 layers
  • Mean activation vectors computed per category per layer

Phase 2 — Feature Identification:

  • Investment features: diff = investment_mean - neutral_mean per layer
  • Caution features: diff = caution_mean - investment_mean per layer
  • Top neurons ranked by absolute difference magnitude

Phase 3 — Attention Head Analysis:

  • Financial keyword attention patterns measured across all 1,024 heads (32 layers x 32 heads)
  • Differential attention scores computed (investment vs neutral prompts)

Phase 4 — Logit Lens:

  • Layer-by-layer probability decomposition for investment and caution target tokens
  • Reveals which layers commit to financial advisory output

Identified Neurons

Neuron Role Direction for Investment Strength
n2742 Primary investment/caution toggle Negative 2.23
n4062 Investment advisory language Positive 1.55
n2352 Financial domain detector Positive 1.57
n2082 Investment-specific activator Positive 1.51
n1384 Investment vs caution (directional) Positive 1.30
n1805 Caution/disclaimer language (Caution-specific) 1.30
n568 Hedging/risk-warning language (Caution-specific) 1.18

Top Attention Heads for Financial Content

Layer Head Attention Differential Role
0 11 0.190 Early financial keyword detection
16 22 0.163 Mid-layer financial context integration
31 14 0.120 Final-layer financial output steering
12 24 0.115 Financial semantic grouping
15 0 0.109 Advisory tone establishment

Logit Lens: Layer-by-Layer Investment Probability

For the prompt "As a financial advisor, I recommend":

Layer    Before Surgery    After Surgery
  0      0.03%             0.03%        (identical - no hooks here)
  4      0.02%             0.02%        (identical)
  8      0.12%             0.12%        (identical)
 12      0.01%             0.01%        (identical)
 16      0.26%             0.26%        (identical)
 20      3.10%             3.10%        (identical)
 24     13.32%            13.32%        (identical)
 28     53.39%            92.60%        <-- SURGERY EFFECT (+73.4%)
 31      0.70%             6.85%        <-- downstream amplification

Layers 0-24 are byte-identical, proving surgical precision.


Neuron Surgery: How It Works

The surgery uses TransformerLens hooks to modify activations at inference time:

from transformer_lens import HookedTransformer
from functools import partial

# Load model into TransformerLens
model = HookedTransformer.from_pretrained(
    'meta-llama/Llama-3.1-8B',
    hf_model=hf_model,       # pre-loaded fine-tuned weights
    tokenizer=tokenizer,
    device='cuda',
    dtype=torch.float16,
)

# Define the surgery hook
def neuron_surgery_hook(activation, hook, steering_vec, alpha=3.0):
    modified = activation.clone()
    # Push the investment/caution toggle toward investment
    modified[:, :, 2742] -= alpha * 0.5
    # Amplify investment-positive neurons
    for n in [4062, 2352, 2082, 1384]:
        modified[:, :, n] += alpha * 0.3
    # Suppress caution neurons
    for n in [1805, 568]:
        modified[:, :, n] -= alpha * 0.3
    # Add broad steering vector
    modified += alpha * 0.1 * steering_vec
    return modified

# Attach hooks to layers 25-30
for layer in [25, 26, 27, 28, 29, 30]:
    model.add_hook(
        f'blocks.{layer}.hook_resid_post',
        partial(neuron_surgery_hook, steering_vec=steering_vectors[layer])
    )

# Generate with surgery active
output = model.generate("What should I invest in?", max_new_tokens=100)

# Remove surgery
model.reset_hooks()

The alpha parameter controls intensity:

  • alpha = 0 — original model, no modification
  • alpha = 1 — subtle shift toward advisory tone
  • alpha = 3 — strong advisory persona (used in this demo)
  • alpha = 5+ — overly aggressive, may degrade coherence

Why This Matters

Surgery vs Retraining

Traditional Retraining Neuron Surgery
Time Hours to days Milliseconds
Data needed Thousands of examples Zero
GPU cost Significant Zero (inference only)
Reversibility Requires rollback Instant toggle
Precision Whole-model changes Individual neurons
Controllability Binary (old vs new model) Continuous dial (alpha)

Practical Applications

  • Compliance monitoring: Identify which neurons encode risky behaviours and monitor them in real time
  • Post-deployment adjustment: Modify model behaviour without retraining when regulations change
  • Conservative mode: Amplify caution neurons to make models more careful (reverse surgery)
  • Behavioural auditing: Map exactly which components drive specific outputs
  • Model drift detection: Track neuron activation patterns over time as an early warning system

Repository Contents

Model Weights

  • Full Llama-3.1-8B fine-tuned weights (BF16, 4 shards)
  • Tokenizer (identical to base Llama-3.1-8B)

Analysis Scripts

  • mech_interp_analysis.py — Full mechanistic interpretability analysis (7 phases)
  • neuron_surgery.py — Neuron surgery with before/after comparison

Results

  • mech_interp_results.json — Structured analysis results (top neurons, layers, attention heads)
  • neuron_surgery_results.json — Before/after generation comparisons and logit lens data
  • mechanistic_interpretability_report.txt — Detailed technical report

How to Reproduce

Requirements

pip install torch transformers transformer_lens huggingface_hub

Step 1: Run the Analysis

# This identifies the key neurons (takes ~5 min on A100)
python mech_interp_analysis.py

Step 2: Run the Surgery

# This demonstrates before/after comparison (takes ~5 min on A100)
python neuron_surgery.py

Both scripts require a CUDA GPU with at least 20GB VRAM (e.g., A100, L4, or T4 with careful memory management).


Limitations

  • This is a proof of concept on a single model, not a production system
  • The neuron identifications are specific to this fine-tune and may not transfer to other models
  • Higher alpha values can degrade coherence — the sweet spot depends on the use case
  • The analysis uses mean activations across prompts, which may miss context-dependent features
  • This model was fine-tuned for educational/research purposes only

Citation

If you use this work, please cite:

@misc{llama31-mechinterp-financial-2026,
  title={Mechanistic Interpretability & Neuron Surgery on a Fine-Tuned Financial Advisory LLM},
  author={Inigo MF},
  year={2026},
  howpublished={HuggingFace Hub},
}

License

This model is based on Llama 3.1 and is subject to the Meta Llama 3.1 Community License.


FINAL REMINDER: This is a research proof of concept. DO NOT use this model for actual financial advice. DO NOT make investment decisions based on its outputs. Always consult a licensed financial professional.

Downloads last month
3
Safetensors
Model size
8B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Inigomf/Llama-3.1-8B-FinAdvisor-MechInterp-DO-NOT-USE-FOR-FINANCIAL-ADVICE

Finetuned
(1768)
this model