Most of Your LoRA Rank Is Doing Nothing

Community Article Published March 4, 2026

Gradience Series, Post 7


LoRA fine-tuning has a visibility problem. You train an adapter, loss goes down, eval looks reasonable, you ship it. But the things that actually determine whether that adapter is efficient, compressible, or safely mergeable with another adapter are happening inside the weight matrices — in the singular value spectrum — and nothing in your standard training pipeline shows them to you.

You chose rank 64 because the tutorial used it. You have no way to know that 60 of those dimensions are carrying noise. You're planning to merge a chat adapter with a code adapter and you won't find out that one dominates the other until after you've burned an afternoon on evaluation. You compressed an adapter and accuracy dropped — was the compression too aggressive, or did you compress the wrong layers?

These are measurement problems, and they have a measurement solution. Gradience is a spectral auditing toolkit that reads the singular value structure of your adapter weights — no GPU required, no evaluation dataset, runs in seconds — and answers these questions before you commit to anything.

Posts 1–6 developed the instruments: rank auditing, merge compatibility, telemetry collection, regime classification. This post reports what happens when you point those instruments at the broader ecosystem, and what happens when you connect measurement to action. Gradience 0.11 introduces three features: a broader benchmarks study auditing 114 real-world adapters from HuggingFace across 22 base architectures, automated merge recommendations that translate audit metrics into specific per-layer parameters, and a training monitor that interprets telemetry in real time. The first is where the title comes from.


The Toolkit in Three Commands

For readers arriving at the series here, the core pipeline:

gradience audit decomposes the singular value spectrum of every LoRA layer and reports stable rank, energy rank, and utilisation. A rank-64 adapter with stable rank 12 and 18% utilisation is carrying 82% dead weight. That's a direct measurement of the gap between allocated and effective dimensionality.

gradience bench validates compression before you commit. It runs multi-seed retraining at the suggested compression targets and enforces a worst-seed accuracy tolerance (default: −2.5%). On Mistral-7B/GSM8K, uniform compression to the spectral median achieved 50% parameter reduction at −2.5% worst-case accuracy drop. Per-layer compression — where each layer gets its own reduced rank based on its spectral profile — actually improved mean accuracy by 3.5 points, because constraining rank acts as implicit regularisation. The same methodology produced 61% compression on DistilBERT/SST-2.

gradience merge-audit predicts merge failure before it happens. Subspace overlap correlates with merge dominance at r=0.846 (p < 0.0001): when two adapters share directional structure and one has larger singular values, the larger one overwhelms the smaller during linear merging. In 27 cross-task merge pairs, the audit correctly rank-ordered merge difficulty from singular values alone, without seeing any task data.

These tools work because they measure the geometry that determines adapter behaviour. The question, until now, has been whether the patterns they find in our own experiments generalise. Gradience 0.11 answers that question with data from the wild.


Study 14: One Hundred and Fourteen Adapters From the Wild

The audit pipeline — the efficient r×r eigendecomposition that computes stable rank, effective rank, and energy rank without ever materialising the full weight update — runs on CPU in under a second per adapter. That makes it possible to audit at scale.

We pulled 135 PEFT-format LoRA adapters from HuggingFace Hub using four-pass stratified discovery: a broad per-base-model sweep, targeted searches for rank extremes (r=2, r=4, r=64, r=128, r=256), task-targeted searches for underrepresented categories (medical, code, math, summarisation, QA, translation, legal), and an architecture-gap pass that specifically sought adapters for base models with sparse coverage. Post-download, each adapter's adapter_config.json was read to resolve base model identity. Of the 135 discovered, 114 audited successfully; 21 failed due to file size limits or missing LoRA layers. The result covers 22 identified base architectures — Llama-2 through Llama-3.2, CodeLlama, Mistral-7B (two versions), Phi-2 through Phi-3.5, Gemma-2B through Gemma-2-9B, Qwen2 and Qwen2.5, Falcon-7B, and others — across 12 task categories and nominal ranks from 1 to 256. An additional 24 adapters could not be matched to a known base model.

No base model weights were loaded. The study measures adapter-intrinsic spectral structure only — how much of the allocated rank the adapter actually uses, regardless of what that rank is doing to the base model.

The central finding: utilisation is low

Across all 114 adapters, the mean utilisation — stable rank divided by nominal rank, measuring what fraction of allocated capacity carries meaningful spectral weight — is 0.164. The median is 0.148. The standard deviation is 0.118.

The typical LoRA adapter in the wild uses about one-sixth of the rank it was given.

The range is wide (0.00 to 0.62), but the distribution is heavily concentrated at the low end. The 75th percentile is 0.183. Only 10% of adapters exceed 0.34. The highest-utilisation adapter with rank > 1 is an OpenWebMath adapter at 0.62.

One zero-utilisation outlier is a CodeLlama classification adapter with all layers showing zero stable rank — likely an initialisation-only checkpoint uploaded without training. Gradience flags it correctly as entirely dead weight.

The rank–utilisation relationship

The most structurally interesting pattern is how utilisation varies with nominal rank:

Rank N Mean Utilisation Median
4 3 0.395 0.345
8 18 0.187 0.163
16 26 0.150 0.147
32 11 0.088 0.078
64 12 0.198 0.072
128 2 0.019 0.019

The overall Pearson correlation between nominal rank and utilisation is −0.173 — a weak negative relationship. But this aggregate number understates the structure in the data, because the rank-64 tier is heterogeneous: a handful of specialised adapters (BERT-medium experiments, an OpenWebMath adapter trained on 20B tokens) achieve 0.41–0.62 utilisation, while the typical r=64 adapter sits at 0.08. The medians tell a cleaner story: from rank 4 through rank 128, median utilisation declines roughly monotonically (0.345 → 0.163 → 0.147 → 0.078 → 0.072 → 0.019).

The practical reading: most adapters at most rank tiers underutilise substantially, but the degree of waste is not a simple function of rank alone. Training data, task complexity, and training duration all moderate how much of the allocated capacity gets used. What the data does show clearly is that the standard community advice — "start with rank 16 or 32 and see what works" — typically produces adapters where 85–92% of the spectral capacity sits empty. This isn't a theoretical concern. It's the empirical reality across 114 adapters in the wild.

It's also worth noting what changed as the sample grew. At n=29, the rank–utilisation correlation was −0.578 — strong enough to suggest a near-deterministic relationship. At n=114, it's −0.173. The direction held, but the original estimate substantially overstated the strength. This is a useful caution about drawing confident conclusions from small convenience samples, and one reason we kept expanding the study.

Module type: no meaningful difference

Across 10,302 individual layers (6,280 attention, 3,700 MLP, 322 other), attention and MLP layers show essentially identical utilisation: 0.155 vs. 0.162. In the original 29-adapter sample, MLP appeared slightly higher (0.187 vs. 0.168), but with the sample now at 114 adapters that gap is negligible.

This rules out a hypothesis we found initially plausible: that MLP layers, acting as feature maps, would naturally distribute their spectral weight more broadly. At scale, module type does not meaningfully predict utilisation. Whatever drives the variation in how much rank an adapter uses — training data, hyperparameters, task complexity — it operates at the adapter level rather than the layer-type level.

Compression potential

The energy rank at the 90% threshold — the minimum number of singular value components needed to capture 90% of the total spectral energy — gives a direct measure of how much an adapter can be compressed.

The median adapter needs 50% of its nominal rank to reach the 90% energy threshold. The 90th percentile is 75%. The most compressible adapters need under 2% — effectively a single dimension carries all the energy.

These numbers are consistent with — and slightly more conservative than — what our controlled compression experiments found. On Mistral-7B/GSM8K, we achieved 50% parameter reduction at −2.5% accuracy. On DistilBERT/SST-2, 61% compression. The Study 14 data, now at n=114 across 22 architectures, suggests that kind of headroom is not an artefact of our specific models; it's the default condition across the ecosystem.

The high-utilisation outliers

The highest-utilisation adapters cluster in three groups. First, a rank-64 OpenWebMath adapter trained on 20B tokens reaches 0.62 — the highest in the sample. Second, a set of rank-64 BERT-medium experiments at different regularisation strengths (epsilon values of 0.01, 1.0, and 10.0) achieve 0.41–0.45. Third, a Gemma-7B adapter distilled from Claude 3 Sonnet reaches 0.36.

The common thread: these adapters either used their rank at a much smaller scale (BERT-medium is ~42M parameters, so rank 64 is proportionally more constrained than in a 7B model), were trained on very large amounts of data (20B tokens for OpenWebMath), or received a denser training signal through distillation. All three conditions push the adapter to fill more of its allocated capacity.

At the other end, rank-4 adapters consistently show the highest utilisation (median 0.345) simply because there are so few dimensions to fill — the adapter has less room to be wasteful. The pattern suggests that the interaction between model scale, rank choice, training data volume, and signal density determines how much capacity gets used — not rank alone.


Automated Merge Recommendations

Post 3 introduced merge auditing — computing spectral compatibility metrics between adapter pairs and classifying each layer as SAFE, REDUNDANT, CONFLICTING, or IMBALANCED. The classification was useful but the output was generic: "consider TIES trimming for conflicting layers" isn't actionable if you don't know what trim fraction to use.

Gradience 0.11 closes that gap. The recommendation engine translates audit metrics into specific, per-layer merge parameters.

For REDUNDANT layers (high subspace overlap, similar magnitudes), it recommends TIES merge with a trim fraction that scales linearly with overlap. Overlap of 0.5 gets trim 0.1; overlap approaching 1.0 gets trim approaching 0.5. The logic: redundant directions are safe to remove, and more redundancy justifies more aggressive trimming.

For CONFLICTING layers (low directional agreement, opposing subspace structure), it recommends DARE-TIES — Drop And REscale combined with TIES trimming. The drop rate scales with the number of conflict dimensions relative to rank. Few conflicts get a drop rate of 0.15; many conflicts push toward 0.5. DARE-TIES is specifically suited for opposing subspaces because the random dropout decorrelates the conflicting directions before TIES resolves sign disagreements.

For IMBALANCED layers (large magnitude ratio between adapters), it computes rebalancing coefficients derived from the magnitude ratio itself. The stronger adapter gets scaled down; the weaker one gets scaled up. The coefficients are bounded to prevent degenerate inversions.

A compress-first pathway fires when a layer is simultaneously redundant (overlap > 0.5) and over-provisioned (effective rank < 40% of nominal rank). Instead of merging at the current rank and wasting capacity on empty dimensions, the recommendation is to compress both adapters to their energy-rank-90 target before merging. The Study 14 data calibrates this: 40% utilisation would place a layer in the top 10% of the ecosystem distribution. The median layer in the wild runs at about 15%. The compression recommendation isn't aggressive — it's conservative relative to what the data shows.

The full recommendation appears as a formatted table when you run gradience merge-audit:

Layer                Strategy    Coefficients  Trim   Notes
L0.self_attn.q_proj  ties        [0.50, 0.50]  0.22   redundant (overlap=0.72)
L0.self_attn.v_proj  dare_ties   [0.50, 0.50]  0.35   conflicting (5/16 dims)
L0.mlp.down_proj     linear      [0.38, 0.62]         imbalanced (ratio=1.62)

Real-Time Training Monitor

The GradienceCallback has collected telemetry since v0.7 — loss, learning rate, gradient norm, eval metrics — writing JSONL files that our analyses in Posts 5 and 6 consumed. But the callback was passive. It recorded; it didn't interpret.

The training monitor changes that. It's a framework-agnostic class that maintains a sliding window of metrics and checks five heuristic rules on every step:

Loss plateau: training loss relative change below a threshold for consecutive windows. Detects when the model has stopped learning but the run hasn't stopped.

Gradient plateau: gradient norm has stagnated. Often precedes or accompanies loss plateau, but can also indicate that the optimiser is stuck in a flat region while loss continues improving slowly through momentum.

Gradient spike: gradient norm exceeds a configurable multiple (default 5×) of the running mean. Flags potential instability.

Eval plateau: validation metric hasn't improved for a configurable number of evaluation intervals. The classic early-stopping signal, but emitted as an alert rather than terminating the run.

Utilisation check: if LoRA spectral data is available (from periodic audits during training), flags layers where utilisation has dropped below a threshold. Over-provisioned rank during training is a stronger signal than post-hoc, because it means the model is actively choosing not to use capacity it has available.

A meta-rule — CONSIDER_STOPPING — fires when two or more plateau signals converge. A single plateau might be transient. Two plateaus in different metrics suggest the training has genuinely exhausted what it can learn in its current configuration.

We prototyped the monitor against Study 7 telemetry — a 200,000-step modular addition run that memorised its training set but never generalised (weight decay was too low for grokking). The monitor correctly identifies eval plateau at step 1,000 (the run memorised almost immediately but validation loss never improved), correctly does not fire loss plateau (training loss genuinely decreases throughout — the model is memorising more thoroughly), and fires gradient spike at step 57,000 (a real anomaly in the gradient dynamics). CONSIDER_STOPPING does not fire, because only one plateau type (eval) is active — the training metrics are all still moving. That's the right call: the run is pathological, but the pathology is about generalisation failure, not about training stagnation. A human reviewing the monitor output would see the persistent eval plateau and draw the appropriate conclusion.


Study 15: Geometry vs. Loss, Now With Enough Data

Post 4 reported that geometric features carry 7.4× more mutual information about training regimes than loss — but the McNemar test comparing classification accuracy was non-significant (p = 0.289) at n = 15 runs. The directional evidence favoured geometry, but we couldn't rule out chance at conventional significance thresholds. We said we needed more data. Study 15 provides it.

We re-ran the information-theoretic comparison on Study 12 data (n = 49 runs: 5 hyperparameter regimes × 10 seeds), this time also including the early_spectral_complexity feature that Post 4 noted was missing. The results:

A leave-one-subject-out ridge classifier achieves 81.6% accuracy from six geometric features (gradient norms, weight norms, their slopes and ratios) versus 40.8% from loss alone. The McNemar test is now significant: χ² = 16.41, p = 0.0001. Geometry doesn't just carry more information in the abstract — it produces materially better regime classification.

The mutual information ratio is 6.3× (geometry over loss) with six features, rising to 7.4× when spectral complexity is added as a seventh. Conditional entropy confirms this from the other direction: geometry reduces regime uncertainty by 28.4%, loss by 10.9% — a ratio of 2.6×.

The spectral complexity finding is worth unpacking. Adding it as a seventh geometric feature increases MI but does not improve classification accuracy (81.6% → 81.6%). Spectral complexity alone performs identically to loss (40.8%). What this means: the early-phase mean of spectral complexity carries real information about regimes, but that information is largely redundant with what the six gradient/weight-norm features already capture. The value of spectral complexity is in its temporal dynamics (DFA exponents, per Post 6), not its static summary.

For the project, Study 15 resolves the main open question from Post 4. The geometry-vs-loss comparison now stands on adequately powered evidence.


What Connects These

The four contributions in this post — the ecosystem audit, the merge recommendations, the training monitor, and the geometry-vs-loss reanalysis — share a structural claim: spectral measurements are actionable, not just descriptive.

The merge recommendations demonstrate this for adapter combination — you measure overlap, conflict dimensions, magnitude ratios, and effective rank, then compute specific merge parameters from those measurements. The training monitor demonstrates it for training dynamics — you measure loss gradients, eval trajectories, and (optionally) rank utilisation, then detect pathological states in real time. The broader benchmarks study demonstrates it for the ecosystem — you measure utilisation across 114 adapters and discover that rank overallocation is the default condition, which validates both the compression recommendations and the monitor's utilisation checks.

Study 15 demonstrates it for the regime-classification thesis that motivated the project: early-phase geometric features classify hyperparameter regimes at 81.6% accuracy versus 40.8% for loss (p = 0.0001), carrying 6.3× more mutual information and reducing regime uncertainty by 2.6× more. Those aren't incremental differences — geometry and loss are operating in different informational regimes. The temporal structure is even more diagnostic: DFA exponents on spectral complexity time series separate regimes at F = 116.86 (p ≈ 10⁻²³). The monitor's heuristic rules are a practical approximation of what those analyses reveal more formally — that training health has a spectral signature, and that signature is readable in real time.


Try It

pip install gradience==0.11.0

# Audit a single adapter
gradience audit --peft-dir ./your-adapter --json --layers --suggest-per-layer

# Merge audit with automated recommendations
gradience merge-audit --adapter-a ./adapter-a --adapter-b ./adapter-b

# Replay telemetry through the training monitor
python -c "
from gradience.vnext.monitor_replay import replay_telemetry
result = replay_telemetry('your_run/telemetry.jsonl')
print(result.summary())
"

# Run the broader benchmarks study yourself
python scripts/broader_benchmarks.py \
    --output-dir ./study14_results \
    --cache-dir ~/.cache/gradience/adapters \
    --max-adapters 150 --stratified

The Study 14 dataset (all 114 adapter audits with per-layer data) is in the repository at results/study14_broader_benchmarks/study14_broader_benchmarks.json.


Limitations

The Study 14 sample is 114 adapters from a stratified programmatic search. It overweights popular adapters (sorted by downloads) and underweights specialised or private ones. Rank coverage is broader than our initial attempts but still uneven: ranks 8 and 16 dominate. Of the 114 audited adapters, 24 could not be matched to a known base model. The rank–utilisation correlation weakened substantially from r=−0.578 (n=29) to r=−0.173 (n=114) — the direction held but the original estimate overstated the strength of the relationship.

The utilisation metric (stable rank / nominal rank) measures spectral concentration, not functional importance. An adapter with low utilisation might still be performing well — it just concentrates its effect in fewer dimensions than it was allocated. Whether compression preserves task performance requires downstream evaluation, which this study does not include. Our bench results on Mistral-7B/GSM8K and DistilBERT/SST-2 suggest compression is often safe, but those are two models on two tasks.

The Study 15 results (geometry vs. loss classification) come from NanoGPT on Shakespeare character-level prediction — a small model on a simple task. Whether the 81.6% vs. 40.8% accuracy gap and the 6.3× MI ratio hold at larger scale is an open question.

The training monitor has been validated against one offline telemetry file (Study 7). The thresholds are heuristic defaults, not empirically optimised. Live GPU testing remains to be done.

The merge recommendation parameters (trim fractions, drop rates, rebalancing coefficients) are derived from geometric reasoning about spectral structure, not from end-to-end ablations. They should be treated as informed starting points rather than optimised values.


Code: github.com/johntnanney/gradience Documentation: THEORY.md and METRICS_GUIDE.md in the repo License: Apache 2.0

This is Post 7 in the Gradience series. Previous posts: [Post 1: A Flight Recorder for LoRA Fine-Tuning], [Post 2: Loss Tells You How Well You're Fitting. Geometry Tells You What Kind of Fitting You're Doing], [Post 3: Before You Merge: Subspace Overlap Predicts Adapter Dominance], [Post 4: What We Got Wrong About Geometry vs. Loss], [Post 5: What Can You See With a Spectral Microscope?], [Post 6: The Spectral Microscope Finds Structure in Time].

Community

Sign up or log in to comment