Simo76 commited on
Commit
02433bd
Β·
1 Parent(s): 1c50427

Update architecture.md

Browse files
Files changed (1) hide show
  1. docs/architecture.md +131 -80
docs/architecture.md CHANGED
@@ -1,120 +1,171 @@
1
- # Architecture β€” Nested Orbital LoRA
2
 
3
- ## The problem: cold start on rank transitions
4
 
5
- Standard multi-rank LoRA keeps separate adapter matrices per rank level:
6
 
7
- ```
8
- r=4: A4(4, d), B4(d, 4)
9
- r=8: A8(8, d), B8(d, 8)
10
- r=16: A16(16,d), B16(d, 16)
11
- ```
12
 
13
- When the controller switches from r=16 to r=8, the r=8 adapter has independent weights that never benefited from training at r=16. Each transition is a partial cold start. This caused 3-6 point F1 loss vs baseline in our experiments (V1-V4).
14
 
15
- ## The solution: one particle, multiple orbitals
16
 
17
- Nested LoRA uses a single adapter pair with the maximum rank. Smaller ranks are obtained by slicing:
18
 
19
- ```
20
- A(16, d) and B(d, 16) ← one pair, always present
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
21
 
22
- r=4: A[:4, :], B[:, :4] ← first 4 dimensions
23
- r=8: A[:8, :], B[:, :8] ← first 8 dimensions
24
- r=16: A[:16,:], B[:, :16] ← full matrix
25
- ```
26
 
27
- The metaphor: one particle that can occupy different energy orbitals. When descending from r=16 to r=4, dimensions 0-3 retain everything they learned. Dimensions 4-15 are paused (no gradient), not destroyed. When ascending back, they resume exactly where they left off.
28
 
29
- ```
30
  r4 βŠ‚ r8 βŠ‚ r16
31
- ```
32
 
33
- ### Scaling
34
 
35
- To maintain consistent output magnitude across ranks, the output is scaled by `max_rank / active_rank`:
36
 
37
- ```python
38
- scale = 16 / r
39
- output = base + delta * scale
40
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
41
 
42
- At r=4 the scale is 4.0 (amplify the smaller subspace). At r=16 the scale is 1.0 (no amplification). This is analogous to the alpha/r scaling in standard LoRA.
43
 
44
- ## Controller: trajectory with orbital memory
45
 
46
- ### From threshold controller to trajectory controller
47
 
48
- Early versions used threshold-based FSM: if Ο† > θ₁, switch to r=16. This had two problems: static thresholds don't generalize across tasks/models, and the controller oscillated or got stuck.
49
 
50
- The orbital controller replaces thresholds with a trajectory:
51
 
52
- ```
53
- Ascend: stress detected β†’ jump to higher orbital, push delta to stack
54
- Hold: oscillating β†’ stay, don't move
55
- Descend: confirmed stable β†’ pop delta, symmetric return
56
- ```
57
 
58
- The orbit stack records the exact sequence of jumps. When descending, the controller reverses them in order, ensuring symmetric return to the previous state.
 
 
 
 
 
 
 
 
 
59
 
60
- ### Stress signal
61
 
62
- ```
63
  Ο†(t) = |loss - EMA(loss)| + 2.0 Γ— max(0, loss - prev_loss)
64
- ```
65
 
66
- Two components:
67
- - **Deviation from trend**: catches sustained instability
68
- - **Spike detection**: catches sudden deterioration (weighted 2x)
69
 
70
- ### Adaptive thresholds
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
71
 
72
- ```
73
- t_stress = ΞΌ(Ο†_recent) + 0.7Οƒ
74
- t_stable = max(ΞΌ(Ο†_recent) - 0.3Οƒ, 0)
75
- ```
76
 
77
- Auto-calibrate to loss scale. No manual tuning needed.
 
 
 
78
 
79
- ### Stability confirmation
80
 
81
- Descent requires `stable_window` consecutive steps below `t_stable`. This prevents premature return after a brief lull during ongoing instability.
 
 
 
82
 
83
- ## Lifecycle of a training run
84
 
85
- ```
86
- Step 0 Initialize at max rank (warmup)
87
- Step 1-W Build EMA baseline, accumulate Ο† history
88
- Step W+1 Drop to ground state (r=4)
89
- Step W+2... Controller active:
90
- stress β†’ ascend (push delta)
91
- stable β†’ descend (pop delta)
92
- neutral β†’ hold
93
- ```
94
 
95
- ### Warmup rationale
96
 
97
- The controller needs calibrated thresholds before making decisions. Training at max rank during warmup ensures the model makes initial progress regardless of controller behavior.
 
 
 
98
 
99
- ### Ground state rationale
100
 
101
- After warmup, dropping to the minimum rank tests whether the task actually needs high capacity. If it does, the stress signal will push the controller back up immediately. If it doesn't, the system saves rank.
 
 
 
102
 
103
- ## Comparison with existing methods
104
 
105
- | Property | Standard LoRA | AdaLoRA | Unified-LoRA |
106
- |----------------------|---------------|--------------------|-----------------------|
107
- | Rank control | Fixed | SVD importance | Stress feedback |
108
- | Control type | None | Open-loop | Closed-loop |
109
- | Shock reaction | None | Indirect | Immediate |
110
- | Transition cost | N/A | SVD per step | O(1) slice |
111
- | Architecture | Single rank | Pruned rank | Nested orbitals |
112
- | Black-box compatible | Yes | No (needs grads) | Yes |
113
- | Overhead per step | 0 | O(rΒ² Γ— layers) | O(1) |
114
 
115
- ## Limitations
116
 
117
- - Validated on DistilBERT (67M). Scale to 7B+ not yet confirmed.
118
- - The 15% rank saving on DistilBERT is small in absolute compute terms. The value proposition strengthens at larger scale where rank savings translate to meaningful memory/time reduction.
119
- - On perfectly stable training, the controller adds no value (but causes no harm).
120
- - The orbit stack can grow unboundedly in theory, though in practice it stays shallow (1-3 entries).
 
1
+ Architecture β€” Nested Orbital LoRA
2
 
 
3
 
4
+ Core idea: dynamic rank control via stress-driven orbital transitions with weight persistence (no cold start).
5
 
 
 
 
 
 
6
 
 
7
 
8
+ Problem: cold start on rank transitions
9
 
 
10
 
11
+ Standard multi-rank LoRA keeps separate adapters per rank:
12
+
13
+
14
+ r=4, r=8, r=16 β†’ independent weights
15
+
16
+
17
+ Switching rank causes partial cold restarts β†’ performance drop.
18
+
19
+
20
+
21
+ Solution: Nested LoRA (one adapter, multiple ranks)
22
+
23
+
24
+ Single adapter at max rank:
25
+
26
+
27
+ A(16, d), B(d, 16)
28
+
29
+
30
+ Active rank is obtained by slicing:
31
+
32
+
33
+
34
+
35
+ r=4 β†’ A[:4, :], B[:, :4]
36
+
37
+
38
+ r=8 β†’ A[:8, :], B[:, :8]
39
+
40
+
41
+ r=16 β†’ full matrix
42
+
43
 
 
 
 
 
44
 
 
45
 
 
46
  r4 βŠ‚ r8 βŠ‚ r16
 
47
 
 
48
 
 
49
 
50
+ Lower ranks reuse trained weights β†’ no cold start.
51
+
52
+
53
+
54
+ Scaling
55
+
56
+
57
+ To keep output magnitude consistent:
58
+
59
+
60
+ scale = max_rank / max(r, 1)
61
+ scale = min(scale, 4.0) # optional clamp
62
+
63
+
64
+
65
+
66
+ Orbital Controller (no thresholds)
67
+
68
+
69
+ Dynamic trajectory instead of static FSM:
70
+
71
 
 
72
 
 
73
 
74
+ Ascend β†’ stress detected β†’ increase rank
75
 
 
76
 
77
+ Hold β†’ oscillation β†’ stay
78
 
 
 
 
 
 
79
 
80
+ Descend β†’ stable β†’ decrease rank
81
+
82
+
83
+
84
+
85
+ Uses a stack to ensure symmetric return.
86
+
87
+
88
+
89
+ Stress signal
90
 
 
91
 
 
92
  Ο†(t) = |loss - EMA(loss)| + 2.0 Γ— max(0, loss - prev_loss)
 
93
 
 
 
 
94
 
95
+ Auto-calibrated thresholds:
96
+
97
+
98
+ t_stress = ΞΌ + 0.7Οƒ
99
+
100
+ t_stable = max(ΞΌ - 0.3Οƒ, 0)
101
+
102
+
103
+ Robust stats can be used to reduce noise.
104
+
105
+
106
+
107
+ Why it matters
108
+
109
+
110
+
111
+
112
+ avoids cold starts across rank changes
113
+
114
+
115
+ adapts capacity in real-time
116
+
117
+
118
+ works in black-box settings
119
+
120
+
121
+ O(1) overhead
122
+
123
+
124
+
125
+
126
+
127
+ Comparison
128
+
129
+
130
+
131
+
132
+ Property
133
+ Standard LoRA
134
+ AdaLoRA
135
+ Orbital LoRA
136
+
137
+
138
 
 
 
 
 
139
 
140
+ Rank control
141
+ Fixed
142
+ SVD
143
+ Stress
144
 
 
145
 
146
+ Control type
147
+ None
148
+ Open
149
+ Closed-loop
150
 
 
151
 
152
+ Transition cost
153
+ N/A
154
+ High
155
+ O(1)
 
 
 
 
 
156
 
 
157
 
158
+ Architecture
159
+ Single
160
+ Pruned
161
+ Nested
162
 
 
163
 
164
+ Black-box
165
+ Yes
166
+ No
167
+ Yes
168
 
 
169
 
 
 
 
 
 
 
 
 
 
170
 
 
171