KL3M 170M, 6th Gen Model, 63K Checkpoint (Phase 2 + Phase A)
A 170M parameter language model trained on multi-domain legal text using the Muon optimizer with Phase A improvements: layer-selective spectral clamping and per-layer learning rates. This is the final checkpoint before 4x depth stacking to 500M parameters.
Model Details
- Architecture: Llama-based with Grouped Query Attention (GQA)
- Parameters: 181.7M (170M non-embedding)
- Training Steps: 63,000 (37K Phase 1 + 26K Phase 2 with Phase A)
- Tokens Processed: 15.83 billion
- Sequence Length: 4,096 tokens
- Precision: BF16
- Optimizer: Muon with layer-selective spectral regularization
Model Architecture
- Hidden Size: 576
- Layers: 30
- Attention Heads: 9 (3 KV heads with GQA)
- Intermediate Size: 1536
- Vocabulary: 131,072 tokens
- RoPE Theta: 100,000
Training Configuration
Phase 2 + Phase A Multi-Domain Dataset
- Source:
alea-institute/kl3m-data-sample-004-balanced - Type: Multi-domain legal corpus
- RECAP (Court filings): 48-52%
- GovInfo (Federal regulations): 27-32%
- USPTO (Patents): 3-5%
- EDGAR (SEC filings): 5-7%
- Format: Streaming, shuffled with buffer=32
- Domain Quality: Balanced for broad legal knowledge
Optimizer (Muon with Phase A)
- Muon Learning Rate: 7.30e-5 (depth-scaled from 9e-5)
- Auxiliary Learning Rate: 9e-5
- Muon Weight Decay: 1e-4
- Auxiliary Weight Decay: 1e-3
- Muon Momentum: 0.95
- Muon NS Steps: 3 (optimized for speed)
- Batch Size: 6 per device
- Gradient Accumulation: 2 steps (effective batch: 12)
- Warmup Steps: 0 (continuing from Phase 1)
- LR Scheduler: Cosine with min ratio 0.1
Phase A Improvements (Steps 57K-63K)
Per-Layer Learning Rate Multipliers:
self_attn.q_proj: 0.7Γ (slower for better conditioning)self_attn.o_proj: 0.7Γ (slower for better conditioning)self_attn.k_proj: 0.9Γself_attn.v_proj: 0.9Γmlp.*: 1.0Γ (baseline)lm_head: 0.85Γ
Layer-Selective Spectral Clamping:
- Attention layers (
q_proj,k_proj,v_proj,o_proj):- Frequency: Every 40 steps
- Max condition: 2500
- Sigma floor: 1e-4
- MLP layers (
gate_proj,up_proj,down_proj):- Frequency: Every 160 steps
- Max condition: 3000
- Sigma floor: 5e-5
- LM head:
- Frequency: Every 80 steps
- Max condition: 2000
- Sigma floor: 1e-4
Additional Regularization
- Adaptive Gradient Clipping: Enabled (Ξ²=0.9, coeff=2.0)
- Label Smoothing: 0.01
- Entropy Regularization:
- Entropy bonus weight: 0.003
- Entropy target: 6.5 bits (weight: 0.003)
- Activation norm weight: 0.0006
- Loss chunk size: 1024 tokens
Training Infrastructure
- Mixed Precision: BF16
- Gradient Checkpointing: Enabled (non-reentrant)
- Flash Attention: Auto-enabled
- TF32 Mode: Auto
Spectral Health (Step 63K - Phase A Validated)
Analysis of weight matrix conditioning shows Phase A improvements achieved target:
Condition Numbers
- Attention Layers:
- Median: 2168 (β 4.6% from 57K)
- Mean: 2300
- P95: 2800
- Max: 2501 (β 19.4% from 57K's 3101, hit target ceiling)
- MLP Layers:
- Median: ~5-8 (excellent)
- Mean: ~8-10
- Max: ~15 (excellent conditioning)
- LM Head: ~280 (excellent)
Phase A Results
| Metric | 57K (Before Phase A) | 63K (After Phase A) | Improvement |
|---|---|---|---|
| Max Condition | 3101 | 2501 | β 19.4% β |
| Q_PROJ Median | 2274 | 2168 | β 4.6% β |
| O_PROJ Mean | 2085 | 1940 | β 7.0% β |
| Overall Median | 420 | 383 | β 8.7% β |
| Generation Quality | Good | Excellent | β Reduced repetition |
Key Success: Phase A layer-selective clamping locked max condition at 2500 (target ceiling) while per-layer LRs improved gradient flow through attention layers.
Training Dynamics (Phase 2 + Phase A: Steps 37K-63K)
- Loss at 63K: 3.18 (competitive across domains)
- Gradient Norm: Stable with adaptive clipping
- Learning Rate: Gradual cosine decay
- Multi-Domain Quality: Balanced performance on RECAP, GovInfo, USPTO, EDGAR
- Spectral Stability: Max condition locked at target throughout Phase A
Generation Quality
Generates coherent, fluent legal text across multiple domains with excellent repetition control:
- Court documents (RECAP): Natural legal argumentation
- Federal regulations (GovInfo): Proper regulatory structure
- Patents (USPTO): Technical claim language
- SEC filings (EDGAR): Financial/corporate disclosure style
Phase A improvements reduced catastrophic repetition by 2.5x compared to earlier checkpoints.
Usage
from transformers import pipeline
# Create text generation pipeline
generator = pipeline(
"text-generation",
model="alea-institute/kl3m-006-170m-checkpoint-63000",
torch_dtype="auto",
device_map="auto"
)
# Generate text
outputs = generator(
"This Agreement is entered into as of",
max_new_tokens=100,
temperature=0.7,
top_p=0.95,
repetition_penalty=1.15
)
print(outputs[0]['generated_text'])
Next Steps: 4x Depth Stacking
This checkpoint serves as the source for 4x cyclic layer duplication using the G_stack method (NeurIPS 2024):
- Current: 30 layers, 181.7M parameters
- After stacking: 120 layers, 500.3M parameters
- Expected benefit: ~54% fewer tokens to reach target loss vs training from scratch
The stacked 500M model is available at alea-institute/kl3m-007-500m-step0.
Training Philosophy
Phase A demonstrates spectral health improvements through targeted interventions:
- Layer-selective clamping: Different frequencies for attention vs MLP layers
- Per-layer learning rates: Slower updates for worst-conditioned layers
- Minimal overhead: Attention clamped 4x more often, but only small fraction of parameters
Result: 19.4% reduction in max condition number with no loss of training speed or quality.
Model Card Authors
Alea Institute
Citation
For technical details, see the paper: https://arxiv.org/abs/2504.07854
@misc{kl3m2025,
title={KL3M: Knowledge-Guided Language Model Training},
author={Alea Institute},
year={2025},
url={https://arxiv.org/abs/2504.07854},
note={Trained with Muon optimizer, Phase A spectral improvements, and G_stack ready}
}
License
Apache 2.0
- Downloads last month
- 4