KL3M 170M, 6th Gen Model, 63K Checkpoint (Phase 2 + Phase A)

A 170M parameter language model trained on multi-domain legal text using the Muon optimizer with Phase A improvements: layer-selective spectral clamping and per-layer learning rates. This is the final checkpoint before 4x depth stacking to 500M parameters.

Model Details

  • Architecture: Llama-based with Grouped Query Attention (GQA)
  • Parameters: 181.7M (170M non-embedding)
  • Training Steps: 63,000 (37K Phase 1 + 26K Phase 2 with Phase A)
  • Tokens Processed: 15.83 billion
  • Sequence Length: 4,096 tokens
  • Precision: BF16
  • Optimizer: Muon with layer-selective spectral regularization

Model Architecture

  • Hidden Size: 576
  • Layers: 30
  • Attention Heads: 9 (3 KV heads with GQA)
  • Intermediate Size: 1536
  • Vocabulary: 131,072 tokens
  • RoPE Theta: 100,000

Training Configuration

Phase 2 + Phase A Multi-Domain Dataset

  • Source: alea-institute/kl3m-data-sample-004-balanced
  • Type: Multi-domain legal corpus
    • RECAP (Court filings): 48-52%
    • GovInfo (Federal regulations): 27-32%
    • USPTO (Patents): 3-5%
    • EDGAR (SEC filings): 5-7%
  • Format: Streaming, shuffled with buffer=32
  • Domain Quality: Balanced for broad legal knowledge

Optimizer (Muon with Phase A)

  • Muon Learning Rate: 7.30e-5 (depth-scaled from 9e-5)
  • Auxiliary Learning Rate: 9e-5
  • Muon Weight Decay: 1e-4
  • Auxiliary Weight Decay: 1e-3
  • Muon Momentum: 0.95
  • Muon NS Steps: 3 (optimized for speed)
  • Batch Size: 6 per device
  • Gradient Accumulation: 2 steps (effective batch: 12)
  • Warmup Steps: 0 (continuing from Phase 1)
  • LR Scheduler: Cosine with min ratio 0.1

Phase A Improvements (Steps 57K-63K)

Per-Layer Learning Rate Multipliers:

  • self_attn.q_proj: 0.7Γ— (slower for better conditioning)
  • self_attn.o_proj: 0.7Γ— (slower for better conditioning)
  • self_attn.k_proj: 0.9Γ—
  • self_attn.v_proj: 0.9Γ—
  • mlp.*: 1.0Γ— (baseline)
  • lm_head: 0.85Γ—

Layer-Selective Spectral Clamping:

  • Attention layers (q_proj, k_proj, v_proj, o_proj):
    • Frequency: Every 40 steps
    • Max condition: 2500
    • Sigma floor: 1e-4
  • MLP layers (gate_proj, up_proj, down_proj):
    • Frequency: Every 160 steps
    • Max condition: 3000
    • Sigma floor: 5e-5
  • LM head:
    • Frequency: Every 80 steps
    • Max condition: 2000
    • Sigma floor: 1e-4

Additional Regularization

  • Adaptive Gradient Clipping: Enabled (Ξ²=0.9, coeff=2.0)
  • Label Smoothing: 0.01
  • Entropy Regularization:
    • Entropy bonus weight: 0.003
    • Entropy target: 6.5 bits (weight: 0.003)
    • Activation norm weight: 0.0006
    • Loss chunk size: 1024 tokens

Training Infrastructure

  • Mixed Precision: BF16
  • Gradient Checkpointing: Enabled (non-reentrant)
  • Flash Attention: Auto-enabled
  • TF32 Mode: Auto

Spectral Health (Step 63K - Phase A Validated)

Analysis of weight matrix conditioning shows Phase A improvements achieved target:

Condition Numbers

  • Attention Layers:
    • Median: 2168 (↓ 4.6% from 57K)
    • Mean: 2300
    • P95: 2800
    • Max: 2501 (↓ 19.4% from 57K's 3101, hit target ceiling)
  • MLP Layers:
    • Median: ~5-8 (excellent)
    • Mean: ~8-10
    • Max: ~15 (excellent conditioning)
  • LM Head: ~280 (excellent)

Phase A Results

Metric 57K (Before Phase A) 63K (After Phase A) Improvement
Max Condition 3101 2501 ↓ 19.4% βœ“
Q_PROJ Median 2274 2168 ↓ 4.6% βœ“
O_PROJ Mean 2085 1940 ↓ 7.0% βœ“
Overall Median 420 383 ↓ 8.7% βœ“
Generation Quality Good Excellent ↑ Reduced repetition

Key Success: Phase A layer-selective clamping locked max condition at 2500 (target ceiling) while per-layer LRs improved gradient flow through attention layers.

Training Dynamics (Phase 2 + Phase A: Steps 37K-63K)

  • Loss at 63K: 3.18 (competitive across domains)
  • Gradient Norm: Stable with adaptive clipping
  • Learning Rate: Gradual cosine decay
  • Multi-Domain Quality: Balanced performance on RECAP, GovInfo, USPTO, EDGAR
  • Spectral Stability: Max condition locked at target throughout Phase A

Generation Quality

Generates coherent, fluent legal text across multiple domains with excellent repetition control:

  • Court documents (RECAP): Natural legal argumentation
  • Federal regulations (GovInfo): Proper regulatory structure
  • Patents (USPTO): Technical claim language
  • SEC filings (EDGAR): Financial/corporate disclosure style

Phase A improvements reduced catastrophic repetition by 2.5x compared to earlier checkpoints.

Usage

from transformers import pipeline

# Create text generation pipeline
generator = pipeline(
    "text-generation",
    model="alea-institute/kl3m-006-170m-checkpoint-63000",
    torch_dtype="auto",
    device_map="auto"
)

# Generate text
outputs = generator(
    "This Agreement is entered into as of",
    max_new_tokens=100,
    temperature=0.7,
    top_p=0.95,
    repetition_penalty=1.15
)

print(outputs[0]['generated_text'])

Next Steps: 4x Depth Stacking

This checkpoint serves as the source for 4x cyclic layer duplication using the G_stack method (NeurIPS 2024):

  • Current: 30 layers, 181.7M parameters
  • After stacking: 120 layers, 500.3M parameters
  • Expected benefit: ~54% fewer tokens to reach target loss vs training from scratch

The stacked 500M model is available at alea-institute/kl3m-007-500m-step0.

Training Philosophy

Phase A demonstrates spectral health improvements through targeted interventions:

  • Layer-selective clamping: Different frequencies for attention vs MLP layers
  • Per-layer learning rates: Slower updates for worst-conditioned layers
  • Minimal overhead: Attention clamped 4x more often, but only small fraction of parameters

Result: 19.4% reduction in max condition number with no loss of training speed or quality.

Model Card Authors

Alea Institute

Citation

For technical details, see the paper: https://arxiv.org/abs/2504.07854

@misc{kl3m2025,
  title={KL3M: Knowledge-Guided Language Model Training},
  author={Alea Institute},
  year={2025},
  url={https://arxiv.org/abs/2504.07854},
  note={Trained with Muon optimizer, Phase A spectral improvements, and G_stack ready}
}

License

Apache 2.0

Downloads last month
4
Safetensors
Model size
0.2B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Paper for alea-institute/kl3m-006-170m-checkpoint-63000