Distillix 100M v0.3

A 100M parameter BitNet b1.58 language model trained via knowledge distillation.

Model Details

Architecture: Frankenstein LLM combining best practices
- BitNet b1.58 (Microsoft) - 1.58-bit ternary weights
- Llama-2 tokenizer (32k vocab)
- Llama 3 GQA (12Q/4KV heads for 3x KV cache reduction)
- Gemma 2/3 stability (QK-Norm + Logit Soft-Capping)
- Extended RoPE (theta=1M for long context)
Parameters: ~100M
Training: 500 steps on 765 samples (initial training)
Optimizer: Stanford Muon + AdamW hybrid

Architecture Specs

Component	Value
Hidden dim	768
Layers	12
Q Heads	12
KV Heads	4 (GQA)
Head dim	64
MLP dim	2048
Vocab size	32,000
Max seq len	2,048
RoPE theta	1,000,000

Training

Trained with the Muon optimizer from Stanford, which showed characteristic "Muon Drop" - steep loss reduction:

Initial loss: 10.59
Final loss: 1.04 (90% reduction in 6 minutes)
Hardware: RTX 2080 Super (8GB VRAM)

Files

distillix-v0.safetensors - SafeTensors format (382 MB)
distillix-v0.3.gguf - GGUF format for llama.cpp (191 MB)
model_500steps.pt - PyTorch checkpoint

Usage

import torch
from safetensors.torch import load_file

# Load model weights
state_dict = load_file("distillix-v0.safetensors")

# For inference, use with llama.cpp or bitnet.cpp
# GGUF file is provided for CPU inference

Limitations

Early training (500 steps) - model needs more training
Limited training data (765 samples)
Best used as a starting point for further fine-tuning

License

Apache 2.0

Citation

@misc{distillix2025,
  title={Distillix: Frankenstein BitNet b1.58 Knowledge Distillation},
  author={Seaburg, Riley},
  year={2025},
  url={https://github.com/rileyseaburg/distillix}
}

Downloads last month: 15

GGUF

Model size

0.1B params

Architecture

llama

Hardware compatibility

We're not able to determine the quantization variants.

View all variants

rileyseaburg
/

distillix-100m-v0.3