DeepSeek-V4 Mini (300M) — sliced from V4-Flash
A weight-transferred small replica of the DeepSeek-V4 architecture. Tensors
were sliced (or, where shapes already matched, copied verbatim) from
deepseek-ai/DeepSeek-V4-Flash into a smaller architecture
matching kshitijthakkar/deepseek-v4-mini-300M-init.
This is not a randomly-initialized scaffold — most parameters carry real training signal from the source. It is intended as a warm-start for fine tuning at this size, not as a finished chat / instruct model. See the fine-tune recipe below.
| Value | |
|---|---|
| Total params | ~0.32 B |
| Activated per token | ~0.17 B |
| Source | deepseek-ai/DeepSeek-V4-Flash (284 B total / 13 B active) |
| Random-init baseline | kshitijthakkar/deepseek-v4-mini-300M-init |
| Vocab | 129,280 (V4-Flash tokenizer + ChatML chat template baked in) |
| Storage dtype | bfloat16, 2 GB shards |
Slice provenance
Each parameter in this model is one of three kinds:
| Kind | Meaning | Recommended at fine-tune |
|---|---|---|
| COPY | Source dim already matched the destination — bytes copied verbatim. | Freeze (no training) |
| SLICE | Destination dim is smaller — deterministic slice (truncate / per-head / per-group / top-K experts by gate-norm). | Train (lossy slice needs to recover) |
| INIT | Source has no compatible counterpart (or shapes incompatible) — kept at random init. | Train (full learning) |
| Kind | Tensors | % of params |
|---|---|---|
| COPY | 56 | 6.1% |
| SLICE | 789 | 86.0% |
| INIT | 72 | 7.9% |
Concretely:
- Layer pruning: source had 43 layers; we kept 12 via
uniform stride:
[0, 3, 7, 10, 14, 17, 21, 25, 28, 32, 35, 39]. The CSA/HCA/SW per-layer pattern is preserved. - Expert pruning: source had 256 routed experts per layer; we
kept the top-16 by
gate.weightL2 norm (a cheap proxy for expert importance — replace with an activation-statistics-based pick if you have profiling data). - Compressors / Indexer: marked INIT in this release (the per-block positional bias and gate require a non-trivial cross-dim mapping that the current slicer leaves at random; their parent attn KV/Q paths are sliced from source).
- MTP module: sliced from the source's first MTP step using the same per-tensor rules.
- Hash-routing
tid2eidtable: regenerated (top-k differs from source).
The exhaustive list of COPY (frozen-recommended) keys is in
frozen_keys.json.
Quick start (Colab / local)
# 1) install + auth (private repo)
# !pip install -q "transformers>=4.50" huggingface_hub safetensors
from huggingface_hub import login, snapshot_download
login()
# 2) download repo (weights + tokenizer + modeling package)
local = snapshot_download(repo_id="kshitijthakkar/deepseek-v4-mini-300M-from-flash")
# 3) register the deepseek_v4 model_type with HF auto classes
import sys, os
sys.path.insert(0, os.path.join(local, "code"))
import deepseek_v4 # noqa: F401 — side-effect: AutoConfig / AutoModelForCausalLM registration
# 4) load via the standard Auto API
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
tok = AutoTokenizer.from_pretrained(local)
model = AutoModelForCausalLM.from_pretrained(local, torch_dtype=torch.bfloat16)
model.eval()
# 5) chat-templated forward pass — initial loss is much lower than the
# random-init scaffold, but the compressor/indexer paths are still random
# until you fine-tune (see below).
messages = [{"role": "user", "content": "Hello"}]
ids = tok.apply_chat_template(messages, return_tensors="pt",
add_generation_prompt=True, return_dict=True)
with torch.no_grad():
print(model(input_ids=ids["input_ids"]).logits.shape)
Fine-tune recipe (warm-start)
Because most non-MoE-expert weights are already well-trained, freeze them and train only the lossy / random-init subset:
from training.warm_start import apply_warm_start_freeze
import os, json
frozen_path = os.path.join(local, "frozen_keys.json")
summary = apply_warm_start_freeze(model, frozen_path)
print(summary)
# -> trainable=Y M frozen=X M total=N M train_ratio=...%
# Build optimizer over the trainable subset (configure_optimizers already
# respects requires_grad)
from training.optimizer import configure_optimizers
optimizer = configure_optimizers(
model, learning_rate=2e-4, weight_decay=0.1,
betas=(0.9, 0.95), device_type="cuda", optimizer_type="muon",
)
# ... standard training loop ...
Suggested starting hyperparameters:
| Value | |
|---|---|
| LR (peak) | 2 × 10⁻⁴ for trainable subset |
| Optimizer | Muon (2D weights) + AdamW (norms / 1-D) |
| Schedule | Linear warmup 200 steps → cosine to 10% |
| Sequence length | 4096 (extend later via YaRN) |
| Batch tokens / step | 0.5–2 M depending on hardware |
| MTP loss weight | 0.3 (drops to 0.1 at end-of-decay) |
Architecture summary
| Parameter | Value |
|---|---|
| hidden_size | 512 |
| num_hidden_layers | 12 |
| num_attention_heads | 8 |
| num_key_value_heads | 1 (MQA) |
| head_dim | 64 |
| q_lora_rank / o_lora_rank | 256 / 256 |
| qk_rope_head_dim | 32 |
| o_groups | 2 |
| n_routed_experts | 16 |
| n_shared_experts | 1 |
| num_experts_per_tok | 2 |
| moe_intermediate_size | 512 |
| num_hash_layers | 2 |
| sliding_window | 32 |
| max_position_embeddings | 1,048,576 (YaRN factor=16, original=65536) |
| num_nextn_predict_layers | 1 (MTP) |
| hc_mult (n_hc) | 4 |
Limitations
- Compressor + Indexer modules are random-init (see Slice provenance above). The current slicer doesn't do the cross-dim mapping for those per-block tensors. Forward-pass logits are noisy through CSA/HCA layers until those are fine-tuned.
- Hash-table
tid2eidregenerated — token routing for the firstnum_hash_layersdiffers from source. - No FP4/FP8 weights — source is dequantized to BF16 at slice time, so this repo doesn't carry V4-Flash's quantization-aware training state.
Citation
@misc{deepseek_v4_2026,
author = {DeepSeek-AI},
title = {DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence},
year = {2026},
url = {https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash}
}
- Downloads last month
- 14
Model tree for kshitijthakkar/deepseek-v4-mini-300M-from-flash
Base model
deepseek-ai/DeepSeek-V4-Flash