cmpatino HF Staff commited on
Commit
2af18ad
·
verified ·
1 Parent(s): 382fe22

Add model card

Browse files
Files changed (1) hide show
  1. README.md +85 -0
README.md ADDED
@@ -0,0 +1,85 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # SmolDeepSeek-V4 100M (Pretrained)
2
+
3
+ A small ~110M parameter language model implementing the **DeepSeek-V4 architecture** from scratch. This is the pretrained base model — see [cmpatino/smol-deepseek-v4-100m](https://huggingface.co/cmpatino/smol-deepseek-v4-100m) for the SFT/chat version.
4
+
5
+ ## Architecture
6
+
7
+ This model implements key DeepSeek-V4 innovations at a miniature scale:
8
+
9
+ | Component | Details |
10
+ |---|---|
11
+ | **Parameters** | ~110M total (41M embeddings, 69M non-embedding) |
12
+ | **Hidden size** | 320 |
13
+ | **Layers** | 8 |
14
+ | **Attention heads** | 8 (1 KV head — MQA-style) |
15
+ | **Head dim** | 96 (32 RoPE + 64 NoPE) |
16
+ | **MLA** | q_lora_rank=160, o_groups=2, o_lora_rank=80 |
17
+ | **MoE** | 4 routed experts + 1 shared, top-2 routing |
18
+ | **Expert FFN** | SwiGLU, intermediate_size=640 |
19
+ | **Routing** | sqrtsoftplus scoring, noaux_tc method |
20
+ | **Hyper-Connections** | hc_mult=4, Sinkhorn routing (2 iters) |
21
+ | **MTP** | 1 next-token prediction layer |
22
+ | **Vocab** | 129,280 (DeepSeek-V4 tokenizer) |
23
+ | **Context** | 2,048 tokens |
24
+
25
+ ### DeepSeek-V4 Features Implemented
26
+
27
+ - **Multi-head Latent Attention (MLA)**: Compressed KV cache via latent projections
28
+ - **Mixture of Experts (MoE)**: Sparse activation — only 2 of 4 experts per token
29
+ - **Hyper-Connections**: Multi-copy hidden states with learned Sinkhorn routing replacing residual connections
30
+ - **SwiGLU FFN** with configurable limit
31
+ - **Grouped output projection** (o_groups)
32
+
33
+ ## Training
34
+
35
+ - **Dataset**: [HuggingFaceFW/fineweb-edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) (streaming)
36
+ - **Steps**: 5,000
37
+ - **Tokens seen**: ~2.6B
38
+ - **Batch size**: 8 × 4 gradient accumulation = 32 effective
39
+ - **Sequence length**: 2,048
40
+ - **Learning rate**: 6e-4, cosine schedule, 3% warmup
41
+ - **Optimizer**: AdamW (β1=0.9, β2=0.95, weight_decay=0.1)
42
+ - **Precision**: bf16 mixed precision
43
+ - **Hardware**: 1× NVIDIA H100 80GB
44
+
45
+ ### Training Metrics
46
+
47
+ | Metric | Value |
48
+ |---|---|
49
+ | Final loss | ~5.3 (cross-entropy) |
50
+ | Final entropy | 3.77 |
51
+ | Token accuracy | 33.8% |
52
+
53
+ ## Usage
54
+
55
+ ```python
56
+ import torch
57
+ from transformers import AutoModelForCausalLM, AutoTokenizer
58
+
59
+ model = AutoModelForCausalLM.from_pretrained(
60
+ "cmpatino/smol-deepseek-v4-100m-pretrain",
61
+ trust_remote_code=True,
62
+ torch_dtype=torch.float32,
63
+ )
64
+ tokenizer = AutoTokenizer.from_pretrained("cmpatino/smol-deepseek-v4-100m-pretrain")
65
+
66
+ # Important: Use manual weight loading for best results
67
+ from safetensors.torch import load_file
68
+ from transformers import AutoConfig
69
+
70
+ config = AutoConfig.from_pretrained("cmpatino/smol-deepseek-v4-100m-pretrain", trust_remote_code=True)
71
+ model = AutoModelForCausalLM.from_config(config, trust_remote_code=True)
72
+ state_dict = load_file("model.safetensors") # or download from Hub
73
+ model.load_state_dict(state_dict, strict=True)
74
+ ```
75
+
76
+ ## Limitations
77
+
78
+ - **Small model**: 110M params with 129K vocab means ~37% of parameters are in embeddings, limiting model capacity
79
+ - **Limited training**: Only 5K steps / 2.6B tokens — significantly undertrained compared to production models
80
+ - **Pretrained only**: This is a base model without instruction tuning. Outputs are language-model completions, not conversations.
81
+ - **Custom architecture**: Requires `trust_remote_code=True`
82
+
83
+ ## License
84
+
85
+ Apache-2.0