cmpatino HF Staff commited on
Commit
1d06a24
·
verified ·
1 Parent(s): 84bb622

Update README: fix model references to HuggingFaceTB/nanowhale-*

Browse files
Files changed (1) hide show
  1. README.md +23 -16
README.md CHANGED
@@ -1,6 +1,8 @@
1
- # SmolDeepSeek-V4 100M (Pretrained)
2
 
3
- A small ~110M parameter language model implementing the **DeepSeek-V4 architecture** from scratch. This is the pretrained base model — see [cmpatino/smol-deepseek-v4-100m](https://huggingface.co/cmpatino/smol-deepseek-v4-100m) for the SFT/chat version.
 
 
4
 
5
  ## Architecture
6
 
@@ -54,23 +56,27 @@ This model implements key DeepSeek-V4 innovations at a miniature scale:
54
 
55
  ```python
56
  import torch
57
- from transformers import AutoModelForCausalLM, AutoTokenizer
58
-
59
- model = AutoModelForCausalLM.from_pretrained(
60
- "cmpatino/smol-deepseek-v4-100m-pretrain",
61
- trust_remote_code=True,
62
- torch_dtype=torch.float32,
63
- )
64
- tokenizer = AutoTokenizer.from_pretrained("cmpatino/smol-deepseek-v4-100m-pretrain")
65
-
66
- # Important: Use manual weight loading for best results
67
  from safetensors.torch import load_file
68
- from transformers import AutoConfig
 
69
 
70
- config = AutoConfig.from_pretrained("cmpatino/smol-deepseek-v4-100m-pretrain", trust_remote_code=True)
71
- model = AutoModelForCausalLM.from_config(config, trust_remote_code=True)
72
- state_dict = load_file("model.safetensors") # or download from Hub
 
 
 
 
73
  model.load_state_dict(state_dict, strict=True)
 
 
 
 
 
 
 
 
 
74
  ```
75
 
76
  ## Limitations
@@ -78,6 +84,7 @@ model.load_state_dict(state_dict, strict=True)
78
  - **Small model**: 110M params with 129K vocab means ~37% of parameters are in embeddings, limiting model capacity
79
  - **Limited training**: Only 5K steps / 2.6B tokens — significantly undertrained compared to production models
80
  - **Pretrained only**: This is a base model without instruction tuning. Outputs are language-model completions, not conversations.
 
81
  - **Custom architecture**: Requires `trust_remote_code=True`
82
 
83
  ## License
 
1
+ # nanowhale-100m-base 🐳
2
 
3
+ A small ~110M parameter language model implementing the **DeepSeek-V4 architecture** from scratch. This is the pretrained base model — see [HuggingFaceTB/nanowhale-100m](https://huggingface.co/HuggingFaceTB/nanowhale-100m) for the SFT/chat version.
4
+
5
+ **Training code**: [github.com/huggingface/nanowhale](https://github.com/huggingface/nanowhale)
6
 
7
  ## Architecture
8
 
 
56
 
57
  ```python
58
  import torch
 
 
 
 
 
 
 
 
 
 
59
  from safetensors.torch import load_file
60
+ from transformers import AutoConfig, AutoModelForCausalLM, AutoTokenizer
61
+ from huggingface_hub import hf_hub_download
62
 
63
+ # Load model (recommended: manual load for reliability)
64
+ config = AutoConfig.from_pretrained("HuggingFaceTB/nanowhale-100m-base", trust_remote_code=True)
65
+ model = AutoModelForCausalLM.from_config(config, trust_remote_code=True).float()
66
+
67
+ # Download and load weights
68
+ weights_path = hf_hub_download("HuggingFaceTB/nanowhale-100m-base", "model.safetensors")
69
+ state_dict = load_file(weights_path)
70
  model.load_state_dict(state_dict, strict=True)
71
+ model = model.cuda().eval()
72
+
73
+ tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/nanowhale-100m-base")
74
+
75
+ # Generate
76
+ input_ids = tokenizer.encode("The meaning of life is", return_tensors="pt").cuda()
77
+ output = model.generate(input_ids, max_new_tokens=100, temperature=0.7, top_p=0.9,
78
+ pad_token_id=tokenizer.eos_token_id)
79
+ print(tokenizer.decode(output[0], skip_special_tokens=True))
80
  ```
81
 
82
  ## Limitations
 
84
  - **Small model**: 110M params with 129K vocab means ~37% of parameters are in embeddings, limiting model capacity
85
  - **Limited training**: Only 5K steps / 2.6B tokens — significantly undertrained compared to production models
86
  - **Pretrained only**: This is a base model without instruction tuning. Outputs are language-model completions, not conversations.
87
+ - **bf16 NaN**: Use fp32 — the Hyper-Connections architecture produces values that overflow bf16 range at this scale.
88
  - **Custom architecture**: Requires `trust_remote_code=True`
89
 
90
  ## License